[ https://issues.apache.org/jira/browse/HBASE-9779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HBASE-9779: ------------------------- Priority: Critical (was: Major) > IntegrationTestLoadAndVerify fails deleting IntegrationTestLoadAndVerify > table > ------------------------------------------------------------------------------- > > Key: HBASE-9779 > URL: https://issues.apache.org/jira/browse/HBASE-9779 > Project: HBase > Issue Type: Bug > Components: test > Affects Versions: 0.96.0 > Reporter: stack > Assignee: stack > Priority: Critical > Attachments: 9779part.txt > > > As part of the test, we want to delete the created table to restore cluster > state. Interestingly we can disable the table successfully but then > immediately after we fail the delete because we cannot get the table > descriptor -- getting the file descriptor is used to test if table is present. > The test for getDescriptor is kinda broke because it throws base IOE which > causes clients to retry over and over again as though the descriptor was > going to come back. > This bug is kinda ugly because in at least one case it caused our > long-running hbase-it suite run to fail so would be good to fix. > Here is sample from a test run: > {code} > Disabling table IntegrationTestLoadAndVerify 2013-10-11 18:27:53,485 INFO > [main] client.HBaseAdmin: Started disable of IntegrationTestLoadAndVerify > 2013-10-11 18:27:53,526 INFO [main] zookeeper.ZooKeeper: Initiating client > connection, connectString=a1805.halxg.cloudera.com:2181 sessionTimeout=90000 > watcher=catalogtracker-on-hconnection-0x5a7e666f > 2013-10-11 18:27:53,527 INFO [main] zookeeper.RecoverableZooKeeper: Process > identifier=catalogtracker-on-hconnection-0x5a7e666f connecting to ZooKeeper > ensemble=a1805.halxg.cloudera.com:2181 > 2013-10-11 18:27:53,527 INFO > [main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: > Opening socket connection to server > a1805.halxg.cloudera.com/10.20.200.105:2181. Will not attempt to authenticate > using SASL (unknown error) > 2013-10-11 18:27:53,527 DEBUG [main] catalog.CatalogTracker: Starting catalog > tracker org.apache.hadoop.hbase.catalog.CatalogTracker@4ace08a5 > 2013-10-11 18:27:53,529 INFO > [main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: Socket > connection established to a1805.halxg.cloudera.com/10.20.200.105:2181, > initiating session > 2013-10-11 18:27:53,539 INFO > [main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: > Session establishment complete on server > a1805.halxg.cloudera.com/10.20.200.105:2181, sessionid = 0x1412d47f53a5c70, > negotiated timeout = 40000 > 2013-10-11 18:27:53,602 DEBUG [main] catalog.CatalogTracker: Stopping catalog > tracker org.apache.hadoop.hbase.catalog.CatalogTracker@4ace08a5 > 2013-10-11 18:27:53,662 INFO [main] zookeeper.ZooKeeper: Session: > 0x1412d47f53a5c70 closed > 2013-10-11 18:27:53,662 INFO [main-EventThread] zookeeper.ClientCnxn: > EventThread shut down > .2013-10-11 18:27:54,666 INFO [main] zookeeper.ZooKeeper: Initiating client > connection, connectString=a1805.halxg.cloudera.com:2181 sessionTimeout=90000 > watcher=catalogtracker-on-hconnection-0x5a7e666f > 2013-10-11 18:27:54,667 INFO [main] zookeeper.RecoverableZooKeeper: Process > identifier=catalogtracker-on-hconnection-0x5a7e666f connecting to ZooKeeper > ensemble=a1805.halxg.cloudera.com:2181 > 2013-10-11 18:27:54,667 INFO > [main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: > Opening socket connection to server > a1805.halxg.cloudera.com/10.20.200.105:2181. Will not attempt to authenticate > using SASL (unknown error) > 2013-10-11 18:27:54,667 DEBUG [main] catalog.CatalogTracker: Starting catalog > tracker org.apache.hadoop.hbase.catalog.CatalogTracker@692c0c5d > 2013-10-11 18:27:54,667 INFO > [main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: Socket > connection established to a1805.halxg.cloudera.com/10.20.200.105:2181, > initiating session > 2013-10-11 18:27:54,696 INFO > [main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: > Session establishment complete on server > a1805.halxg.cloudera.com/10.20.200.105:2181, sessionid = 0x1412d47f53a5c71, > negotiated timeout = 40000 > 2013-10-11 18:27:54,821 DEBUG [main] catalog.CatalogTracker: Stopping catalog > tracker org.apache.hadoop.hbase.catalog.CatalogTracker@692c0c5d > 2013-10-11 18:27:54,871 INFO [main] zookeeper.ZooKeeper: Session: > 0x1412d47f53a5c71 closed > 2013-10-11 18:27:54,871 INFO [main-EventThread] zookeeper.ClientCnxn: > EventThread shut down > .2013-10-11 18:27:55,890 INFO [main] zookeeper.ZooKeeper: Initiating client > connection, connectString=a1805.halxg.cloudera.com:2181 sessionTimeout=90000 > watcher=catalogtracker-on-hconnection-0x5a7e666f > 2013-10-11 18:27:55,891 INFO [main] zookeeper.RecoverableZooKeeper: Process > identifier=catalogtracker-on-hconnection-0x5a7e666f connecting to ZooKeeper > ensemble=a1805.halxg.cloudera.com:2181 > 2013-10-11 18:27:55,891 INFO > [main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: > Opening socket connection to server > a1805.halxg.cloudera.com/10.20.200.105:2181. Will not attempt to authenticate > using SASL (unknown error) > 2013-10-11 18:27:55,891 DEBUG [main] catalog.CatalogTracker: Starting catalog > tracker org.apache.hadoop.hbase.catalog.CatalogTracker@74234057 > 2013-10-11 18:27:55,892 INFO > [main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: Socket > connection established to a1805.halxg.cloudera.com/10.20.200.105:2181, > initiating session > 2013-10-11 18:27:55,896 INFO > [main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: > Session establishment complete on server > a1805.halxg.cloudera.com/10.20.200.105:2181, sessionid = 0x1412d47f53a5c72, > negotiated timeout = 40000 > 2013-10-11 18:27:55,908 DEBUG [main] catalog.CatalogTracker: Stopping catalog > tracker org.apache.hadoop.hbase.catalog.CatalogTracker@74234057 > 2013-10-11 18:27:55,912 INFO [main] zookeeper.ZooKeeper: Session: > 0x1412d47f53a5c72 closed > 2013-10-11 18:27:55,912 INFO [main-EventThread] zookeeper.ClientCnxn: > EventThread shut down > 2389 ms > Deleting table IntegrationTestLoadAndVerify > 2013-10-11 18:33:02,029 INFO [main] hbase.HBaseCluster: Restoring cluster - > started > 2013-10-11 18:33:02,032 INFO [main] hbase.HBaseCluster: Added new HBaseAdmin > 2013-10-11 18:33:02,032 INFO [main] hbase.HBaseCluster: Restoring cluster - > done > 2013-10-11 18:33:02,033 ERROR [main] util.AbstractHBaseTool: Error running > command-line tool > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=35, exceptions: > Fri Oct 11 18:27:55 PDT 2013, > org.apache.hadoop.hbase.client.RpcRetryingCaller@48c6f1e8, > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException): > java.io.IOException: HTableDescriptor missing for IntegrationTestLoadAndVerify > at > org.apache.hadoop.hbase.master.handler.TableEventHandler.getTableDescriptor(TableEventHandler.java:231) > at > org.apache.hadoop.hbase.master.handler.DeleteTableHandler.prepareWithTableLock(DeleteTableHandler.java:58) > at > org.apache.hadoop.hbase.master.handler.TableEventHandler.prepare(TableEventHandler.java:93) > at org.apache.hadoop.hbase.master.HMaster.deleteTable(HMaster.java:1816) > at org.apache.hadoop.hbase.master.HMaster.deleteTable(HMaster.java:1826) > at > org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:38213) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2146) > at > org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1851) > .... > {code} > Looking on master logs, it looks fine: > {code} > 1 2013-10-11 18:27:55,541 DEBUG [MASTER_TABLE_OPERATIONS-a1805:60000-0] > handler.DisableTableHandler: Disable waiting until done; 298996 ms remaining; > [] > 0 2013-10-11 18:27:55,546 INFO [MASTER_TABLE_OPERATIONS-a1805:60000-0] > handler.DisableTableHandler: Disabled table is done=true > 1 2013-10-11 18:27:55,555 DEBUG [MASTER_TABLE_OPERATIONS-a1805:60000-0] > lock.ZKInterProcessLockBase: Released > /hbase/table-lock/IntegrationTestLoadAndVerify/write-master:600000000000001 > 2 2013-10-11 18:27:55,923 INFO [RpcServer.handler=14,port=60000] > master.HMaster: Client=hbase//10.20.200.112 delete > IntegrationTestLoadAndVerify > 3 2013-10-11 18:27:55,931 DEBUG [RpcServer.handler=14,port=60000] > lock.ZKInterProcessLockBase: Acquired a lock for > /hbase/table-lock/IntegrationTestLoadAndVerify/write-master:600000000000002 > 4 2013-10-11 18:27:55,954 DEBUG [RpcServer.handler=14,port=60000] > lock.ZKInterProcessLockBase: Released > /hbase/table-lock/IntegrationTestLoadAndVerify/write-master:600000000000002 > 5 2013-10-11 18:27:56,167 INFO [RpcServer.handler=15,port=60000] > master.HMaster: Client=hbase//10.20.200.112 delete > IntegrationTestLoadAndVerify > ... > {code} > Oddly no exceptions other than corrupted snapshot because mssing hfile. All > seems to have been operating fine. Will look more. -- This message was sent by Atlassian JIRA (v6.1#6144)