[ https://issues.apache.org/jira/browse/HBASE-12847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14275433#comment-14275433 ]
Rajeshbabu Chintaguntla commented on HBASE-12847: ------------------------------------------------- I will check this [~apurtell]. Thanks. > TestZKLessSplitOnCluster frequently times out in 0.98 builds > ------------------------------------------------------------ > > Key: HBASE-12847 > URL: https://issues.apache.org/jira/browse/HBASE-12847 > Project: HBase > Issue Type: Bug > Reporter: Andrew Purtell > Fix For: 0.98.10 > > Attachments: test.log.bad.gz, test.log.good.gz > > > Gets hung up in testSSHCleanupDaugtherRegionsOfAbortedSplit waiting on > deleteTable > {noformat} > "Thread-334" prio=10 tid=0x00007f15382da800 nid=0x40ae in Object.wait() > [0x00007 > f1315f5d000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00000007e1b525b8> (a > org.apache.hadoop.hbase.ipc.RpcClie > nt$Call) > at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1452) > - locked <0x00000007e1b525b8> (a > org.apache.hadoop.hbase.ipc.RpcClient$C > all) > at > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.ja > va:1661) > at > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementatio > n.callBlockingMethod(RpcClient.java:1719) > at > org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService > $BlockingStub.disableTable(MasterProtos.java:43749) > at > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplemen > tation$5.disableTable(HConnectionManager.java:1995) > at > org.apache.hadoop.hbase.client.HBaseAdmin$5.call(HBaseAdmin.java:947) > at > org.apache.hadoop.hbase.client.HBaseAdmin$5.call(HBaseAdmin.java:942) > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcR > etryingCaller.java:117) > - locked <0x00000007dfffe938> (a > org.apache.hadoop.hbase.client.RpcRetry > ingCaller) > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcR > etryingCaller.java:93) > - locked <0x00000007dfffe938> (a > org.apache.hadoop.hbase.client.RpcRetry > ingCaller) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin. > java:3398) > at > org.apache.hadoop.hbase.client.HBaseAdmin.disableTableAsync(HBaseAdmi > n.java:942) > at > org.apache.hadoop.hbase.client.HBaseAdmin.disableTable(HBaseAdmin.jav > a:974) > at > org.apache.hadoop.hbase.HBaseTestingUtility.deleteTable(HBaseTestingUtility.java:1532) > at > org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.testSSHCleanupDaugtherRegionsOfAbortedSplit(TestSplitTransactionOnCluster.java:1172) > {noformat} > See attached test.log.good.gz and test.log.bad.gz. > In test.log.bad at 2015-01-13 08:02:45,947 we acquire a lock on > testSSHCleanupDaugtherRegionsOfAbortedSplit to start a disable but do nothing > afterward. Nothing happens for one minute. Then looks like the client makes > another request but it can't get the table lock. There's no progress until > timeout. > {noformat} > 2015-01-13 08:02:45,947 INFO [Thread-334] client.HBaseAdmin$5(945): Started > disable of testSSHCleanupDaugtherRegionsOfAbortedSplit > 2015-01-13 08:02:45,948 INFO [FifoRpcScheduler.handler1-thread-4] > master.HMaster(2213): Client=apurtell//10.40.8.95 disable > testSSHCleanupDaugtherRegionsOfAbortedSplit > 2015-01-13 08:02:45,950 DEBUG [FifoRpcScheduler.handler1-thread-4] > lock.ZKInterProcessLockBase(226): Acquired a lock for > /hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:367740000000001 > 2015-01-13 08:03:44,585 DEBUG > [ip-10-40-8-95.us-west-2.compute.internal,36774,1421136162827-BalancerChore] > master.HMaster(1553): Not running balancer because 3 region(s) in transition: > {e4f79ac2b4711f7d906a291d94302d6e={e4f79ac2b4711f7d906a291d94302d6e > state=SPLITTING, ts=1421136165837, > server=ip-10-40-8-95.us-west-2.compute.internal,47769,1421136163047}, > ebbcf66ec09960d42a2a49252599bb6d={ebbcf66ec09960d42a2a49252599bb6d > state=SPLITTI... > 2015-01-13 08:03:46,210 INFO [Thread-334] client.HBaseAdmin$5(945): Started > disable of testSSHCleanupDaugtherRegionsOfAbortedSplit > 2015-01-13 08:03:46,211 INFO [FifoRpcScheduler.handler1-thread-3] > master.HMaster(2213): Client=apurtell//10.40.8.95 disable > testSSHCleanupDaugtherRegionsOfAbortedSplit > 2015-01-13 08:03:46,214 DEBUG [FifoRpcScheduler.handler1-thread-3] > master.TableLockManager$ZKTableLockManager$1(242): Table is locked by > [tableName= > default+testSSHCleanupDaugtherRegionsOfAbortedSplit, > lockOwner=ip-10-40-8-95.us-west-2.compute.internal,36774,1421136162827, > threadId=486, purpose=C_M_DISABLE_TABLE, isShared=false, > createTime=1421136165948] > {noformat} > If we look at test.log.good where the same sequence of events begin, the > table is quickly disabled and deleted: > {noformat} > 2015-01-13 08:01:47,923 INFO [Thread-333] client.HBaseAdmin$5(945): Started > disable of testSSHCleanupDaugtherRegionsOfAbortedSplit > 2015-01-13 08:01:47,924 INFO [FifoRpcScheduler.handler1-thread-5] > master.HMaster(2213): Client=apurtell//10.40.8.95 disable > testSSHCleanupDaugtherRegionsOfAbortedSplit > 2015-01-13 08:01:47,926 DEBUG [FifoRpcScheduler.handler1-thread-5] > lock.ZKInterProcessLockBase(226): Acquired a lock for > /hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000001 > 2015-01-13 08:01:47,932 INFO [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > handler.DisableTableHandler(130): Attempting to disable table > testSSHCleanupDaugtherRegionsOfAbortedSplit > 2015-01-13 08:01:47,933 INFO [Thread-333] > zookeeper.RecoverableZooKeeper(121): Process > identifier=catalogtracker-on-hconnection-0x60cf9dfa connecting to ZooKeeper > ensemble=localhost:59024 > 2015-01-13 08:01:47,933 DEBUG [Thread-333] catalog.CatalogTracker(198): > Starting catalog tracker > org.apache.hadoop.hbase.catalog.CatalogTracker@610b218c > 2015-01-13 08:01:47,935 DEBUG [Thread-333-EventThread] > zookeeper.ZooKeeperWatcher(313): catalogtracker-on-hconnection-0x60cf9dfa0x0, > quorum=localhost:59024, baseZNode=/hbase Received ZooKeeper Event, type=None, > state=SyncConnected, path=null > 2015-01-13 08:01:47,935 DEBUG [Thread-333] zookeeper.ZKUtil(430): > catalogtracker-on-hconnection-0x60cf9dfa0x0, quorum=localhost:59024, > baseZNode=/hbase Set watcher on existing znode=/hbase/meta-region-server > 2015-01-13 08:01:47,936 INFO [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > handler.DisableTableHandler(190): Disabled table, > testSSHCleanupDaugtherRegionsOfAbortedSplit, is done=true > 2015-01-13 08:01:47,936 DEBUG [Thread-333-EventThread] > zookeeper.ZooKeeperWatcher(390): > catalogtracker-on-hconnection-0x60cf9dfa-0x14ae2500c300017 connected > 2015-01-13 08:01:47,938 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > lock.ZKInterProcessLockBase(328): Released > /hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000001 > 2015-01-13 08:01:47,941 DEBUG [Thread-333] catalog.CatalogTracker(222): > Stopping catalog tracker > org.apache.hadoop.hbase.catalog.CatalogTracker@610b218c > 2015-01-13 08:01:47,945 INFO [Thread-333] client.HBaseAdmin(999): Disabled > testSSHCleanupDaugtherRegionsOfAbortedSplit > 2015-01-13 08:01:47,946 INFO [FifoRpcScheduler.handler1-thread-2] > master.HMaster(2047): Client=apurtell//10.40.8.95 delete > testSSHCleanupDaugtherRegionsOfAbortedSplit > 2015-01-13 08:01:47,948 DEBUG [FifoRpcScheduler.handler1-thread-2] > lock.ZKInterProcessLockBase(226): Acquired a lock for > /hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000002 > 2015-01-13 08:01:47,953 INFO [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > handler.TableEventHandler(123): Handling table operation C_M_DELETE_TABLE on > table testSSHCleanupDaugtherRegionsOfAbortedSplit > 2015-01-13 08:01:47,958 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > handler.DeleteTableHandler(93): Deleting regions from META > 2015-01-13 08:01:47,962 INFO [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > catalog.MetaEditor(496): Deleted [{ENCODED => > bc81e04ede70636e30fd1224df8bc70b, NAME => > 'testSSHCleanupDaugtherRegionsOfAbortedSplit,,1421136107549.bc81e04ede70636e30fd1224df8bc70b.', > STARTKEY => '', ENDKEY => ''}] > 2015-01-13 08:01:47,964 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > handler.DeleteTableHandler(104): Archiving region > testSSHCleanupDaugtherRegionsOfAbortedSplit,,1421136107549.bc81e04ede70636e30fd1224df8bc70b. > from FS > 2015-01-13 08:01:47,965 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > backup.HFileArchiver(93): ARCHIVING > hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b > 2015-01-13 08:01:47,969 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > backup.HFileArchiver(134): Archiving [class > org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, > file:hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b/f] > 2015-01-13 08:01:47,978 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > backup.HFileArchiver(438): Finished archiving from class > org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, > file:hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b/f/9b6aaa5d9dd44f3ab113a32c029d50ae, > to > hdfs://localhost:43493/user/apurtell/hbase/archive/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b/f/9b6aaa5d9dd44f3ab113a32c029d50ae > 2015-01-13 08:01:47,979 INFO [IPC Server handler 6 on 43493] > blockmanagement.BlockManager(1009): BLOCK* addToInvalidates: > blk_1073741839_1015 127.0.0.1:49933 127.0.0.1:55797 127.0.0.1:57090 > 2015-01-13 08:01:47,979 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > backup.HFileArchiver(453): Deleted all region files in: > hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b > 2015-01-13 08:01:47,980 INFO [IPC Server handler 2 on 43493] > blockmanagement.BlockManager(1009): BLOCK* addToInvalidates: > blk_1073741838_1014 127.0.0.1:57090 127.0.0.1:49933 127.0.0.1:55797 > 2015-01-13 08:01:47,980 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > handler.DeleteTableHandler(114): Table > 'testSSHCleanupDaugtherRegionsOfAbortedSplit' archived! > 2015-01-13 08:01:47,980 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > handler.DeleteTableHandler(117): Removing > 'testSSHCleanupDaugtherRegionsOfAbortedSplit' descriptor. > 2015-01-13 08:01:47,981 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > handler.DeleteTableHandler(121): Removing > 'testSSHCleanupDaugtherRegionsOfAbortedSplit' from region states. > 2015-01-13 08:01:47,981 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > handler.DeleteTableHandler(125): Marking > 'testSSHCleanupDaugtherRegionsOfAbortedSplit' as deleted. > 2015-01-13 08:01:47,984 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > lock.ZKInterProcessLockBase(328): Released > /hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000002 > {noformat} > The first interesting line missing in test.log.bad is: > {noformat} > 2015-01-13 08:01:47,932 INFO [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] > handler.DisableTableHandler(130): Attempting to disable table > testSSHCleanupDaugtherRegionsOfAbortedSplit > {noformat} > Nowhere in test.log.bad does "Attempting to disable table > testSSHCleanupDaugtherRegionsOfAbortedSplit" appear. -- This message was sent by Atlassian JIRA (v6.3.4#6332)