[jira] [Created] (HBASE-12847) TestZKLessSplitOnCluster frequently times out in 0.98 builds

Andrew Purtell (JIRA) Tue, 13 Jan 2015 00:45:08 -0800

Andrew Purtell created HBASE-12847:
--------------------------------------

             Summary: TestZKLessSplitOnCluster frequently times out in 0.98 
builds
                 Key: HBASE-12847
                 URL: https://issues.apache.org/jira/browse/HBASE-12847
             Project: HBase
          Issue Type: Bug
            Reporter: Andrew Purtell
             Fix For: 0.98.10



Gets hung up in testSSHCleanupDaugtherRegionsOfAbortedSplit waiting on 
deleteTable
{noformat}
"Thread-334" prio=10 tid=0x00007f15382da800 nid=0x40ae in Object.wait() [0x00007
f1315f5d000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000007e1b525b8> (a org.apache.hadoop.hbase.ipc.RpcClie
nt$Call)
        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1452)
        - locked <0x00000007e1b525b8> (a org.apache.hadoop.hbase.ipc.RpcClient$C
all)
        at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.ja
va:1661)
        at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementatio
n.callBlockingMethod(RpcClient.java:1719)
        at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService
$BlockingStub.disableTable(MasterProtos.java:43749)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplemen
tation$5.disableTable(HConnectionManager.java:1995)
        at org.apache.hadoop.hbase.client.HBaseAdmin$5.call(HBaseAdmin.java:947)
        at org.apache.hadoop.hbase.client.HBaseAdmin$5.call(HBaseAdmin.java:942)
        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcR
etryingCaller.java:117)
        - locked <0x00000007dfffe938> (a org.apache.hadoop.hbase.client.RpcRetry
ingCaller)
        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcR
etryingCaller.java:93)
        - locked <0x00000007dfffe938> (a org.apache.hadoop.hbase.client.RpcRetry
ingCaller)
        at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.
java:3398)
        at org.apache.hadoop.hbase.client.HBaseAdmin.disableTableAsync(HBaseAdmi
n.java:942)
        at org.apache.hadoop.hbase.client.HBaseAdmin.disableTable(HBaseAdmin.jav
a:974)
        at 
org.apache.hadoop.hbase.HBaseTestingUtility.deleteTable(HBaseTestingUtility.java:1532)
        at 
org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.testSSHCleanupDaugtherRegionsOfAbortedSplit(TestSplitTransactionOnCluster.java:1172)
{noformat}

See attached test.log.good.gz and test.log.bad.gz.

In test.log.bad at 2015-01-13 08:02:45,947 we acquire a lock on 
testSSHCleanupDaugtherRegionsOfAbortedSplit to start a disable but do nothing 
afterward. Nothing happens for one minute. Then looks like the client makes 
another request but it can't get the table lock. There's no progress until 
timeout.

{noformat}
2015-01-13 08:02:45,947 INFO  [Thread-334] client.HBaseAdmin$5(945): Started 
disable of testSSHCleanupDaugtherRegionsOfAbortedSplit
2015-01-13 08:02:45,948 INFO  [FifoRpcScheduler.handler1-thread-4] 
master.HMaster(2213): Client=apurtell//10.40.8.95 disable 
testSSHCleanupDaugtherRegionsOfAbortedSplit
2015-01-13 08:02:45,950 DEBUG [FifoRpcScheduler.handler1-thread-4] 
lock.ZKInterProcessLockBase(226): Acquired a lock for 
/hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:367740000000001
2015-01-13 08:03:44,585 DEBUG 
[ip-10-40-8-95.us-west-2.compute.internal,36774,1421136162827-BalancerChore] 
master.HMaster(1553): Not running balancer because 3 region(s) in transition: 
{e4f79ac2b4711f7d906a291d94302d6e={e4f79ac2b4711f7d906a291d94302d6e 
state=SPLITTING, ts=1421136165837, 
server=ip-10-40-8-95.us-west-2.compute.internal,47769,1421136163047}, 
ebbcf66ec09960d42a2a49252599bb6d={ebbcf66ec09960d42a2a49252599bb6d 
state=SPLITTI...
2015-01-13 08:03:46,210 INFO  [Thread-334] client.HBaseAdmin$5(945): Started 
disable of testSSHCleanupDaugtherRegionsOfAbortedSplit
2015-01-13 08:03:46,211 INFO  [FifoRpcScheduler.handler1-thread-3] 
master.HMaster(2213): Client=apurtell//10.40.8.95 disable 
testSSHCleanupDaugtherRegionsOfAbortedSplit
2015-01-13 08:03:46,214 DEBUG [FifoRpcScheduler.handler1-thread-3] 
master.TableLockManager$ZKTableLockManager$1(242): Table is locked by 
[tableName=
default+testSSHCleanupDaugtherRegionsOfAbortedSplit, 
lockOwner=ip-10-40-8-95.us-west-2.compute.internal,36774,1421136162827, 
threadId=486, purpose=C_M_DISABLE_TABLE, isShared=false, 
createTime=1421136165948]
{noformat}

If we look at test.log.good where the same sequence of events begin, the table 
is quickly disabled and deleted:

{noformat}
2015-01-13 08:01:47,923 INFO  [Thread-333] client.HBaseAdmin$5(945): Started 
disable of testSSHCleanupDaugtherRegionsOfAbortedSplit
2015-01-13 08:01:47,924 INFO  [FifoRpcScheduler.handler1-thread-5] 
master.HMaster(2213): Client=apurtell//10.40.8.95 disable 
testSSHCleanupDaugtherRegionsOfAbortedSplit
2015-01-13 08:01:47,926 DEBUG [FifoRpcScheduler.handler1-thread-5] 
lock.ZKInterProcessLockBase(226): Acquired a lock for 
/hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000001
2015-01-13 08:01:47,932 INFO  [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
handler.DisableTableHandler(130): Attempting to disable table 
testSSHCleanupDaugtherRegionsOfAbortedSplit
2015-01-13 08:01:47,933 INFO  [Thread-333] zookeeper.RecoverableZooKeeper(121): 
Process identifier=catalogtracker-on-hconnection-0x60cf9dfa connecting to 
ZooKeeper ensemble=localhost:59024
2015-01-13 08:01:47,933 DEBUG [Thread-333] catalog.CatalogTracker(198): 
Starting catalog tracker org.apache.hadoop.hbase.catalog.CatalogTracker@610b218c
2015-01-13 08:01:47,935 DEBUG [Thread-333-EventThread] 
zookeeper.ZooKeeperWatcher(313): catalogtracker-on-hconnection-0x60cf9dfa0x0, 
quorum=localhost:59024, baseZNode=/hbase Received ZooKeeper Event, type=None, 
state=SyncConnected, path=null
2015-01-13 08:01:47,935 DEBUG [Thread-333] zookeeper.ZKUtil(430): 
catalogtracker-on-hconnection-0x60cf9dfa0x0, quorum=localhost:59024, 
baseZNode=/hbase Set watcher on existing znode=/hbase/meta-region-server
2015-01-13 08:01:47,936 INFO  [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
handler.DisableTableHandler(190): Disabled table, 
testSSHCleanupDaugtherRegionsOfAbortedSplit, is done=true
2015-01-13 08:01:47,936 DEBUG [Thread-333-EventThread] 
zookeeper.ZooKeeperWatcher(390): 
catalogtracker-on-hconnection-0x60cf9dfa-0x14ae2500c300017 connected
2015-01-13 08:01:47,938 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
lock.ZKInterProcessLockBase(328): Released 
/hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000001
2015-01-13 08:01:47,941 DEBUG [Thread-333] catalog.CatalogTracker(222): 
Stopping catalog tracker org.apache.hadoop.hbase.catalog.CatalogTracker@610b218c
2015-01-13 08:01:47,945 INFO  [Thread-333] client.HBaseAdmin(999): Disabled 
testSSHCleanupDaugtherRegionsOfAbortedSplit
2015-01-13 08:01:47,946 INFO  [FifoRpcScheduler.handler1-thread-2] 
master.HMaster(2047): Client=apurtell//10.40.8.95 delete 
testSSHCleanupDaugtherRegionsOfAbortedSplit
2015-01-13 08:01:47,948 DEBUG [FifoRpcScheduler.handler1-thread-2] 
lock.ZKInterProcessLockBase(226): Acquired a lock for 
/hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000002
2015-01-13 08:01:47,953 INFO  [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
handler.TableEventHandler(123): Handling table operation C_M_DELETE_TABLE on 
table testSSHCleanupDaugtherRegionsOfAbortedSplit
2015-01-13 08:01:47,958 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
handler.DeleteTableHandler(93): Deleting regions from META
2015-01-13 08:01:47,962 INFO  [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
catalog.MetaEditor(496): Deleted [{ENCODED => bc81e04ede70636e30fd1224df8bc70b, 
NAME => 
'testSSHCleanupDaugtherRegionsOfAbortedSplit,,1421136107549.bc81e04ede70636e30fd1224df8bc70b.',
 STARTKEY => '', ENDKEY => ''}]
2015-01-13 08:01:47,964 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
handler.DeleteTableHandler(104): Archiving region 
testSSHCleanupDaugtherRegionsOfAbortedSplit,,1421136107549.bc81e04ede70636e30fd1224df8bc70b.
 from FS
2015-01-13 08:01:47,965 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
backup.HFileArchiver(93): ARCHIVING 
hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b
2015-01-13 08:01:47,969 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
backup.HFileArchiver(134): Archiving [class 
org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, 
file:hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b/f]
2015-01-13 08:01:47,978 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
backup.HFileArchiver(438): Finished archiving from class 
org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, 
file:hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b/f/9b6aaa5d9dd44f3ab113a32c029d50ae,
 to 
hdfs://localhost:43493/user/apurtell/hbase/archive/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b/f/9b6aaa5d9dd44f3ab113a32c029d50ae
2015-01-13 08:01:47,979 INFO  [IPC Server handler 6 on 43493] 
blockmanagement.BlockManager(1009): BLOCK* addToInvalidates: 
blk_1073741839_1015 127.0.0.1:49933 127.0.0.1:55797 127.0.0.1:57090 
2015-01-13 08:01:47,979 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
backup.HFileArchiver(453): Deleted all region files in: 
hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b
2015-01-13 08:01:47,980 INFO  [IPC Server handler 2 on 43493] 
blockmanagement.BlockManager(1009): BLOCK* addToInvalidates: 
blk_1073741838_1014 127.0.0.1:57090 127.0.0.1:49933 127.0.0.1:55797 
2015-01-13 08:01:47,980 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
handler.DeleteTableHandler(114): Table 
'testSSHCleanupDaugtherRegionsOfAbortedSplit' archived!
2015-01-13 08:01:47,980 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
handler.DeleteTableHandler(117): Removing 
'testSSHCleanupDaugtherRegionsOfAbortedSplit' descriptor.
2015-01-13 08:01:47,981 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
handler.DeleteTableHandler(121): Removing 
'testSSHCleanupDaugtherRegionsOfAbortedSplit' from region states.
2015-01-13 08:01:47,981 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
handler.DeleteTableHandler(125): Marking 
'testSSHCleanupDaugtherRegionsOfAbortedSplit' as deleted.
2015-01-13 08:01:47,984 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
lock.ZKInterProcessLockBase(328): Released 
/hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000002
{noformat}

The first interesting line missing in test.log.bad is:
{noformat}
2015-01-13 08:01:47,932 INFO  [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
handler.DisableTableHandler(130): Attempting to disable table 
testSSHCleanupDaugtherRegionsOfAbortedSplit
{noformat}

Nowhere in test.log.bad does "Attempting to disable table 
testSSHCleanupDaugtherRegionsOfAbortedSplit" appear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12847) TestZKLessSplitOnCluster frequently times out in 0.98 builds

Reply via email to