[ https://issues.apache.org/jira/browse/HBASE-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13803498#comment-13803498 ]
Ted Yu commented on HBASE-9812: ------------------------------- Looks like there was some issue with zookeeper. Following the above log I saw: {code} 2013-10-21 11:52:24,809 INFO [dummy-master,1,1.splitLogManagerTimeoutMonitor] hbase.Chore(93): dummy-master,1,1.splitLogManagerTimeoutMonitor exiting 2013-10-21 11:52:24,908 DEBUG [pool-1-thread-1-EventThread] zookeeper.ZooKeeperWatcher(310): split-log-manager-tests145fa180-5cc5-4165-9d67-4073ab3f921b-0x141dadbb9320000 Received ZooKeeper Event, type=None, state=Disconnected, path=null 2013-10-21 11:52:24,908 DEBUG [pool-1-thread-1-EventThread] zookeeper.ZooKeeperWatcher(392): split-log-manager-tests145fa180-5cc5-4165-9d67-4073ab3f921b-0x141dadbb9320000 Received Disconnected from ZooKeeper, ignoring 2013-10-21 11:52:33,628 ERROR [SyncThread:0] server.NIOServerCnxn(180): Unexpected Exception: java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59) at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153) at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:167) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101) 2013-10-21 11:52:33,629 INFO [pool-1-thread-1] zookeeper.MiniZooKeeperCluster(249): Shutdown MiniZK cluster with all ZK servers 2013-10-21 11:52:33,641 INFO [pool-1-thread-1] hbase.ResourceChecker(171): after: master.TestSplitLogManager#testMultipleResubmits Thread=9 (was 8) Potentially hanging thread: pool-1-thread-1-SendThread(localhost:64465) java.lang.Thread.sleep(Native Method) org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:86) org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:937) org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:993) - Thread LEAK? -, OpenFileDescriptor=143 (was 137) - OpenFileDescriptor LEAK? -, MaxFileDescriptor=4096 (was 4096), SystemLoadAverage=595 (was 572) - SystemLoadAverage LEAK? -, ProcessCount=124 (was 124), AvailableMemoryMB=641 (was 751), ConnectionCount=0 (was 0) {code} I wonder if SplitLogManager should utilize RecoverableZooKeeper#setData(): {code} public Stat setData(String path, byte[] data, int version) {code} > Intermittent TestSplitLogManager#testMultipleResubmits failure > -------------------------------------------------------------- > > Key: HBASE-9812 > URL: https://issues.apache.org/jira/browse/HBASE-9812 > Project: HBase > Issue Type: Test > Reporter: Ted Yu > Priority: Minor > > From > https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/801/testReport/org.apache.hadoop.hbase.master/TestSplitLogManager/testMultipleResubmits/ > : > {code} > junit.framework.AssertionFailedError: Waiting timed out after [9,600] msec > at junit.framework.Assert.fail(Assert.java:57) > at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193) > at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:146) > at > org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3220) > at > org.apache.hadoop.hbase.master.TestSplitLogManager.waitForCounter(TestSplitLogManager.java:164) > at > org.apache.hadoop.hbase.master.TestSplitLogManager.waitForCounter(TestSplitLogManager.java:157) > at > org.apache.hadoop.hbase.master.TestSplitLogManager.testMultipleResubmits(TestSplitLogManager.java:280) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > ... > 2013-10-21 11:52:15,148 DEBUG [pool-1-thread-1] zookeeper.ZKUtil(430): > split-log-manager-tests145fa180-5cc5-4165-9d67-4073ab3f921b-0x141dadbb9320000 > Set watcher on znode that does not yet exist, /hbase/splitWAL/foo%2F1 > 2013-10-21 11:52:15,148 DEBUG [pool-1-thread-1] > master.TestSplitLogManager(186): waiting for task node creation > 2013-10-21 11:52:15,164 DEBUG [pool-1-thread-1-EventThread] > zookeeper.ZooKeeperWatcher(310): > split-log-manager-tests145fa180-5cc5-4165-9d67-4073ab3f921b-0x141dadbb9320000 > Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, > path=/hbase/splitWAL/foo%2F1 > 2013-10-21 11:52:15,164 DEBUG [pool-1-thread-1-EventThread] > regionserver.TestMasterAddressManager$NodeCreationListener(107): > nodeCreated(/hbase/splitWAL/foo%2F1) > 2013-10-21 11:52:15,164 DEBUG [pool-1-thread-1] > master.TestSplitLogManager(188): task created > 2013-10-21 11:52:15,164 DEBUG [pool-1-thread-1-EventThread] > master.SplitLogManager(711): put up splitlog task at znode > /hbase/splitWAL/foo%2F1 > 2013-10-21 11:52:15,166 DEBUG [pool-1-thread-1-EventThread] > master.SplitLogManager(753): task not yet acquired /hbase/splitWAL/foo%2F1 > ver = 0 > 2013-10-21 11:52:15,193 DEBUG [pool-1-thread-1-EventThread] > zookeeper.ZooKeeperWatcher(310): > split-log-manager-tests145fa180-5cc5-4165-9d67-4073ab3f921b-0x141dadbb9320000 > Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, > path=/hbase/splitWAL/foo%2F1 > 2013-10-21 11:52:15,194 INFO [pool-1-thread-1] hbase.Waiter(174): Waiting up > to [3,200] milli-secs(wait.for.ratio=[1]) > 2013-10-21 11:52:15,194 INFO [pool-1-thread-1-EventThread] > master.SplitLogManager(826): task /hbase/splitWAL/foo%2F1 acquired by > worker1,1,1 > 2013-10-21 11:52:15,204 INFO [pool-1-thread-1] hbase.Waiter(174): Waiting up > to [9,600] milli-secs(wait.for.ratio=[1]) > 2013-10-21 11:52:15,247 INFO > [dummy-master,1,1.splitLogManagerTimeoutMonitor] > master.SplitLogManager$TimeoutMonitor(1408): total tasks = 1 unassigned = 0 > tasks={/hbase/splitWAL/foo%2F1=last_update = 1382356335195 last_version = 1 > cur_worker_name = worker1,1,1 status = in_progress incarnation = 0 resubmits > = 0 batch = installed = 1 done = 0 error = 0} > 2013-10-21 11:52:20,250 INFO > [dummy-master,1,1.splitLogManagerTimeoutMonitor] > master.SplitLogManager$TimeoutMonitor(1408): total tasks = 1 unassigned = 0 > tasks={/hbase/splitWAL/foo%2F1=last_update = 1382356335195 last_version = 1 > cur_worker_name = worker1,1,1 status = in_progress incarnation = 0 resubmits > = 0 batch = installed = 1 done = 0 error = 0} > 2013-10-21 11:52:21,251 INFO > [dummy-master,1,1.splitLogManagerTimeoutMonitor] master.SplitLogManager(875): > resubmitting task /hbase/splitWAL/foo%2F1 > 2013-10-21 11:52:24,808 DEBUG > [dummy-master,1,1.splitLogManagerTimeoutMonitor] > zookeeper.ZooKeeperWatcher(458): > split-log-manager-tests145fa180-5cc5-4165-9d67-4073ab3f921b-0x141dadbb9320000 > Received InterruptedException, doing nothing here > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309) > at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1264) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:414) > at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:881) > at > org.apache.hadoop.hbase.master.SplitLogManager.resubmit(SplitLogManager.java:880) > at > org.apache.hadoop.hbase.master.SplitLogManager.access$400(SplitLogManager.java:107) > at > org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor.chore(SplitLogManager.java:1400) > at org.apache.hadoop.hbase.Chore.run(Chore.java:80) > at java.lang.Thread.run(Thread.java:662) > 2013-10-21 11:52:24,809 DEBUG > [dummy-master,1,1.splitLogManagerTimeoutMonitor] master.SplitLogManager(881): > failed to resubmit task /hbase/splitWAL/foo%2F1 version changed > {code} > Looks like the version change caused some resubmission to fail. -- This message was sent by Atlassian JIRA (v6.1#6144)