[ https://issues.apache.org/jira/browse/HBASE-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16342526#comment-16342526 ]
Chia-Ping Tsai edited comment on HBASE-19870 at 1/28/18 11:06 AM: ------------------------------------------------------------------ {quote}And maybe the testNotCloseZkWhenPending is enough for testing the problem? Just add a assert to make sure that the thread is still alive, and try reading from the ROZKClient to make sure that it still works? {quote} I don't think so. Reading the data from ROZKClient will add two tasks to the queue - 1) call the async api (++pendingRequests) of zk and 2) handler the callback of zk (--pendingRequests). This NPE happens because the number of pendingRequests is not equal with zero and the no task exist in the queue. Specifically, the NPE is caused by the following events. # add the first task (number of task = 1, pendingRequests = 0) # ROZKClient#run execute the first task ( number of task = 0, pendingRequests => 1, register the callback to zk) # zk is too busy to run the callback ( pendingRequests => 1) # ROZKClient#run get null task and number of pendingRequests isn't equal with zero. ROZKClient SHOULD wait for next task but it try to process the null task... If we want to reproduce the error, we must make sure the ROZKClient#run execute before the second task is added. The testNotCloseZkWhenPending add a blocker to the first task hence it also block the ROZKClient#run. {code:java} doAnswer(new Answer<Object>() { @Override public Object answer(InvocationOnMock invocation) throws Throwable { latch.await(); return invocation.callRealMethod(); } }).when(mockedZK).exists(anyString(), anyBoolean(), any(StatCallback.class), any()); RO_ZK.zookeeper = mockedZK; CompletableFuture<Stat> future = RO_ZK.exists(PATH); // 2 * keep alive time to ensure that we will not close the zk when there are pending requests Thread.sleep(6000);{code} I guess testNotCloseZkWhenPending tried do make the same concurrent contention as this issue but it didn't. [~Apache9] WDYT? was (Author: chia7712): {quote}And maybe the testNotCloseZkWhenPending is enough for testing the problem? Just add a assert to make sure that the thread is still alive, and try reading from the ROZKClient to make sure that it still works? {quote} I don't think so. Reading the data from ROZKClient will add two tasks to the queue - 1) call the async api (++pendingRequests) of zk and 2) handler the callback of zk (--pendingRequests). This NPE happens because the number of pendingRequests is not equal with zero and the no task exist in the queue. Specifically, the NPE is caused by the following events. # add the first task (number of task = 1, pendingRequests = 0) # ROZKClient#run execute the first task ( number of task = 0, pendingRequests => 1, register the callback to zk) # zk is too busy to run the callback ( pendingRequests => 1) # ROZKClient#run get null task and number of pendingRequests isn't equal with zero. ROZKClient SHOULD wait for next task but it try to process the null task... If we want to reproduce the error, we must make sure the ROZKClient#run execute before the second task is added. The testNotCloseZkWhenPending add a blocker to the first task hence it also block the ROZKClient#run. {code:java} doAnswer(new Answer<Object>() { @Override public Object answer(InvocationOnMock invocation) throws Throwable { latch.await(); return invocation.callRealMethod(); } }).when(mockedZK).exists(anyString(), anyBoolean(), any(StatCallback.class), any()); RO_ZK.zookeeper = mockedZK; CompletableFuture<Stat> future = RO_ZK.exists(PATH); // 2 * keep alive time to ensure that we will not close the zk when there are pending requests Thread.sleep(6000);{code} I guess testNotCloseZkWhenPending tried do make the same concurrent contention as this issue but it didn't. [~Apache9] WDYT? > Fix the NPE in ReadOnlyZKClient#run > ----------------------------------- > > Key: HBASE-19870 > URL: https://issues.apache.org/jira/browse/HBASE-19870 > Project: HBase > Issue Type: Sub-task > Reporter: Chia-Ping Tsai > Assignee: Chia-Ping Tsai > Priority: Major > Fix For: 2.0.0-beta-2 > > Attachments: HBASE-19870.v1.patch > > > I notice a NPE from my jenkins. > {code} > 2018-01-26 17:26:41,078 DEBUG [M:0;8546d406e429:40557-EventThread] > zookeeper.ZKWatcher(443): replicationLogCleaner-0x161337ddc090004, > quorum=localhost:56060, baseZNode=/hbase Received ZooKeeper Event, type=None, > state=Disconnected, path=null > java.lang.NullPointerException > at > org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient.run(ReadOnlyZKClient.java:322) > at java.lang.Thread.run(Thread.java:748) > {code} > If any zk task invokes the #onComplete late, the count of current requests > will not zero and then the null from task queue will destroy the work thread > in ReadOnlyZKClient. -- This message was sent by Atlassian JIRA (v7.6.3#76005)