[ 
https://issues.apache.org/jira/browse/HBASE-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16342526#comment-16342526
 ] 

Chia-Ping Tsai edited comment on HBASE-19870 at 1/28/18 11:06 AM:
------------------------------------------------------------------

{quote}And maybe the testNotCloseZkWhenPending is enough for testing the 
problem? Just add a assert to make sure that the thread is still alive, and try 
reading from the ROZKClient to make sure that it still works?
{quote}
I don't think so. Reading the data from ROZKClient will add two tasks to the 
queue - 1) call the async api (++pendingRequests) of zk and 2) handler the 
callback of zk (--pendingRequests). This NPE happens because the number of 
pendingRequests is not equal with zero and the no task exist in the queue. 
Specifically, the NPE is caused by the following events.
 # add the first task (number of task = 1, pendingRequests = 0)
 # ROZKClient#run execute the first task ( number of task = 0, pendingRequests 
=> 1, register the callback to zk)
 # zk is too busy to run the callback ( pendingRequests => 1)
 # ROZKClient#run get null task and number of pendingRequests isn't equal with 
zero. ROZKClient SHOULD wait for next task but it try to process the null 
task...

If we want to reproduce the error, we must make sure the ROZKClient#run execute 
before the second task is added. The testNotCloseZkWhenPending add a blocker to 
the first task hence it also block the ROZKClient#run. 
{code:java}
doAnswer(new Answer<Object>() {

  @Override
  public Object answer(InvocationOnMock invocation) throws Throwable {
    latch.await();
    return invocation.callRealMethod();
  }
}).when(mockedZK).exists(anyString(), anyBoolean(), any(StatCallback.class), 
any());
RO_ZK.zookeeper = mockedZK;
CompletableFuture<Stat> future = RO_ZK.exists(PATH);
// 2 * keep alive time to ensure that we will not close the zk when there are 
pending requests
Thread.sleep(6000);{code}
I guess testNotCloseZkWhenPending tried do make the same concurrent contention 
as this issue but it didn't. [~Apache9] WDYT?

 


was (Author: chia7712):
{quote}And maybe the testNotCloseZkWhenPending is enough for testing the 
problem? Just add a assert to make sure that the thread is still alive, and try 
reading from the ROZKClient to make sure that it still works?
{quote}
I don't think so. Reading the data from ROZKClient will add two tasks to the 
queue - 1) call the async api (++pendingRequests) of zk and 2) handler the 
callback of zk (--pendingRequests). This NPE happens because the number of 
pendingRequests is not equal with zero and the no task exist in the queue. 
Specifically, the NPE is caused by the following events.
 # add the first task (number of task = 1, pendingRequests = 0)
 # ROZKClient#run execute the first task ( number of task = 0, pendingRequests 
=> 1, register the callback to zk)
 # zk is too busy to run the callback ( pendingRequests => 1)
 # ROZKClient#run get null task and number of pendingRequests isn't equal with 
zero. ROZKClient SHOULD wait for next task but it try to process the null 
task...

If we want to reproduce the error, we must make sure the ROZKClient#run execute 
before the second task is added. The testNotCloseZkWhenPending add a blocker to 
the first task hence it also block the ROZKClient#run. 
{code:java}
doAnswer(new Answer<Object>() {

  @Override
  public Object answer(InvocationOnMock invocation) throws Throwable {
    latch.await();
    return invocation.callRealMethod();
  }
}).when(mockedZK).exists(anyString(), anyBoolean(), any(StatCallback.class), 
any());
RO_ZK.zookeeper = mockedZK;
CompletableFuture<Stat> future = RO_ZK.exists(PATH);
// 2 * keep alive time to ensure that we will not close the zk when there are 
pending requests
Thread.sleep(6000);{code}
I guess testNotCloseZkWhenPending tried do make the same concurrent contention 
as this issue but it didn't. [~Apache9] WDYT?

 

 

 

 

 

 

 

> Fix the NPE in ReadOnlyZKClient#run
> -----------------------------------
>
>                 Key: HBASE-19870
>                 URL: https://issues.apache.org/jira/browse/HBASE-19870
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Chia-Ping Tsai
>            Assignee: Chia-Ping Tsai
>            Priority: Major
>             Fix For: 2.0.0-beta-2
>
>         Attachments: HBASE-19870.v1.patch
>
>
> I notice a NPE from my jenkins.
> {code}
> 2018-01-26 17:26:41,078 DEBUG [M:0;8546d406e429:40557-EventThread] 
> zookeeper.ZKWatcher(443): replicationLogCleaner-0x161337ddc090004, 
> quorum=localhost:56060, baseZNode=/hbase Received ZooKeeper Event, type=None, 
> state=Disconnected, path=null
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient.run(ReadOnlyZKClient.java:322)
>       at java.lang.Thread.run(Thread.java:748)
> {code}
> If any zk task invokes the #onComplete late, the count of current requests 
> will not zero and then the null from task queue will destroy the work thread 
> in ReadOnlyZKClient.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to