[ 
https://issues.apache.org/jira/browse/HBASE-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635488#comment-13635488
 ] 

Jeffrey Zhong commented on HBASE-8365:
--------------------------------------

{quote}
nodeDataChangeEvent only will give the latest data because it will not be able 
to read the old data
{quote}
ZooKeeper intentionally only sends out notifications without passing the 
original state which triggers the notification. It relies on clients to fetch 
the latest state. In addition, ZooKeeper watcher is one-time trigger which 
means it only fire once and client need re-setup watcher on the same znode to 
get next notification.

In our case, from the log, the related updates with watcher set on the region 
are: 1) opening->opening 2) opening->failed_open 3) failed_open->offline 4) 
offline->opening

The first notification(when we got FAILED_OPEN) is triggered by the update of 
opening->opening. When Master got the notification and znode was already 
changed to failed_open, that's the first trace nodeDataChange. 

The thing puzzles me is that ZooKeeper watcher will reset up on failed_open 
state after receiving the first failed_open and should only get more 
notifications when failed_open state changes. While we still get one more 
failed_open later from the same znode and data has the same version as we 
received the first notification. I guess we may trigger ZK client reads stale 
cache data when the node state changes from failed_open -> offline OR race 
conditions in ZK side to cause the dup notifications.
 
 
  


                
> Duplicated ZK notifications cause Master abort (or other unknown issues)
> ------------------------------------------------------------------------
>
>                 Key: HBASE-8365
>                 URL: https://issues.apache.org/jira/browse/HBASE-8365
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.6
>            Reporter: Jeffrey Zhong
>         Attachments: TestResult.txt
>
>
> The duplicated ZK notifications should happen in trunk as well. Since the way 
> we handle ZK notifications is different in trunk, we don't see the issue 
> there. I'll explain later.
> The issue is causing TestMetaReaderEditor.testRetrying flaky with error 
> message {code}reader: count=2, t=null{code} A related link is at 
> https://builds.apache.org/job/HBase-0.94/941/testReport/junit/org.apache.hadoop.hbase.catalog/TestMetaReaderEditor/testRetrying/
> The test case failure is due to an IllegalStateException and master is 
> aborted so the rest test cases also failed after testRetrying.
> Below are steps why the issue is happening(region 
> fa0e7a5590feb69bd065fbc99c228b36 is in interests):
> 1) Got first notification event RS_ZK_REGION_FAILED_OPEN at 2013-04-04 
> 17:39:01,197
> {code} DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744): 
> Handling transition=RS_ZK_REGION_FAILED_OPEN, 
> server=janus.apache.org,42093,1365097126155, 
> region=fa0e7a5590feb69bd065fbc99c228b36{code}
> In the step, AM tries to open the region on another RS in a separate thread
> 2) Got second notification event RS_ZK_REGION_FAILED_OPEN at 2013-04-04 
> 17:39:01,200 
> {code}DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744): 
> Handling transition=RS_ZK_REGION_FAILED_OPEN, 
> server=janus.apache.org,42093,1365097126155, 
> region=fa0e7a5590feb69bd065fbc99c228b36{code}
> 3) Later got opening notification event result from the step 1 at 2013-04-04 
> 17:39:01,288 
> {code} DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744): 
> Handling transition=RS_ZK_REGION_OPENING, 
> server=janus.apache.org,54833,1365097126175, 
> region=fa0e7a5590feb69bd065fbc99c228b36{code}
> Step 2 ClosedRegionHandler throws IllegalStateException because "Cannot 
> transit it to OFFLINE"(state is in opening from notification 3) and abort 
> Master. This could happen in 0.94 because we handle notifications using 
> executorService which opens the door to handle events out of order through 
> receive them in order of updates.
> I've confirmed that we don't have duplicated AM listeners and both events 
> triggered by same ZK data of exact same version. The issue can be reproduced 
> once by running testRetrying test case 20 times in a loop.
> There are several issues for the failure:
> 1) duplicated ZK notifications. Since ZK watcher is one time trigger, the 
> duplicated notification should not happen from the same data of the same 
> version in the first place
> 2) ZooKeeper watcher handling is wrong in both 0.94 and trunk as following:
> a) 0.94 handle notifications in async way which may lead to handle 
> notifications out of order of the events happened
> b) In trunk, we handle ZK notifications synchronously which slows down other 
> components such as SSH, LogSplitting etc. because we have a single 
> notification queue
> c) In trunk & 0.94, we could use stale event data because we have a long 
> listener list. ZK node state could have changed at the time when handling the 
> event. If a listener needs to act upon latest state, it should re-fetch the 
> data to verify if the data triggered the handler hasn't changed.
> Suggestions:
> For 0.94, we can bandit the CloseRegionHandler to pass in the expected ZK 
> data version to skip event handling on stale data with min impact.
> For trunk, I'll open an improvement JIRA on ZK notification handling to 
> provide more parallelism to handle unrelated notifications.
> For the duplicated ZK notifications, we need bring some ZK experters to take 
> a look at this.
> Please let me know what you think or any better idea.
> Thanks!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to