[ https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239641#comment-13239641 ]
Todd Lipcon commented on HADOOP-8220: ------------------------------------- I'll add a new test to the ActiveStandbyElector-specific code for this. I was testing it via the "integration test", but you're right that adding to the unit tests makes sense too. bq. How does NPE occur when the elector makes sure the client is recreated upon rejoining the election? Which zkClient are you talking about? The NPE occurred in the previous code because we had the following sequence: - createNode succeeded - called ZKFC becomeActive() callback -- becomeActive() throws exception -- ZKFC had a catch() clause which called quitElection () (it turned out this wasn't the right behavior) --- quitElection() nulled out zkClient - ActiveStandbyElector called monitorNode(), which tried to use zkClient, which had just been nulled out. The new behavior avoids this, since the error handling patch is in ActiveStandbyElector itself. This makes it easier to get the right semantics. bq. What is the purpose of adding the sleep? Could you please elaborate? Without the sleep, it will do a tight loop retrying to become active. This generates a lot of log spew and has little actual benefit. If instead we retry only once a second, then (a) the logs are more readable, and (b) if there is another StandbyNode in the cluster, it will get a chance to try to become active. I will add a comment to this effect in the code. > ZKFailoverController doesn't handle failure to become active correctly > ---------------------------------------------------------------------- > > Key: HADOOP-8220 > URL: https://issues.apache.org/jira/browse/HADOOP-8220 > Project: Hadoop Common > Issue Type: Bug > Components: ha > Affects Versions: 0.23.3, 0.24.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Critical > Attachments: hadoop-8220.txt > > > The ZKFC doesn't properly handle the case where the monitored service fails > to become active. Currently, it catches the exception and logs a warning, but > then continues on, after calling quitElection(). This causes a NPE when it > later tries to use the same zkClient instance while handling that same > request. There is a test case, but the test case doesn't ensure that the node > that had the failure is later able to recover properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira