[ 
https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239641#comment-13239641
 ] 

Todd Lipcon commented on HADOOP-8220:
-------------------------------------

I'll add a new test to the ActiveStandbyElector-specific code for this. I was 
testing it via the "integration test", but you're right that adding to the unit 
tests makes sense too.

bq. How does NPE occur when the elector makes sure the client is recreated upon 
rejoining the election? Which zkClient are you talking about?

The NPE occurred in the previous code because we had the following sequence:
- createNode succeeded
- called ZKFC becomeActive() callback
-- becomeActive() throws exception
-- ZKFC had a catch() clause which called quitElection () (it turned out this 
wasn't the right behavior)
--- quitElection() nulled out zkClient
- ActiveStandbyElector called monitorNode(), which tried to use zkClient, which 
had just been nulled out.

The new behavior avoids this, since the error handling patch is in 
ActiveStandbyElector itself. This makes it easier to get the right semantics.

bq. What is the purpose of adding the sleep? Could you please elaborate?

Without the sleep, it will do a tight loop retrying to become active. This 
generates a lot of log spew and has little actual benefit. If instead we retry 
only once a second, then (a) the  logs are more readable, and (b) if there is 
another StandbyNode in the cluster, it will get a chance to try to become 
active.

I will add a comment to this effect in the code.
                
> ZKFailoverController doesn't handle failure to become active correctly
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-8220
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8220
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 0.23.3, 0.24.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>         Attachments: hadoop-8220.txt
>
>
> The ZKFC doesn't properly handle the case where the monitored service fails 
> to become active. Currently, it catches the exception and logs a warning, but 
> then continues on, after calling quitElection(). This causes a NPE when it 
> later tries to use the same zkClient instance while handling that same 
> request. There is a test case, but the test case doesn't ensure that the node 
> that had the failure is later able to recover properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to