[ 
https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239725#comment-13239725
 ] 

Bikas Saha commented on HADOOP-8220:
------------------------------------

bq. The new behavior avoids this, since the error handling patch is in 
ActiveStandbyElector itself. This makes it easier to get the right semantics.
Ah. Now I get it. The elector should be robust against client code (ZKFC in 
this case). I like Hari's proposal of using a return value to inform about 
fail/success of becoming active. I am not that familiar with standard practices 
in Java - are return values preferred or exceptions?

bq. This generates a lot of log spew and has little actual benefit. If instead 
we retry only once a second, then (a) the logs are more readable, and (b) if 
there is another StandbyNode in the cluster, it will get a chance to try to 
become active.
I did not understand where the tight loop is? Do you mean (Elector gets 
lock<->ZKFC fails to becomes active)?
I do not have any data on the trade off between 1) letting the last active 
become active again with log spew 2) letting another standby become active by 
making the last active sleep. But for arguments sake I would prefer 1). IMO 
having continuity in the active node would reduce the overheads of 
client/datanode failover etc.

bq. becomeActive() should be protected by a timeout also. If NN is taking far 
too long to return, FC should declare failure and give up the lock. Otherwise, 
it is a deadlock.
Hari, this seems similar to the alternative proposed in HADOOP-8205 about 
trying to make sure that the transition to active is short.

                
> ZKFailoverController doesn't handle failure to become active correctly
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-8220
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8220
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 0.23.3, 0.24.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>         Attachments: hadoop-8220.txt
>
>
> The ZKFC doesn't properly handle the case where the monitored service fails 
> to become active. Currently, it catches the exception and logs a warning, but 
> then continues on, after calling quitElection(). This causes a NPE when it 
> later tries to use the same zkClient instance while handling that same 
> request. There is a test case, but the test case doesn't ensure that the node 
> that had the failure is later able to recover properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to