[ 
https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881687#comment-15881687
 ] 

Daniel Templeton commented on HADOOP-10584:
-------------------------------------------

Resetting the counts isn't the answer.  I can now reproduce this issue reliably 
by setting a break point in {{processWatchEvent()}} and shutting down ZK before 
continuing.  The issue is a race condition between the events from the ZK 
client and creating/statting the ZK node.  If the disconnected update event 
comes first, all is well.  If not, it will retry a few times and then fail the 
RM.

To echo earlier comments, why does ZK connection loss necessitate stopping the 
RM in this case?  It doesn't in any other case.  My proposal would be to remove 
the fatal error completely.  We could instead either transition to standby 
explicitly or just ignore the error (and hence the retries) on connection loss 
and wait for the ZK event to trigger the transition.  I kinda like the latter.  
Any opinion?

> ActiveStandbyElector goes down if ZK quorum become unavailable
> --------------------------------------------------------------
>
>                 Key: HADOOP-10584
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10584
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Priority: Critical
>         Attachments: hadoop-10584-prelim.patch, rm.log
>
>
> ActiveStandbyElector retries operations for a few times. If the ZK quorum 
> itself is down, it goes down and the daemons will have to be brought up 
> again. 
> Instead, it should log the fact that it is unable to talk to ZK, call 
> becomeStandby on its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to