[ https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881687#comment-15881687 ]
Daniel Templeton commented on HADOOP-10584: ------------------------------------------- Resetting the counts isn't the answer. I can now reproduce this issue reliably by setting a break point in {{processWatchEvent()}} and shutting down ZK before continuing. The issue is a race condition between the events from the ZK client and creating/statting the ZK node. If the disconnected update event comes first, all is well. If not, it will retry a few times and then fail the RM. To echo earlier comments, why does ZK connection loss necessitate stopping the RM in this case? It doesn't in any other case. My proposal would be to remove the fatal error completely. We could instead either transition to standby explicitly or just ignore the error (and hence the retries) on connection loss and wait for the ZK event to trigger the transition. I kinda like the latter. Any opinion? > ActiveStandbyElector goes down if ZK quorum become unavailable > -------------------------------------------------------------- > > Key: HADOOP-10584 > URL: https://issues.apache.org/jira/browse/HADOOP-10584 > Project: Hadoop Common > Issue Type: Bug > Components: ha > Affects Versions: 2.4.0 > Reporter: Karthik Kambatla > Priority: Critical > Attachments: hadoop-10584-prelim.patch, rm.log > > > ActiveStandbyElector retries operations for a few times. If the ZK quorum > itself is down, it goes down and the daemons will have to be brought up > again. > Instead, it should log the fact that it is unable to talk to ZK, call > becomeStandby on its client, and continue to attempt connecting to ZK. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org