[ https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039564#comment-14039564 ]
Karthik Kambatla commented on HADOOP-10584: ------------------------------------------- Logs from when we saw this error: {noformat} zzzz-yy-xx 06:01:30,039 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3335ms for sessionid 0x2459abcbfd0027f, closing socket connection and attempting reconnect zzzz-yy-xx 06:01:30,144 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode... zzzz-yy-xx 06:01:30,233 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server MASKED-1/10.1.128.51:2181. Will not attempt to authenticate using SASL (unknown error) zzzz-yy-xx 06:01:30,233 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to MASKED-1/10.1.128.51:2181, initiating session zzzz-yy-xx 06:01:31,901 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 1667ms for sessionid 0x2459abcbfd0027f, closing socket connection and attempting reconnect zzzz-yy-xx 06:01:32,405 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server MASKED-2/10.1.128.48:2181. Will not attempt to authenticate using SASL (unknown error) zzzz-yy-xx 06:01:32,406 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to MASKED-2/10.1.128.48:2181, initiating session zzzz-yy-xx 06:01:32,409 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server MASKED-2/10.1.128.48:2181, sessionid = 0x2459abcbfd0027f, negotiated timeout = 5000 zzzz-yy-xx 06:01:32,412 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. zzzz-yy-xx 06:01:35,742 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x2459abcbfd0027f, closing socket connection and attempting reconnect zzzz-yy-xx 06:01:35,850 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode... zzzz-yy-xx 06:01:35,966 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server MASKED-3/10.1.128.49:2181. Will not attempt to authenticate using SASL (unknown error) zzzz-yy-xx 06:01:35,967 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to MASKED-3/10.1.128.49:2181, initiating session zzzz-yy-xx 06:01:35,968 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server MASKED-3/10.1.128.49:2181, sessionid = 0x2459abcbfd0027f, negotiated timeout = 5000 zzzz-yy-xx 06:01:35,972 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. zzzz-yy-xx 06:01:39,303 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3335ms for sessionid 0x2459abcbfd0027f, closing socket connection and attempting reconnect zzzz-yy-xx 06:01:39,411 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode... zzzz-yy-xx 06:01:39,904 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server MASKED-1/10.1.128.51:2181. Will not attempt to authenticate using SASL (unknown error) zzzz-yy-xx 06:01:39,904 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to MASKED-1/10.1.128.51:2181, initiating session zzzz-yy-xx 06:01:41,572 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 1668ms for sessionid 0x2459abcbfd0027f, closing socket connection and attempting reconnect zzzz-yy-xx 06:01:41,678 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. zzzz-yy-xx 06:01:41,926 INFO org.apache.zookeeper.ZooKeeper: Session: 0x2459abcbfd0027f closed zzzz-yy-xx 06:01:41,927 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. zzzz-yy-xx 06:01:41,927 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x2459abcbfd0027f zzzz-yy-xx 06:01:41,927 INFO org.apache.hadoop.ipc.Server: Stopping server on 8018 zzzz-yy-xx 06:01:41,927 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8018 zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder {noformat} > ActiveStandbyElector goes down if ZK quorum become unavailable > -------------------------------------------------------------- > > Key: HADOOP-10584 > URL: https://issues.apache.org/jira/browse/HADOOP-10584 > Project: Hadoop Common > Issue Type: Bug > Components: ha > Affects Versions: 2.4.0 > Reporter: Karthik Kambatla > Assignee: Karthik Kambatla > Priority: Critical > Attachments: hadoop-10584-prelim.patch > > > ActiveStandbyElector retries operations for a few times. If the ZK quorum > itself is down, it goes down and the daemons will have to be brought up > again. > Instead, it should log the fact that it is unable to talk to ZK, call > becomeStandby on its client, and continue to attempt connecting to ZK. -- This message was sent by Atlassian JIRA (v6.2#6252)