[ 
https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039564#comment-14039564
 ] 

Karthik Kambatla commented on HADOOP-10584:
-------------------------------------------

Logs from when we saw this error:

{noformat}
zzzz-yy-xx 06:01:30,039 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 3335ms for sessionid 
0x2459abcbfd0027f, closing socket connection and attempting reconnect
zzzz-yy-xx 06:01:30,144 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
disconnected. Entering neutral mode...
zzzz-yy-xx 06:01:30,233 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server MASKED-1/10.1.128.51:2181. Will not attempt to 
authenticate using SASL (unknown error)
zzzz-yy-xx 06:01:30,233 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to MASKED-1/10.1.128.51:2181, initiating session
zzzz-yy-xx 06:01:31,901 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 1667ms for sessionid 
0x2459abcbfd0027f, closing socket connection and attempting reconnect
zzzz-yy-xx 06:01:32,405 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server MASKED-2/10.1.128.48:2181. Will not attempt to 
authenticate using SASL (unknown error)
zzzz-yy-xx 06:01:32,406 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to MASKED-2/10.1.128.48:2181, initiating session
zzzz-yy-xx 06:01:32,409 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server MASKED-2/10.1.128.48:2181, sessionid = 
0x2459abcbfd0027f, negotiated timeout = 5000
zzzz-yy-xx 06:01:32,412 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
connected.
zzzz-yy-xx 06:01:35,742 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 3334ms for sessionid 
0x2459abcbfd0027f, closing socket connection and attempting reconnect
zzzz-yy-xx 06:01:35,850 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
disconnected. Entering neutral mode...
zzzz-yy-xx 06:01:35,966 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server MASKED-3/10.1.128.49:2181. Will not attempt to 
authenticate using SASL (unknown error)
zzzz-yy-xx 06:01:35,967 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to MASKED-3/10.1.128.49:2181, initiating session
zzzz-yy-xx 06:01:35,968 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server MASKED-3/10.1.128.49:2181, sessionid = 
0x2459abcbfd0027f, negotiated timeout = 5000
zzzz-yy-xx 06:01:35,972 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
connected.
zzzz-yy-xx 06:01:39,303 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 3335ms for sessionid 
0x2459abcbfd0027f, closing socket connection and attempting reconnect
zzzz-yy-xx 06:01:39,411 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
disconnected. Entering neutral mode...
zzzz-yy-xx 06:01:39,904 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server MASKED-1/10.1.128.51:2181. Will not attempt to 
authenticate using SASL (unknown error)
zzzz-yy-xx 06:01:39,904 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to MASKED-1/10.1.128.51:2181, initiating session
zzzz-yy-xx 06:01:41,572 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 1668ms for sessionid 
0x2459abcbfd0027f, closing socket connection and attempting reconnect
zzzz-yy-xx 06:01:41,678 FATAL org.apache.hadoop.ha.ActiveStandbyElector: 
Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further 
znode monitoring connection errors.
zzzz-yy-xx 06:01:41,926 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x2459abcbfd0027f closed
zzzz-yy-xx 06:01:41,927 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal 
error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not 
retrying further znode monitoring connection errors.
zzzz-yy-xx 06:01:41,927 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x2459abcbfd0027f
zzzz-yy-xx 06:01:41,927 INFO org.apache.hadoop.ipc.Server: Stopping server on 
8018
zzzz-yy-xx 06:01:41,927 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Yielding from election
zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
listener on 8018
zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ha.HealthMonitor: Stopping 
HealthMonitor thread
zzzz-yy-xx 06:01:41,928 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
Responder
{noformat}

> ActiveStandbyElector goes down if ZK quorum become unavailable
> --------------------------------------------------------------
>
>                 Key: HADOOP-10584
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10584
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
>         Attachments: hadoop-10584-prelim.patch
>
>
> ActiveStandbyElector retries operations for a few times. If the ZK quorum 
> itself is down, it goes down and the daemons will have to be brought up 
> again. 
> Instead, it should log the fact that it is unable to talk to ZK, call 
> becomeStandby on its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to