[jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable

2024-01-04 Thread Shilun Fan (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802967#comment-17802967
 ] 

Shilun Fan commented on HADOOP-10584:
-

Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a 
blocker. Retarget 3.5.0.

> ActiveStandbyElector goes down if ZK quorum become unavailable
> --
>
> Key: HADOOP-10584
> URL: https://issues.apache.org/jira/browse/HADOOP-10584
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Assignee: Daniel Templeton
>Priority: Major
> Attachments: HADOOP-10584.prelim.patch, hadoop-10584-prelim.patch, 
> rm.log
>
>
> ActiveStandbyElector retries operations for a few times. If the ZK quorum 
> itself is down, it goes down and the daemons will have to be brought up 
> again. 
> Instead, it should log the fact that it is unable to talk to ZK, call 
> becomeStandby on its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable

2018-03-01 Thread SammiChen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383045#comment-16383045
 ] 

SammiChen commented on HADOOP-10584:


Hi [~templedf],  does this still target for 2.9.1?  If not, can we push this 
out to next 2.9.2 release? 

> ActiveStandbyElector goes down if ZK quorum become unavailable
> --
>
> Key: HADOOP-10584
> URL: https://issues.apache.org/jira/browse/HADOOP-10584
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Assignee: Daniel Templeton
>Priority: Major
> Attachments: HADOOP-10584.prelim.patch, hadoop-10584-prelim.patch, 
> rm.log
>
>
> ActiveStandbyElector retries operations for a few times. If the ZK quorum 
> itself is down, it goes down and the daemons will have to be brought up 
> again. 
> Instead, it should log the fact that it is unable to talk to ZK, call 
> becomeStandby on its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable

2017-02-23 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881687#comment-15881687
 ] 

Daniel Templeton commented on HADOOP-10584:
---

Resetting the counts isn't the answer.  I can now reproduce this issue reliably 
by setting a break point in {{processWatchEvent()}} and shutting down ZK before 
continuing.  The issue is a race condition between the events from the ZK 
client and creating/statting the ZK node.  If the disconnected update event 
comes first, all is well.  If not, it will retry a few times and then fail the 
RM.

To echo earlier comments, why does ZK connection loss necessitate stopping the 
RM in this case?  It doesn't in any other case.  My proposal would be to remove 
the fatal error completely.  We could instead either transition to standby 
explicitly or just ignore the error (and hence the retries) on connection loss 
and wait for the ZK event to trigger the transition.  I kinda like the latter.  
Any opinion?

> ActiveStandbyElector goes down if ZK quorum become unavailable
> --
>
> Key: HADOOP-10584
> URL: https://issues.apache.org/jira/browse/HADOOP-10584
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Priority: Critical
> Attachments: hadoop-10584-prelim.patch, rm.log
>
>
> ActiveStandbyElector retries operations for a few times. If the ZK quorum 
> itself is down, it goes down and the daemons will have to be brought up 
> again. 
> Instead, it should log the fact that it is unable to talk to ZK, call 
> becomeStandby on its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable

2017-02-17 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872106#comment-15872106
 ] 

Daniel Templeton commented on HADOOP-10584:
---

I'm looking at this issue now, and it seems to me that the issue could be 
resolved by reseting the retry counts when the session is reconnected.  If 
we've lost the session, then whatever retry counts we had previously don't 
really apply anymore, so we should reset them on reconnect.  It looks like this 
issue is happening only in the case that the ZK connection is flaky.

> ActiveStandbyElector goes down if ZK quorum become unavailable
> --
>
> Key: HADOOP-10584
> URL: https://issues.apache.org/jira/browse/HADOOP-10584
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Priority: Critical
> Attachments: hadoop-10584-prelim.patch, rm.log
>
>
> ActiveStandbyElector retries operations for a few times. If the ZK quorum 
> itself is down, it goes down and the daemons will have to be brought up 
> again. 
> Instead, it should log the fact that it is unable to talk to ZK, call 
> becomeStandby on its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable

2015-11-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004828#comment-15004828
 ] 

Karthik Kambatla commented on HADOOP-10584:
---

Based on my recollection from a while ago and briefly looking at the attached 
prelim patch, there are a couple of issues here:
# When RM loses connection while executing an operation, the operation just 
fails without enough retries. The patch adds a retry-loop to handle this.
# When RM loses connection to ZK but doesn't give up being Active. This leads 
to the RM continuing to serve apps and nodes connected to it. The patch, in 
addition to rejoining election, has the client (ZKFC/RM) enter neutral mode. 
Today, the RM doesn't do anything on {{enterNeutralMode}} but of course this 
can be improved going forward. 

I won't be able to work on this for the next month or so. If anyone has cycles, 
please feel free to take it up. 

> ActiveStandbyElector goes down if ZK quorum become unavailable
> --
>
> Key: HADOOP-10584
> URL: https://issues.apache.org/jira/browse/HADOOP-10584
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: hadoop-10584-prelim.patch, rm.log
>
>
> ActiveStandbyElector retries operations for a few times. If the ZK quorum 
> itself is down, it goes down and the daemons will have to be brought up 
> again. 
> Instead, it should log the fact that it is unable to talk to ZK, call 
> becomeStandby on its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable

2015-06-18 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591365#comment-14591365
 ] 

Rakesh R commented on HADOOP-10584:
---

Sorry for pitch in late. After looking at the logic, I also feel this case can 
occur in the production clusters. On ZooKeeper connection loss 
ActiveStandbyElector will do certain number of retries and finally notifies 
{{ActiveStandbyElectorCallback#notifyFatalError()}}. I could see the 
{{EmbeddedElectorService#notifyFatalError}} implementation is handling the case 
by immediately terminating the service. I think we have room to improve this 
logic instead of immediately terminating.

About the proposed patch, IIUC it is not required to do an additional handling 
of ZooKeeper exceptions and do re-election in ActiveStandbyElector class. 
Presently we have {{ActiveStandbyElector#processWatchEvent}} logic to handle 
the ZK connection state changes. On connection state changes, ZooKeeper client 
will notify this to the registered ZK watcher like, SyncConnected, 
Disconnected, Expired etc. Based on the STATE {{ActiveStandbyElector}} is 
notifying the registered {{ActiveStandbyElectorCallback}} and does the state 
transitions. Please see 
[ActiveStandbyElector.java#L550|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ActiveStandbyElector.java#L550]

What I meant is ZooKeeper client will be alive which internally does the 
connection re-establishment infinitely. IMHO, we could think of implemeting 
{{EmbeddedElectorService#enterNeutralMode}} to handle the NEUTRAL transition of 
RM. Also, {{ActiveStandbyElectorCallback#notifyFatalError()}} has to be 
refined. Any thoughts?

{code}
  public void enterNeutralMode() {
/**
 * Possibly due to transient connection issues. Do nothing.
 * TODO: Might want to keep track of how long in this state and transition
 * to standby.
 */
  }
{code}

 ActiveStandbyElector goes down if ZK quorum become unavailable
 --

 Key: HADOOP-10584
 URL: https://issues.apache.org/jira/browse/HADOOP-10584
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: hadoop-10584-prelim.patch, rm.log


 ActiveStandbyElector retries operations for a few times. If the ZK quorum 
 itself is down, it goes down and the daemons will have to be brought up 
 again. 
 Instead, it should log the fact that it is unable to talk to ZK, call 
 becomeStandby on its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable

2015-06-15 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587287#comment-14587287
 ] 

Xuan Gong commented on HADOOP-10584:


[~vinodkv] [~kasha]

bq. From my previous investigation, the patch I posted should help.

It does help. But in the patch, reJoinElection(0) is called, which will further 
call  joinElectionInternal
{code}
  private void joinElectionInternal() {
Preconditions.checkState(appData != null,
trying to join election without any app data);
if (zkClient == null) {
  if (!reEstablishSession()) {
fatalError(Failed to reEstablish connection with ZooKeeper);
return;
  }
}
createRetryCount = 0;
wantToBeInElection = true;
createLockNodeAsync();
  }
{code}
Since the ZK quorum is unavailable, we still have the same issue. The 
difference is that with the patch we will retry 45s more(by using the default 
configuration).

So if we will want to use retry-then-exist pattern, I think that both current 
code and current code + the patch are fine. We also need to modify the 
configurations based on the cluster.

Or, if we do not expect RM exists because of this reason (ZK quorum is 
unavailable), instead of doing
{code}
public void handle(RMFatalEvent event) {
  LOG.fatal(Received a  + RMFatalEvent.class.getName() +  of type  +
  event.getType().name() + . Cause:\n + event.getCause());

  ExitUtil.terminate(1, event.getCause());
}
{code}

We could check the eventType, and transit the RM to standby ,then rejoin 
electorService.

 ActiveStandbyElector goes down if ZK quorum become unavailable
 --

 Key: HADOOP-10584
 URL: https://issues.apache.org/jira/browse/HADOOP-10584
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: hadoop-10584-prelim.patch, rm.log


 ActiveStandbyElector retries operations for a few times. If the ZK quorum 
 itself is down, it goes down and the daemons will have to be brought up 
 again. 
 Instead, it should log the fact that it is unable to talk to ZK, call 
 becomeStandby on its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable

2015-06-15 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586986#comment-14586986
 ] 

Vinod Kumar Vavilapalli commented on HADOOP-10584:
--

Does seem critical to me, but haven't seen any activity in a while. [~kasha] / 
[~xgong], can one of you comment on a possible fix? If it's uncertain, I'd like 
to move this to 2.7.2. Tx.

 ActiveStandbyElector goes down if ZK quorum become unavailable
 --

 Key: HADOOP-10584
 URL: https://issues.apache.org/jira/browse/HADOOP-10584
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: hadoop-10584-prelim.patch, rm.log


 ActiveStandbyElector retries operations for a few times. If the ZK quorum 
 itself is down, it goes down and the daemons will have to be brought up 
 again. 
 Instead, it should log the fact that it is unable to talk to ZK, call 
 becomeStandby on its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable

2015-06-15 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587180#comment-14587180
 ] 

Karthik Kambatla commented on HADOOP-10584:
---

We run into this occasionally on our test clusters. From my previous 
investigation, the patch I posted should help. However, I couldn't test because 
I couldn't find a way to reproduce the problem. It should be okay to punt to 
2.7.2. 

 ActiveStandbyElector goes down if ZK quorum become unavailable
 --

 Key: HADOOP-10584
 URL: https://issues.apache.org/jira/browse/HADOOP-10584
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: hadoop-10584-prelim.patch, rm.log


 ActiveStandbyElector retries operations for a few times. If the ZK quorum 
 itself is down, it goes down and the daemons will have to be brought up 
 again. 
 Instead, it should log the fact that it is unable to talk to ZK, call 
 becomeStandby on its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable

2015-01-04 Thread Peng Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263809#comment-14263809
 ] 

Peng Zhang commented on HADOOP-10584:
-

I met the similar error for YARN RM that enabled HA automatic-failover. 

{noformat}
2015-01-04,12:42:30,682 FATAL org.apache.hadoop.ha.ActiveStandbyElector: 
Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further 
znode monitoring connection errors.
2015-01-04,12:42:30,886 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x2498936f2a8c448 closed
2015-01-04,12:42:30,888 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2015-01-04,12:42:30,888 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
EMBEDDED_ELECTOR_FAILED. Cause:
Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further 
znode monitoring connection errors.
2015-01-04,12:42:30,891 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
{noformat}

 ActiveStandbyElector goes down if ZK quorum become unavailable
 --

 Key: HADOOP-10584
 URL: https://issues.apache.org/jira/browse/HADOOP-10584
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: hadoop-10584-prelim.patch


 ActiveStandbyElector retries operations for a few times. If the ZK quorum 
 itself is down, it goes down and the daemons will have to be brought up 
 again. 
 Instead, it should log the fact that it is unable to talk to ZK, call 
 becomeStandby on its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable

2014-06-20 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039564#comment-14039564
 ] 

Karthik Kambatla commented on HADOOP-10584:
---

Logs from when we saw this error:

{noformat}
-yy-xx 06:01:30,039 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 3335ms for sessionid 
0x2459abcbfd0027f, closing socket connection and attempting reconnect
-yy-xx 06:01:30,144 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
disconnected. Entering neutral mode...
-yy-xx 06:01:30,233 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server MASKED-1/10.1.128.51:2181. Will not attempt to 
authenticate using SASL (unknown error)
-yy-xx 06:01:30,233 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to MASKED-1/10.1.128.51:2181, initiating session
-yy-xx 06:01:31,901 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 1667ms for sessionid 
0x2459abcbfd0027f, closing socket connection and attempting reconnect
-yy-xx 06:01:32,405 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server MASKED-2/10.1.128.48:2181. Will not attempt to 
authenticate using SASL (unknown error)
-yy-xx 06:01:32,406 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to MASKED-2/10.1.128.48:2181, initiating session
-yy-xx 06:01:32,409 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server MASKED-2/10.1.128.48:2181, sessionid = 
0x2459abcbfd0027f, negotiated timeout = 5000
-yy-xx 06:01:32,412 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
connected.
-yy-xx 06:01:35,742 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 3334ms for sessionid 
0x2459abcbfd0027f, closing socket connection and attempting reconnect
-yy-xx 06:01:35,850 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
disconnected. Entering neutral mode...
-yy-xx 06:01:35,966 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server MASKED-3/10.1.128.49:2181. Will not attempt to 
authenticate using SASL (unknown error)
-yy-xx 06:01:35,967 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to MASKED-3/10.1.128.49:2181, initiating session
-yy-xx 06:01:35,968 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server MASKED-3/10.1.128.49:2181, sessionid = 
0x2459abcbfd0027f, negotiated timeout = 5000
-yy-xx 06:01:35,972 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
connected.
-yy-xx 06:01:39,303 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 3335ms for sessionid 
0x2459abcbfd0027f, closing socket connection and attempting reconnect
-yy-xx 06:01:39,411 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
disconnected. Entering neutral mode...
-yy-xx 06:01:39,904 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server MASKED-1/10.1.128.51:2181. Will not attempt to 
authenticate using SASL (unknown error)
-yy-xx 06:01:39,904 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to MASKED-1/10.1.128.51:2181, initiating session
-yy-xx 06:01:41,572 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 1668ms for sessionid 
0x2459abcbfd0027f, closing socket connection and attempting reconnect
-yy-xx 06:01:41,678 FATAL org.apache.hadoop.ha.ActiveStandbyElector: 
Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further 
znode monitoring connection errors.
-yy-xx 06:01:41,926 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x2459abcbfd0027f closed
-yy-xx 06:01:41,927 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal 
error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not 
retrying further znode monitoring connection errors.
-yy-xx 06:01:41,927 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x2459abcbfd0027f
-yy-xx 06:01:41,927 INFO org.apache.hadoop.ipc.Server: Stopping server on 
8018
-yy-xx 06:01:41,927 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
-yy-xx 06:01:41,928 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Yielding from election
-yy-xx 06:01:41,928 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
listener on 8018
-yy-xx 06:01:41,928 INFO org.apache.hadoop.ha.HealthMonitor: Stopping 
HealthMonitor thread
-yy-xx 06:01:41,928 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
Responder
{noformat}

 ActiveStandbyElector goes down if ZK quorum become unavailable
 --

 Key: HADOOP-10584
 URL: 

[jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable

2014-05-15 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993080#comment-13993080
 ] 

Karthik Kambatla commented on HADOOP-10584:
---

More background: We saw this when ZK became inaccessible for a few minutes. 
ZKFC went down and the corresponding master was transitioned to Standby. 

bq. You mean instead of calling fatalError() like its doing now?
Yes. Or, we should have two retry modes. The retries we have today followed by 
a call to becomeStandby, within an outer retry-forever loop that sleeps for a 
shorter time between inner-loops.



 ActiveStandbyElector goes down if ZK quorum become unavailable
 --

 Key: HADOOP-10584
 URL: https://issues.apache.org/jira/browse/HADOOP-10584
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical

 ActiveStandbyElector retries operations for a few times. If the ZK quorum 
 itself is down, it goes down and the daemons will have to be brought up 
 again. 
 Instead, it should log the fact that it is unable to talk to ZK, call 
 becomeStandby on its client, and continue to attempt connecting to ZK.



--
This message was sent by Atlassian JIRA
(v6.2#6252)