[jira] [Commented] (ZOOKEEPER-1678) Server fails to join quorum when a peer is unreachable (5 ZK server setup)

Flavio Junqueira (JIRA) Tue, 09 Jul 2013 02:44:15 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13703093#comment-13703093
 ]


Flavio Junqueira commented on ZOOKEEPER-1678:
---------------------------------------------

[~juliolopez] FLE pushes notifications to other servers, but it could happen 
that we start a server and there is no one else around, so instead of sending 
an overwhelming number of messages, the server backs off and caps at 60 seconds 
as you say if I remember correctly. I don't mind having that cap value 
configurable if it helps with you case. 

I'm not convinced that the randomization of notifications will make any 
difference, but perhaps I'm not understanding your proposal correctly.

I think you're referring to SendWorker in QCM, is it right? We do have one per 
server, no?

As [~abranzyck] pointed out, it sounds like the problem reported here could be 
solved by the fix of ZOOKEEPER-900. Could you guys make sure that solution 
works here and possibly provide an updated patch?
                
> Server fails to join quorum when a peer is unreachable (5 ZK server setup)
> --------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1678
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1678
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.5
>         Environment: java version "1.6.0_32"
> Java(TM) SE Runtime Environment (build 1.6.0_32-b05)
> Java HotSpot(TM) 64-Bit Server VM (build 20.7-b02, mixed mode)
> Distributor ID:       Ubuntu
> Description:  Ubuntu 12.04.1 LTS
> Release:      12.04
> Codename:     precise
> uname -a Linux ha-vani3-0 3.2.0-23-virtual #36-Ubuntu SMP Tue Apr 10 22:29:03 
> UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
>            Reporter: Julio Lopez
>
> In a 5-node ZK cluster setup, in the following state:
> * 1 host is down / not reachable.
> * 4 hosts are up.
> * 3 ZK servers are in quorum.
> * a 4th ZK server was restarted and is trying to re-join the quorum.
> The 4th server is not able to rejoin the quorum because the connection to the 
> host that is not established, and apparently takes to long to timeout.
> Stack traces and additional information coming.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1678) Server fails to join quorum when a peer is unreachable (5 ZK server setup)

Reply via email to