[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiafu Jiang updated ZOOKEEPER-2930:
-----------------------------------
    Description: 
I deploy a cluster of ZooKeeper with three nodes:

ofs_zk1:20.10.11.101, 30.10.11.101
ofs_zk2:20.10.11.102, 30.10.11.102
ofs_zk3:20.10.11.103, 30.10.11.103

I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command.

It is supposed that the new Leader should be elected in some seconds, but the 
fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of 
them can become the new Leader.

I change the log level to DEBUG (the default is INFO), and restart zookeeper 
servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem.

I read the log and the ZooKeeper source code, and I think I find the reason.

When the potential leader(says ofs_zk3) begins the 
election(FastLeaderElection.lookForLeader()), it will send notifications to all 
the servers. 
When it fails to receive any notification during a timeout, it will resend the 
notifications, and double the timeout. This process will repeat until any 
notification is received or the timeout reaches a max value.
The FastLeaderElection.sendNotifications() just put the notification message 
into a queue and return. The WorkerSender is responsable to send the 
notifications.

The WorkerSender just process the notifications one by one by passing the 
notifications to QuorumCnxManager. Here comes the problem, the 
QuorumCnxManager.toSend() blocks for a long time when the notification is send 
to ofs_zk2(whose network is down) and some notifications (which belongs to 
ofs_zk1) will thus be blocked for a long time. The repeated notifications by 
FastLeaderElection.sendNotifications() just make things worse.

Here is the related source code:

{code:java}
    public void toSend(Long sid, ByteBuffer b) {
        /*
         * If sending message to myself, then simply enqueue it (loopback).
         */
        if (this.mySid == sid) {
             b.position(0);
             addToRecvQueue(new Message(b.duplicate(), sid));
            /*
             * Otherwise send to the corresponding thread to send.
             */
        } else {
             /*
              * Start a new connection if doesn't have one already.
              */
             ArrayBlockingQueue<ByteBuffer> bq = new 
ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY);
             ArrayBlockingQueue<ByteBuffer> bqExisting = 
queueSendMap.putIfAbsent(sid, bq);
             if (bqExisting != null) {
                 addToSendQueue(bqExisting, b);
             } else {
                 addToSendQueue(bq, b);
             }
             
             // This may block!!!
             connectOne(sid);
                
        }
    }
{code}

Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the 
epoch ack, but in fact the ofs_zk1 does not receive the notification(which says 
the leader is ofs_zk1) because the ofs_zk3 have not send the notification(the 
notification may still in the sendqueue of WorkerSender). At last, the 
potential leader ofs_zk3 fails to receive the epoch ack in timeout, so it quit 
the leader and begins a new election. 

The log files of ofs_zk1 and ofs_zk3 are attached.

  was:
I deploy a cluster of ZooKeeper with three nodes:

ofs_zk1:20.10.11.101, 30.10.11.101
ofs_zk2:20.10.11.102, 30.10.11.102
ofs_zk3:20.10.11.103, 30.10.11.103

I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command.

It is supposed that the new Leader should be elected in some seconds, but the 
fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of 
them can become the new Leader.

I change the log level to DEBUG (the default is INFO), and restart zookeeper 
servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem.

I read the log and the ZooKeeper source code, and I think I find the reason.

When the potential leader(says ofs_zk3) begins the 
election(FastLeaderElection.lookForLeader()), it sends notifications to all the 
servers. 
When it fails to receive any notification during a timeout, it will resend the 
notifications, and double the timeout. This process will repeat until any 
notification is received or the timeout reaches a max value.
The FastLeaderElection.sendNotifications() just put the notification message 
into a queue and return. The WorkerSender is responsable to send the 
notifications.

The WorkerSender just process the notifications one bye one by passing the 
notifications to QuorumCnxManager. Here comes the problem, the 
QuorumCnxManager.toSend() blocks for a long time when the notification is send 
to ofs_zk2(whose network is down) and some notifications (which belongs to 
ofs_zk1) will thus be blocked for a long time. The repeated notifications by 
FastLeaderElection.sendNotifications() just make things worse.

Here is the related source code:

{code:java}
    public void toSend(Long sid, ByteBuffer b) {
        /*
         * If sending message to myself, then simply enqueue it (loopback).
         */
        if (this.mySid == sid) {
             b.position(0);
             addToRecvQueue(new Message(b.duplicate(), sid));
            /*
             * Otherwise send to the corresponding thread to send.
             */
        } else {
             /*
              * Start a new connection if doesn't have one already.
              */
             ArrayBlockingQueue<ByteBuffer> bq = new 
ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY);
             ArrayBlockingQueue<ByteBuffer> bqExisting = 
queueSendMap.putIfAbsent(sid, bq);
             if (bqExisting != null) {
                 addToSendQueue(bqExisting, b);
             } else {
                 addToSendQueue(bq, b);
             }
             
             // This may block!!!
             connectOne(sid);
                
        }
    }
{code}

Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the 
epoch ack, but in fact the ofs_zk1 does not receive the notification(which says 
the leader is ofs_zk1) because the ofs_zk3 have not send the notification(the 
notification may still in the sendqueue of WorkerSender). At last, the 
potential leader ofs_zk3 fails to receive the epoch ack in timeout, so it quit 
the leader and begins a new election. 

The log files of ofs_zk1 and ofs_zk3 are attached.


> Leader cannot be elected due to network timeout of some members.
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2930
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2930
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.10
>         Environment: Java 8
> ZooKeeper 3.4.11(from github)
> Centos6.5
>            Reporter: Jiafu Jiang
>            Priority: Major
>         Attachments: zoo.cfg, zookeeper1.log, zookeeper2.log
>
>
> I deploy a cluster of ZooKeeper with three nodes:
> ofs_zk1:20.10.11.101, 30.10.11.101
> ofs_zk2:20.10.11.102, 30.10.11.102
> ofs_zk3:20.10.11.103, 30.10.11.103
> I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command.
> It is supposed that the new Leader should be elected in some seconds, but the 
> fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of 
> them can become the new Leader.
> I change the log level to DEBUG (the default is INFO), and restart zookeeper 
> servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem.
> I read the log and the ZooKeeper source code, and I think I find the reason.
> When the potential leader(says ofs_zk3) begins the 
> election(FastLeaderElection.lookForLeader()), it will send notifications to 
> all the servers. 
> When it fails to receive any notification during a timeout, it will resend 
> the notifications, and double the timeout. This process will repeat until any 
> notification is received or the timeout reaches a max value.
> The FastLeaderElection.sendNotifications() just put the notification message 
> into a queue and return. The WorkerSender is responsable to send the 
> notifications.
> The WorkerSender just process the notifications one by one by passing the 
> notifications to QuorumCnxManager. Here comes the problem, the 
> QuorumCnxManager.toSend() blocks for a long time when the notification is 
> send to ofs_zk2(whose network is down) and some notifications (which belongs 
> to ofs_zk1) will thus be blocked for a long time. The repeated notifications 
> by FastLeaderElection.sendNotifications() just make things worse.
> Here is the related source code:
> {code:java}
>     public void toSend(Long sid, ByteBuffer b) {
>         /*
>          * If sending message to myself, then simply enqueue it (loopback).
>          */
>         if (this.mySid == sid) {
>              b.position(0);
>              addToRecvQueue(new Message(b.duplicate(), sid));
>             /*
>              * Otherwise send to the corresponding thread to send.
>              */
>         } else {
>              /*
>               * Start a new connection if doesn't have one already.
>               */
>              ArrayBlockingQueue<ByteBuffer> bq = new 
> ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY);
>              ArrayBlockingQueue<ByteBuffer> bqExisting = 
> queueSendMap.putIfAbsent(sid, bq);
>              if (bqExisting != null) {
>                  addToSendQueue(bqExisting, b);
>              } else {
>                  addToSendQueue(bq, b);
>              }
>              
>              // This may block!!!
>              connectOne(sid);
>                 
>         }
>     }
> {code}
> Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the 
> epoch ack, but in fact the ofs_zk1 does not receive the notification(which 
> says the leader is ofs_zk1) because the ofs_zk3 have not send the 
> notification(the notification may still in the sendqueue of WorkerSender). At 
> last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, 
> so it quit the leader and begins a new election. 
> The log files of ofs_zk1 and ofs_zk3 are attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to