[
https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888834#action_12888834
]
Flavio Paiva Junqueira commented on ZOOKEEPER-790:
--------------------------------------------------
Thank you both for all the information. I have been able to reproduce and find
the source of the bug, but I don't have a patch yet. The problem is deeper than
I thought originally. Let me show you what is going on:
I'm including an excerpt of logs from two runs: one good and one bad. The first
run is good (see the excerpt below). I have killed the follower and restarted
it as Vishal suggested. When it comes back, it declares itself as leader, also
as Vishal and Travis observed. However, different from what Vishal and Travis
observed, it drops leadership and follows happily the leader right after.
{noformat}
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:quorump...@654] - LEADING
INFO - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3
(n.leader), 0 (n.zxid), 1 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my
state)
INFO - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3
(n.leader), 0 (n.zxid), 1 (n.round), LEADING (n.state), 3 (n.sid), LEADING (my
state)
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lea...@54] - TCP NoDelay set to: true
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:zookeeper.version=3.4.0--1, built on 07/15/2010 10:36 GMT
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:host.name=XXXXXXXXXX.com
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:java.version=1.6.0_04
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:java.vendor=Sun Microsystems Inc.
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:java.home=/usr/java/jdk1.6.0_04/jre
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:java.class.path=.XXXXXXXXX
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:java.library.path= XXXXXXXXX
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:java.io.tmpdir=/tmp
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:java.compiler=<NA>
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:os.name=Linux
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:os.arch=amd64
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:os.version=2.6.18-53.1.21.el5
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:user.name=XXXXXXXXX
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:user.home=XXXXXXXXX
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:user.dir=XXXXXXXXX
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:zookeeperser...@151] - Created
server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 60000
datadir /XXXXXXXXX/zookeeper/version-2 snapdir /XXXXXXXX/zookeeper/version-2
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:files...@82] - Reading snapshot
/XXXXXXXX/zookeeper/version-2/snapshot.100113340
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:files...@82] - Reading snapshot
/XXXXXXXX/zookeeper/version-2/snapshot.100113340
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:filetxnsnap...@208] - Snapshotting:
10011f748
INFO - [SessionTracker:zookeeperser...@315] - Expiring session
0x229d6a9e0ca0000, timeout of 10000ms exceeded
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lea...@394] - Shutdown called
java.lang.Exception: shutdown Leader! reason: Waiting for a quorum of
followers, only synced with: 2:
at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:394)
at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:317)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:657)
INFO - [Thread-10:leader$learnercnxaccep...@243] - exception while shutting
down acceptor: java.net.SocketException: Socket closed
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:finalrequestproces...@378] -
shutdown of request processor complete
INFO - [SyncThread:2:syncrequestproces...@151] - SyncRequestProcessor exited!
INFO - [CommitProcessor:2:commitproces...@148] - CommitProcessor exited loop!
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:quorump...@620] - LOOKING
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:files...@82] - Reading snapshot
/XXXXXXXXX/zookeeper/version-2/snapshot.10011f748
INFO - [SessionTracker:sessiontrackeri...@158] - SessionTrackerImpl exited
loop!
INFO - [ProcessThread:-1:preprequestproces...@385] - Processed session
termination for sessionid: 0x229d6a9e0ca0000
ERROR - [ProcessThread:-1:nioservercnxn$factor...@87] - Thread
Thread[ProcessThread:-1,5,main] died
java.lang.NullPointerException
at
org.apache.zookeeper.server.quorum.ProposalRequestProcessor.processRequest(ProposalRequestProcessor.java:71)
at
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:435)
at
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:fastleaderelect...@663] - New
election. My id = 2, Proposed zxid = 4296144712
INFO - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 2
(n.leader), 4296144712 (n.zxid), 2 (n.round), LOOKING (n.state), 2 (n.sid),
LOOKING (my state)
INFO - [WorkerSender Thread:quorumcnxmana...@162] - Have smaller server
identifier, so dropping the connection: (3, 2)
INFO - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3
(n.leader), 0 (n.zxid), 1 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my
state)
INFO - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3
(n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING
(my state)
INFO - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3
(n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING
(my state)
INFO - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3
(n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING
(my state)
INFO - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3
(n.leader), 0 (n.zxid), 1 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my
state)
INFO - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3
(n.leader), 0 (n.zxid), 1 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my
state)
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:quorump...@642] - FOLLOWING
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lear...@72] - TCP NoDelay set to:
true
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:zookeeperser...@151] - Created
server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 60000
datadir /XXXXXXXXX/zookeeper/version-2 snapdir /XXXXXXXXX/zookeeper/version-2
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lear...@285] - Getting a snapshot
from leader
FATAL - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lear...@314] - Setting leader epoch 1
WARN - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:follo...@116] - Got zxid 0x100162711
expected 0x1
WARN - [SyncThread:2:filetxn...@196] - Creating new log file: 100162711
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:filetxnsnap...@208] - Snapshotting:
100164c43
{noformat}
Now, the following run is bad. I followed the same procedure, but this time we
get the epoch problem. Looking carefully at the logs, you'll see that the main
difference is that this time the zookeeper server created a new log file:
log.200000001. This is bad because the zookeeper server will use the zxid in
the latest txn log file to determine its last epoch. Given that the current
leader is in epoch 1, it will correctly refuse to talk to the leader as we
observe below.
{noformat}
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:quorump...@654] - LEADING
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lea...@54] - TCP NoDelay set to: true
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:zookeeper.version=3.4.0--1, built on 07/15/2010 10:36 GMT
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:host.name=XXXXXXXX.com
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:java.version=1.6.0_05
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:java.vendor=Sun Microsystems Inc.
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:java.home=/usr/java/jdk1.6.0_05/jre
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:java.class.path=XXXXXXXXX
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:java.library.path=XXXXXXXXX
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:java.io.tmpdir=/tmp
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:java.compiler=<NA>
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:os.name=Linux
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:os.arch= amd64
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:os.version=2.6.18-92.1.13.el5.perfctr.2.6.36
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:user.name=XXXXXXXXX
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:user.home=XXXXXXXX
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server
environment:user.dir=XXXXXXXXX
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:zookeeperser...@151] - Created
server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 60000
datadir /xxxxxxx/zookeeper/version-2 snapdir /XXXXXXX/zookeeper/version-2
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:files...@82] - Reading snapshot
/XXXXXXX/zookeeper/version-2/snapshot.1000469b4
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:files...@82] - Reading snapshot
/XXXXXXX/zookeeper/version-2/snapshot.1000469b4
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:filetxnsnap...@208] - Snapshotting:
100055cec
INFO - [SessionTracker:zookeeperser...@315] - Expiring session
0x129d6b61b5b0000, timeout of 10000ms exceeded
INFO - [ProcessThread:-1:preprequestproces...@385] - Processed session
termination for sessionid: 0x129d6b61b5b0000
WARN - [SyncThread:1:filetxn...@196] - Creating new log file: log.200000001
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lea...@394] - Shutdown called
java.lang.Exception: shutdown Leader! reason: Waiting for a quorum of
followers, only synced with: 1:
at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:394)
at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:317)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:657)
INFO - [ProcessThread:-1:preprequestproces...@119] - PrepRequestProcessor
exited loop!
INFO - [CommitProcessor:1:commitproces...@148] - CommitProcessor exited loop!
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:finalrequestproces...@378] -
shutdown of request processor complete
INFO - [SyncThread:1:syncrequestproces...@151] - SyncRequestProcessor exited!
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:quorump...@620] - LOOKING
INFO - [Thread-8:leader$learnercnxaccep...@243] - exception while shutting
down acceptor: java.net.SocketException: Socket closed
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:files...@82] - Reading snapshot
/XXXXXXXX/zookeeper/version-2/snapshot.100055cec
INFO - [SessionTracker:sessiontrackeri...@158] - SessionTrackerImpl exited
loop!
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:fastleaderelect...@663] - New
election. My id = 1, Proposed zxid = 8589934593
INFO - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 1
(n.leader), 8589934593 (n.zxid), 2 (n.round), LOOKING (n.state), 1 (n.sid),
LOOKING (my state)
INFO - [WorkerSender Thread:quorumcnxmana...@162] - Have smaller server
identifier, so dropping the connection: (2, 1)
INFO - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3
(n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING
(my state)
INFO - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3
(n.leader), 0 (n.zxid), 1 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my
state)
INFO - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3
(n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING
(my state)
INFO - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3
(n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING
(my state)
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:quorump...@642] - FOLLOWING
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lear...@72] - TCP NoDelay set to:
true
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:zookeeperser...@151] - Created
server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 60000
datadir /xxxxxxx/zookeeper/version-2 snapdir /xxxxxxx/zookeeper/version-2
FATAL - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:follo...@71] - Leader epoch 1 is
less than our epoch 2
WARN - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:follo...@82] - Exception when
following the leader
java.io.IOException: Error: Epoch of leader is lower
at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
INFO - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:follo...@166] - shutdown called
java.lang.Exception: shutdown Follower
at
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
{noformat}
Fixing it is not as simple as setting the current zxid to 200000001 after we
hear from a quorum because the request processor pipeline starts complaining
about an attempt to commit a future proposal.
> Last processed zxid set prematurely while establishing leadership
> -----------------------------------------------------------------
>
> Key: ZOOKEEPER-790
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790
> Project: Zookeeper
> Issue Type: Bug
> Components: quorum
> Affects Versions: 3.3.1
> Reporter: Flavio Paiva Junqueira
> Assignee: Flavio Paiva Junqueira
> Priority: Blocker
> Fix For: 3.3.2, 3.4.0
>
> Attachments: ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2
>
>
> The leader code is setting the last processed zxid to the first of the new
> epoch even before connecting to a quorum of followers. Because the leader
> code sets this value before connecting to a quorum of followers
> (Leader.java:281) and the follower code throws an IOException
> (Follower.java:73) if the leader epoch is smaller, we have that when the
> false leader drops leadership and becomes a follower, it finds a smaller
> epoch and kills itself.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.