Hi Flavio, I have 3 set of logs and they all seem to indicate two problems on the misbehaving follower:
Problem 1: Expected zxid is incorrect =0 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x300000002 expected 0x1 =0 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x300000002 expected 0x1 =2495 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x400000001 expected 0x1 =2495 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x400000001 expected 0x1 =191617 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x500000001 expected 0x1 =191617 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x500000001 expected 0x1 =0 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x600000001 expected 0x1 =0 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x600000001 expected 0x1 =245016 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x700000001 expected 0x1 =245016 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x700000001 expected 0x1 Note expected zxid is always 0x1 (lastQueued is always 0?) Problem 2: While joining the cluster expected epoch is 1 higher than seen earlier =14991 [QuorumPeer:/0.0.0.0:2181] FATAL org.apache.zookeeper.server.quorum.Learner - Leader epoch 7 is less than our epoch 8 -Vishal On Fri, Jun 18, 2010 at 6:33 PM, Vishal K <vishalm...@gmail.com> wrote: > > Nevermind. I am on the wrong track. Flavio's earlier mail did clarify that > the follower received the epoch before restart. > > > On Fri, Jun 18, 2010 at 6:20 PM, Vishal K <vishalm...@gmail.com> wrote: > >> I might be wrong here, but let me try to chip in my few cents. >> >> I think the problem is in LearnerHandler.java at the leader fo this >> Follower. >> >> /* see what other packets from the proposal >> * and tobeapplied queues need to be sent >> * and then decide if we can just send a DIFF >> * or we actually need to send the whole snapshot >> */ >> long leaderLastZxid = leader.startForwarding(this, updates); >> ---> this leaderLastZxid returned is probably incorrect. >> // a special case when both the ids are the same >> if (peerLastZxid == leaderLastZxid) { >> packetToSend = Leader.DIFF; >> zxidToSend = leaderLastZxid; >> } >> >> QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER, >> leaderLastZxid, null, null); >> oa.writeRecord(newLeaderQP, "packet"); >> bufferedOutput.flush() >> >> >> >> On Fri, Jun 18, 2010 at 4:49 PM, Flavio Paiva Junqueira (JIRA) < >> j...@apache.org> wrote: >> >>> >>> [ >>> https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880320#action_12880320] >>> >>> Flavio Paiva Junqueira commented on ZOOKEEPER-335: >>> -------------------------------------------------- >>> >>> Guys, I don't see enough information in these logs to determine what's >>> going on. Let me tell you what I'm seeing so that perhaps other folks can >>> help me out here. >>> >>> One part of the log that is suspicious is this one: >>> >>> {noformat} >>> =6693 [QuorumPeer:/0.0.0.0:2181] WARN >>> org.apache.zookeeper.server.quorum.Learner - Got zxid 0x300000001 expected >>> 0x1 >>> =6693 [QuorumPeer:/0.0.0.0:2181] WARN >>> org.apache.zookeeper.server.quorum.Learner - Got zxid 0x300000001 expected >>> 0x1 >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor30] >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor27] >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor22] >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor23] >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor18] >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor20] >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor19] >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor31] >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor21] >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor26] >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor25] >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor33] >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor29] >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor28] >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor24] >>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor32] >>> >>> ************* NODE RESTARTED HERE ********************** >>> {noformat} >>> >>> Before being restarted, the bad node receives a proposal with zxid <3,1> >>> and it expects <0,1>. Next in the logs after being restarted, I can see that >>> it is complaining that it has epoch 4 and the leader 3. Something strange >>> apparently happened during the restart. It also seems to be the case that >>> the node was being able to talk to the others (first entries in the log >>> before the excerpt above). >>> >>> Do you guys see anything I'm overlooking? >>> >>> > zookeeper servers should commit the new leader txn to their logs. >>> > ----------------------------------------------------------------- >>> > >>> > Key: ZOOKEEPER-335 >>> > URL: >>> https://issues.apache.org/jira/browse/ZOOKEEPER-335 >>> > Project: Zookeeper >>> > Issue Type: Bug >>> > Components: server >>> > Affects Versions: 3.1.0 >>> > Reporter: Mahadev konar >>> > Assignee: Mahadev konar >>> > Priority: Blocker >>> > Fix For: 3.4.0 >>> > >>> > Attachments: zk.log.gz, zklogs.tar.gz >>> > >>> > >>> > currently the zookeeper followers do not commit the new leader >>> election. This will cause problems in a failure scenarios with a follower >>> acking to the same leader txn id twice, which might be two different >>> intermittent leaders and allowing them to propose two different txn's of the >>> same zxid. >>> >>> -- >>> This message is automatically generated by JIRA. >>> - >>> You can reply to this email to add a comment to the issue online. >>> >>> >> >