[
https://issues.apache.org/jira/browse/ZOOKEEPER-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Camille Fournier resolved ZOOKEEPER-1118.
-----------------------------------------
Resolution: Duplicate
> Inconsistent data after server crashes several times
> ----------------------------------------------------
>
> Key: ZOOKEEPER-1118
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1118
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum
> Affects Versions: 3.3.2
> Environment: Redhat RHEL5
> Reporter: Kurt Young
> Priority: Critical
>
> I think there is a bug when Follower try to sync data with Leader.
> Assume there are some operations committed during one server had been
> crashed. When the server restart, it will receive a NEWLEADER packet which
> include the last zxid of leader and the server will set its own
> lastProcessZxid to the leader's.
> {code:title=Follower.java|borderStyle=solid}
> void followLeader() throws InterruptedException {
> fzk.registerJMX(new FollowerBean(this, zk), self.jmxLocalPeerBean);
> try {
> InetSocketAddress addr = findLeader();
> try {
> connectToLeader(addr);
> long newLeaderZxid = registerWithLeader(Leader.FOLLOWERINFO); //
> get the last zxid from leader
> //check to see if the leader zxid is lower than ours
>
> //this should never happen but is just a safety check
>
> long lastLoggedZxid = self.getLastLoggedZxid();
> if ((newLeaderZxid >> 32L) < (lastLoggedZxid >> 32L)) {
> LOG.fatal("Leader epoch " + Long.toHexString(newLeaderZxid >>
> 32L)
> + " is less than our epoch " +
> Long.toHexString(lastLoggedZxid >> 32L));
> throw new IOException("Error: Epoch of leader is lower");
> }
> syncWithLeader(newLeaderZxid); // set its own lastProcessZxid
> to leader's last zxid
> {code}
> Then, some COMMIT packets will be received by the server in order to sync the
> data with leader. And then, the leader will send an UPTODATE packet to server
> to take a snapshot.
> {code:title=Follower.java|borderStyle=solid}
> protected void processPacket(QuorumPacket qp) throws IOException{
> switch (qp.getType()) {
> case Leader.PING:
> ping(qp);
> break;
> case Leader.PROPOSAL:
> TxnHeader hdr = new TxnHeader();
> BinaryInputArchive ia = BinaryInputArchive
> .getArchive(new ByteArrayInputStream(qp.getData()));
> Record txn = SerializeUtils.deserializeTxn(ia, hdr);
> if (hdr.getZxid() != lastQueued + 1) {
> LOG.warn("Got zxid 0x"
> + Long.toHexString(hdr.getZxid())
> + " expected 0x"
> + Long.toHexString(lastQueued + 1));
> }
> lastQueued = hdr.getZxid();
> fzk.logRequest(hdr, txn);
> break;
> case Leader.COMMIT:
> fzk.commit(qp.getZxid());
> break;
> case Leader.UPTODATE:
> fzk.takeSnapshot();
> self.cnxnFactory.setZooKeeperServer(fzk);
> break;
> case Leader.REVALIDATE:
> revalidate(qp);
> break;
> case Leader.SYNC:
> fzk.sync();
> break;
> }
> }
> {code}
> Notice the different way the Follower treat the COMMIT and the UPTODATE
> packets. When receives a COMMIT packet, the follower will give this to a
> processor to deal with. But if receives a UPTODATE packet, the follower will
> take a snapshot immediately. So it is possible that the server will take
> snapshot before it commits all the operations it missed. Then if the server
> crashed again and recovered, it will recover its data from the snapshot, so
> the date inconsistent with the leader now, but its last zxid is the same.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira