[ https://issues.apache.org/jira/browse/ZOOKEEPER-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799388#comment-17799388 ]
zhanglu153 edited comment on ZOOKEEPER-4781 at 12/21/23 11:56 AM: ------------------------------------------------------------------ In ZOOKEEPER-1291 AcceptedEpoch not updated at leader before it proposes the epoch to followers - ASF JIRA (apache.org), update the acceptedEpoch file in the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method. However, when encountering exceptions such as a full disk, writing to the acceptedEpoch file will fail. There is an issue in the code where the value of waitingForNewEpoch will be set to false regardless of whether the write to the acceptedEpoch file is successful or not, resulting in no attempt to update the acceptedEpoch file on the leader server again. If the value of acceptedEpoch is less than the value of currentEpoch after the leader server restarts, the zk server will never be able to start. {code:java} if (connectingFollowers.contains(self.getId()) && verifier.containsQuorum(connectingFollowers)) { self.setAcceptedEpoch(epoch); waitingForNewEpoch = false; connectingFollowers.notifyAll(); }{code} To solve this problem, we should first write the acceptedEpoch file and then update the value of waitingForNewEpoch. A patch [^0001-zk-leader-acceptedEpoch-zk-zk.patch] has been submitted to resolve this issue. was (Author: JIRAUSER298392): In [ZOOKEEPER-1291] AcceptedEpoch not updated at leader before it proposes the epoch to followers - ASF JIRA (apache.org), update the acceptedEpoch file in the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method. However, when encountering exceptions such as a full disk, writing to the acceptedEpoch file will fail. There is an issue in the code where the value of waitingForNewEpoch will be set to false regardless of whether the write to the acceptedEpoch file is successful or not, resulting in no attempt to update the acceptedEpoch file on the leader server again. If the value of acceptedEpoch is less than the value of currentEpoch after the leader server restarts, the zk server will never be able to start. {code:java} if (connectingFollowers.contains(self.getId()) && verifier.containsQuorum(connectingFollowers)) { self.setAcceptedEpoch(epoch); waitingForNewEpoch = false; connectingFollowers.notifyAll(); }{code} To solve this problem, we should first write the acceptedEpoch file and then update the value of waitingForNewEpoch. A patch has been submitted to resolve this issue. > ZooKeeper not starting because the accepted epoch is less than the current > epoch. > --------------------------------------------------------------------------------- > > Key: ZOOKEEPER-4781 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4781 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.5.10, 3.4.14, 3.6.4, 3.7.2, 3.8.3, 3.9.1 > Reporter: zhanglu153 > Priority: Major > Attachments: 0001-zk-leader-acceptedEpoch-zk-zk.patch > > > This issue occurred in our abnormal testing environment, where the disk was > injected with anomalies and frequently filled up. > The the scenario is as follows: > # Configure three node ZooKeeper cluster, lets say nodes are A, B and C. > # Start the cluster, and node C becomes the leader. > # The disk of Node C is injected with a full disk exception. > # Node C called the org.apache.zookeeper.server.quorum.Leader#lead method. > {code:java} > void lead() throws IOException, InterruptedException { > self.end_fle = Time.currentElapsedTime(); > long electionTimeTaken = self.end_fle - self.start_fle; > self.setElectionTimeTaken(electionTimeTaken); > LOG.info("LEADING - LEADER ELECTION TOOK - {}", electionTimeTaken); > self.start_fle = 0; > self.end_fle = 0; > zk.registerJMX(new LeaderBean(this, zk), self.jmxLocalPeerBean); > try { > self.tick.set(0); > zk.loadData(); > > leaderStateSummary = new StateSummary(self.getCurrentEpoch(), > zk.getLastProcessedZxid()); > // Start thread that waits for connection requests from > // new followers. > cnxAcceptor = new LearnerCnxAcceptor(); > cnxAcceptor.start(); > > readyToStart = true; > long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch()); > > zk.setZxid(ZxidUtils.makeZxid(epoch, 0)); > > synchronized(this){ > lastProposed = zk.getZxid(); > } > > newLeaderProposal.packet = new QuorumPacket(NEWLEADER, zk.getZxid(), > null, null); > if ((newLeaderProposal.packet.getZxid() & 0xffffffffL) != 0) { > LOG.info("NEWLEADER proposal has Zxid of " > + Long.toHexString(newLeaderProposal.packet.getZxid())); > } > > waitForEpochAck(self.getId(), leaderStateSummary); > self.setCurrentEpoch(epoch); > ... {code} > # Node C, as the leader, will start the LearnerCnxAcceptor thread, and then > call the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method. > At this time, the value of waitingForNewEpoch is true, and the size of > connectingFollowers is not greater than n/2. Node C directly calls > connectingFollowers.wait to wait. The maximum waiting time is > self.getInitLimit()*self.getTickTime() ms. > {code:java} > public long getEpochToPropose(long sid, long lastAcceptedEpoch) throws > InterruptedException, IOException { > synchronized(connectingFollowers) { > if (!waitingForNewEpoch) { > return epoch; > } > if (lastAcceptedEpoch >= epoch) { > epoch = lastAcceptedEpoch+1; > } > if (isParticipant(sid)) { > connectingFollowers.add(sid); > } > QuorumVerifier verifier = self.getQuorumVerifier(); > if (connectingFollowers.contains(self.getId()) && > > verifier.containsQuorum(connectingFollowers)) { > self.setAcceptedEpoch(epoch); > waitingForNewEpoch = false; > connectingFollowers.notifyAll(); > } else { > long start = Time.currentElapsedTime(); > long cur = start; > long end = start + self.getInitLimit()*self.getTickTime(); > while(waitingForNewEpoch && cur < end) { > connectingFollowers.wait(end - cur); > cur = Time.currentElapsedTime(); > } > if (waitingForNewEpoch) { > throw new InterruptedException("Timeout while waiting for > epoch from quorum"); > } > } > return epoch; > } > } {code} > # Node B connects to the 2888 communication port of node C and starts a new > LeanerHandler thread. > # Node A connects to the 2888 communication port of node C and starts a new > LeanerHandler thread. > # After node B connects to node C, call the > org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method in the > LearnerHandler thread.At this point, the value of waitingForNewEpoch is true, > and the size of connectingFollowers is greater than n/2. Then, set the value > of waitingForNewEpoch to false. Due to the disk of node C being full, calling > setAcceptedEpoch to write the acceptedEpoch value failed with an IO > exception. Node C fails to update the acceptedEpoch file and did not > successfully call the connectingFollowers.notifyAll() method. This will cause > node C to wait at connectingFollowers.wait, with a maximum wait of > self.getInitLimit()*self.getTickTime() ms. Due to an IO exception thrown, the > socket connection between node B and node C was disconnected, and the > LeanerHandler thread exited. > {code:java} > 2023-12-15 15:08:35,572 [myid:3] - ERROR > [LearnerHandler-/192.168.174.227:41696:LearnerHandler@648] - Unexpected > exception causing shutdown while sock still open2023-12-15 15:08:35,572 > [myid:3] - ERROR [LearnerHandler-/192.168.174.227:41696:LearnerHandler@648] - > Unexpected exception causing shutdown while sock still > openjava.io.FileNotFoundException: > /cloud/data/zookeeper/data/version-2/acceptedEpoch.tmp (No space left on > device) at java.io.FileOutputStream.open0(Native Method) at > java.io.FileOutputStream.open(FileOutputStream.java:270) at > java.io.FileOutputStream.<init>(FileOutputStream.java:213) at > java.io.FileOutputStream.<init>(FileOutputStream.java:162) at > org.apache.zookeeper.common.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:59) > at > org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:1393) > at > org.apache.zookeeper.server.quorum.QuorumPeer.setAcceptedEpoch(QuorumPeer.java:1437) > at > org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:885) > at > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:358) > {code} > # After node A connects to node C, the > org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method is called > in the LearnerHandler thread. At this time, the value of waitingForNewEpoch > is false, and epoch is returned directly. > # Node B reconnects to the 2888 cluster communication port of node C and > starts a new LeanerHandler thread. Call the > org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method in the > LearnerHandler thread. At this time, the value of waitingForNewEpoch is > false, and epoch is returned directly. > # After node A connects to node C, the LeanerHandler thread will then call > the org.apache.zookeeper.server.quorum.Leader#waitForEpochAck method. At this > point, selectingFollowers does not include node C, Call the > selectingFollowers.wait method to wait. > # After node B connects to node C, the LeanerHandler thread will then call > the org.apache.zookeeper.server.quorum.Leader#waitForEpochAck method. At this > point, selectingFollowers does not include node C, Call the > selectingFollowers.wait method to wait. > {code:java} > public void waitForEpochAck(long id, StateSummary ss) throws IOException, > InterruptedException { > synchronized(electingFollowers) { > if (electionFinished) { > return; > } > if (ss.getCurrentEpoch() != -1) { > if (ss.isMoreRecentThan(leaderStateSummary)) { > throw new IOException("Follower is ahead of the leader, > leader summary: " > + > leaderStateSummary.getCurrentEpoch() > + " (current epoch), " > + > leaderStateSummary.getLastZxid() > + " (last zxid)"); > } > if (isParticipant(id)) { > electingFollowers.add(id); > } > } > QuorumVerifier verifier = self.getQuorumVerifier(); > if (electingFollowers.contains(self.getId()) && > verifier.containsQuorum(electingFollowers)) { > electionFinished = true; > electingFollowers.notifyAll(); > } else { > long start = Time.currentElapsedTime(); > long cur = start; > long end = start + self.getInitLimit()*self.getTickTime(); > while(!electionFinished && cur < end) { > electingFollowers.wait(end - cur); > cur = Time.currentElapsedTime(); > } > if (!electionFinished) { > throw new InterruptedException("Timeout while waiting for > epoch to be acked by quorum"); > } > } > } > } {code} > # Node C waits for self.getInitLimit()*self.getTickTime() ms in the > connectingFollowers.wait method. However, since the value of > waitingForNewEpoch is false, it does not throw a timeout exception and > returns the epoch directly. > # Node C calls the org.apache.zookeeper.server.quorum.Leader#waitForEpochAck > method. At this time, selectingFollowers contains node C and the size of > selectingFollowers is greater than n/2. Then call the > selectingFollowers.notifyAll() method to release the lock. > # The disk space of node C has been partially released, so subsequent calls > to setCurrentEpoch have successfully updated the currentEpoch file. > # The cluster provides services normally for a period of time. > # Stop C. > # Start C, bellow exception with message "The accepted epoch, c is less than > the current epoch, d" is thrown. > {code:java} > 2023-12-15 15:57:00,051 [myid:3] - ERROR [main:QuorumPeer@698] - Unable to > load database on disk > java.io.IOException: The accepted epoch, c is less than the current epoch, d > at > org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:695) > at > org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636) > at > org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170) > at > org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114) > at > org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81) > 2023-12-15 15:57:00,132 [myid:3] - ERROR [main:QuorumPeerMain@92] - > Unexpected exception, exiting abnormally > java.lang.RuntimeException: Unable to run quorum server > at > org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:699) > at > org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636) > at > org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170) > at > org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114) > at > org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81) > Caused by: java.io.IOException: The accepted epoch, c is less than the > current epoch, d > at > org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:695) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)