[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805844#comment-17805844
 ] 

Li Wang commented on ZOOKEEPER-4785:
------------------------------------

The issue can be fixed by persisting uncommitted txns from leader 
synchronously, so we can make sure the following order when processing 
NEWLEADER msg in the Learner.syncWithLeader() method.

1. Persisting all the txns/proposals in disk
2. Writing current epoch to disk
3. Sending ACK of NEWLEADER to leader
4. Sending ACK of proposals to leader




> Txn loss due to race condition when follower DIFF sync with leader
> ------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4785
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4785
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.8.0, 3.7.1, 3.8.1, 3.7.2, 3.8.2, 3.9.1
>            Reporter: Li Wang
>            Priority: Major
>
> We had txn loss incident in production recently. After investigation, we 
> found it was caused by the race condition of follower writing the current 
> epoch and sending the ACK_LD before successfully persisting all the txns from 
> DIFF sync in Learner.syncWithLeader() method.
> case Leader.NEWLEADER: 
>         ...
>         *self.setCurrentEpoch(newEpoch);*
>         writeToTxnLog = true;
>         //Anything after this needs to go to the transaction log, not applied 
> directly in memory
>         isPreZAB1_0 = false;
>         // ZOOKEEPER-3911: make sure sync the uncommitted logs before commit 
> them (ACK NEWLEADER).
>         sock.setSoTimeout(self.tickTime * self.syncLimit);
>         self.setSyncMode(QuorumPeer.SyncMode.NONE);
>         zk.startupWithoutServing();
>         if (zk instanceof FollowerZooKeeperServer) {
>             FollowerZooKeeperServer fzk = (FollowerZooKeeperServer) zk;
>             for (PacketInFlight p : packetsNotCommitted) {
>               * fzk.logRequest(p.hdr, p.rec, p.digest);*
>             }
>             packetsNotCommitted.clear();
>         }
>         writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), 
> true);
>         break;
>     }
> In this method, when follower receives the NEWLEADER msg, the current epoch 
> is updated before writing the uncommitted txns to the disk and writing txns 
> is done asynchronously by the SyncThreadd.  If follower crashes after setting 
> the current epoch and sending ACK_LD and before all transactions are 
> successfully written to disk, transactions loss can happen.  
> This is because leader election is based on epoch first and then transaction 
> id.  When the follower becomes a leader because it has highest epoch, it will 
> ask the other followers to truncate txns even they have been written to disk, 
> causing data loss.
> The following is the scenario
> 1. Leader election happened
> 2. A follower synced with Leader via DIFF, received committed proposals from 
> leader and kept them in memory
> 3. The follower received the NEWLEADER message
> 4. The follower updated the newEpoch
> 5. The follower was bounced  before writing all the uncommitted txns to disk
> 6. Leader shutdown and a new election triggered
> 7. Follower became the new leader because it has largest currentEpoch
> 8. New leader asked other followers to truncate their committed txns and 
> transactions got lost



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to