[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Wang updated ZOOKEEPER-4785:
-------------------------------
    Description: 
We had txn loss incident in production recently. After investigation, we found 
it was caused by the race condition of follower writing the current epoch and 
sending the ACK_LD before successfully persisting all the txns from DIFF sync 
in Learner.syncWithLeader() method.

{code:java}
case Leader.NEWLEADER: 
        ...
        self.setCurrentEpoch(newEpoch);
        writeToTxnLog = true;
        //Anything after this needs to go to the transaction log, not applied 
directly in memory
        isPreZAB1_0 = false;

        // ZOOKEEPER-3911: make sure sync the uncommitted logs before commit 
them (ACK NEWLEADER).
        sock.setSoTimeout(self.tickTime * self.syncLimit);
        self.setSyncMode(QuorumPeer.SyncMode.NONE);
        zk.startupWithoutServing();
        if (zk instanceof FollowerZooKeeperServer) {
            FollowerZooKeeperServer fzk = (FollowerZooKeeperServer) zk;
            for (PacketInFlight p : packetsNotCommitted) {
              fzk.logRequest(p.hdr, p.rec, p.digest);
            }
            packetsNotCommitted.clear();
        }

        writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), 
true);
        break;
    }
{code}



In this method, when follower receives the NEWLEADER msg, the current epoch is 
updated before writing the uncommitted txns to the disk and writing txns is 
done asynchronously by the SyncThreadd.  If follower crashes after setting the 
current epoch and sending ACK_LD and before all transactions are successfully 
written to disk, transactions loss can happen.  

This is because leader election is based on epoch first and then transaction 
id.  When the follower becomes a leader because it has highest epoch, it will 
ask the other followers to truncate txns even they have been written to disk, 
causing data loss.

The following is the scenario

1. Leader election happened
2. A follower synced with Leader via DIFF, received committed proposals from 
leader and kept them in memory
3. The follower received the NEWLEADER message
4. The follower updated the newEpoch
5. The follower was bounced  before writing all the uncommitted txns to disk
6. Leader shutdown and a new election triggered
7. Follower became the new leader because it has largest currentEpoch
8. New leader asked other followers to truncate their committed txns and 
transactions got lost






  was:
We had txn loss incident in production recently. After investigation, we found 
it was caused by the race condition of follower writing the current epoch and 
sending the ACK_LD before successfully persisting all the txns from DIFF sync 
in Learner.syncWithLeader() method.

{code:java}
case Leader.NEWLEADER: 
        ...
        *self.setCurrentEpoch(newEpoch);*
        writeToTxnLog = true;
        //Anything after this needs to go to the transaction log, not applied 
directly in memory
        isPreZAB1_0 = false;

        // ZOOKEEPER-3911: make sure sync the uncommitted logs before commit 
them (ACK NEWLEADER).
        sock.setSoTimeout(self.tickTime * self.syncLimit);
        self.setSyncMode(QuorumPeer.SyncMode.NONE);
        zk.startupWithoutServing();
        if (zk instanceof FollowerZooKeeperServer) {
            FollowerZooKeeperServer fzk = (FollowerZooKeeperServer) zk;
            for (PacketInFlight p : packetsNotCommitted) {
              fzk.logRequest(p.hdr, p.rec, p.digest);
            }
            packetsNotCommitted.clear();
        }

        writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), 
true);
        break;
    }
{code}



In this method, when follower receives the NEWLEADER msg, the current epoch is 
updated before writing the uncommitted txns to the disk and writing txns is 
done asynchronously by the SyncThreadd.  If follower crashes after setting the 
current epoch and sending ACK_LD and before all transactions are successfully 
written to disk, transactions loss can happen.  

This is because leader election is based on epoch first and then transaction 
id.  When the follower becomes a leader because it has highest epoch, it will 
ask the other followers to truncate txns even they have been written to disk, 
causing data loss.

The following is the scenario

1. Leader election happened
2. A follower synced with Leader via DIFF, received committed proposals from 
leader and kept them in memory
3. The follower received the NEWLEADER message
4. The follower updated the newEpoch
5. The follower was bounced  before writing all the uncommitted txns to disk
6. Leader shutdown and a new election triggered
7. Follower became the new leader because it has largest currentEpoch
8. New leader asked other followers to truncate their committed txns and 
transactions got lost







> Txn loss due to race condition in Learner.syncWithLeader() when follower DIFF 
> sync with leader
> ----------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4785
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4785
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.8.0, 3.7.1, 3.8.1, 3.7.2, 3.8.2, 3.9.1
>            Reporter: Li Wang
>            Assignee: Li Wang
>            Priority: Major
>
> We had txn loss incident in production recently. After investigation, we 
> found it was caused by the race condition of follower writing the current 
> epoch and sending the ACK_LD before successfully persisting all the txns from 
> DIFF sync in Learner.syncWithLeader() method.
> {code:java}
> case Leader.NEWLEADER: 
>         ...
>         self.setCurrentEpoch(newEpoch);
>         writeToTxnLog = true;
>         //Anything after this needs to go to the transaction log, not applied 
> directly in memory
>         isPreZAB1_0 = false;
>         // ZOOKEEPER-3911: make sure sync the uncommitted logs before commit 
> them (ACK NEWLEADER).
>         sock.setSoTimeout(self.tickTime * self.syncLimit);
>         self.setSyncMode(QuorumPeer.SyncMode.NONE);
>         zk.startupWithoutServing();
>         if (zk instanceof FollowerZooKeeperServer) {
>             FollowerZooKeeperServer fzk = (FollowerZooKeeperServer) zk;
>             for (PacketInFlight p : packetsNotCommitted) {
>               fzk.logRequest(p.hdr, p.rec, p.digest);
>             }
>             packetsNotCommitted.clear();
>         }
>         writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), 
> true);
>         break;
>     }
> {code}
> In this method, when follower receives the NEWLEADER msg, the current epoch 
> is updated before writing the uncommitted txns to the disk and writing txns 
> is done asynchronously by the SyncThreadd.  If follower crashes after setting 
> the current epoch and sending ACK_LD and before all transactions are 
> successfully written to disk, transactions loss can happen.  
> This is because leader election is based on epoch first and then transaction 
> id.  When the follower becomes a leader because it has highest epoch, it will 
> ask the other followers to truncate txns even they have been written to disk, 
> causing data loss.
> The following is the scenario
> 1. Leader election happened
> 2. A follower synced with Leader via DIFF, received committed proposals from 
> leader and kept them in memory
> 3. The follower received the NEWLEADER message
> 4. The follower updated the newEpoch
> 5. The follower was bounced  before writing all the uncommitted txns to disk
> 6. Leader shutdown and a new election triggered
> 7. Follower became the new leader because it has largest currentEpoch
> 8. New leader asked other followers to truncate their committed txns and 
> transactions got lost



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to