[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ZOOKEEPER-4766:
--------------------------------------
    Labels: pull-request-available  (was: )

> Ensure leader election time does not unnecessarily scale with tree size due 
> to snapshotting
> -------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4766
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4766
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: leaderElection
>    Affects Versions: 3.5.9, 3.8.3
>         Environment: General behavior, should occur in all environments
>            Reporter: Rishabh Rai
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.5.9, 3.8.3
>
>   Original Estimate: 24h
>          Time Spent: 10m
>  Remaining Estimate: 23h 50m
>
> Hi ZK community, this is regarding a fix for a behavior that is causing the 
> leader election time to unnecessarily scale with the amount of data in the ZK 
> data tree.
> *tl;dr:* During leader election, the leader always saves a snapshot when 
> loading its data tree. This snapshot seems unnecessary, even in the case 
> where the leader needs to send an updated SNAP to a learner, since it 
> serializes the tree before sending anyway. Snapshotting slows down leader 
> election and increases ZK downtime significantly as more data is stored in 
> the tree. This improvement is to avoid taking a snapshot so that this 
> unnecessary downtime is avoided.
> During leader election, when the [data is 
> loaded|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L601]
>  by the tentatively elected (i.e. pre-finalized quorum) leader server, a 
> [snapshot of the tree is always 
> taken|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/ZooKeeperServer.java#L540].
>  The loadData method is called from multiple places, but specifically in the 
> context of leader election, it seems like the snapshotting step is 
> unnecessary for the leader when loading data:
>  * Because it has loaded the tree at this point, we know that if the leader 
> were to go down again, it would still be able to recover back to the current 
> state at which we are snapshotting without using the snapshot that we are 
> taking in loadData()
>  * There are no ongoing transactions until leader election is completed and 
> the ZK ensemble is back up, so no data would be lost after the point at which 
> the data tree is loaded
>  * Once the ensemble is healthy and the leader is handling transactions 
> again, any new transactions are being logged and when needed the log is being 
> rolled over when needed anyway, so if the leader is recovering from a 
> failure, the snapshot taken during loadData() does not afford us any 
> additional benefits over the initial snapshot (if it existed) and transaction 
> log that the leader used to load its data from in loadData()
>  * When the leader is deciding to send a SNAP or a DIFF to a learner, a [SNAP 
> is 
> serialized|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L582]
>  and sent [if and only if it is 
> needed|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L562].
>  The snapshot taken in loadData() again does not seem to be beneficial here.
> The PR for this fix only skips this snapshotting step in loadData() during 
> leader election. The behavior of the function remains the same for other 
> usages. With this change, during leader election the data tree would only be 
> serialized when sending a SNAP to a learner. In other scenarios, no data tree 
> serialization would be needed at all. In both cases, there is a significant 
> in the time spent in leader election.
> If my understanding of any of this is incorrect, or if I'm failing to 
> consider some other aspect of the process, please let me know. The PR for the 
> change can also be changed to enable/disable this behavior via a java 
> property.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to