[
https://issues.apache.org/jira/browse/ZOOKEEPER-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ZOOKEEPER-4766:
--------------------------------------
Labels: pull-request-available (was: )
> Ensure leader election time does not unnecessarily scale with tree size due
> to snapshotting
> -------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-4766
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4766
> Project: ZooKeeper
> Issue Type: Improvement
> Components: leaderElection
> Affects Versions: 3.5.9, 3.8.3
> Environment: General behavior, should occur in all environments
> Reporter: Rishabh Rai
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.5.9, 3.8.3
>
> Original Estimate: 24h
> Time Spent: 10m
> Remaining Estimate: 23h 50m
>
> Hi ZK community, this is regarding a fix for a behavior that is causing the
> leader election time to unnecessarily scale with the amount of data in the ZK
> data tree.
> *tl;dr:* During leader election, the leader always saves a snapshot when
> loading its data tree. This snapshot seems unnecessary, even in the case
> where the leader needs to send an updated SNAP to a learner, since it
> serializes the tree before sending anyway. Snapshotting slows down leader
> election and increases ZK downtime significantly as more data is stored in
> the tree. This improvement is to avoid taking a snapshot so that this
> unnecessary downtime is avoided.
> During leader election, when the [data is
> loaded|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L601]
> by the tentatively elected (i.e. pre-finalized quorum) leader server, a
> [snapshot of the tree is always
> taken|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/ZooKeeperServer.java#L540].
> The loadData method is called from multiple places, but specifically in the
> context of leader election, it seems like the snapshotting step is
> unnecessary for the leader when loading data:
> * Because it has loaded the tree at this point, we know that if the leader
> were to go down again, it would still be able to recover back to the current
> state at which we are snapshotting without using the snapshot that we are
> taking in loadData()
> * There are no ongoing transactions until leader election is completed and
> the ZK ensemble is back up, so no data would be lost after the point at which
> the data tree is loaded
> * Once the ensemble is healthy and the leader is handling transactions
> again, any new transactions are being logged and when needed the log is being
> rolled over when needed anyway, so if the leader is recovering from a
> failure, the snapshot taken during loadData() does not afford us any
> additional benefits over the initial snapshot (if it existed) and transaction
> log that the leader used to load its data from in loadData()
> * When the leader is deciding to send a SNAP or a DIFF to a learner, a [SNAP
> is
> serialized|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L582]
> and sent [if and only if it is
> needed|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L562].
> The snapshot taken in loadData() again does not seem to be beneficial here.
> The PR for this fix only skips this snapshotting step in loadData() during
> leader election. The behavior of the function remains the same for other
> usages. With this change, during leader election the data tree would only be
> serialized when sending a SNAP to a learner. In other scenarios, no data tree
> serialization would be needed at all. In both cases, there is a significant
> in the time spent in leader election.
> If my understanding of any of this is incorrect, or if I'm failing to
> consider some other aspect of the process, please let me know. The PR for the
> change can also be changed to enable/disable this behavior via a java
> property.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)