Rishabh Rai created ZOOKEEPER-4766:
--------------------------------------
Summary: Ensure leader election time does not unnecessarily scale
with tree size due to snapshotting
Key: ZOOKEEPER-4766
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4766
Project: ZooKeeper
Issue Type: Improvement
Components: leaderElection
Affects Versions: 3.8.3, 3.5.9
Environment: General behavior, should occur in all environments
Reporter: Rishabh Rai
Fix For: 3.8.3, 3.5.9
Hi ZK community, this is regarding a fix for a behavior that is causing the
leader election time to unnecessarily scale with the amount of data in the ZK
data tree.
*tl;dr:* During leader election, the leader always saves a snapshot when
loading its data tree. This snapshot seems unnecessary, even in the case where
the leader needs to send an updated SNAP to a learner, since it serializes the
tree before sending anyway. Snapshotting slows down leader election and
increases ZK downtime significantly as more data is stored in the tree. This
improvement is to avoid taking a snapshot so that this unnecessary downtime is
avoided.
During leader election, when the [data is
loaded|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L601]
by the tentatively elected (i.e. pre-finalized quorum) leader server, a
[snapshot of the tree is always
taken|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/ZooKeeperServer.java#L540].
The loadData method is called from multiple places, but specifically in the
context of leader election, it seems like the snapshotting step is unnecessary
for the leader when loading data:
* Because it has loaded the tree at this point, we know that if the leader
were to go down again, it would still be able to recover back to the current
state at which we are snapshotting without using the snapshot that we are
taking in loadData()
* There are no ongoing transactions until leader election is completed and the
ZK ensemble is back up, so no data would be lost after the point at which the
data tree is loaded
* Once the ensemble is healthy and the leader is handling transactions again,
any new transactions are being logged and when needed the log is being rolled
over when needed anyway, so if the leader is recovering from a failure, the
snapshot taken during loadData() does not afford us any additional benefits
over the initial snapshot (if it existed) and transaction log that the leader
used to load its data from in loadData()
* When the leader is deciding to send a SNAP or a DIFF to a learner, a [SNAP
is
serialized|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L582]
and sent [if and only if it is
needed|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L562].
The snapshot taken in loadData() again does not seem to be beneficial here.
The PR for this fix only skips this snapshotting step in loadData() during
leader election. The behavior of the function remains the same for other
usages. With this change, during leader election the data tree would only be
serialized when sending a SNAP to a learner. In other scenarios, no data tree
serialization would be needed at all. In both cases, there is a significant in
the time spent in leader election.
If my understanding of any of this is incorrect, or if I'm failing to consider
some other aspect of the process, please let me know. The PR for the change can
also be changed to enable/disable this behavior via a java property.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)