[jira] [Created] (ZOOKEEPER-4766) Ensure leader election time does not unnecessarily scale with tree size due to snapshotting

Rishabh Rai (Jira) Mon, 30 Oct 2023 15:35:05 -0700

Rishabh Rai created ZOOKEEPER-4766:
--------------------------------------

             Summary: Ensure leader election time does not unnecessarily scale 
with tree size due to snapshotting
                 Key: ZOOKEEPER-4766
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4766
             Project: ZooKeeper
          Issue Type: Improvement
          Components: leaderElection
    Affects Versions: 3.8.3, 3.5.9
         Environment: General behavior, should occur in all environments
            Reporter: Rishabh Rai
             Fix For: 3.8.3, 3.5.9



Hi ZK community, this is regarding a fix for a behavior that is causing the 
leader election time to unnecessarily scale with the amount of data in the ZK 
data tree.



*tl;dr:* During leader election, the leader always saves a snapshot when 
loading its data tree. This snapshot seems unnecessary, even in the case where 
the leader needs to send an updated SNAP to a learner, since it serializes the 
tree before sending anyway. Snapshotting slows down leader election and 
increases ZK downtime significantly as more data is stored in the tree. This 
improvement is to avoid taking a snapshot so that this unnecessary downtime is 
avoided.


During leader election, when the [data is 
loaded|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L601]
 by the tentatively elected (i.e. pre-finalized quorum) leader server, a 
[snapshot of the tree is always 
taken|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/ZooKeeperServer.java#L540].
 The loadData method is called from multiple places, but specifically in the 
context of leader election, it seems like the snapshotting step is unnecessary 
for the leader when loading data:
 * Because it has loaded the tree at this point, we know that if the leader 
were to go down again, it would still be able to recover back to the current 
state at which we are snapshotting without using the snapshot that we are 
taking in loadData()
 * There are no ongoing transactions until leader election is completed and the 
ZK ensemble is back up, so no data would be lost after the point at which the 
data tree is loaded
 * Once the ensemble is healthy and the leader is handling transactions again, 
any new transactions are being logged and when needed the log is being rolled 
over when needed anyway, so if the leader is recovering from a failure, the 
snapshot taken during loadData() does not afford us any additional benefits 
over the initial snapshot (if it existed) and transaction log that the leader 
used to load its data from in loadData()
 * When the leader is deciding to send a SNAP or a DIFF to a learner, a [SNAP 
is 
serialized|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L582]
 and sent [if and only if it is 
needed|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L562].
 The snapshot taken in loadData() again does not seem to be beneficial here.

The PR for this fix only skips this snapshotting step in loadData() during 
leader election. The behavior of the function remains the same for other 
usages. With this change, during leader election the data tree would only be 
serialized when sending a SNAP to a learner. In other scenarios, no data tree 
serialization would be needed at all. In both cases, there is a significant in 
the time spent in leader election.

If my understanding of any of this is incorrect, or if I'm failing to consider 
some other aspect of the process, please let me know. The PR for the change can 
also be changed to enable/disable this behavior via a java property.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ZOOKEEPER-4766) Ensure leader election time does not unnecessarily scale with tree size due to snapshotting

Reply via email to