[
https://issues.apache.org/jira/browse/HDDS-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699605#comment-17699605
]
Ivan Andika commented on HDDS-8131:
-----------------------------------
[~Nibiruxu] Thank you for your continued feedback.
> The influential purge is the leader's purge, and so is the leader's
> commitIndex.
Agreed.
> One slow follower will not affect the leader's commitIndex.
Agreed since there is already an up-to-date follower which forms the majority.
I think I did not make a distinction between a *local commitIndex* which is
stored by each raft server (including leader). And {*}leader's commitIndex{*},
which is leader's view of followers' commitIndex, sort of like matchIndex[] in
the Raft paper.
> there is a chance where the leader's snapshotIndex is larger than the
> commitIndex and nextIndex of the slow follower
> I did not get what this means or implies.
This implies that if purgeUptoSnapshotIndex is true, there is a chance that
leader already purged the logs up to the snapshotIndex.
Therefore, the logs indices between late follower's commitIndex and the
leader's snapshotIndex (late follower's commitIndex < *purged indices* <
leader's snapshotIndex) will not exist in the leader's and the late follower's
logs. In this case, when the leader is trying to replicate its logs to the
late/slow follower, it doesn't know how to replicate these purged logs (since
it doesn't exist anymore), and therefore it forced the follower to install
snapshot.
Going back to the example where purgeUptoSnapshotIndex is True.
Leader's snapshotIndex is 1_500_000 and the leader's log has been purged up to
index 1_500_000 (Means leader's {{StartIndex}} is 1_500_001). The late
follower's commitIndex is 1_000_000 (Its {{Nextindex}} is 1_000_001). When the
leader tries to send AppendEntries to the late follower, it needs to send
entries for index 1_000_001 onwards, but the leader realizes that log indices
1_001_000 to 1_500_000 has already been purged although it's needed to be
replicated to the late follower. So it has no choice but to force the late
follower to install the snapshot since {{follower's NextIndex < leader's
StartIndex.}}
> turning off the purgeUptoSnapshotIndex seems to trigger the purge more easily
If there is a late follower, the leader will not be able to maximum index
passed to purge is the min(leader's snapshotIndex, leader -> commitIndex of all
peers)
So the leader can only purge *up to* {*}min(leader's snapshotIndex, commitIndex
of all peers) = slowest follower's commitIndex{*}.
If in the extreme condition where the follower is dead and the matchIndex of
the user is never updated, the leader will never purge any log until the
follower is back online.
That is why, in my opinion, purge will be fewer if we disable
purgeUptoSnapshotIndex.
> Add Configuration for OM Ratis Log Purge Tuning Parameters
> ----------------------------------------------------------
>
> Key: HDDS-8131
> URL: https://issues.apache.org/jira/browse/HDDS-8131
> Project: Apache Ozone
> Issue Type: Improvement
> Components: Ozone Manager
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.4.0
>
>
> Currently Ozone Manager enables {{raft.server.log.purge.upto.snapshot.index}}
> by default.
> However, for OM cluster with large metadata store, there might be a case
> where OM leader purge its Ratis logs before a slow follower replicated it to
> its log. This means that the follower needs to download the whole metadata
> store from the OM leader. This can be problematic if the metadata store in
> leader is too large.
> We should add two configurations in OM to enable/disable Ratis purge
> parameters:
> * {{raft.server.log.purge.upto.snapshot.index}}
> ** Disabling this would guarantee that the OM leader will not purge its
> Ratis log unless all the logs have been replicated to all the followers
> (through {{{}commitIndex{}}}).
> ** This would effectively means that there shouldn't be a case where the
> slow follower needs to download the full metadata from the leader. So no
> snapshot download from follower. For small OM metadata, it can be faster for
> follower to download the leader's metadata snapshot than normally replicating
> and applying the outstanding logs.
> ** For a very slow follower / downed follower, the OM leader cannot purge
> the log until the follower catch up to it. This might increase the disk space
> usage for OM leader.
> ** Default would be {{true}} to preserve the current OM snapshot behavior
> * {{raft.server.log.purge.preservation.log.num}}
> ** RATIS-1626 introduces logic to preserve the latest n won't-be-purged logs
> ** Setting n > 0 while still enabling
> {{raft.server.log.purge.upto.snapshot.index}} should balance a between the
> cost of preserving & transferring logs and the cost of transferring snapshot.
> ** Default would be 0 to preserve the current OM snapshot behavior
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]