[ 
https://issues.apache.org/jira/browse/HDDS-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699605#comment-17699605
 ] 

Ivan Andika commented on HDDS-8131:
-----------------------------------

[~Nibiruxu] Thank you for your continued feedback.

> The influential purge is the leader's purge, and so is the leader's 
> commitIndex.

Agreed.

> One slow follower will not affect the leader's commitIndex.

Agreed since there is already an up-to-date follower which forms the majority. 
I think I did not make a distinction between a *local commitIndex* which is 
stored by each raft server (including leader). And {*}leader's commitIndex{*}, 
which is leader's view of followers' commitIndex, sort of like matchIndex[] in 
the Raft paper.

> there is a chance where the leader's snapshotIndex is larger than the 
> commitIndex and nextIndex of the slow follower

> I did not get what this means or implies.

This implies that if purgeUptoSnapshotIndex is true, there is a chance that 
leader already purged the logs up to the snapshotIndex.

Therefore, the logs indices between late follower's commitIndex and the 
leader's snapshotIndex (late follower's commitIndex < *purged indices* < 
leader's snapshotIndex) will not exist in the leader's and the late follower's 
logs. In this case, when the leader is trying to replicate its logs to the 
late/slow follower, it doesn't know how to replicate these purged logs (since 
it doesn't exist anymore), and therefore it forced the follower to install 
snapshot.

Going back to the example where purgeUptoSnapshotIndex is True.

Leader's snapshotIndex is 1_500_000 and the leader's log has been purged up to 
index 1_500_000 (Means leader's {{StartIndex}} is 1_500_001). The late 
follower's commitIndex is 1_000_000 (Its {{Nextindex}} is 1_000_001). When the 
leader tries to send AppendEntries to the late follower, it needs to send 
entries for index 1_000_001 onwards, but the leader realizes that log indices 
1_001_000 to 1_500_000 has already been purged although it's needed to be 
replicated to the late follower. So it has no choice but to force the late 
follower to install the snapshot since {{follower's NextIndex < leader's 
StartIndex.}}

> turning off the purgeUptoSnapshotIndex seems to trigger the purge more easily

If there is a late follower, the leader will not be able to maximum index 
passed to purge is the min(leader's snapshotIndex, leader -> commitIndex of all 
peers)

So the leader can only purge *up to* {*}min(leader's snapshotIndex, commitIndex 
of all peers) = slowest follower's commitIndex{*}.

If in the extreme condition where the follower is dead and the matchIndex of 
the user is never updated, the leader will never purge any log until the 
follower is back online.

That is why, in my opinion, purge will be fewer if we disable 
purgeUptoSnapshotIndex.

> Add Configuration for OM Ratis Log Purge Tuning Parameters
> ----------------------------------------------------------
>
>                 Key: HDDS-8131
>                 URL: https://issues.apache.org/jira/browse/HDDS-8131
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: Ozone Manager
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.4.0
>
>
> Currently Ozone Manager enables {{raft.server.log.purge.upto.snapshot.index}} 
> by default.
> However, for OM cluster with large metadata store, there might be a case 
> where OM leader purge its Ratis logs before a slow follower replicated it to 
> its log. This means that the follower needs to download the whole metadata 
> store from the OM leader. This can be problematic if the metadata store in 
> leader is too large.
> We should add two configurations in OM to enable/disable Ratis purge 
> parameters:
>  * {{raft.server.log.purge.upto.snapshot.index}}
>  ** Disabling this would guarantee that the OM leader will not purge its 
> Ratis log unless all the logs have been replicated to all the followers 
> (through {{{}commitIndex{}}}).
>  ** This would effectively means that there shouldn't be a case where the 
> slow follower needs to download the full metadata from the leader. So no 
> snapshot download from follower. For small OM metadata, it can be faster for 
> follower to download the leader's metadata snapshot than normally replicating 
> and applying the outstanding logs.
>  ** For a very slow follower / downed follower, the OM leader cannot purge 
> the log until the follower catch up to it. This might increase the disk space 
> usage for OM leader.
>  ** Default would be {{true}} to preserve the current OM snapshot behavior
>  * {{raft.server.log.purge.preservation.log.num}}
>  ** RATIS-1626 introduces logic to preserve the latest n won't-be-purged logs
>  ** Setting n > 0 while still enabling 
> {{raft.server.log.purge.upto.snapshot.index}} should balance a between the 
> cost of preserving & transferring logs and the cost of transferring snapshot.
>  ** Default would be 0 to preserve the current OM snapshot behavior



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to