[
https://issues.apache.org/jira/browse/RATIS-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz-wo Sze updated RATIS-1411:
------------------------------
Component/s: server
> Alleviate slow follower issue
> -----------------------------
>
> Key: RATIS-1411
> URL: https://issues.apache.org/jira/browse/RATIS-1411
> Project: Ratis
> Issue Type: Improvement
> Components: server
> Reporter: Sammi Chen
> Assignee: Sammi Chen
> Priority: Major
> Time Spent: 2.5h
> Remaining Estimate: 0h
>
> There is slow follower issue observed in our stress test. For example, when
> intensively writing 1TB data, the leader and one follower next_index is
> 100w+, the slow follower next_index is 50w+. The gap is huge. Which will
> cause a lot of WatchForCommit timeout exception.
> After rerun the test and do the investigation, the Ozone
> stateMachineDataCache is the key point. With stateMachineDataCache set to
> 1024 or more, as long as majority(leader and one follower) have committed the
> write request index, write request data is removed from
> stateMachineDataCache. Leader has to fetch that chunk of data from on-disk
> chunk file when grpcLogAppender of the second follower want to send that
> write request out.
> The time cost of reading from chunk file is much more expensive than reading
> from chunk file. Once one follower cannot get the data from
> stateMachineDataCache, it will never catch up with, till the write finishes.
> I tried using Guava Cache to replace the
> ResourceLimitCache(stateMachineDataCache). It doesn't make an obvious
> difference since the Cache size is limited. As long as the follower
> next_index request be evicted out of the cache, the follower start to become
> more and more slower.
> Then I tried using the PriorityBlockingList to replace the
> LinkedBlockingDeque in chunkExecutors, to put the readStatemachine task ahead
> of other block's write task, execute the task by entryIndex order. Although
> the readStatemachine will get the priority to execute first, but since there
> are so many readStatemachines tasks, the overall effect is less than
> expected.
> So the key point to resolve the slow follwer is to make sure that all its'
> data stay in the cache as long as possbile.
> My solution is set a threshold between the majority commited index and slow
> follwer's commited index to guarantee the data in cache. I use 0.75 as the
> ratio in my test. The effect is very well. I write 2TB data with a 3 DN
> cluster, each with 10 HDD. The task finisehd in 40mins without any
> watchForCommit timeout.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)