[
https://issues.apache.org/jira/browse/RATIS-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sammi Chen updated RATIS-1411:
------------------------------
Description:
There is slow follower issue observed in our stress test. For example, when
intensively writing 1TB data, the leader and one follower next_index is 100w+,
the slow follower next_index is 50w+. The gap is huge. Which will cause a lot
of WatchForCommit timeout exception.
After rerun the test and do the investigation, the Ozone stateMachineDataCache
is the key point. With stateMachineDataCache set to 1024 or more, as long as
majority(leader and one follower) have committed the write request index, write
request data is removed from stateMachineDataCache. Leader has to fetch that
chunk of data from on-disk chunk file when grpcLogAppender of the second
follower want to send that write request out.
The time cost of reading from chunk file is much more expensive than reading
from chunk file. Once one follower cannot get the data from
stateMachineDataCache, it will never catch up with, till the write finishes.
I tried using Guava Cache to replace the
ResourceLimitCache(stateMachineDataCache). It doesn't make an obvious
difference since the Cache size is limited. As long as the follower next_index
request be evicted out of the cache, the follower start to become more and
more slower.
Then I tried using the PriorityBlockingList to replace the
LinkedBlockingDeque in chunkExecutors, to put the readStatemachine task ahead
of other block's write task, execute the task by entryIndex order. Although
the readStatemachine will get the priority to execute first, but since there
are so many readStatemachines tasks, the overall effect is less than expected.
So the key point to resolve the slow follwer is to make sure that all its' data
stay in the cache as long as possbile.
My solution is set a threshold between the majority commited index and slow
follwer's commited index to guarantee the data in cache. I use 0.75 as the
ratio in my test. The effect is very well. I write 2TB data with a 3 DN
cluster, each with 10 HDD. The task finisehd in 40mins without any
watchForCommit timeout.
was:
There is slow follower issue observed in our stress test. For example, when
intensively writing 1TB data, the leader and one follower next_index is 100w+,
the slow follower next_index is 50w+. The gap is huge. Which will cause a lot
of WatchForCommit timeout exception.
After rerun the test and do the investigation, the Ozone stateMachineDataCache
is the key point. With stateMachineDataCache set to 1024 or more, as long as
majority(leader and one follower) have committed the write request index, write
request data is removed from stateMachineDataCache. Leader has to fetch that
chunk of data from on-disk chunk file when grpcLogAppender of the second
follower want to send that write request out.
The time cost of reading from chunk file is much more expensive than reading
from chunk file. Once one follower cannot get the data from
stateMachineDataCache, it will never catch up with, till the write finishes.
I tried using Guava Cache to replace the ResourceLimitCache(she
tateMachineDataCache). It doesn't make an obvious difference since the Cache
size is limited. As long as the follower next_index request be evicted out of
the cache, the follower start to become more and more slower.
Then I tried using the PriorityBlockingList to replace the
LinkedBlockingDeque in chunkExecutors, to put the readStatemachine task ahead
of other block's write task, execute the task by entryIndex order. Although
the readStatemachine will get the priority to execute first, but since there
are so many readStatemachines, the overall effect is less than expected.
So the key point to resove the slow follwer is to make sure that all its' data
stay in the cache as long as possbile.
My solution is set a threshold between the majority commited index and slow
follwer's commited index to guarantee the data in cache. I use 0.75 as the
ratio in my test. The effect is very well. I write 2TB data with a 3 DN
cluster, each with 10 HDD. The task finisehd in 40mins without any
watchForCommit timeout.
> Alleviate slow follower issue
> -----------------------------
>
> Key: RATIS-1411
> URL: https://issues.apache.org/jira/browse/RATIS-1411
> Project: Ratis
> Issue Type: Improvement
> Reporter: Sammi Chen
> Assignee: Sammi Chen
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> There is slow follower issue observed in our stress test. For example, when
> intensively writing 1TB data, the leader and one follower next_index is
> 100w+, the slow follower next_index is 50w+. The gap is huge. Which will
> cause a lot of WatchForCommit timeout exception.
> After rerun the test and do the investigation, the Ozone
> stateMachineDataCache is the key point. With stateMachineDataCache set to
> 1024 or more, as long as majority(leader and one follower) have committed the
> write request index, write request data is removed from
> stateMachineDataCache. Leader has to fetch that chunk of data from on-disk
> chunk file when grpcLogAppender of the second follower want to send that
> write request out.
> The time cost of reading from chunk file is much more expensive than reading
> from chunk file. Once one follower cannot get the data from
> stateMachineDataCache, it will never catch up with, till the write finishes.
> I tried using Guava Cache to replace the
> ResourceLimitCache(stateMachineDataCache). It doesn't make an obvious
> difference since the Cache size is limited. As long as the follower
> next_index request be evicted out of the cache, the follower start to become
> more and more slower.
> Then I tried using the PriorityBlockingList to replace the
> LinkedBlockingDeque in chunkExecutors, to put the readStatemachine task ahead
> of other block's write task, execute the task by entryIndex order. Although
> the readStatemachine will get the priority to execute first, but since there
> are so many readStatemachines tasks, the overall effect is less than
> expected.
> So the key point to resolve the slow follwer is to make sure that all its'
> data stay in the cache as long as possbile.
> My solution is set a threshold between the majority commited index and slow
> follwer's commited index to guarantee the data in cache. I use 0.75 as the
> ratio in my test. The effect is very well. I write 2TB data with a 3 DN
> cluster, each with 10 HDD. The task finisehd in 40mins without any
> watchForCommit timeout.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)