[ 
https://issues.apache.org/jira/browse/RATIS-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sammi Chen updated RATIS-1411:
------------------------------
    Description: 
There is slow follower issue observed in our stress test.  For example, when 
intensively writing 1TB data, the leader and one follower next_index is 100w+, 
the slow follower next_index is 50w+.  The gap is huge.  Which will cause a lot 
of WatchForCommit timeout exception. 

After rerun the test and do the investigation,  the Ozone stateMachineDataCache 
is the key point.  With stateMachineDataCache set to 1024 or more, as long as 
majority(leader and one follower) have committed the write request index, write 
request data is removed from stateMachineDataCache. Leader has to fetch that 
chunk of data from on-disk chunk file when grpcLogAppender of the second 
follower want to send that write request out. 

The time cost of reading from chunk file is much more expensive than reading 
from chunk file.  Once one follower cannot get the data from 
stateMachineDataCache, it will never catch up with,  till the write finishes. 

I tried using Guava Cache to replace the 
ResourceLimitCache(stateMachineDataCache). It doesn't make an obvious 
difference since the Cache size is limited.  As long as the follower next_index 
request be evicted out of the cache,  the follower start to become more and 
more slower. 

Then I tried using the   PriorityBlockingList to replace the 
LinkedBlockingDeque in chunkExecutors, to put the readStatemachine task ahead 
of other block's write task,  execute the task by entryIndex order.  Although 
the readStatemachine will get the priority to execute first, but since there 
are so many readStatemachines tasks, the overall effect is less than expected. 

So the key point to resolve the slow follwer is to make sure that all its' data 
stay in the cache as long as possbile. 

My solution is set a threshold between the majority commited index and slow 
follwer's commited index to guarantee the data in cache.  I use 0.75 as the 
ratio in my test. The effect is very well.  I write 2TB data with a 3 DN 
cluster, each with 10 HDD. The task finisehd in 40mins without any 
watchForCommit timeout. 






  was:
There is slow follower issue observed in our stress test.  For example, when 
intensively writing 1TB data, the leader and one follower next_index is 100w+, 
the slow follower next_index is 50w+.  The gap is huge.  Which will cause a lot 
of WatchForCommit timeout exception. 

After rerun the test and do the investigation,  the Ozone stateMachineDataCache 
is the key point.  With stateMachineDataCache set to 1024 or more, as long as 
majority(leader and one follower) have committed the write request index, write 
request data is removed from stateMachineDataCache. Leader has to fetch that 
chunk of data from on-disk chunk file when grpcLogAppender of the second 
follower want to send that write request out. 

The time cost of reading from chunk file is much more expensive than reading 
from chunk file.  Once one follower cannot get the data from 
stateMachineDataCache, it will never catch up with,  till the write finishes. 

I tried using Guava Cache to replace the ResourceLimitCache(she 
tateMachineDataCache). It doesn't make an obvious difference since the Cache 
size is limited.  As long as the follower next_index request be evicted out of 
the cache,  the follower start to become more and more slower. 

Then I tried using the   PriorityBlockingList to replace the 
LinkedBlockingDeque in chunkExecutors, to put the readStatemachine task ahead 
of other block's write task,  execute the task by entryIndex order.  Although 
the readStatemachine will get the priority to execute first, but since there 
are so many readStatemachines, the overall effect is less than expected. 

So the key point to resove the slow follwer is to make sure that all its' data 
stay in the cache as long as possbile. 

My solution is set a threshold between the majority commited index and slow 
follwer's commited index to guarantee the data in cache.  I use 0.75 as the 
ratio in my test. The effect is very well.  I write 2TB data with a 3 DN 
cluster, each with 10 HDD. The task finisehd in 40mins without any 
watchForCommit timeout. 







> Alleviate slow follower issue
> -----------------------------
>
>                 Key: RATIS-1411
>                 URL: https://issues.apache.org/jira/browse/RATIS-1411
>             Project: Ratis
>          Issue Type: Improvement
>            Reporter: Sammi Chen
>            Assignee: Sammi Chen
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is slow follower issue observed in our stress test.  For example, when 
> intensively writing 1TB data, the leader and one follower next_index is 
> 100w+, the slow follower next_index is 50w+.  The gap is huge.  Which will 
> cause a lot of WatchForCommit timeout exception. 
> After rerun the test and do the investigation,  the Ozone 
> stateMachineDataCache is the key point.  With stateMachineDataCache set to 
> 1024 or more, as long as majority(leader and one follower) have committed the 
> write request index, write request data is removed from 
> stateMachineDataCache. Leader has to fetch that chunk of data from on-disk 
> chunk file when grpcLogAppender of the second follower want to send that 
> write request out. 
> The time cost of reading from chunk file is much more expensive than reading 
> from chunk file.  Once one follower cannot get the data from 
> stateMachineDataCache, it will never catch up with,  till the write finishes. 
> I tried using Guava Cache to replace the 
> ResourceLimitCache(stateMachineDataCache). It doesn't make an obvious 
> difference since the Cache size is limited.  As long as the follower 
> next_index request be evicted out of the cache,  the follower start to become 
> more and more slower. 
> Then I tried using the   PriorityBlockingList to replace the 
> LinkedBlockingDeque in chunkExecutors, to put the readStatemachine task ahead 
> of other block's write task,  execute the task by entryIndex order.  Although 
> the readStatemachine will get the priority to execute first, but since there 
> are so many readStatemachines tasks, the overall effect is less than 
> expected. 
> So the key point to resolve the slow follwer is to make sure that all its' 
> data stay in the cache as long as possbile. 
> My solution is set a threshold between the majority commited index and slow 
> follwer's commited index to guarantee the data in cache.  I use 0.75 as the 
> ratio in my test. The effect is very well.  I write 2TB data with a 3 DN 
> cluster, each with 10 HDD. The task finisehd in 40mins without any 
> watchForCommit timeout. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to