[ 
https://issues.apache.org/jira/browse/RATIS-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-1411:
------------------------------
    Component/s: server

> Alleviate slow follower issue
> -----------------------------
>
>                 Key: RATIS-1411
>                 URL: https://issues.apache.org/jira/browse/RATIS-1411
>             Project: Ratis
>          Issue Type: Improvement
>          Components: server
>            Reporter: Sammi Chen
>            Assignee: Sammi Chen
>            Priority: Major
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> There is slow follower issue observed in our stress test.  For example, when 
> intensively writing 1TB data, the leader and one follower next_index is 
> 100w+, the slow follower next_index is 50w+.  The gap is huge.  Which will 
> cause a lot of WatchForCommit timeout exception. 
> After rerun the test and do the investigation,  the Ozone 
> stateMachineDataCache is the key point.  With stateMachineDataCache set to 
> 1024 or more, as long as majority(leader and one follower) have committed the 
> write request index, write request data is removed from 
> stateMachineDataCache. Leader has to fetch that chunk of data from on-disk 
> chunk file when grpcLogAppender of the second follower want to send that 
> write request out. 
> The time cost of reading from chunk file is much more expensive than reading 
> from chunk file.  Once one follower cannot get the data from 
> stateMachineDataCache, it will never catch up with,  till the write finishes. 
> I tried using Guava Cache to replace the 
> ResourceLimitCache(stateMachineDataCache). It doesn't make an obvious 
> difference since the Cache size is limited.  As long as the follower 
> next_index request be evicted out of the cache,  the follower start to become 
> more and more slower. 
> Then I tried using the   PriorityBlockingList to replace the 
> LinkedBlockingDeque in chunkExecutors, to put the readStatemachine task ahead 
> of other block's write task,  execute the task by entryIndex order.  Although 
> the readStatemachine will get the priority to execute first, but since there 
> are so many readStatemachines tasks, the overall effect is less than 
> expected. 
> So the key point to resolve the slow follwer is to make sure that all its' 
> data stay in the cache as long as possbile. 
> My solution is set a threshold between the majority commited index and slow 
> follwer's commited index to guarantee the data in cache.  I use 0.75 as the 
> ratio in my test. The effect is very well.  I write 2TB data with a 3 DN 
> cluster, each with 10 HDD. The task finisehd in 40mins without any 
> watchForCommit timeout. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to