[ 
https://issues.apache.org/jira/browse/KAFKA-10690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503785#comment-17503785
 ] 

Jun Rao commented on KAFKA-10690:
---------------------------------

[~ocadaruma] : Thanks for filing the jira. Have you tried enabling replication 
throttling? This will help prevent the out-of-sync replicas from pulling data 
too aggressively. 

> Produce-response delay caused by lagging replica fetch which affects in-sync 
> one
> --------------------------------------------------------------------------------
>
>                 Key: KAFKA-10690
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10690
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 2.4.1
>            Reporter: Haruki Okada
>            Priority: Major
>         Attachments: image-2020-11-06-11-15-21-781.png, 
> image-2020-11-06-11-15-38-390.png, image-2020-11-06-11-17-09-910.png
>
>
> h2. Our environment
>  * Kafka version: 2.4.1
> h2. Phenomenon
>  * Produce response time 99th (remote scope) degrades to 500ms, which is 20 
> times worse than usual
>  ** Meanwhile, the cluster was running replica reassignment to service-in new 
> machine to recover replicas which held by failed (Hardware issue) broker 
> machine
> !image-2020-11-06-11-15-21-781.png|width=292,height=166!
> h2. Analysis
> Let's say
>  * broker-X: The broker we observed produce latency degradation
>  * broker-Y: The broker under servicing-in
> broker-Y was catching up replicas of partitions:
>  * partition-A: has relatively small log size
>  * partition-B: has large log size
> (actually, broker-Y was catching-up many other partitions. I noted only two 
> partitions here to make explanation simple)
> broker-X was the leader for both partition-A and partition-B.
> We found that both partition-A and partition-B are assigned to same 
> ReplicaFetcherThread of broker-Y, and produce latency started to degrade 
> right after broker-Y finished catching up partition-A.
> !image-2020-11-06-11-17-09-910.png|width=476,height=174!
> Besides, we observed disk reads on broker-X during service-in. (This is 
> natural since old segments are likely not in page cache)
> !image-2020-11-06-11-15-38-390.png|width=292,height=193!
> So we suspected that:
>  * In-sync replica fetch (partition-A) was involved by lagging replica fetch 
> (partition-B), which should be slow because it causes actual disk reads
>  ** Since ReplicaFetcherThread sends fetch requests in blocking manner, next 
> fetch request can't be sent until one fetch request completes
>  ** => Causes in-sync replica fetch for partitions assigned to same replica 
> fetcher thread to delay
>  ** => Causes remote scope produce latency degradation
> h2. Possible fix
> We think this issue can be addressed by designating part of 
> ReplicaFetcherThread (or creating another thread pool) for lagging replica 
> catching-up, but not so sure this is the appropriate way.
> Please give your opinions about this issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to