[DISCUSS] [KIP worthy?] Anyone with large clusters facing Produce Response Time degradation as in KAFKA-10690?

マテュアルン Wed, 09 Mar 2022 07:17:43 -0800

Hello,
This is Arun from LINE Corporation.
We have a cluster with a large number of brokers (200+), node failures are 
bound to happen relatively often. Upon recovery of the machine, or upon 
reassignment of the replicas on failed node, we often have a large amount of 
lagging replica catch up. Multiple replicas (re-)assigned to a target broker 
could start fetching from the same source broker id holding the leader replica. 
This occasionally leads to Produce Response Time degradation as illustrated in 
https://issues.apache.org/jira/browse/KAFKA-10690 .

Wanted to check if this is faced by anyone else, and if a solution merits a KIP.

With Regards
マテュアルン Mathew Arun
LINE Corporation

[DISCUSS] [KIP worthy?] Anyone with large clusters facing Produce Response Time degradation as in KAFKA-10690?

Reply via email to