[ https://issues.apache.org/jira/browse/KAFKA-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Beard updated KAFKA-12225: -------------------------------- Priority: Major (was: Minor) > Unexpected broker bottleneck when scaling producers > --------------------------------------------------- > > Key: KAFKA-12225 > URL: https://issues.apache.org/jira/browse/KAFKA-12225 > Project: Kafka > Issue Type: Improvement > Components: core > Environment: AWS Based > 5-node cluster running on k8s with EBS attached disks (HDD) > Kafka Version 2.5.0 > Multiple Producers (KafkaStreams, Akka Streams, golang Sarama) > Reporter: Harel Ben Attia > Priority: Major > > > *TLDR*: There seems to be a major lock contention that can happen on > *{{Log.lock}}* during producer-scaling when produce-request sending is > time-based ({{linger.ms}}) and not data-size based (max batch size). > Hi, > We're running a 5-node Kafka cluster on one of our production systems on AWS. > Recently, we have started to notice that as our producer services scale out, > the Kafka idle-percentage drops abruptly from ~70% idle percentage to 0% on > all brokers, even though none of the physical resources of the brokers are > exhausted. > Initially, we realised that our {{io.thread}} count was too low, causing high > request queuing and the low idle percentage, so we have increased it, hoping > to see one of the physical resources maxing out. After changing it we still > continued to see abrupt drops of the idle-percentage to 0% (with no physical > resource maxing out), so we continued to investigate. > The investigation has shown that there's a direct relation to {{linger.ms}} > being the controlling factor of sending out produce requests. Whenever > messages are being sent out from the producer due to the {{linger.ms}} > threshold, scaling out the service increased the number of produce requests > in a way which is not proportional to our traffic increase, bringing down all > the brokers to a near-halt in terms of being able to process requests and, as > mentioned, without any exhaustion of physical resources. > After some more experiments and profiling a broker through flight recorder, > we have found out that the cause of the issue is a lock contention on a > *{{java.lang.Object}}*, wasting a lot of time on all the > {{data-plane-kafka-request-handler}} threads. 90% of the locks were on Log's > *{{lock: Object}}* instance, inside the *{{Log.append()}}* method. The stack > traces show that these locks occur during the {{handleProductRequest}} > method. We have ruled out replication as the source of the issues, as there > were no replication issues, and the control-plane has a separate thread pool, > so this focused us back on the actual producers, leading back to the > behaviour of our producer service when scaling out. > At that point we thought that maybe the issue is related to the number of > partitions of the topic (60 currently), and increasing it would reduce the > lock contention on each {{Log}} instance, but since each producer writes to > all partitions (data is evenly spread and not skewed), then increasing the > number of partitions would only cause each producer to generate more > produce-requests, not alleviating the lock contention. Also, increasing the > number of brokers would increase the idle percentage per broker, but > essentially would not help reducing the produce-request latency, since this > would not change the rate of produce-requests per Log. > Eventually, we've worked around the issue by making the {{linger.ms}} value > high enough so it stopped being the controlling factor of sending messages > (e.g. produce-requests became coupled to the size of the traffic due to the > max batch size becoming the controlling factor). This allowed us to utilise > the cluster better without upscaling it. > From our analysis, it seems that this lock behaviour limits Kafka's ability > to be robust to producer configuration and scaling, and hurts the ability to > do efficient capacity planning for the cluster, increasing the risk of an > unexpected bottleneck when traffic increases. > It would be great if you can validate these conclusions, or provide any more > information that will help us understand the issue better or work around it > in a more efficient way. -- This message was sent by Atlassian Jira (v8.20.10#820010)