[ 
https://issues.apache.org/jira/browse/KAFKA-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Beard updated KAFKA-12225:
--------------------------------
    Priority: Major  (was: Minor)

> Unexpected broker bottleneck when scaling producers
> ---------------------------------------------------
>
>                 Key: KAFKA-12225
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12225
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>         Environment: AWS Based
> 5-node cluster running on k8s with EBS attached disks (HDD)
> Kafka Version 2.5.0
> Multiple Producers (KafkaStreams, Akka Streams, golang Sarama)
>            Reporter: Harel Ben Attia
>            Priority: Major
>
>  
> *TLDR*: There seems to be a major lock contention that can happen on 
> *{{Log.lock}}* during producer-scaling when produce-request sending is 
> time-based ({{linger.ms}}) and not data-size based (max batch size).
> Hi,
> We're running a 5-node Kafka cluster on one of our production systems on AWS. 
> Recently, we have started to notice that as our producer services scale out, 
> the Kafka idle-percentage drops abruptly from ~70% idle percentage to 0% on 
> all brokers, even though none of the physical resources of the brokers are 
> exhausted.
> Initially, we realised that our {{io.thread}} count was too low, causing high 
> request queuing and the low idle percentage, so we have increased it, hoping 
> to see one of the physical resources maxing out. After changing it we still 
> continued to see abrupt drops of the idle-percentage to 0% (with no physical 
> resource maxing out), so we continued to investigate.
> The investigation has shown that there's a direct relation to {{linger.ms}} 
> being the controlling factor of sending out produce requests. Whenever 
> messages are being sent out from the producer due to the {{linger.ms}} 
> threshold, scaling out the service increased the number of produce requests 
> in a way which is not proportional to our traffic increase, bringing down all 
> the brokers to a near-halt in terms of being able to process requests and, as 
> mentioned, without any exhaustion of physical resources.
> After some more experiments and profiling a broker through flight recorder, 
> we have found out that the cause of the issue is a lock contention on a 
> *{{java.lang.Object}}*, wasting a lot of time on all the 
> {{data-plane-kafka-request-handler}} threads. 90% of the locks were on Log's 
> *{{lock: Object}}* instance, inside the *{{Log.append()}}* method. The stack 
> traces show that these locks occur during the {{handleProductRequest}} 
> method. We have ruled out replication as the source of the issues, as there 
> were no replication issues, and the control-plane has a separate thread pool, 
> so this focused us back on the actual producers, leading back to the 
> behaviour of our producer service when scaling out.
> At that point we thought that maybe the issue is related to the number of 
> partitions of the topic (60 currently), and increasing it would reduce the 
> lock contention on each {{Log}} instance, but since each producer writes to 
> all partitions (data is evenly spread and not skewed), then increasing the 
> number of partitions would only cause each producer to generate more 
> produce-requests, not alleviating the lock contention. Also, increasing the 
> number of brokers would increase the idle percentage per broker, but 
> essentially would not help reducing the produce-request latency, since this 
> would not change the rate of produce-requests per Log.
> Eventually, we've worked around the issue by making the {{linger.ms}} value 
> high enough so it stopped being the controlling factor of sending messages 
> (e.g. produce-requests became coupled to the size of the traffic due to the 
> max batch size becoming the controlling factor). This allowed us to utilise 
> the cluster better without upscaling it.
> From our analysis, it seems that this lock behaviour limits Kafka's ability 
> to be robust to producer configuration and scaling, and hurts the ability to 
> do efficient capacity planning for the cluster, increasing the risk of an 
> unexpected bottleneck when traffic increases.
> It would be great if you can validate these conclusions, or provide any more 
> information that will help us understand the issue better or work around it 
> in a more efficient way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to