Harel Ben Attia created KAFKA-12225:
---------------------------------------

             Summary: Unexpected broker bottleneck when scaling producers
                 Key: KAFKA-12225
                 URL: https://issues.apache.org/jira/browse/KAFKA-12225
             Project: Kafka
          Issue Type: Improvement
          Components: core
         Environment: AWS Based
5-node cluster running on k8s with EBS attached disks (HDD)
Kafka Version 2.5.0
Multiple Producers (KafkaStreams, Akka Streams, golang Sarama)
            Reporter: Harel Ben Attia


 

*TLDR*: There seems to be a major lock contention that can happen on 
*{{Log.lock}}* which can happen during producer-scaling when produce-request 
sending is time-based ({{linger.ms}}) and not data-size based (max batch size).

Hi,

We're running a 5-node Kafka cluster on one of our production systems on AWS. 
Recently, we have started to notice that as our producer services scale out, 
the Kafka idle-percentage drops abruptly from ~70% idle percentage to 0% on all 
brokers, even though none of the physical resources of the brokers are 
exhausted.

Initially, we realised that our {{io.thread}} count was too low, causing high 
request queuing and the low idle percentage, so we have increased it, hoping to 
see one of the physical resources maxing out. After changing it we still 
continued to see abrupt drops of the idle-percentage to 0% (with no physical 
resource maxing out), so we continued to investigate.

The investigation has shown that there's a direct relation to {{linger.ms}} 
being the controlling factor of sending out produce requests. Whenever messages 
are being sent out from the producer due to the {{linger.ms}} threshold, 
scaling out the service increased the number of produce requests in a way which 
is not proportional to our traffic increase, bringing down all the brokers to a 
near-halt in terms of being able to process requests and, as mentioned, without 
any exhaustion of physical resources.

After some more experiments and profiling a broker through flight recorder, we 
have found out that the cause of the issue is a lock contention on a 
*{{java.lang.Object}}*, wasting a lot of time on all the 
{{data-plane-kafka-request-handler}} threads. 90% of the locks were on Log's 
*{{lock: Object}}* instance, inside the *{{Log.append()}}* method. The stack 
traces show that these locks occur during the {{handleProductRequest}} method. 
We have ruled out replication as the source of the issues, as there were no 
replication issues, and the control-plane has a separate thread pool, so this 
focused us back on the actual producers, leading back to the behaviour of our 
producer service when scaling out.

At that point we thought that maybe the issue is related to the number of 
partitions of the topic (60 currently), and increasing it would reduce the lock 
contention on each {{Log}} instance, but since each producer writes to all 
partitions (data is evenly spread and not skewed), then increasing the number 
of partitions would only cause each producer to generate more produce-requests, 
not alleviating the lock contention. Also, increasing the number of brokers 
would increase the idle percentage per broker, but essentially would not help 
reducing the produce-request latency, since this would not change the rate of 
produce-requests per Log.

Eventually, we've worked around the issue by making the {{linger.ms}} value 
high enough so it stopped being the controlling factor of sending messages 
(e.g. produce-requests became coupled to the size of the traffic due to the max 
batch size becoming the controlling factor). This allowed us to utilise the 
cluster better without upscaling it.

>From our analysis, it seems that this lock behaviour limits Kafka's ability to 
>be robust to producer configuration and scaling, and hurts the ability to do 
>efficient capacity planning for the cluster, increasing the risk of an 
>unexpected bottleneck when traffic increases.

It would be great if you can validate these conclusions, or provide any more 
information that will help us understand the issue better or work around it in 
a more efficient way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to