Haoze Wu created KAFKA-14886:
--------------------------------

             Summary: Broker request handler thread pool is full due to single 
request slowdown
                 Key: KAFKA-14886
                 URL: https://issues.apache.org/jira/browse/KAFKA-14886
             Project: Kafka
          Issue Type: Improvement
    Affects Versions: 2.8.0
            Reporter: Haoze Wu


In Kafka-2.8.0, we found that the number of data plane Kafka request handlers 
may quickly reach the limit when only one request is stuck. As a result, all 
other requests that require a data plane request handler will be stuck.

When there is a slowdown inside the storeOffsets function at line 777 due to 
I/O operation, the thread holds the lock acquired at line 754.

 
{code:java}
  private def doCommitOffsets(group: GroupMetadata,
                              memberId: String,
                              groupInstanceId: Option[String],
                              generationId: Int,
                              offsetMetadata: immutable.Map[TopicPartition, 
OffsetAndMetadata],
                              responseCallback: immutable.Map[TopicPartition, 
Errors] => Unit): Unit = {
    group.inLock { // Line 754
..
      groupManager.storeOffsets(....) // Line 777
..
  }
} {code}
Its call stack is:

 
{code:java}
kafka.coordinator.group.GroupMetadata,inLock,227
kafka.coordinator.group.GroupCoordinator,handleCommitOffsets,755
kafka.server.KafkaApis,handleOffsetCommitRequest,515
kafka.server.KafkaApis,handle,175
kafka.server.KafkaRequestHandler,run,74
java.lang.Thread,run,748 {code}
This happens when the broker is handling the commit offset request from the 
consumer. When the slowdown mentioned above makes consumers get no response 
back, the consumer will automatically resend the request to the broker. Note 
that each request from the consumer is handled by a 
data-plane-kafka-request-handler thread. Therefore, another 
data-plane-kafka-request-handler thread will be also stuck at line 754 when 
handling the retry requests, because it tries to acquire the very same lock of 
the consumer group. The retry will occur repeatedly, and none of them can 
succeed. As a result, the pool of data-plane-kafka-request-handler threads will 
be full. Note that this pool of threads is responsible for handling all such 
requests from all producers and consumers. As a result, all the producers and 
consumers would be affected.

However, the backoff mechanism might be able to solve this issue, by reducing 
the number of requests in a short time and reserving more slots in the thread 
pool. Therefore, we increase the backoff config “retry-backoff-ms”, to see if 
the issue disappears. Specifically, we increase the retry backoff from 100ms 
(default) to 1000ms in consumer’s config. However, we found that the mentioned 
thread pool is full again, because there are multiple heartbeat requests that 
take up the slots of this thread pool. All those heartbeat request handling is 
stuck when they are acquiring the same consumer group lock, which has been 
acquired at line 754 as mentioned. Specifically, the heartbeat handling is 
stuck at GroupCoordinator.handleHeartbeat@624:
{code:java}
  def handleHeartbeat(groupId: String,
                      memberId: String,
                      groupInstanceId: Option[String],
                      generationId: Int,
                      responseCallback: Errors => Unit): Unit = {
..
      case Some(group) => group.inLock { // Line 624
..
      }
..
} {code}
The heartbeat requests are sent at the interval of 3000ms (by default) from the 
consumer. It has no backoff mechanism. The thread pool for 
data-plane-kafka-request-handler will be full soon.

Fix: 

Instead of waiting for the lock, we can just try to acquire the lock (probably 
with a time limit). If the acquisition fails, this request can be discarded so 
that other requests (which include the retry of the discarded one) can be 
processed. However, we feel this fix would affect the semantic of many 
operations. We would like to hear some suggestions from the community.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to