Haoze Wu created KAFKA-14886:
--------------------------------
Summary: Broker request handler thread pool is full due to single
request slowdown
Key: KAFKA-14886
URL: https://issues.apache.org/jira/browse/KAFKA-14886
Project: Kafka
Issue Type: Improvement
Affects Versions: 2.8.0
Reporter: Haoze Wu
In Kafka-2.8.0, we found that the number of data plane Kafka request handlers
may quickly reach the limit when only one request is stuck. As a result, all
other requests that require a data plane request handler will be stuck.
When there is a slowdown inside the storeOffsets function at line 777 due to
I/O operation, the thread holds the lock acquired at line 754.
{code:java}
private def doCommitOffsets(group: GroupMetadata,
memberId: String,
groupInstanceId: Option[String],
generationId: Int,
offsetMetadata: immutable.Map[TopicPartition,
OffsetAndMetadata],
responseCallback: immutable.Map[TopicPartition,
Errors] => Unit): Unit = {
group.inLock { // Line 754
..
groupManager.storeOffsets(....) // Line 777
..
}
} {code}
Its call stack is:
{code:java}
kafka.coordinator.group.GroupMetadata,inLock,227
kafka.coordinator.group.GroupCoordinator,handleCommitOffsets,755
kafka.server.KafkaApis,handleOffsetCommitRequest,515
kafka.server.KafkaApis,handle,175
kafka.server.KafkaRequestHandler,run,74
java.lang.Thread,run,748 {code}
This happens when the broker is handling the commit offset request from the
consumer. When the slowdown mentioned above makes consumers get no response
back, the consumer will automatically resend the request to the broker. Note
that each request from the consumer is handled by a
data-plane-kafka-request-handler thread. Therefore, another
data-plane-kafka-request-handler thread will be also stuck at line 754 when
handling the retry requests, because it tries to acquire the very same lock of
the consumer group. The retry will occur repeatedly, and none of them can
succeed. As a result, the pool of data-plane-kafka-request-handler threads will
be full. Note that this pool of threads is responsible for handling all such
requests from all producers and consumers. As a result, all the producers and
consumers would be affected.
However, the backoff mechanism might be able to solve this issue, by reducing
the number of requests in a short time and reserving more slots in the thread
pool. Therefore, we increase the backoff config “retry-backoff-ms”, to see if
the issue disappears. Specifically, we increase the retry backoff from 100ms
(default) to 1000ms in consumer’s config. However, we found that the mentioned
thread pool is full again, because there are multiple heartbeat requests that
take up the slots of this thread pool. All those heartbeat request handling is
stuck when they are acquiring the same consumer group lock, which has been
acquired at line 754 as mentioned. Specifically, the heartbeat handling is
stuck at GroupCoordinator.handleHeartbeat@624:
{code:java}
def handleHeartbeat(groupId: String,
memberId: String,
groupInstanceId: Option[String],
generationId: Int,
responseCallback: Errors => Unit): Unit = {
..
case Some(group) => group.inLock { // Line 624
..
}
..
} {code}
The heartbeat requests are sent at the interval of 3000ms (by default) from the
consumer. It has no backoff mechanism. The thread pool for
data-plane-kafka-request-handler will be full soon.
Fix:
Instead of waiting for the lock, we can just try to acquire the lock (probably
with a time limit). If the acquisition fails, this request can be discarded so
that other requests (which include the retry of the discarded one) can be
processed. However, we feel this fix would affect the semantic of many
operations. We would like to hear some suggestions from the community.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)