[
https://issues.apache.org/jira/browse/KAFKA-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148244#comment-17148244
]
Neo Wu commented on KAFKA-10134:
--------------------------------
Hi, [~guozhang]
for "What's still puzzling me is that, even in the second branch, since we
always keep calling `timer.update` then we should still eventually exit the
while loop with `timer.expired`. So why would we observe that it blocks inside
the while-loop forever is not clear to me."
yes, it will exit while loop eventually, but usually the app code is like
following (at least in my case)
{code:java}
while (!shutdown) {
try {
ConsumerRecords<byte[], byte[]> records =
consumer.poll(Duration.ofSeconds(30));
if (records.isEmpty()) continue;
processRecords(records);
} catch (Throwable e) {
if (shutdown) break;
logger.error("failed to pull message, retry in 10 seconds", e);
Threads.sleepRoughly(Duration.ofSeconds(10));
}
}
{code}
so during 30s, if kafak is down in the middle, since fetchablePartitions return
non-empty, the consumer.poll keeps busy loop,
and as soon as it exists, the application usually will try to poll immediately.
in application level, sure i can put delay if poll return empty, but still it
will trigger high cpu time to time, and timeout passed into consumer.poll can't
be high, say if i use 5 secs,
in the second case, the thread acts like, high cpu 5sec, -> sleep 5s -> 100%cpu
5s, (which is still not considered as healthy behavior)
> High CPU issue during rebalance in Kafka consumer after upgrading to 2.5
> ------------------------------------------------------------------------
>
> Key: KAFKA-10134
> URL: https://issues.apache.org/jira/browse/KAFKA-10134
> Project: Kafka
> Issue Type: Bug
> Components: clients
> Affects Versions: 2.5.0
> Reporter: Sean Guo
> Assignee: Guozhang Wang
> Priority: Blocker
> Fix For: 2.6.0, 2.5.1
>
>
> We want to utilize the new rebalance protocol to mitigate the stop-the-world
> effect during the rebalance as our tasks are long running task.
> But after the upgrade when we try to kill an instance to let rebalance happen
> when there is some load(some are long running tasks >30S) there, the CPU will
> go sky-high. It reads ~700% in our metrics so there should be several threads
> are in a tight loop. We have several consumer threads consuming from
> different partitions during the rebalance. This is reproducible in both the
> new CooperativeStickyAssignor and old eager rebalance rebalance protocol. The
> difference is that with old eager rebalance rebalance protocol used the high
> CPU usage will dropped after the rebalance done. But when using cooperative
> one, it seems the consumers threads are stuck on something and couldn't
> finish the rebalance so the high CPU usage won't drop until we stopped our
> load. Also a small load without long running task also won't cause continuous
> high CPU usage as the rebalance can finish in that case.
>
> "executor.kafka-consumer-executor-4" #124 daemon prio=5 os_prio=0
> cpu=76853.07ms elapsed=841.16s tid=0x00007fe11f044000 nid=0x1f4 runnable
> [0x00007fe119aab000]"executor.kafka-consumer-executor-4" #124 daemon prio=5
> os_prio=0 cpu=76853.07ms elapsed=841.16s tid=0x00007fe11f044000 nid=0x1f4
> runnable [0x00007fe119aab000] java.lang.Thread.State: RUNNABLE at
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:467)
> at
> org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1275)
> at
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1241)
> at
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1216)
> at
>
> By debugging into the code we found it looks like the clients are in a loop
> on finding the coordinator.
> I also tried the old rebalance protocol for the new version the issue still
> exists but the CPU will be back to normal when the rebalance is done.
> Also tried the same on the 2.4.1 which seems don't have this issue. So it
> seems related something changed between 2.4.1 and 2.5.0.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)