[
https://issues.apache.org/jira/browse/KAFKA-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
A. Sophie Blee-Goldman updated KAFKA-13126:
---
Description:
In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was
overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this
override, users of both the plain consumer client and kafka streams still set
the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an
overflow when computing the {{joinGroupTimeoutMs}} and results in it being set
to the {{request.timeout.ms}} instead, which is much lower.
This can easily make consumers drop out of the group, since they must rejoin
now within 30s (by default) but have no obligation to almost ever call poll()
given the high {{max.poll.interval.ms}} – basically they will only do so after
processing the last record from the previously polled batch. So in heavy
processing cases, where each record takes a long time to process, or when using
a very large {{max.poll.records}}, it can be difficult to make any progress at
all before dropping out and needing to rejoin. And of course, the rebalance
that is kicked off upon this member rejoining can result in many of the other
members in the group dropping out as well, leading to an endless cycle of
missed rebalances.
We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when it
occurs. The workaround until then is of course to just set the
{{max.poll.interval.ms}} to MAX_VALUE - 5000 (5s is the
JOIN_GROUP_TIMEOUT_LAPSE)
was:
In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was
overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this
override, users of both the plain consumer client and kafka streams still set
the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an
overflow when computing the {{joinGroupTimeoutMs}} and results in it being set
to the {{request.timeout.ms}} instead, which is much lower.
This can easily make consumers drop out of the group, since they must rejoin
now within 30s (by default) but have no obligation to almost ever call poll()
given the high {{max.poll.interval.ms}} – basically they will only do so after
processing the last record from the previously polled batch. So in heavy
processing cases, where each record takes a long time to process, or when using
a very large {{max.poll.records}}, it can be difficult to make any progress at
all before dropping out and needing to rejoin. And of course, the rebalance
that is kicked off upon this member rejoining can result in many of the other
members in the group dropping out as well, leading to an endless cycle of
missed rebalances.
We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when it
occurs.
> Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads
> to missing rebalances
> -
>
> Key: KAFKA-13126
> URL: https://issues.apache.org/jira/browse/KAFKA-13126
> Project: Kafka
> Issue Type: Bug
> Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
> Fix For: 3.1.0
>
>
> In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was
> overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this
> override, users of both the plain consumer client and kafka streams still set
> the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an
> overflow when computing the {{joinGroupTimeoutMs}} and results in it being
> set to the {{request.timeout.ms}} instead, which is much lower.
> This can easily make consumers drop out of the group, since they must rejoin
> now within 30s (by default) but have no obligation to almost ever call poll()
> given the high {{max.poll.interval.ms}} – basically they will only do so
> after processing the last record from the previously polled batch. So in
> heavy processing cases, where each record takes a long time to process, or
> when using a very large {{max.poll.records}}, it can be difficult to make
> any progress at all before dropping out and needing to rejoin. And of course,
> the rebalance that is kicked off upon this member rejoining can result in
> many of the other members in the group dropping out as well, leading to an
> endless cycle of missed rebalances.
> We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when
> it occurs. The workaround until then is of course to just set the
> {{max.poll.interval.ms}} to MAX_VALUE - 5000 (5s is the
> JOIN_GROUP_TIMEOUT_LAPSE)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)