[
https://issues.apache.org/jira/browse/KAFKA-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898035#comment-17898035
]
Greg Harris commented on KAFKA-17232:
-------------------------------------
Hi [~kakhramanov] thanks for the bug report!
That error appearing multiple times is the intended behavior of the patch; It
should eventually resolve once the initial load of consumer groups finishes,
and this log message is printed:
[https://github.com/apache/kafka/blob/b6b2c9ebc45bd60572c24355886620dbdc406ce9/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointConnector.java#L217]
Are you seeing the task configuration error appearing continuously without ever
resolving? Are you seeing the log messages indicating loading has finished?
Also, are you seeing any Scheduler logs mentioning `loading initial consumer
groups`? That would tell you if the load was actually timing out. The
RetriableException infers that the load operation timed out, but perhaps there
is a mistake that causes the exception to be thrown even if the timeout has not
elapsed yet.
I also wonder if these log messages could be coming from cancelled tasks by
accident.
> MirrorCheckpointConnector does not generate task configs if initial consumer
> group load times out
> -------------------------------------------------------------------------------------------------
>
> Key: KAFKA-17232
> URL: https://issues.apache.org/jira/browse/KAFKA-17232
> Project: Kafka
> Issue Type: Bug
> Components: mirrormaker
> Affects Versions: 3.9.0
> Reporter: Greg Harris
> Assignee: TengYao Chi
> Priority: Major
> Fix For: 3.9.0
>
>
> The MirrorCheckpointConnector has two operations that read the source
> consumer groups:
> * loadInitialConsumerGroups
> * refreshConsumerGroups
> loadInitialConsumerGroups blocks the start() method of the connector, while
> refreshConsumerGroups is asynchronous and runs periodically while the
> connector is running.
> loadInitialConsumerGroups may take a long time to execute, and may exceed the
> configured "admin.timeout.ms" used by the Scheduler. This timeout is logged
> and the start() method returns normally. If this happens, the framework will
> generate task configs immediately after start(), before
> loadInitialConsumerGroups can finish, and will generate an empty set of task
> configs:
> [https://github.com/apache/kafka/blob/e2494e6ffb89f8288ed2aeb9b5596c755210bffd/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointConnector.java#L118-L121].
> Later, when loadInitialConsumerGroups completes, it will not request task
> reconfiguration, believing it is the initial load operation.
> Later still, when refreshConsumerGroups completes, it will not request task
> reconfiguration, as the set of consumer groups has not changed since the
> initial load:
> [https://github.com/apache/kafka/blob/e2494e6ffb89f8288ed2aeb9b5596c755210bffd/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointConnector.java#L173-L180]
>
> This leads to a situation where the MirrorCheckpointConnector believes it has
> converged with nothing to update, but actually has consumer groups that are
> not allocated to tasks.
> This happens particularly for large, stable Kafka clusters with many consumer
> groups that are not being actively created or deleted.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)