[
https://issues.apache.org/jira/browse/KAFKA-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898172#comment-17898172
]
Asker commented on KAFKA-17232:
-------------------------------
Hello [~gharris1727],
Thank you very much for responding to our comment; it is highly valuable to our
team.
*“Are you seeing the task configuration error appearing continuously without
ever resolving?”*
Yes, we are seeing this ERROR log constantly. Last Friday, I upgraded from
Kafka 3.6.0 to 3.9.0 and thought the error might be temporary. However, when I
returned to work on Monday, I saw that the error continued to appear
persistently, even though the mirroring itself is functioning: I can see that
messages from cluster A to cluster B are appearing in the topics.
*“Are you seeing the log messages indicating loading has finished?”*
No, we did not see such log messages.
*“Also, are you seeing any Scheduler logs mentioning loading initial consumer
groups?”*
No, we did not find such logs. Running the following command on the server
where the Kafka broker and MirrorMaker 2 service are running returned no
results:
{code:bash}
[root@kafka-analytics-3a~]# journalctl -u kafka-mirror-maker.service -xe | grep
"loading initial consumer groups"
{code}
*“I also wonder if these log messages could be coming from cancelled tasks by
accident.”*
I think you asked a very pertinent question! I’m glad you brought it up. I hope
we’re thinking along the same lines, but even if not, it’s still worth
discussing.
We have three clusters in our configuration:
{code:bash}
clusters=analytics-dev, app-dev, telemetry-dev
{code}
MirrorMaker is enabled from app-dev to analytics-dev and from telemetry-dev to
analytics-dev. However, the Timeout while loading consumer groups error always
references clientIds of clusters between which MirrorMaker should not be active:
- analytics-dev->app-dev
{code:bash}
[2024-11-11 12:41:44,943] ERROR [Worker clientId=analytics-dev->app-dev,
groupId=analytics-dev-mm2] Failed to reconfigure connector's tasks
(MirrorCheckpointConnector), retrying after backoff.
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195)
{code}
- telemetry-dev->app-dev
{code:bash}
[2024-11-11 12:41:44,497] ERROR [Worker clientId=telemetry-dev->app-dev,
groupId=telemetry-dev-mm2] Failed to reconfigure connector's tasks
(MirrorCheckpointConnector), retrying after backoff.
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195)
{code}
- app-dev->telemetry-dev
{code:bash}
[2024-11-11 12:41:44,943] ERROR [Worker clientId=app-dev->telemetry-dev,
groupId=app-dev-mm2] Failed to reconfigure connector's tasks
(MirrorCheckpointConnector), retrying after backoff.
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195)
{code}
And so on.
This means that between analytics-dev->app-dev, there are no topics configured
for MirrorMaker, and the same applies for telemetry-dev->app-dev, etc. In other
words, MirrorMaker is attempting to interact between clusters where it is not
supposed to be active.
Additionally, it’s noticeable that the cluster app-dev always appears in these
errors. I’m not sure why this is happening. The only distinguishing feature of
this cluster is that it has ACLs, but analytics-dev also has ACLs.
Our connect-mirror-maker.properties file looks like this:
{code:bash}
clusters=analytics-dev, app-dev, telemetry-dev
# Analytics-dev cluster configuration
analytics-dev.bootstrap.servers=kafka-analytics-1a:9092,
kafka-analytics-2a:9092, kafka-analytics-3a:9092
analytics-dev.security.protocol=...
analytics-dev.sasl.mechanism=...
analytics-dev.sasl.jaas.config=...
analytics-dev.checkpoints.topic.replication.factor=2
analytics-dev.heartbeats.topic.replication.factor=2
analytics-dev.offset-syncs.topic.replication.factor=2
analytics-dev.offset.storage.replication.factor=2
analytics-dev.status.storage.replication.factor=2
analytics-dev.config.storage.replication.factor=2
# App-dev cluster configuration
app-dev.bootstrap.servers=kafka-app-1a:9092, kafka-app-2a:9092,
kafka-app-3a:9092
app-dev.security.protocol=...
app-dev.sasl.mechanism=...
app-dev.sasl.jaas.config=...
app-dev.checkpoints.topic.replication.factor=2
app-dev.heartbeats.topic.replication.factor=2
app-dev.offset-syncs.topic.replication.factor=2
app-dev.offset.storage.replication.factor=2
app-dev.status.storage.replication.factor=2
app-dev.config.storage.replication.factor=2
# Telemetry-dev cluster configuration
telemetry-dev.bootstrap.servers=kafka-telemetry-1a:9092
telemetry-dev.security.protocol=...
telemetry-dev.checkpoints.topic.replication.factor=1
telemetry-dev.heartbeats.topic.replication.factor=1
telemetry-dev.offset-syncs.topic.replication.factor=1
telemetry-dev.offset.storage.replication.factor=1
telemetry-dev.status.storage.replication.factor=1
telemetry-dev.config.storage.replication.factor=1
# Replication flows
app-dev->analytics-dev.enabled=true
app-dev->analytics-dev.topics=...
analytics-dev->app-dev.enabled=false
telemetry-dev->analytics-dev.enabled=true
telemetry-dev->analytics-dev.topics=...
analytics-dev->telemetry-dev.enabled=false
replication.factor=2
replication.policy.class=org.apache.kafka.connect.mirror.IdentityReplicationPolicy
dedicated.mode.enable.internal.rest=true
num.streams=3
tasks.max=2
{code}
We are eagerly awaiting your response!
Best regards,
Asker Kakhramanov
> MirrorCheckpointConnector does not generate task configs if initial consumer
> group load times out
> -------------------------------------------------------------------------------------------------
>
> Key: KAFKA-17232
> URL: https://issues.apache.org/jira/browse/KAFKA-17232
> Project: Kafka
> Issue Type: Bug
> Components: mirrormaker
> Affects Versions: 3.9.0
> Reporter: Greg Harris
> Assignee: TengYao Chi
> Priority: Major
> Fix For: 3.9.0
>
>
> The MirrorCheckpointConnector has two operations that read the source
> consumer groups:
> * loadInitialConsumerGroups
> * refreshConsumerGroups
> loadInitialConsumerGroups blocks the start() method of the connector, while
> refreshConsumerGroups is asynchronous and runs periodically while the
> connector is running.
> loadInitialConsumerGroups may take a long time to execute, and may exceed the
> configured "admin.timeout.ms" used by the Scheduler. This timeout is logged
> and the start() method returns normally. If this happens, the framework will
> generate task configs immediately after start(), before
> loadInitialConsumerGroups can finish, and will generate an empty set of task
> configs:
> [https://github.com/apache/kafka/blob/e2494e6ffb89f8288ed2aeb9b5596c755210bffd/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointConnector.java#L118-L121].
> Later, when loadInitialConsumerGroups completes, it will not request task
> reconfiguration, believing it is the initial load operation.
> Later still, when refreshConsumerGroups completes, it will not request task
> reconfiguration, as the set of consumer groups has not changed since the
> initial load:
> [https://github.com/apache/kafka/blob/e2494e6ffb89f8288ed2aeb9b5596c755210bffd/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointConnector.java#L173-L180]
>
> This leads to a situation where the MirrorCheckpointConnector believes it has
> converged with nothing to update, but actually has consumer groups that are
> not allocated to tasks.
> This happens particularly for large, stable Kafka clusters with many consumer
> groups that are not being actively created or deleted.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)