[jira] [Commented] (KAFKA-17232) MirrorCheckpointConnector does not generate task configs if initial consumer group load times out

Asker (Jira) Thu, 14 Nov 2024 00:06:08 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898172#comment-17898172
 ]


Asker commented on KAFKA-17232:
-------------------------------

Hello [~gharris1727],
Thank you very much for responding to our comment; it is highly valuable to our 
team.

*“Are you seeing the task configuration error appearing continuously without 
ever resolving?”*
Yes, we are seeing this ERROR log constantly. Last Friday, I upgraded from 
Kafka 3.6.0 to 3.9.0 and thought the error might be temporary. However, when I 
returned to work on Monday, I saw that the error continued to appear 
persistently, even though the mirroring itself is functioning: I can see that 
messages from cluster A to cluster B are appearing in the topics.

*“Are you seeing the log messages indicating loading has finished?”*
No, we did not see such log messages.

*“Also, are you seeing any Scheduler logs mentioning loading initial consumer 
groups?”*
No, we did not find such logs. Running the following command on the server 
where the Kafka broker and MirrorMaker 2 service are running returned no 
results:
{code:bash}
[root@kafka-analytics-3a~]# journalctl -u kafka-mirror-maker.service -xe | grep 
"loading initial consumer groups"
{code}

*“I also wonder if these log messages could be coming from cancelled tasks by 
accident.”*
I think you asked a very pertinent question! I’m glad you brought it up. I hope 
we’re thinking along the same lines, but even if not, it’s still worth 
discussing.

We have three clusters in our configuration:
{code:bash}
clusters=analytics-dev, app-dev, telemetry-dev
{code}
MirrorMaker is enabled from app-dev to analytics-dev and from telemetry-dev to 
analytics-dev. However, the Timeout while loading consumer groups error always 
references clientIds of clusters between which MirrorMaker should not be active:

- analytics-dev->app-dev
{code:bash}
[2024-11-11 12:41:44,943] ERROR [Worker clientId=analytics-dev->app-dev, 
groupId=analytics-dev-mm2] Failed to reconfigure connector's tasks 
(MirrorCheckpointConnector), retrying after backoff. 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195)
{code}

- telemetry-dev->app-dev
{code:bash}
[2024-11-11 12:41:44,497] ERROR [Worker clientId=telemetry-dev->app-dev, 
groupId=telemetry-dev-mm2] Failed to reconfigure connector's tasks 
(MirrorCheckpointConnector), retrying after backoff. 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195)
{code}

- app-dev->telemetry-dev
{code:bash}
[2024-11-11 12:41:44,943] ERROR [Worker clientId=app-dev->telemetry-dev, 
groupId=app-dev-mm2] Failed to reconfigure connector's tasks 
(MirrorCheckpointConnector), retrying after backoff. 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195)
{code}

And so on.

This means that between analytics-dev->app-dev, there are no topics configured 
for MirrorMaker, and the same applies for telemetry-dev->app-dev, etc. In other 
words, MirrorMaker is attempting to interact between clusters where it is not 
supposed to be active.

Additionally, it’s noticeable that the cluster app-dev always appears in these 
errors. I’m not sure why this is happening. The only distinguishing feature of 
this cluster is that it has ACLs, but analytics-dev also has ACLs.

Our connect-mirror-maker.properties file looks like this:
{code:bash}
clusters=analytics-dev, app-dev, telemetry-dev

# Analytics-dev cluster configuration
analytics-dev.bootstrap.servers=kafka-analytics-1a:9092, 
kafka-analytics-2a:9092, kafka-analytics-3a:9092
analytics-dev.security.protocol=...
analytics-dev.sasl.mechanism=...
analytics-dev.sasl.jaas.config=...
analytics-dev.checkpoints.topic.replication.factor=2
analytics-dev.heartbeats.topic.replication.factor=2
analytics-dev.offset-syncs.topic.replication.factor=2
analytics-dev.offset.storage.replication.factor=2
analytics-dev.status.storage.replication.factor=2
analytics-dev.config.storage.replication.factor=2

# App-dev cluster configuration
app-dev.bootstrap.servers=kafka-app-1a:9092, kafka-app-2a:9092, 
kafka-app-3a:9092
app-dev.security.protocol=...
app-dev.sasl.mechanism=...
app-dev.sasl.jaas.config=...
app-dev.checkpoints.topic.replication.factor=2
app-dev.heartbeats.topic.replication.factor=2
app-dev.offset-syncs.topic.replication.factor=2
app-dev.offset.storage.replication.factor=2
app-dev.status.storage.replication.factor=2
app-dev.config.storage.replication.factor=2

# Telemetry-dev cluster configuration
telemetry-dev.bootstrap.servers=kafka-telemetry-1a:9092
telemetry-dev.security.protocol=...
telemetry-dev.checkpoints.topic.replication.factor=1
telemetry-dev.heartbeats.topic.replication.factor=1
telemetry-dev.offset-syncs.topic.replication.factor=1
telemetry-dev.offset.storage.replication.factor=1
telemetry-dev.status.storage.replication.factor=1
telemetry-dev.config.storage.replication.factor=1

# Replication flows
app-dev->analytics-dev.enabled=true
app-dev->analytics-dev.topics=...
analytics-dev->app-dev.enabled=false

telemetry-dev->analytics-dev.enabled=true
telemetry-dev->analytics-dev.topics=...
analytics-dev->telemetry-dev.enabled=false

replication.factor=2
replication.policy.class=org.apache.kafka.connect.mirror.IdentityReplicationPolicy

dedicated.mode.enable.internal.rest=true

num.streams=3
tasks.max=2
{code}

We are eagerly awaiting your response!

Best regards,
Asker Kakhramanov

> MirrorCheckpointConnector does not generate task configs if initial consumer 
> group load times out
> -------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-17232
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17232
>             Project: Kafka
>          Issue Type: Bug
>          Components: mirrormaker
>    Affects Versions: 3.9.0
>            Reporter: Greg Harris
>            Assignee: TengYao Chi
>            Priority: Major
>             Fix For: 3.9.0
>
>
> The MirrorCheckpointConnector has two operations that read the source 
> consumer groups:
>  * loadInitialConsumerGroups
>  * refreshConsumerGroups
> loadInitialConsumerGroups blocks the start() method of the connector, while 
> refreshConsumerGroups is asynchronous and runs periodically while the 
> connector is running.
> loadInitialConsumerGroups may take a long time to execute, and may exceed the 
> configured "admin.timeout.ms" used by the Scheduler. This timeout is logged 
> and the start() method returns normally. If this happens, the framework will 
> generate task configs immediately after start(), before 
> loadInitialConsumerGroups can finish, and will generate an empty set of task 
> configs: 
> [https://github.com/apache/kafka/blob/e2494e6ffb89f8288ed2aeb9b5596c755210bffd/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointConnector.java#L118-L121].
> Later, when loadInitialConsumerGroups completes, it will not request task 
> reconfiguration, believing it is the initial load operation.
> Later still, when refreshConsumerGroups completes, it will not request task 
> reconfiguration, as the set of consumer groups has not changed since the 
> initial load: 
> [https://github.com/apache/kafka/blob/e2494e6ffb89f8288ed2aeb9b5596c755210bffd/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointConnector.java#L173-L180]
>  
> This leads to a situation where the MirrorCheckpointConnector believes it has 
> converged with nothing to update, but actually has consumer groups that are 
> not allocated to tasks.
> This happens particularly for large, stable Kafka clusters with many consumer 
> groups that are not being actively created or deleted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-17232) MirrorCheckpointConnector does not generate task configs if initial consumer group load times out

Reply via email to