Asker created KAFKA-18007:
-----------------------------
Summary: MirrorCheckpointConnector fails with “Timeout while
loading consumer groups” after upgrading to Kafka 3.9.0
Key: KAFKA-18007
URL: https://issues.apache.org/jira/browse/KAFKA-18007
Project: Kafka
Issue Type: Bug
Components: mirrormaker
Affects Versions: 3.9.0
Environment: - Kafka Version: Upgraded sequentially from 3.6.0 to 3.9.0
- Clusters: Three clusters named A, B, and C
- Clusters A and B mirror topics to cluster C using MirrorMaker 2
- Number of Consumer Groups: Approximately 200
- Number of Topics: Approximately 2000
- Operating System: Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-135-generic x86_64)
Reporter: Asker
After upgrading our Kafka clusters from version 3.6.0 to 3.9.0, we started
experiencing repeated errors with the MirrorCheckpointConnector in MirrorMaker
2. The connector fails with a RetriableException stating “Timeout while loading
consumer groups.” This issue persists despite several attempts to resolve it.
Error Message:
{code:bash}
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
connect-mirror-maker.sh[2526630]: [2024-11-11 12:21:57,342] ERROR [Worker
clientId=analytics-dev->app-dev, groupId=analytics-dev-mm2] Failed to
reconfigure connector's tasks (MirrorCheckpointConnector), retrying after
backoff. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
connect-mirror-maker.sh[2526630]:
org.apache.kafka.connect.errors.RetriableException: Timeout while loading
consumer groups.
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
connect-mirror-maker.sh[2526630]: at
org.apache.kafka.connect.mirror.MirrorCheckpointConnector.taskConfigs(MirrorCheckpointConnector.java:138)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
connect-mirror-maker.sh[2526630]: at
org.apache.kafka.connect.runtime.Worker.connectorTaskConfigs(Worker.java:398)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
connect-mirror-maker.sh[2526630]: at
org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnector(DistributedHerder.java:2243)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
connect-mirror-maker.sh[2526630]: at
org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnectorTasksWithExponentialBackoffRetries(DistributedHerder.java:2183)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
connect-mirror-maker.sh[2526630]: at
org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$null$47(DistributedHerder.java:2199)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
connect-mirror-maker.sh[2526630]: at
org.apache.kafka.connect.runtime.distributed.DistributedHerder.runRequest(DistributedHerder.java:2402)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
connect-mirror-maker.sh[2526630]: at
org.apache.kafka.connect.runtime.distributed.DistributedHerder.tick(DistributedHerder.java:498)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
connect-mirror-maker.sh[2526630]: at
org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:383)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
connect-mirror-maker.sh[2526630]: at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
connect-mirror-maker.sh[2526630]: at
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
connect-mirror-maker.sh[2526630]: at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
connect-mirror-maker.sh[2526630]: at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
connect-mirror-maker.sh[2526630]: at
java.base/java.lang.Thread.run(Thread.java:840){code}
Steps to Reproduce:
1. Upgrade Kafka clusters sequentially from 3.6.0 to 3.9.0.
2. Configure MirrorMaker 2 to mirror topics from clusters A and B to cluster C.
3. Start MirrorMaker 2.
4. Observe the logs for the MirrorCheckpointConnector.
What We Tried:
{*}Checked ACLs and Authentication{*}:
- Ensured that the mirror_maker user has the necessary permissions and can
authenticate successfully.
- Verified that we could list consumer groups using kafka-consumer-groups.sh
with the mirror_maker user.
{*}Increased Timeouts{*}:
- Increased admin.timeout.ms to 300000 (5 minutes) and even higher values.
- Adjusted admin.request.timeout.ms and admin.retry.backoff.ms accordingly.
{*}Enabled Detailed Logging{*}:
- Set the logging level to DEBUG for org.apache.kafka.connect.mirror to gain
more insights.
- No additional information that could help resolve the issue was found.
{*}Temporary Workarounds{*}:
- Disabled emit.checkpoints.enabled and sync.group.offsets.enabled to prevent
the MirrorCheckpointConnector from running.
- This is not a viable long-term solution as we need to synchronize consumer
group offsets.
Resolution:
Rolled Back to Kafka 3.8.1:
- As a test, we downgraded our Kafka clusters back to version 3.8.1.
- After the downgrade, the error disappeared, and the
MirrorCheckpointConnector functioned correctly.
- This suggests that the issue was introduced in version 3.9.0.
Analysis:
Possible Relation to KAFKA-17232:
- We found the JIRA issue KAFKA-17232 titled “MirrorCheckpointConnector does
not generate task configs if initial consumer group load times out.”
- It appears that changes introduced in Kafka 3.9.0 related to this issue may
have inadvertently caused our problem.
- However, our clusters are not particularly large, and the initial consumer
group load should not exceed the timeouts.
Request:
{*}Assistance in Resolving the Issue{*}:
- Is there a known workaround or configuration change that can prevent this
error in Kafka 3.9.0?
- Could the changes made in KAFKA-17232 have unintentionally caused this
problem?
- Are there plans to address this issue in an upcoming release?
*Guidance on Next Steps*:
- Should we avoid upgrading to versions beyond 3.8.1 until this issue is
resolved?
- Is it advisable to apply any patches or pull requests manually?
Thank you for your attention to this matter. Please let me know if I can
provide any additional information to help resolve this issue.
Best regards,
Asker Kakhramanov
--
This message was sent by Atlassian Jira
(v8.20.10#820010)