[ https://issues.apache.org/jira/browse/KAFKA-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769707#comment-17769707 ]
Steve Jacobs commented on KAFKA-15467: -------------------------------------- Rolling back to MirrorMaker2 3.4.1 to see if that fixes the issue. No one else has any ideas? This seems like a pretty major bug if it's a bug... > On reconnect mm2 cannot find offset that exists in remote cluster and > re-syncs the entire topic > ----------------------------------------------------------------------------------------------- > > Key: KAFKA-15467 > URL: https://issues.apache.org/jira/browse/KAFKA-15467 > Project: Kafka > Issue Type: Bug > Components: mirrormaker > Affects Versions: 3.5.1 > Reporter: Steve Jacobs > Priority: Major > > I'm encountering an odd issue with mirrormaker2 with our remote replication > setup to high latency remote sites (satellite). > Every few days we get several topics completely re-replicated, this appears > to happen after a network connectivity outage. It doesn't matter if it's a > long outage (hours) or a short one (minutes). And it only seems to affect a > few topics. > I was finally able to track down some logs showing the issue. This was after > an hour-ish long outage where connectivity went down. There were lots of logs > about connection timeouts, etc. Here is the relevant part when the connection > came back up: > {code:java} > 2023-09-08 16:52:45,380 INFO [scbi->gcp.MirrorSourceConnector|worker] > [AdminClient > clientId=mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin] > Disconnecting from node 0 due to socket connection setup timeout. The > timeout value is 63245 ms. (org.apache.kafka.clients.NetworkClient) > [kafka-admin-client-thread | > mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin] > 2023-09-08 16:52:45,380 INFO [scbi->gcp.MirrorSourceConnector|worker] > [AdminClient > clientId=mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin] > Metadata update failed > (org.apache.kafka.clients.admin.internals.AdminMetadataManager) > [kafka-admin-client-thread | > mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin] > 2023-09-08 16:52:47,029 INFO [scbi->gcp.MirrorSourceConnector|task-1] > [Consumer > clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer, > groupId=null] Disconnecting from node 0 due to socket connection setup > timeout. The timeout value is 52624 ms. > (org.apache.kafka.clients.NetworkClient) > [task-thread-scbi->gcp.MirrorSourceConnector-1] > 2023-09-08 16:52:47,029 INFO [scbi->gcp.MirrorSourceConnector|task-1] > [Consumer > clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer, > groupId=null] Error sending fetch request (sessionId=460667411, > epoch=INITIAL) to node 0: (org.apache.kafka.clients.FetchSessionHandler) > [task-thread-scbi->gcp.MirrorSourceConnector-1] > 2023-09-08 16:52:47,336 INFO [scbi->gcp.MirrorSourceConnector|worker] > refreshing topics took 67359 ms (org.apache.kafka.connect.mirror.Scheduler) > [Scheduler for MirrorSourceConnector: > scbi->gcp|scbi->gcp.MirrorSourceConnector-refreshing topics] > 2023-09-08 16:52:48,413 INFO [scbi->gcp.MirrorSourceConnector|task-1] > [Consumer > clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer, > groupId=null] Fetch position FetchPosition{offset=4918131, > offsetEpoch=Optional[0], > currentLeader=LeaderAndEpoch{leader=Optional[kafka.scbi.eng.neoninternal.org:9094 > (id: 0 rack: null)], epoch=0}} is out of range for partition > reading.sensor.hfp01sc-0, resetting offset > (org.apache.kafka.clients.consumer.internals.AbstractFetch) > [task-thread-scbi->gcp.MirrorSourceConnector-1] > (Repeats for 11 more topics) > 2023-09-08 16:52:48,479 INFO [scbi->gcp.MirrorSourceConnector|task-1] > [Consumer > clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer, > groupId=null] Resetting offset for partition reading.sensor.hfp01sc-0 to > position FetchPosition{offset=3444977, offsetEpoch=Optional.empty, > currentLeader=LeaderAndEpoch{leader=Optional[kafka.scbi.eng.neoninternal.org:9094 > (id: 0 rack: null)], epoch=0}}. > (org.apache.kafka.clients.consumer.internals.SubscriptionState) > [task-thread-scbi->gcp.MirrorSourceConnector-1] > (Repeats for 11 more topics) {code} > The consumer reports that offset 4918131 is out of range for this > topic/partition, but that offset still exists on the remote cluster. I can go > pull it up with a consumer right now. The earliest offset in that topic that > still exists is 3444977 as of yesterday. We have 30 day retention configured > so pulling in 30 days of duplicate data is very not good. It almost seems > like a race condition as there are 38 topics we replicate but this only > affected 12 (on this occurance). > > The number of topics affected seems to vary each time. Today I see one site > has 2 topics it is resending, and another has 13. -- This message was sent by Atlassian Jira (v8.20.10#820010)