OmniaGM commented on PR #12366: URL: https://github.com/apache/kafka/pull/12366#issuecomment-1307124622
> Thanks @OmniaGM, good idea. I've updated the README and added an integration test that verifies that MM2 can still run with exactly-once support enabled. > > I should note that the `testReplication` test case is currently failing locally with timeout issues, but succeeds when I bump the timeout. Going to see how the Jenkins build goes; we may choose to increase the timeout in these tests if they also fail during CI. I had similar issue with the timeouts > It turns out that the `testReplication` flakiness persisted in Jenkins, and was not solved by increasing timeouts. > > Instead, the root of the problem was a change in the Connect framework's behavior when exactly-once support is enabled. Without exactly-once support, `SourceTask::commitRecord` is invoked [as soon as the record is ack'd](https://github.com/apache/kafka/blob/c1bb307a361b7b6e17261cc84ea2b108bacca84d/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L148) by the Kafka cluster, which usually causes those calls to be spread out over time. With exactly-once support, `SourceTask::commitRecord` is invoked [for every record in a transaction](https://github.com/apache/kafka/blob/c1bb307a361b7b6e17261cc84ea2b108bacca84d/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/ExactlyOnceWorkerSourceTask.java#L312) once that transaction is committed, which causes a rapid series of calls to take place one after the other. > > MirrorMaker 2 triggers a (potential) offset sync after every call to `commitRecord`, but it has [logic to prevent too many outstanding offset syncs from accruing](https://github.com/apache/kafka/blob/c1bb307a361b7b6e17261cc84ea2b108bacca84d/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorSourceTask.java#L208-L211). The exact limit on the number of outstanding offset requests [is ten](https://github.com/apache/kafka/blob/c1bb307a361b7b6e17261cc84ea2b108bacca84d/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorSourceTask.java#L53), which is less than the total number of topic partitions being replicated during the integration test. As a result, the test became flaky, since sometimes MM2 would drop an offset sync for partition 0 of the `test-topic-1` and then fail when [checking for offset syncs for that topic partition](https://github.com/apache/kafka/blob/c1bb307a361b7b6e17261cc84ea2b108bacca84d/connect/mirror/src/test/java/org/apache/kafka/co nnect/mirror/integration/MirrorConnectorsIntegrationBaseTest.java#L319-L320). > > Since the behavior change in the Connect framework may have an impact on MirrorMaker 2 outside of testing environments, I've tweaked the offset sync limit to apply on a per-topic-partition basis. This way, if a flurry of calls to `commitRecord` takes place when a transaction is committed, every topic partition should still get a chance for an offset sync, but there is still an upper bound on the number of outstanding offset syncs (although that bound is now proportional to the number of topic partitions being replicated, instead of a constant). Good finding. Should we add logs in `sendOffsetSync` to make it easier to find out if MirrorMaker is hitting the limit of `MAX_OUTSTANDING_OFFSET_SYNCS` in the future? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org