OmniaGM commented on PR #12366:
URL: https://github.com/apache/kafka/pull/12366#issuecomment-1307124622

   > Thanks @OmniaGM, good idea. I've updated the README and added an 
integration test that verifies that MM2 can still run with exactly-once support 
enabled.
   > 
   > I should note that the `testReplication` test case is currently failing 
locally with timeout issues, but succeeds when I bump the timeout. Going to see 
how the Jenkins build goes; we may choose to increase the timeout in these 
tests if they also fail during CI.
   
   I had similar issue with the timeouts 
   
   > It turns out that the `testReplication` flakiness persisted in Jenkins, 
and was not solved by increasing timeouts.
   > 
   > Instead, the root of the problem was a change in the Connect framework's 
behavior when exactly-once support is enabled. Without exactly-once support, 
`SourceTask::commitRecord` is invoked [as soon as the record is 
ack'd](https://github.com/apache/kafka/blob/c1bb307a361b7b6e17261cc84ea2b108bacca84d/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L148)
 by the Kafka cluster, which usually causes those calls to be spread out over 
time. With exactly-once support, `SourceTask::commitRecord` is invoked [for 
every record in a 
transaction](https://github.com/apache/kafka/blob/c1bb307a361b7b6e17261cc84ea2b108bacca84d/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/ExactlyOnceWorkerSourceTask.java#L312)
 once that transaction is committed, which causes a rapid series of calls to 
take place one after the other.
   > 
   > MirrorMaker 2 triggers a (potential) offset sync after every call to 
`commitRecord`, but it has [logic to prevent too many outstanding offset syncs 
from 
accruing](https://github.com/apache/kafka/blob/c1bb307a361b7b6e17261cc84ea2b108bacca84d/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorSourceTask.java#L208-L211).
 The exact limit on the number of outstanding offset requests [is 
ten](https://github.com/apache/kafka/blob/c1bb307a361b7b6e17261cc84ea2b108bacca84d/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorSourceTask.java#L53),
 which is less than the total number of topic partitions being replicated 
during the integration test. As a result, the test became flaky, since 
sometimes MM2 would drop an offset sync for partition 0 of the `test-topic-1` 
and then fail when [checking for offset syncs for that topic 
partition](https://github.com/apache/kafka/blob/c1bb307a361b7b6e17261cc84ea2b108bacca84d/connect/mirror/src/test/java/org/apache/kafka/co
 nnect/mirror/integration/MirrorConnectorsIntegrationBaseTest.java#L319-L320).
   > 
   > Since the behavior change in the Connect framework may have an impact on 
MirrorMaker 2 outside of testing environments, I've tweaked the offset sync 
limit to apply on a per-topic-partition basis. This way, if a flurry of calls 
to `commitRecord` takes place when a transaction is committed, every topic 
partition should still get a chance for an offset sync, but there is still an 
upper bound on the number of outstanding offset syncs (although that bound is 
now proportional to the number of topic partitions being replicated, instead of 
a constant).
   
   Good finding. Should we add logs in `sendOffsetSync` to make it easier to 
find out if MirrorMaker is hitting the limit of `MAX_OUTSTANDING_OFFSET_SYNCS` 
in the future? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to