[ https://issues.apache.org/jira/browse/KAFKA-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876318#comment-16876318 ]
Randall Hauch commented on KAFKA-8391: -------------------------------------- Finally got a chance to look at this. Here's what I think is going on: The integration test does the following: # start Connect distributed (3 workers) # start the connector and wait for the connector and all tasks to start completely # verify at least N records were produced on the output topic # change the configuration of the connector to use a different topic # waits for the connector and all tasks to be running # verify at least N records were produced on the new output topic Logs show that after the test changes the connector configuration (step 4), the distributed worker immediately updates the connector config, restarts the connector with the new configuration, gets the latest task configurations and writes them to the internal config topic -- all before the `PUT` request returns and before step 4 completes. However, the tasks actually don't get restarted until the workers running those tasks detect the new task configurations, and this could take a few seconds. Note that step 5 occurs immediately after step 4 ends, so it's possible that step 5 starts before the workers all see the updates and stop their tasks. IOW, step 5 might actually succeed *before* any of the tasks are restarted. And because restarting the tasks takes a bit of time, the test's self-imposed constraint of completing step 6 within 15 seconds might not actually work. This could be corrected a few ways. One option would be to just increase the maximum time required to complete step 6, but this could still result in flaky tests. Another option is to sleep for some fixed period of time after step 4 before starting step 5, but this might still be flaky if the sleep time is too short or result in the test sleeping much longer than needed if the sleep time is too long. The best way to fix this is to actually wait for the tasks to restart. Because we're testing the Connect framework, the integration tests already use `MonitorableSourceConnector` and `MonitorableSinkConnector`, which are special connectors that make use of the statically-accessible connector and task handles to record various activities within the connector and tasks. We can modify these handles to allow the tests to set an expectation that the connector and tasks will be restarted, so that the tests can subsequently cause the restart and then wait for the restart to complete. This requires a bit of code, but it will be the most reliable way of waiting until the connectors and tasks are restarted without being flaky and without waiting too long. See https://github.com/apache/kafka/pull/7019 for a proposed solution / fix. > Flaky Test RebalanceSourceConnectorsIntegrationTest#testDeleteConnector > ----------------------------------------------------------------------- > > Key: KAFKA-8391 > URL: https://issues.apache.org/jira/browse/KAFKA-8391 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect > Affects Versions: 2.3.0 > Reporter: Matthias J. Sax > Assignee: Randall Hauch > Priority: Critical > Labels: flaky-test > Fix For: 2.4.0 > > > [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/4747/testReport/junit/org.apache.kafka.connect.integration/RebalanceSourceConnectorsIntegrationTest/testDeleteConnector/] > {quote}java.lang.AssertionError: Condition not met within timeout 30000. > Connector tasks did not stop in time. at > org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:375) at > org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:352) at > org.apache.kafka.connect.integration.RebalanceSourceConnectorsIntegrationTest.testDeleteConnector(RebalanceSourceConnectorsIntegrationTest.java:166){quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)