[ 
https://issues.apache.org/jira/browse/KAFKA-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876318#comment-16876318
 ] 

Randall Hauch commented on KAFKA-8391:
--------------------------------------

Finally got a chance to look at this. Here's what I think is going on:

The integration test does the following:
# start Connect distributed (3 workers)
# start the connector and wait for the connector and all tasks to start 
completely 
# verify at least N records were produced on the output topic
# change the configuration of the connector to use a different topic
# waits for the connector and all tasks to be running
# verify at least N records were produced on the new output topic

Logs show that after the test changes the connector configuration (step 4), the 
distributed worker immediately updates the connector config, restarts the 
connector with the new configuration, gets the latest task configurations and 
writes them to the internal config topic -- all before the `PUT` request 
returns and before step 4 completes. However, the tasks actually don't get 
restarted until the workers running those tasks detect the new task 
configurations, and this could take a few seconds. Note that step 5 occurs 
immediately after step 4 ends, so it's possible that step 5 starts before the 
workers all see the updates and stop their tasks. IOW, step 5 might actually 
succeed *before* any of the tasks are restarted. And because restarting the 
tasks takes a bit of time, the test's self-imposed constraint of completing 
step 6 within 15 seconds might not actually work.

This could be corrected a few ways. One option would be to just increase the 
maximum time required to complete step 6, but this could still result in flaky 
tests. Another option is to sleep for some fixed period of time after step 4 
before starting step 5, but this might still be flaky if the sleep time is too 
short or result in the test sleeping much longer than needed if the sleep time 
is too long.

The best way to fix this is to actually wait for the tasks to restart. Because 
we're testing the Connect framework, the integration tests already use 
`MonitorableSourceConnector` and `MonitorableSinkConnector`, which are special 
connectors that make use of the statically-accessible connector and task 
handles to record various activities within the connector and tasks. We can 
modify these handles to allow the tests to set an expectation that the 
connector and tasks will be restarted, so that the tests can subsequently cause 
the restart and then wait for the restart to complete. This requires a bit of 
code, but it will be the most reliable way of waiting until the connectors and 
tasks are restarted without being flaky and without waiting too long.

See https://github.com/apache/kafka/pull/7019 for a proposed solution / fix.

> Flaky Test RebalanceSourceConnectorsIntegrationTest#testDeleteConnector
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-8391
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8391
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 2.3.0
>            Reporter: Matthias J. Sax
>            Assignee: Randall Hauch
>            Priority: Critical
>              Labels: flaky-test
>             Fix For: 2.4.0
>
>
> [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/4747/testReport/junit/org.apache.kafka.connect.integration/RebalanceSourceConnectorsIntegrationTest/testDeleteConnector/]
> {quote}java.lang.AssertionError: Condition not met within timeout 30000. 
> Connector tasks did not stop in time. at 
> org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:375) at 
> org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:352) at 
> org.apache.kafka.connect.integration.RebalanceSourceConnectorsIntegrationTest.testDeleteConnector(RebalanceSourceConnectorsIntegrationTest.java:166){quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to