Greg Harris created KAFKA-10286: ----------------------------------- Summary: Connect system tests should wait for workers to join group Key: KAFKA-10286 URL: https://issues.apache.org/jira/browse/KAFKA-10286 Project: Kafka Issue Type: Test Components: KafkaConnect Affects Versions: 2.6.0 Reporter: Greg Harris Assignee: Greg Harris
There are a few flakey test failures for {{connect_distributed_test}} in which one of the workers does not join the group quickly, and the test fails in the following manner: # The test starts each of the connect workers, and waits for their REST APIs to become available # All workers start up, complete plugin scanning, and start their REST API # At least one worker kicks off an asynchronous job to join the group that hangs for a yet unknown reason (30s timeout) # The test continues without all of the members joined # The test makes a call to the REST api that it expects to succeed, and gets an error # The test fails without the worker ever joining the group Instead of allowing the test to fail in this manner, we could wait for each worker to join the group with the existing 60s startup timeout. This change would go into effect for all system tests using the {{ConnectDistributedService}}, currently just {{connect_distributed_test}} and {{connect_rest_test}}. Alternatively we could retry the operation that failed, or ensure that we use a known-good worker to continue the test, but these would require more involved code changes. The existing wait-for-startup logic is the most natural place to fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)