Greg Harris created KAFKA-10286:
-----------------------------------

             Summary: Connect system tests should wait for workers to join group
                 Key: KAFKA-10286
                 URL: https://issues.apache.org/jira/browse/KAFKA-10286
             Project: Kafka
          Issue Type: Test
          Components: KafkaConnect
    Affects Versions: 2.6.0
            Reporter: Greg Harris
            Assignee: Greg Harris


There are a few flakey test failures for {{connect_distributed_test}} in which 
one of the workers does not join the group quickly, and the test fails in the 
following manner:
 # The test starts each of the connect workers, and waits for their REST APIs 
to become available
 # All workers start up, complete plugin scanning, and start their REST API
 # At least one worker kicks off an asynchronous job to join the group that 
hangs for a yet unknown reason (30s timeout)
 # The test continues without all of the members joined
 # The test makes a call to the REST api that it expects to succeed, and gets 
an error
 # The test fails without the worker ever joining the group

Instead of allowing the test to fail in this manner, we could wait for each 
worker to join the group with the existing 60s startup timeout. This change 
would go into effect for all system tests using the 
{{ConnectDistributedService}}, currently just {{connect_distributed_test}} and 
{{connect_rest_test}}. 

Alternatively we could retry the operation that failed, or ensure that we use a 
known-good worker to continue the test, but these would require more involved 
code changes. The existing wait-for-startup logic is the most natural place to 
fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to