[ 
https://issues.apache.org/jira/browse/KAFKA-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764676#comment-16764676
 ] 

ASF GitHub Bot commented on KAFKA-7909:
---------------------------------------

wicknicks commented on pull request #6251: KAFKA-7909: Coordinator changes to 
fix flakiness of Connect Test
URL: https://github.com/apache/kafka/pull/6251
 
 
   KAFKA-7909: Coordinator changes to fix flakiness of Connect Integration Test
   
   This commit introduces a change and reverts a recent one. First, in
   GroupCoordinator, we attempt to complete JoinGroup when the last member
   joins. Second, we revert a recent change in AbstractCoordinator
   that changed how a Generation is determined to be valid. The second
   change is required in the following situation: a group of consumers
   are killed when they are in the midst of a JoinGroup operation, and
   the consumers that correctly initiated the JoinGroup request, that are
   just waiting for the last member(s) to join, try to LeaveGroup, but
   are unable to do so because they are not considered valid. If these
   consumers are restarted, these old consumers are still in the group, and the
   new JoinGroup never succeeds.
   
   Signed-off-by: Arjun Satish <ar...@confluent.io>
   
   After these changes, we are able to run the flaky 
[test](https://github.com/apache/kafka/blob/dc935c4/connect/runtime/src/test/java/org/apache/kafka/connect/integration/ExampleConnectIntegrationTest.java#L105)
 more than 300 times without seeing any failures.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Coordinator changes cause Connect integration test to fail
> ----------------------------------------------------------
>
>                 Key: KAFKA-7909
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7909
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer, core
>    Affects Versions: 2.2.0
>            Reporter: Arjun Satish
>            Assignee: Arjun Satish
>            Priority: Blocker
>             Fix For: 2.2.0
>
>
> We recently introduced integration tests in Connect. This test spins up one 
> or more Connect workers along with a Kafka broker and Zk in a single process 
> and attempts to move records using a Connector. In the [Example Integration 
> Test|https://github.com/apache/kafka/blob/3c73633/connect/runtime/src/test/java/org/apache/kafka/connect/integration/ExampleConnectIntegrationTest.java#L105],
>  we spin up three workers each hosting a Connector task that consumes records 
> from a Kafka topic. When the connector starts up, it may go through multiple 
> rounds of rebalancing. We notice the following two problems in the last few 
> days:
>  # After members join a group, there are no pendingMembers remaining, but the 
> join group method does not complete, and send these members a signal that 
> they are not ready to start consuming from their respective partitions.
>  # Because of quick rebalances, a consumer might have started a group, but 
> Connect starts  a rebalance, after we which we create three new instances of 
> the consumer (one from each worker/task). But the group coordinator seems to 
> have 4 members in the group. This causes the JoinGroup to indefinitely stall. 
> Even though this ticket is described in the connect of Connect, it may be 
> applicable to general consumers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to