Hello.

We have KAFKA-8104 "Consumer cannot rejoin to the group after rebalancing" [1] 
issue.
It reproduces on many production environments.

I prepared reproducer and fix [2] for this issue.
But, I need assistance with the "fair" reproducer.

Please, help me with the review and "fair" reproducer:

PR contains the fix of race condition bug between "consumer thread" and 
"consumer coordinator heartbeat thread". It reproduces in many production 
environments.

Condition for reproducing:

1. Consumer thread initiates rejoin to the group because of commit timeout. 
Call of `AbstractCoordinator#joinGroupIfNeeded` which leads to 
`sendJoinGroupRequest`.
2. `JoinGroupResponseHandler` writes to the 
`AbstractCoordinator.this.generation` new generation data and leaves the` 
synchronized` section.
3. Heartbeat thread executes `mabeLeaveGroup` and clears generation data via 
`resetGenerationOnLeaveGroup`.
4. Consumer thread executes `onJoinComplete(generation.generationId, 
generation.memberId, generation.protocol, memberAssignment);` with the cleared 
generation data. This leads to the corresponding
exception.

The race fixed with the condition in `maybeLeaveGroup`: if we have ongoing 
rejoin process in consumer thread there is no reason to reset generation data 
and send `LeaveGroupRequest` in heartbeat
thread.

This PR contains unfair "reproducer".
It implemented with the `CountDownLatch` that imitates described race in 
`AbstractCoordinator` code.



[1] https://issues.apache.org/jira/browse/KAFKA-8104
[2] https://github.com/apache/kafka/pull/7460

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to