Hello. We have KAFKA-8104 "Consumer cannot rejoin to the group after rebalancing" [1] issue. It reproduces on many production environments.
I prepared reproducer and fix [2] for this issue. But, I need assistance with the "fair" reproducer. Please, help me with the review and "fair" reproducer: PR contains the fix of race condition bug between "consumer thread" and "consumer coordinator heartbeat thread". It reproduces in many production environments. Condition for reproducing: 1. Consumer thread initiates rejoin to the group because of commit timeout. Call of `AbstractCoordinator#joinGroupIfNeeded` which leads to `sendJoinGroupRequest`. 2. `JoinGroupResponseHandler` writes to the `AbstractCoordinator.this.generation` new generation data and leaves the` synchronized` section. 3. Heartbeat thread executes `mabeLeaveGroup` and clears generation data via `resetGenerationOnLeaveGroup`. 4. Consumer thread executes `onJoinComplete(generation.generationId, generation.memberId, generation.protocol, memberAssignment);` with the cleared generation data. This leads to the corresponding exception. The race fixed with the condition in `maybeLeaveGroup`: if we have ongoing rejoin process in consumer thread there is no reason to reset generation data and send `LeaveGroupRequest` in heartbeat thread. This PR contains unfair "reproducer". It implemented with the `CountDownLatch` that imitates described race in `AbstractCoordinator` code. [1] https://issues.apache.org/jira/browse/KAFKA-8104 [2] https://github.com/apache/kafka/pull/7460
signature.asc
Description: This is a digitally signed message part