Hello, Guozhang. Got it, thanks for the help with the PR. Will wait for your review.
В Пн, 14/10/2019 в 13:40 -0700, Guozhang Wang пишет: > Hello Nikolay, > > I'm still on your PR, but was swamped with some other issues as the release > code freeze date's approaching, will try to make another pass on it asap. > > > Guozhang > > On Mon, Oct 14, 2019 at 12:46 PM Nikolay Izhikov <nizhi...@apache.org> > wrote: > > > Hello. > > > > I got very helpfull advices from guozhang. > > And now, we have a ready fix and reproducer. > > > > This PR fixes a very long living Kafka Consumer bug. > > Please, join to the review. > > > > [1] https://issues.apache.org/jira/browse/KAFKA-8104 > > [2] https://github.com/apache/kafka/pull/7460 > > > > В Пн, 07/10/2019 в 21:37 +0300, Nikolay Izhikov пишет: > > > Hello. > > > > > > We have KAFKA-8104 "Consumer cannot rejoin to the group after > > > > rebalancing" [1] issue. > > > It reproduces on many production environments. > > > > > > I prepared reproducer and fix [2] for this issue. > > > But, I need assistance with the "fair" reproducer. > > > > > > Please, help me with the review and "fair" reproducer: > > > > > > PR contains the fix of race condition bug between "consumer thread" and > > > > "consumer coordinator heartbeat thread". It reproduces in many production > > environments. > > > > > > Condition for reproducing: > > > > > > 1. Consumer thread initiates rejoin to the group because of commit > > > > timeout. Call of `AbstractCoordinator#joinGroupIfNeeded` which leads to > > `sendJoinGroupRequest`. > > > 2. `JoinGroupResponseHandler` writes to the > > > > `AbstractCoordinator.this.generation` new generation data and leaves the` > > synchronized` section. > > > 3. Heartbeat thread executes `mabeLeaveGroup` and clears generation data > > > > via `resetGenerationOnLeaveGroup`. > > > 4. Consumer thread executes `onJoinComplete(generation.generationId, > > > > generation.memberId, generation.protocol, memberAssignment);` with the > > cleared generation data. This leads to the corresponding > > > exception. > > > > > > The race fixed with the condition in `maybeLeaveGroup`: if we have > > > > ongoing rejoin process in consumer thread there is no reason to reset > > generation data and send `LeaveGroupRequest` in heartbeat > > > thread. > > > > > > This PR contains unfair "reproducer". > > > It implemented with the `CountDownLatch` that imitates described race in > > > > `AbstractCoordinator` code. > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/KAFKA-8104 > > > [2] https://github.com/apache/kafka/pull/7460 > >
signature.asc
Description: This is a digitally signed message part