Hello, Guozhang.

Got it, thanks for the help with the PR.
Will wait for your review.

В Пн, 14/10/2019 в 13:40 -0700, Guozhang Wang пишет:
> Hello Nikolay,
> 
> I'm still on your PR, but was swamped with some other issues as the release
> code freeze date's approaching, will try to make another pass on it asap.
> 
> 
> Guozhang
> 
> On Mon, Oct 14, 2019 at 12:46 PM Nikolay Izhikov <nizhi...@apache.org>
> wrote:
> 
> > Hello.
> > 
> > I got very helpfull advices from guozhang.
> > And now, we have a ready fix and reproducer.
> > 
> > This PR fixes a very long living Kafka Consumer bug.
> > Please, join to the review.
> > 
> > [1] https://issues.apache.org/jira/browse/KAFKA-8104
> > [2] https://github.com/apache/kafka/pull/7460
> > 
> > В Пн, 07/10/2019 в 21:37 +0300, Nikolay Izhikov пишет:
> > > Hello.
> > > 
> > > We have KAFKA-8104 "Consumer cannot rejoin to the group after
> > 
> > rebalancing" [1] issue.
> > > It reproduces on many production environments.
> > > 
> > > I prepared reproducer and fix [2] for this issue.
> > > But, I need assistance with the "fair" reproducer.
> > > 
> > > Please, help me with the review and "fair" reproducer:
> > > 
> > > PR contains the fix of race condition bug between "consumer thread" and
> > 
> > "consumer coordinator heartbeat thread". It reproduces in many production
> > environments.
> > > 
> > > Condition for reproducing:
> > > 
> > > 1. Consumer thread initiates rejoin to the group because of commit
> > 
> > timeout. Call of `AbstractCoordinator#joinGroupIfNeeded` which leads to
> > `sendJoinGroupRequest`.
> > > 2. `JoinGroupResponseHandler` writes to the
> > 
> > `AbstractCoordinator.this.generation` new generation data and leaves the`
> > synchronized` section.
> > > 3. Heartbeat thread executes `mabeLeaveGroup` and clears generation data
> > 
> > via `resetGenerationOnLeaveGroup`.
> > > 4. Consumer thread executes `onJoinComplete(generation.generationId,
> > 
> > generation.memberId, generation.protocol, memberAssignment);` with the
> > cleared generation data. This leads to the corresponding
> > > exception.
> > > 
> > > The race fixed with the condition in `maybeLeaveGroup`: if we have
> > 
> > ongoing rejoin process in consumer thread there is no reason to reset
> > generation data and send `LeaveGroupRequest` in heartbeat
> > > thread.
> > > 
> > > This PR contains unfair "reproducer".
> > > It implemented with the `CountDownLatch` that imitates described race in
> > 
> > `AbstractCoordinator` code.
> > > 
> > > 
> > > 
> > > [1] https://issues.apache.org/jira/browse/KAFKA-8104
> > > [2] https://github.com/apache/kafka/pull/7460
> 
> 

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to