Hello Nikolay,

I'm still on your PR, but was swamped with some other issues as the release
code freeze date's approaching, will try to make another pass on it asap.


Guozhang

On Mon, Oct 14, 2019 at 12:46 PM Nikolay Izhikov <nizhi...@apache.org>
wrote:

> Hello.
>
> I got very helpfull advices from guozhang.
> And now, we have a ready fix and reproducer.
>
> This PR fixes a very long living Kafka Consumer bug.
> Please, join to the review.
>
> [1] https://issues.apache.org/jira/browse/KAFKA-8104
> [2] https://github.com/apache/kafka/pull/7460
>
> В Пн, 07/10/2019 в 21:37 +0300, Nikolay Izhikov пишет:
> > Hello.
> >
> > We have KAFKA-8104 "Consumer cannot rejoin to the group after
> rebalancing" [1] issue.
> > It reproduces on many production environments.
> >
> > I prepared reproducer and fix [2] for this issue.
> > But, I need assistance with the "fair" reproducer.
> >
> > Please, help me with the review and "fair" reproducer:
> >
> > PR contains the fix of race condition bug between "consumer thread" and
> "consumer coordinator heartbeat thread". It reproduces in many production
> environments.
> >
> > Condition for reproducing:
> >
> > 1. Consumer thread initiates rejoin to the group because of commit
> timeout. Call of `AbstractCoordinator#joinGroupIfNeeded` which leads to
> `sendJoinGroupRequest`.
> > 2. `JoinGroupResponseHandler` writes to the
> `AbstractCoordinator.this.generation` new generation data and leaves the`
> synchronized` section.
> > 3. Heartbeat thread executes `mabeLeaveGroup` and clears generation data
> via `resetGenerationOnLeaveGroup`.
> > 4. Consumer thread executes `onJoinComplete(generation.generationId,
> generation.memberId, generation.protocol, memberAssignment);` with the
> cleared generation data. This leads to the corresponding
> > exception.
> >
> > The race fixed with the condition in `maybeLeaveGroup`: if we have
> ongoing rejoin process in consumer thread there is no reason to reset
> generation data and send `LeaveGroupRequest` in heartbeat
> > thread.
> >
> > This PR contains unfair "reproducer".
> > It implemented with the `CountDownLatch` that imitates described race in
> `AbstractCoordinator` code.
> >
> >
> >
> > [1] https://issues.apache.org/jira/browse/KAFKA-8104
> > [2] https://github.com/apache/kafka/pull/7460
>


-- 
-- Guozhang

Reply via email to