Re: [DISCUSS] KIP-333 Consider a faster form of rebalancing

Becket Qin Tue, 07 Aug 2018 20:05:57 -0700

Hi Richard,

Sorry for the late response. As discussed in the other offline thread, I am
still not sure if this use case is common enough to have a built-in
rebalance policy.


I think usually the time to detect the consumer failure and rebalance would
be the longer than the catching up time as the catch up usually happens in
parallel by all the other consumers in a group. If the there is a
bottleneck of consuming a single hot partition, this problem will exist
regardless of rebalance. In any case, the approach of having an ad-hoc
hidden consumer seems a little hacky.

Thanks,

Jiangjie (Becket) Qin

On Wed, Jul 18, 2018 at 2:39 PM, Richard Yu <[email protected]>
wrote:

>  Hi Becket,
> I made some changes and clarified the motivation for this KIP. :)It should
> be easier to understand now since I included a diagram.
> Thanks,Richard Yu
>     On Tuesday, July 17, 2018, 4:38:11 PM GMT+8, Richard Yu
> <[email protected]> wrote:
>
>   Hi Becket,
> Thanks for reviewing this KIP. :)
> I probably did not explicitly state what we were trying to avoid by
> introducing this mode. As mentioned in the KIP, there is a offset lag which
> could result after a crash. Our main goal is to avoid this lag (i.e. the
> latency in terms of time that results from the crash, not to reduce the
> number of records reprocessed).
> I could provide a couple of diagrams with what I am envisioning because
> some points in my KIP might otherwise be hard to grasp (I will also include
> some diagrams to give you a better idea of an use case). As for your
> questions, I could provide a couple of answers:
> 1. Yes, the two consumers will in fact be processing in parallel. We do
> this because we want to accelerate the processing speed of the records to
> make up for the latency caused by the crash.
> 2. After the recovery point, records will not be processed twice. Let me
> describe the scenario I was envisioning: we would let the consumer that
> crashed seek to the end of the log using KafkaConsumer#seekToEnd.
> Meanwhile, a secondary consumer will start processing from the latest
> checkpointed offset and continue until it  has hit the place where the
> first consumer that crashed began processing after seekToEnd was first
> called. Since the consumer that crashed skipped from the recovery point to
> the end of the log, the intermediate offsets will be processed only by the
> secondary consumer. So it is important to note that the offset ranges which
> the two threads process will not overlap. (This is important as it prevents
> offsets from being processed more than once)
>
> 3. As for the committed offsets, the possibility of rewinding is not
> likely. If my understanding is correct, you are probably worried that after
> the crash, offsets that has already been previously committed will be
> committed again. The current design prevents that from happening, as the
> policy of where to start processing after a crash is universal across all
> Consumer instances -- we will begin processing from the latest offset
> committed.
>
> I hope that you at least got some of your questions answered. I will
> update the KIP soon, so please stay tuned.
>
> Thanks,Richard Yu
>     On Tuesday, July 17, 2018, 2:14:07 PM GMT+8, Becket Qin <
> [email protected]> wrote:
>
>  Hi Richard,
>
> Thanks for the KIP. I am a little confused on what is proposed. The KIP
> suggests that after recovery from a consumer crash, there will be two
> consumers consuming from the same partition. One consumes starting from the
> log end offset at the point of recovery, and another consumes starting from
> the last committed offset and keeping consuming with the first consumer in
> parallel? Does that mean the messages after the recovery point will be
> consumed twice? If those two consumer commits offsets, does that mean the
> committed offsets may rewind?
>
> The proposal sounds a little hacky and introduce some non-deterministic
> behavior. It would be useful to have a concrete use case example to explain
> what is actually needed. If the goal is to reduce the number of records
> that are reprocessed when consume crashes, maybe we can have an auto commit
> interval based on number of messages. If the application just wants to read
> from the end of the log after recovery from crash, would calling seekToEnd
> explicitly work?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Thu, Jul 5, 2018 at 6:46 PM, Richard Yu <[email protected]>
> wrote:
>
> > Hi all,
> >
> > I would like to discuss KIP-333 (which proposes a faster mode of
> > rebalancing).
> > Here is the link for the KIP:
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 333%3A+Add+faster+mode+of+rebalancing
> >
> > Thanks,
> > Richard Yu
> >
>
>

Re: [DISCUSS] KIP-333 Consider a faster form of rebalancing

Reply via email to