Re: Balancing based on lag

Jens Rantil Mon, 12 Sep 2016 01:17:43 -0700

Hi Stevo,

Thank you for your response. Yes, I understand there can only be one active
consumer per partition and I understand that partitions should be spread
evenly. However, there will almost always be cases when they are somewhat
unbalanced.

That said, how consumers are distributed among partitions is not by design.
Let me clarify: My questions was whether there has been any discussion on
1) having the rebalancing algorithm taking lag into account to assign
partitions to consumers in such way that it tries to spread sum of lag per
consumer as evenly as possible and 2) possibly triggering rebalancing
algorithm if lag could be considerably improved.

I have two specific use cases:

   1. I recently added new partitions to a topic and reset offsets to
   beginning. This meant that the old partitions had much more data than the
   new partitions. The rebalance when adding new nodes was definitely not
   "fair" in terms of lag. I had to scale up to have equal number of consumers
   as number of partitions to be sure that a single consumers was not assigned
   two of the old partitions.
   2. Autoscaling. When consumers come and go based on CPU load or
   whatever, they will invariably have different numbers of partitions
   assigned to them. Lag aware rebalancer would definitely be a small
   optimisation that could do quite a lot here.

If no discussion of this has been done, I'll consider writing a KIP for it.

Cheers,
Jens

On Mon, Sep 12, 2016 at 12:25 AM Stevo Slavić <ssla...@gmail.com> wrote:

> Hello Jens,
>
> By design there can be only one active consumer per consumer group per
> partion at a time, only one thread after being assigned (acquiring a lock)
> moves the offset for consumer group for partition, so no concurrency
> problems. Typically that assignment lasts long until rebalancing gets
> triggered e.g. because consumer instance joined or left the same group.
>
> To have all active consumers evenly loaded, one has to have publishing
> evenly distribute messages across the partitions with appropriate
> partitioning strategy. Check which one is being used with which settings.
> It may be tuned well already but maybe not for your test e.g. publishes 20k
> messages to one partition before publishing to next one (assumes lots of
> messages will be published so batches writes, trading of temp uneven
> balancing for better throughput), so if test publishes 40k messages only,
> only two partitions will actually get the data.
>
> Kind regards,
> Stevo Slavic.
>
> On Sun, Sep 11, 2016, 22:49 Jens Rantil <jens.ran...@tink.se> wrote:
>
> > Hi,
> >
> > We have a partition which has many more messages than all other
> partitions.
> > Does anyone know if there has been any discussions on having a partition
> > balancer that tries to balance consumers based on consumer group lag?
> >
> > Example:
> >
> > [
> >     { partition: 0, consumers: "192.168.1.2", lag: 20000 },
> >     { partition: 1, consumers: "192.168.1.2", lag: 20000 },
> >     { partition: 2, consumers: "192.168.1.3", lag: 0 },
> > ]
> >
> > Clearly, it would be more optimial if "192.168.1.3" also takes care of
> > partition 1.
> >
> > Cheers,
> > Jens
> >
> >
> > --
> > Jens Rantil
> > Backend engineer
> > Tink AB
> >
> > Email: jens.ran...@tink.se
> > Phone: +46 708 84 18 32
> > Web: www.tink.se
> >
> > Facebook <https://www.facebook.com/#!/tink.se> Linkedin
> > <
> >
> http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
> > >
> >  Twitter <https://twitter.com/tink>
> >
>
-- 

Jens Rantil
Backend Developer @ Tink

Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
For urgent matters you can reach me at +46-708-84 18 32.

Re: Balancing based on lag

Reply via email to