Jun,

I hear you say "partitions are evenly distributed among all consumers in
the same group", yet I did bump into a case where launching a process with
X high level consumer API threads took over all partitions, sending
existing consumers to be unemployed.

According to the claim above, and if I am not mistaken:
on a topic T with 12 partitions and 3 consumers C1-C3 on the same group
with 4 threads each,
adding a new consumer C4 with 12 threads should yield the following balance:
C1-C3 each relinquish a single partition holding only 3 partitions each.
C4 holds the 3 partitions relinquished by C1-C3.
Yet, in the case I described what happened is that C4 gained all 12
partitions and sent C1-C3 out of business with 0 partitions each.
Now maybe I overlooked something but I think I did see that happen.

BTW
What key is used to distinguish one consumer from another? "consumer.id"?
docs for "consumer.id" are "Generated automatically if not set."
What is the best practice for setting it's value? leave empty? is server
host name good enough? what are the considerations?
When using the high level consumer API, are all threads identified as the
same consumer? I guess they are, right?...

Thanks,
Shlomi


On Tue, Oct 28, 2014 at 4:21 AM, Jun Rao <jun...@gmail.com> wrote:

> You can take a look at the "consumer rebalancing algorithm" part in
> http://kafka.apache.org/documentation.html. Basically, partitions are
> evenly distributed among all consumers in the same group. If there are more
> consumers in a group than partitions, some consumers will never get any
> data.
>
> Thanks,
>
> Jun
>
> On Mon, Oct 27, 2014 at 4:14 AM, Shlomi Hazan <shl...@viber.com> wrote:
>
> > Hi All,
> >
> > Using Kafka's high consumer API I have bumped into a situation where
> > launching a consumer process P1 with X consuming threads on a topic with
> X
> > partition kicks out all other existing consumer threads that consumed
> prior
> > to launching the process P.
> > That is, consumer process P is stealing all partitions from all other
> > consumer processes.
> >
> > While understandable, it makes it hard to size & deploy a cluster with a
> > number of partitions that will both allow balancing of consumption across
> > consuming processes, dividing the partitions across consumers by setting
> > each consumer with it's share of the total number of partitions on the
> > consumed topic, and on the other hand provide room for growth and
> addition
> > of new consumers to help with increasing traffic into the cluster and the
> > topic.
> >
> > This stealing effect forces me to have more partitions then really needed
> > at the moment, planning for future growth, or stick to what I need and
> > trust the option to add partitions which comes with a price in terms of
> > restarting consumers, bumping into out of order messages (hash
> > partitioning) etc.
> >
> > Is this policy of stealing is intended, or did I just jump to
> conclusions?
> > what is the way to cope with the sizing question?
> >
> > Shlomi
> >
>

Reply via email to