I understand your point of view... My requirement is *exact* balancing -
parts of our current flow have a consumption processing of around 5
minutes... (and this is an important/expensive part - because it's CPU and
memory intensive and we'd like to avoid queueing) so we need EQUAL load
balancing - and we need to know when we need to scale/descale.
If you pay attention I'm always saying equal load balancing with multiple
producers.
By that I mean: if I have 10 partitions in a topic and send 10 messages
from different producers I expect the load to be exactly divided, 1 message
in each partition.

I thought about some possible solutions using Kafka as-is, although there
is always a drawback.

What Kafka offers out of the box:
A) RoundRobinPartitioner - cyclic round robin internal to producer - tends
to producer N messages in each single partition before total the number of
partition is met (where N is equal to the number of producers). Drawback:
unequal balance over short periods of time (depending on the number of
producers, where the messages are coming from - which producer, etc).
B) DefaultPartitioner - hash of the key modulus total number of partitions
- if a random key is used, mathematically (think of big number of messages)
should be equally distributed - Drawback: unequal balance over short
periods of time.
Is that correct / do you agree?

Possible options we could think of:

1) A custom partitioner using shared memory between producers to decide the
next partition; Drawback - all producers would need to be within the shared
memory boundary.
2) Creating a single dummy consumer/producer with RoundRobinpartitioner
between two topics "in", where the real producers would send the message to
and "out" with multiple partitions where the real consumers would listen
to. Drawbacks: Single point of failure (ok, we could have a single
partition "in" an extra consumer within the same consumer group that could
take if the consumer/producer fails) - I believe we could go far here -
making design, maintainability, monitoring, etc worse from an architecture
point of view.

I didn't understand the atomic counter - I guess maybe would look like
number 1?
And maybe fanout, like number 2?

I believe we should be able to do perfect load balancing, 10 messages
received in a topic being distributed to 10 partitions - 20 messages, 20
partitions, no matter who generated them.
The thing is that currently the broker receives messages on the partition
level only. No way to send them on a topic level and redistribute

We are currently paying extra idle machines - my ideas are either:
i) make sure we are not missing something (maybe some of our assumptions
are wrong and we have easy ootb options)
ii) if we are not missing something going with option 1 (and limiting our
producers to be within the shared mem boundaries)
iii) Checking the feasibility(how hard would it be?)/acceptance of the
community of doing this in Kafka by submitting a KIP

Thanks once again!



On Mon, Jun 15, 2020 at 9:09 PM Colin McCabe <cmcc...@apache.org> wrote:

> This is a bit frustrating since you keep saying that the load is not
> balanced, but the load actually is balanced, it's just balanced in an
> approximate fashion.  If you need exact balancing (for example, because
> you're creating a job scheduler or something), then you need to use a
> different strategy.  One example would be using an external atomic counter
> to determine what partition the producers should send the messages to.
> Another would be using a single consumer with fanout.  I think this is
> outside the scope of Kafka, at least if I understand the problem here (?)
>
> best,
> Colin
>
> On Mon, Jun 15, 2020, at 11:32, Vinicius Scheidegger wrote:
> > Hi Collin,
> >
> > One producer shouldn't need to know about the other to distribute the
> load
> > equally, but what Kafka has now is roughly equal...
> > If you have a single producer RounRobinPartitioner works fine, if you
> have
> > 10 producers you can have 7/8 messages in one partition while another
> > partition has none (producers are in sync - which happened a couple times
> > in our tests).
> >
> > Producer0 getNext() = partition0
> > Producer1 getNext() = partition0
> > Producer2 getNext() = partition0
> >
> > A link to some of our test data prints:
> > https://imgur.com/a/ha9OQMj
> >
> > This, depending on how intensive (slow) your consumption rate is, may be
> a
> > problem as it will generate enqueuing.
> > We use Kafka as a messaging protocol in a big (and in some points heavy
> > load) machine learning flow - for high throughput (lightweight
> processing)
> > enqueuing is not an issue - aƱthough we saw it happening. but for heavy
> > processes we are unable to do equal load balance.
> >
> > We currently use the DefaultPartitioner and Kafka algorithm (murmur2 hash
> > of the key) to decide the partition.
> > We noticed enqueuing and timeouts while several consumers were idle -
> which
> > made us take a better look on how the load is balanced.
> >
> > I believe the only way to perform equal load balance without having to
> know
> > other producers would be to do it on the Broker side. Do you agree?
> >
> > Thanks,
> >
> >
> >
> > On Mon, Jun 15, 2020 at 7:32 PM Colin McCabe <cmcc...@apache.org> wrote:
> >
> > > Hi Vinicius,
> > >
> > > It's actually not necessary for one producer to know about the others
> to
> > > get an even distribution across partitions, right?  All that's really
> > > required is that all producers produce a roughly equal amount of data
> to
> > > each partition, which is what RoundRobinPartitioner is designed to
> do.  In
> > > mathematical terms, the sum of several uniform random variables is
> itself
> > > uniformly random.
> > >
> > > (There is a bug in RRP right now, KAFKA-9965, but it's not related to
> what
> > > we're talking about now and we have a fix ready.)
> > >
> > > cheers,
> > > Colin
> > >
> > >
> > > On Sun, Jun 14, 2020, at 14:26, Vinicius Scheidegger wrote:
> > > > Hi Collin,
> > > >
> > > > Thanks for the reply. Actually the RoundRobinPartitioner won't do an
> > > equal
> > > > distribution when working with multiple producers. One producer does
> not
> > > > know the others. If you consider that producers are randomly
> producing
> > > > messages, in the worst case scenario all producers can be synced and
> one
> > > > could have as many messages in a single partition as the number of
> > > > producers.
> > > > It's easy to generate evidences of it.
> > > >
> > > > I have asked this question on the users mail list too (and on Slack
> and
> > > on
> > > > Stackoverflow).
> > > >
> > > > Kafka currently does not have means to do a round robin across
> multiple
> > > > producers or on the broker side.
> > > >
> > > > This means there is currently NO GUARANTEE of equal distribution
> across
> > > > partitions as the partition election is decided by the producer.
> > > >
> > > > There result is an unbalanced consumption when working with consumer
> > > groups
> > > > and the options are: creating a custom shared partitioner, relying on
> > > Kafka
> > > > random partition or introducing a middle man between topics (all of
> them
> > > > having big cons).
> > > >
> > > > I thought of asking here to see whether this is a topic that could
> > > concern
> > > > other developers (and maybe understand whether this could be a KIP
> > > > discussion)
> > > >
> > > > Maybe I'm missing something... I would like to know.
> > > >
> > > > According to my interpretation of the code (just read through some
> > > > classes), but there is currently no way to do partition balancing on
> the
> > > > broker - the producer sends messages directly to partition leaders so
> > > > partition currently needs to be defined on the producer.
> > > >
> > > > I understand that in order to perform round robin across partitions
> of a
> > > > topic when working with multiple producers, some development needs
> to be
> > > > done. Am I right?
> > > >
> > > >
> > > > Thanks
> > > >
> > > >
> > > > On Fri, Jun 12, 2020, 10:57 PM Colin McCabe <cmcc...@apache.org>
> wrote:
> > > >
> > > > > HI Vinicius,
> > > > >
> > > > > This question seems like a better fit for the user mailing list
> rather
> > > > > than the developer mailing list.
> > > > >
> > > > > Anyway, if I understand correctly, you are asking if the producer
> can
> > > > > choose to assign partitions in a round-robin fashion rather than
> based
> > > on
> > > > > the key.  The answer is, you can, by using RoundRobinPartitioner.
> > > (again,
> > > > > if I'm understanding the question correctly).
> > > > >
> > > > > best,
> > > > > Colin
> > > > >
> > > > > On Tue, Jun 9, 2020, at 00:48, Vinicius Scheidegger wrote:
> > > > > > Anyone?
> > > > > >
> > > > > > On Fri, Jun 5, 2020 at 2:42 PM Vinicius Scheidegger <
> > > > > > vinicius.scheideg...@gmail.com> wrote:
> > > > > >
> > > > > > > Does anyone know how could I perform a load balance to
> distribute
> > > > > equally
> > > > > > > the messages to all consumers within the same consumer group
> having
> > > > > > > multiple producers?
> > > > > > >
> > > > > > > Is this a conceptual flaw on Kafka, wasn't it thought for equal
> > > > > > > distribution with multiple producers or am I missing something?
> > > > > > > I've asked on Stack Overflow, on Kafka users mailing group,
> here
> > > (on
> > > > > Kafka
> > > > > > > Devs) and on Slack - and still have no definitive answer
> (actually
> > > > > most of
> > > > > > > the time I got no answer at all)
> > > > > > >
> > > > > > > Would something like this even be possible in the way Kafka is
> > > > > currently
> > > > > > > designed?
> > > > > > > How does proposing for a KIP work?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, May 28, 2020, 3:44 PM Vinicius Scheidegger <
> > > > > > > vinicius.scheideg...@gmail.com> wrote:
> > > > > > >
> > > > > > >> Hi,
> > > > > > >>
> > > > > > >> I'm trying to understand a little bit more about how Kafka
> works.
> > > > > > >> I have a design with multiple producers writing to a single
> topic
> > > and
> > > > > > >> multiple consumers in a single Consumer Group consuming
> message
> > > from
> > > > > this
> > > > > > >> topic.
> > > > > > >>
> > > > > > >> My idea is to distribute the messages from all producers
> equally.
> > > From
> > > > > > >> reading the documentation I understood that the partition is
> > > always
> > > > > > >> selected by the producer. Is that correct?
> > > > > > >>
> > > > > > >> I'd also like to know if there is an out of the box option to
> > > assign
> > > > > the
> > > > > > >> partition via a round robin *on the broker side *to guarantee
> > > equal
> > > > > > >> distribution of the load - if possible to each consumer, but
> if
> > > not
> > > > > > >> possible, at least to each partition.
> > > > > > >>
> > > > > > >> If my understanding is correct, it looks like in a multiple
> > > producer
> > > > > > >> scenario there is lack of support from Kafka regarding load
> > > balancing
> > > > > and
> > > > > > >> customers have to either stick to the hash of the key (random
> > > > > distribution,
> > > > > > >> although it would guarantee same key goes to the same
> partition)
> > > or
> > > > > they
> > > > > > >> have to create their own logic on the producer side (i.e. by
> > > sharing
> > > > > memory)
> > > > > > >>
> > > > > > >> Am I missing something?
> > > > > > >>
> > > > > > >> Thank you,
> > > > > > >>
> > > > > > >> Vinicius Scheidegger
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to