Jack,

Zookeeper is likely the bottleneck if rebalancing takes a very long time.
As Jay said, this will be addressed in the consumer rewrite planned for
0.9. Few more workarounds that were tried at LinkedIn - 1) To deploy
Zookeeper on SSDs and 2) Turning sync on every write off
(zookeeper.forceSync). I'm not sure if #2 negatively affected the
consistency of the zookeeper data ever but it did help with speeding up the
rebalancing.

THanks,
Neha

On Thu, Nov 6, 2014 at 11:31 AM, Jay Kreps <jay.kr...@gmail.com> wrote:

> Unfortunately the performance of the consumer balancing scales poorly with
> the number of partitions. This is one of the things the consumer rewrite
> project is meant to address, however that is not complete yet. A reasonable
> workaround may be to decouple your application parallelism from the number
> of partitions. I.e. have the processing of each partition happen in a
> threadpool. I'm assuming that you don't actually have 2,500 machines, just
> that you need that much parallelism since each messages takes a bit of time
> to process. This does weaken the delivery ordering, but you may be able to
> shard the processing by key to solve that problem.
>
> -Jay
>
> On Thu, Nov 6, 2014 at 10:59 AM, Jack Foy <j...@whitepages.com> wrote:
>
> > Hi all,
> >
> > We are building a system that will carry a high volume of traffic (on the
> > order of 2 billion messages in each batch), which we need to process at a
> > rate of 50,000 messages per second. We need to guarantee at-least-once
> > delivery for each message. The system we are feeding has a latency of
> 50ms
> > per message, and can absorb many concurrent requests.
> >
> > We have a Kafka 0.8.1.1 cluster with three brokers and a Zookeeper 3.4.5
> > cluster with 5 nodes, each on physical hardware.
> >
> > We intend to deploy a consumer group of 2500 consumers against a single
> > topic, with a partition for each consumer. We expect our consumers to be
> > stable over the course of the run, so we expect rebalancing to be rare.
> In
> > testing, we have successfully run 512 high-level consumers against 1024
> > partitions, but beyond 512 consumers the rebalance at startup doesn’t
> > complete within 10 minutes. Is this a workable strategy with high-level
> > consumers? Can we actually deploy a consumer group with this many
> consumers
> > and partitions?
> >
> > We see throughput of more than 500,000 messages per second with our 512
> > consumers, but we need greater parallelism to meet our performance needs.
> >
> > --
> > Jack Foy <j...@whitepages.com>
> >
> >
> >
> >
>

Reply via email to