I've been trying to use Kafka to feed data into a computing cluster (e.g.
500 servers).  The basic design is one 'job submitter' server is a Producer
into a Topic with 1000 partitions.  I then have 500 servers each running an
instance of a multithreaded High Level Consumer all with a shared
group.idthat asynchronously process incoming messages against a CPU
intensive
workload.  My expectation was that the Kafka would use server side logic to
map the topic partitions into the different consumer instances in the
shared group.  My goal is to be able to join and leave consumer instances
over the lifetime of the processing and have Kafka automatically rebalance
the partitions to the set of live Consumer instances.

This hasn't been working well for me -- in practice I've seen one or two of
my cluster servers pick up messages and the others sit idle.  I suspect
that each High Level Consumer is picking up partition 0 and ZooKeeper is
getting confused about which instance/socket to map the messages into.
 After reading through the docs a few more times, I think the partition ->
group mapping logic is client side rather than server side -- if this is
the case I think my scenario is fundamentally broken unless I implement an
independent service for partition -> client mapping.  I've looked through
the Simple Consumer example and it looks like the partition mapping logic
is handled client side there so it seems to lead me back down the path of
writing my own partitioning service.

Can you confirm my understanding that partition -> consumer mapping is
client side logic?  Is there an established pattern I should be following
to use Kafka in a 1 Producer -> Many Consumers Instances in a Shared Group
scenario?

Thanks in advance for your advice,

 - jcb

Reply via email to