I've been trying to use Kafka to feed data into a computing cluster (e.g. 500 servers). The basic design is one 'job submitter' server is a Producer into a Topic with 1000 partitions. I then have 500 servers each running an instance of a multithreaded High Level Consumer all with a shared group.idthat asynchronously process incoming messages against a CPU intensive workload. My expectation was that the Kafka would use server side logic to map the topic partitions into the different consumer instances in the shared group. My goal is to be able to join and leave consumer instances over the lifetime of the processing and have Kafka automatically rebalance the partitions to the set of live Consumer instances.
This hasn't been working well for me -- in practice I've seen one or two of my cluster servers pick up messages and the others sit idle. I suspect that each High Level Consumer is picking up partition 0 and ZooKeeper is getting confused about which instance/socket to map the messages into. After reading through the docs a few more times, I think the partition -> group mapping logic is client side rather than server side -- if this is the case I think my scenario is fundamentally broken unless I implement an independent service for partition -> client mapping. I've looked through the Simple Consumer example and it looks like the partition mapping logic is handled client side there so it seems to lead me back down the path of writing my own partitioning service. Can you confirm my understanding that partition -> consumer mapping is client side logic? Is there an established pattern I should be following to use Kafka in a 1 Producer -> Many Consumers Instances in a Shared Group scenario? Thanks in advance for your advice, - jcb
