I have a scenario where, for a given topic I'll have 500 consumers (1
consumer per instance of an app). I've setup the topic so it has 500
partitions, thus ensuring each consumer will eventually get work (the data
produced into kafka  use the default partitioning strategy).

note: These consumer app instances are run in containers via Marathon
(using Mesos).

Several times a day the consumer apps can be intentionally restarted (to
upgrade the app, etc). When a rolling restart occurs, Kafka begins its
rebalancing process. This process can take 10 minutes or so as the rolling
restart itself takes a few minutes. As a result, what I've seen is that a
consumer will have its partitions reassigned, consume a new message, start
working on it, and then a reassignment occurs again. The work being
performed when a message is received is effectively lost since messages
being processed take 30s - 2 hours to process, and a re-assignment occurs.

One suggestion from someone was to create a separate "app" in marathon for
each instance, therefore I'd have 500 apps in marathon, and assign each one
a specific partition number instead of letting Kafka assign partitions
automatically to the consumers. This is problematic because I need to be
able to increase/decrease the number of instances of the app based on
demand coming into the system.

To work around this, we have a custom component that consumes kafka topics
and puts messages into redis lists (one per kafka topic). Then our
consumers are doing a BLPOP (blocking pop operation) to ensure the message
is only processed once, but also helps avoid rebalancing in kafka when the
consumer apps are restarted.

I'm considering using a different queueing system such as ActiveMQ,
RabbitMQ...to avoid this kafka to redis scenario. Is kafka the right fit?
Is there a better approach to doing this with kafka?

Thanks in advance,
Craig

Reply via email to