Hello everybody, I am trying to understand how Kafka Direct Stream works. I'm interested in having a production ready Spark Streaming application that consumes a Kafka topic. But I need to guarantee there's (almost) no downtime, specially during deploys (and submit) of new versions. What it seems to be the best solution is to deploy and submit the new version without shutting down the previous one, wait for the new application to start consuming events and then shutdown the previous one.
What I would expect is that the events get distributed among the two applications in a balanced fashion using the consumer group id splitted by the partition key that I've previously set on my Kafka Producer. However I don't see that Kafka Direct stream support this functionality. I've achieved this with the Receiver-based approach (btw i've used "kafka" for the "offsets.storage" kafka property[2]). However this approach come with technical difficulties named in the Documentation[1] (ie: exactly-once semantics). Anyway, not even this approach seems very failsafe, Does anyone know a way to safely deploy new versions of a streaming application of this kind without downtime? Thanks in advance Mariano [1] http://spark.apache.org/docs/latest/streaming-kafka-integration.html [2] http://kafka.apache.org/documentation.html#oldconsumerconfigs