Hi there, I have a problem with fulfilling all my needs when using Spark Streaming on Kafka. Let me enumerate my requirements: 1. I want to have at-least-once/exactly-once processing. 2. I want to have my application fault & simple stop tolerant. The Kafka offsets need to be tracked between restarts. 3. I want to be able to upgrade code of my application without losing Kafka offsets.
Now what my requirements imply according to my knowledge: 1. implies using new Kafka DirectStream. 2. implies using checkpointing. kafka DirectStream will write offsets to the checkpoint as well. 3. implies that checkpoints can't be used between controlled restarts. So I need to install shutdownHook with ssc.stop(stopGracefully=true) (here is a description how: https://metabroadcast.com/blog/stop-your-spark-streaming-application-gracefully ) Now my problems are: 1. If I cannot make checkpoints between code upgrades, does it mean that Spark does not help me at all with keeping my Kafka offsets? Does it mean, that I have to implement my own storing to/initalization of offsets from Zookeeper? 2. When I set up shutdownHook and my any executor throws an exception, it seems that application does not fail, but stuck in running state. Is that because stopGracefully deadlocks on exceptions? How to overcome this problem? Maybe I can avoid setting shutdownHook and there is other way to stop gracefully your app? 3. If I somehow overcome 2., is it enough to just stop gracefully my app to be able to upgrade code & not lose Kafka offsets? Thank you a lot for your answers, Krzysztof Zarzycki