Hi there,
I have a problem with fulfilling all my needs when using Spark Streaming on
Kafka. Let me enumerate my requirements:
1. I want to have at-least-once/exactly-once processing.
2. I want to have my application fault & simple stop tolerant. The Kafka
offsets need to be tracked between restarts.
3. I want to be able to upgrade code of my application without losing Kafka
offsets.

Now what my requirements imply according to my knowledge:
1. implies using new Kafka DirectStream.
2. implies  using checkpointing. kafka DirectStream will write offsets to
the checkpoint as well.
3. implies that checkpoints can't be used between controlled restarts. So I
need to install shutdownHook with ssc.stop(stopGracefully=true) (here is a
description how:
https://metabroadcast.com/blog/stop-your-spark-streaming-application-gracefully
)

Now my problems are:
1. If I cannot make checkpoints between code upgrades, does it mean that
Spark does not help me at all with keeping my Kafka offsets? Does it mean,
that I have to implement my own storing to/initalization of offsets from
Zookeeper?
2. When I set up shutdownHook and my any executor throws an exception, it
seems that application does not fail, but stuck in running state. Is that
because stopGracefully deadlocks on exceptions? How to overcome this
problem? Maybe I can avoid setting shutdownHook and there is other way to
stop gracefully your app?

3. If I somehow overcome 2., is it enough to just stop gracefully my app to
be able to upgrade code & not lose Kafka offsets?


Thank you a lot for your answers,
Krzysztof Zarzycki

Reply via email to