I am using Structured Streaming with Spark 2.2.  We are using Kafka as our
source and are using checkpoints for failure recovery and e2e exactly once
guarantees.  I would like to get some more information on how to handle
updates to the application when there is a change in stateful operations
and/or output schema.

As some of the sources suggest I can start the updated application
parallelly with the old application until it catches up with the old
application in terms of data, and then kill the old one.  But then the new
application will have to re-read/re-process all the data in kafka which
could take a long time.

I want to AVOID this re-processing of the data in the newly deployed
updated application.

One way I can think of is for the application to keep writing the offsets
into something in addition to the checkpoint directory, for example in
zookeeper/hdfs.  And then, on an update of the application, I command Kafka
readstream() to start reading from the offsets stored in this new location
(zookeeper/hdfs) - since the updated application can't read from the
checkpoint directory which is now deemed incompatible.

So a couple of questions:
1.  Is the above-stated solution a valid solution?
2.  If yes, How can I automate the detection of whether the application is
being restarted because of a failure/maintenance or because of code changes
to stateful operations and/or output schema?

Any guidance, example or information source is appreciated.

Thanks,
Priyank

Reply via email to