I am using Structured Streaming with Spark 2.2. We are using Kafka as our source and are using checkpoints for failure recovery and e2e exactly once guarantees. I would like to get some more information on how to handle updates to the application when there is a change in stateful operations and/or output schema.
As some of the sources suggest I can start the updated application parallelly with the old application until it catches up with the old application in terms of data, and then kill the old one. But then the new application will have to re-read/re-process all the data in kafka which could take a long time. I want to AVOID this re-processing of the data in the newly deployed updated application. One way I can think of is for the application to keep writing the offsets into something in addition to the checkpoint directory, for example in zookeeper/hdfs. And then, on an update of the application, I command Kafka readstream() to start reading from the offsets stored in this new location (zookeeper/hdfs) - since the updated application can't read from the checkpoint directory which is now deemed incompatible. So a couple of questions: 1. Is the above-stated solution a valid solution? 2. If yes, How can I automate the detection of whether the application is being restarted because of a failure/maintenance or because of code changes to stateful operations and/or output schema? Any guidance, example or information source is appreciated. Thanks, Priyank