Hi everyone, I am currently exploring the fault tolerance and recovery mechanism in batch mode within Apache Flink.
If I terminate the task manager process while the job is running, the job restarts from the point of failure. However, at some point, the job restarts from the very beginning. The documentation mentions that the checkpointing and state backend do not work in batch mode. How does recovery after a failure occur in BATCH mode? According to the documentation: “In BATCH runtime mode, Flink will attempt to return to previous processing steps for which intermediate results are still available. Potentially, only those tasks that fail (or their predecessors in the graph) will have to be restarted.” https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/execution_mode/ I would appreciate any information regarding this matter. Kind regards, Vladimir