Hi Flink community, We recently investigated a scenario that led us to revisit Flink’s recovery semantics around checkpoints and savepoints, and I’d like to ask some questions. We run Flink 1.20 on Kubernetes with the flink-kubernetes-operator. Autoscaling is enabled, so restarts due to rescaling or spec changes can occur around the same time as manual savepointing. Our sinks use the default Flink FileSink writing Parquet files to S3 with exactly-once semantics.
In a reproduced scenario, a checkpoint completed, then a manual savepoint completed, and before the next periodic checkpoint finished the job restarted. On restart, Flink restored from the earlier checkpoint rather than the newer savepoint, leading to reprocessing and duplicate Parquet files being written. From code inspection and docs, this appears to be expected behavior: Flink never uses savepoints for automatic recovery and always restores from the latest completed checkpoint (FLIP-193 / FLINK-25191). So my questions are: - Has there been any discussion about making recovery-from-savepoint configurable (e.g. if newer than the last checkpoint and still present)? - Is the expectation today that 2PC sinks must avoid irreversible side effects during savepoint snapshots or encode enough metadata to deduplicate on recovery? Thanks for any insight or pointers to prior discussions. Best regards, Lucas
