Hi community, I’d like to initiate a discussion regarding CIP-21: Support Flink job recovery from JobManager failure for Apache Celeborn [1].
This proposal aims to enable Celeborn to support Flink’s batch job recovery feature [2]. With this enhancement, Flink batch jobs using Celeborn will be able to recover from previously completed stages after a JobManager failure, eliminating the need to restart the entire job from scratch. Your feedback and questions are welcome — please feel free to share any thoughts you may have. Best regards, Xu Huang [1] CIP-21: Support flink jobs recovery from JobManager failure for Apache Celeborn. https://cwiki.apache.org/confluence/x/kw9JFg [2] FLIP-383: Support Job Recovery from JobMaster Failures for Batch Jobs. https://cwiki.apache.org/confluence/x/QwqZE