Hi, I am testing the Flink Fine-Grained Recovery <https://cwiki.apache.org/confluence/display/FLINK/FLIP-1%3A+Fine+Grained+Recovery+from+Task+Failures> from Task Failures on Flink 1.17 and am facing some issues where I need some advice. Have a jobgraph below with 5 operators, and all connections between operators are pipelined and the job's parallelism.default is set to 2. Have configured RestartPipelinedRegionFailoverStrategy with Exponential Delay Restart Strategy.
A -> B -> C -> D -> E There are a total of 10 tasks running. The first pipeline (a1 to e1) runs on a TaskManager (say TM1), and the second pipeline (a2 to e2) runs on another TaskManager (say TM2). a1 -> b1 -> c1 -> d1 -> e1 a2 -> b2 -> c2 -> d2 -> e2 When TM1 failed, I expected only 5 tasks (a1 to e1) would fail and they alone would be restarted, but all 10 tasks are getting restarted. There is only one pipeline region, which consists of all 10 execution vertices, and RestartPipelinedRegionFailoverStrategy#getTasksNeedingRestart returns all 10 tasks. Is it the right behaviour, or could there be any issue? Is it possible to restart only the pipeline of the failed task (a1 to e1) without restarting other parallel pipelines. Thanks, Prabhu Joseph