Hi, I am testing the Flink Fine-Grained Recovery
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-1%3A+Fine+Grained+Recovery+from+Task+Failures>
from Task Failures on Flink 1.17 and am facing some issues where I need
some advice. Have a jobgraph below with 5 operators, and all connections
between operators are pipelined and the job's parallelism.default is set to
2. Have configured RestartPipelinedRegionFailoverStrategy with Exponential
Delay Restart Strategy.

A -> B -> C -> D -> E

There are a total of 10 tasks running. The first pipeline  (a1 to e1) runs
on a TaskManager (say TM1), and the second pipeline (a2 to e2) runs on
another TaskManager (say TM2).

a1 -> b1 -> c1 -> d1 -> e1
a2 -> b2 -> c2 -> d2 -> e2

When TM1 failed, I expected only 5 tasks (a1 to e1) would fail and they
alone would be restarted, but all 10 tasks are getting restarted. There is
only one pipeline region, which consists of all 10 execution vertices, and
RestartPipelinedRegionFailoverStrategy#getTasksNeedingRestart returns all
10 tasks. Is it the right behaviour, or could there be any issue? Is it
possible to restart only the pipeline of the failed task (a1 to e1) without
restarting other parallel pipelines.

Thanks,
Prabhu Joseph

Reply via email to