[ https://issues.apache.org/jira/browse/FLINK-21707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298711#comment-17298711 ]
Till Rohrmann commented on FLINK-21707: --------------------------------------- Thanks for reporting this issue [~zhuzh]. Why do we have to wait for a blocking intermediate result to be fully produced before scheduling independent consumers? Not having to do this could allow us to get rid of the {{PipelinedRegionSchedulingStrategy.correlatedResultPartitions}} and avoid the problem you are describing. Is it because we want to support stage wise scheduling? If this is strictly required, then your proposal should work. However, I am bit hesitant because it feels as if we are complicating things a bit and that these special cases makes everything a bit more brittle. > Job is possible to hang when restarting a FINISHED task with POINTWISE > BLOCKING consumers > ----------------------------------------------------------------------------------------- > > Key: FLINK-21707 > URL: https://issues.apache.org/jira/browse/FLINK-21707 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.11.3, 1.12.2, 1.13.0 > Reporter: Zhu Zhu > Priority: Blocker > > Job is possible to hang when restarting a FINISHED task with POINTWISE > BLOCKING consumers. This is because > {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} will try to > schedule all the consumer tasks/regions of the finished *ExecutionJobVertex*, > even though the regions are not the exact consumers of the finished > *ExecutionVertex*. In this case, some of the regions can be in non-CREATED > state because they are not connected to nor affected by the restarted tasks. > However, {{PipelinedRegionSchedulingStrategy#maybeScheduleRegion()}} does not > allow to schedule a non-CREATED region and will throw an Exception and breaks > the scheduling of all the other regions. One example to show this problem > case can be found at > [PipelinedRegionSchedulingITCase#testRecoverFromPartitionException > |https://github.com/zhuzhurk/flink/commit/1eb036b6566c5cb4958d9957ba84dc78ce62a08c]. > To fix the problem, we can add a filter in > {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} to only > trigger the scheduling of regions in CREATED state. -- This message was sent by Atlassian Jira (v8.3.4#803005)