[ 
https://issues.apache.org/jira/browse/FLINK-33565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788255#comment-17788255
 ] 

Rui Fan commented on FLINK-33565:
---------------------------------

Hi [~mapohl] , thanks for providing these background.:)
{quote}There's a difference between the {{Default-}} and the 
{{{}AdaptiveScheduler{}}}. The latter one doesn't support pipelined regions. 
The {{DefaultScheduler}} does support them. Therefore, concurrent exceptions 
should happen when using the {{{}AdaptiveScheduler{}}}.
{quote}
I don't understand why concurrent exceptions should happen when using the 
{{{}AdaptiveScheduler{}}}. When one job only has all-to-all shuffle, 
AdaptiveScheduler and DefaultScheduler should have similar exception-related 
logic, right?

 
{quote}But there was an issue in the past that cannot be explained till now 
where concurrent exceptions caused an issue in a run that had the 
{{AdaptiveScheduler}} enabled (see FLINK-33121). So far, they looked into it 
but struggled to find the cause for this.
{quote}
I will take a look FLINK-33121 as well.

 
{quote}On the other hand, {{DefaultScheduler}} comes with pipelined region 
support. The scenario that they have considered when thinking about concurrent 
exceptions was that you can have two pipelined regions being executed 
concurrently. They are both failing independently with one of the two errors 
becoming the root cause for the job's failure. The 
{{PipelinedRegionSchedulingStrategy}} is in charge of scheduling vertex 
restarts. Apparently, it would be possible to put the vertices of two different 
pipelines together to reduce the number of restarts.

I looked into the code of 
{{{}PipelinedRegionSchedulingStrategy#restartTasks{}}}. I struggled to find the 
merge behavior, though. Based on my finding, the 
{{PipelinedRegionSchedulingStrategy}} does indeed merge pipelined regions 
together. But only based on the vertices that are already selected for a 
restart. 
{quote}
Yeah, I agree with you. When multiple regions fail at the same time:
 * Action1: flink should restart them together 
 * Action2: and pick one as the root cause, the rest of exceptions as the 
concurrent exceptions 

IIUC, this action1 was proposed by 
[FLIP-364|https://cwiki.apache.org/confluence/x/uJqzDw] , the part 1.2 is 
related to this. FLIP-364 proposes merging multiple Exceptions into one restart.

And action2 was mentioned by [~zhuzh]  in the [mail 
list|https://lists.apache.org/thread/l7wyc7pndpsvh2h7hj3fw2td9yphrlox] of 
FLIP-364.

 

Please correct me if I misunderstood anything, and looking forward to your 
feedback, thanks~

> The concurrentExceptions doesn't work
> -------------------------------------
>
>                 Key: FLINK-33565
>                 URL: https://issues.apache.org/jira/browse/FLINK-33565
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.18.0, 1.17.1
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>
> First of all, thanks to [~mapohl] for helping double-check in advance that 
> this was indeed a bug .
> Displaying exception history in WebUI is supported in FLINK-6042.
> h1. What's the concurrentExceptions?
> When an execution fails due to an exception, other executions in the same 
> region will also restart, and the first Exception is rootException. If other 
> restarted executions also report Exception at this time, we hope to collect 
> these exceptions and Displayed to the user as concurrentExceptions.
> h2. What's this bug?
> The concurrentExceptions is always empty in production, even if other 
> executions report exception at very close times.
> h1. Why doesn't it work?
> If one job has all-to-all shuffle, this job only has one region, and this 
> region has a lot of executions. If one execution throw exception:
>  * JobMaster will mark the state as FAILED for this execution.
>  * The rest of executions of this region will be marked to CANCELING.
>  ** This call stack can be found at FLIP-364 
> [part-4.2.3|https://cwiki.apache.org/confluence/display/FLINK/FLIP-364%3A+Improve+the+restart-strategy#FLIP364:Improvetherestartstrategy-4.2.3Detailedcodeforfull-failover]
>  
> When these executions throw exception as well, it JobMaster will mark the 
> state from CANCELING to CANCELED instead of FAILED.
> The CANCELED execution won't call FAILED logic, so their exceptions are 
> ignored.
> Note: all reports are executed inside of JobMaster RPC thread, it's single 
> thread. So these reports are executed serially. So only one execution is 
> marked to FAILED, and the rest of executions will be marked to CANCELED later.
> h1. How to fix it?
> Offline discuss with [~mapohl] , we need to discuss with community should we 
> keep the concurrentExceptions first.
>  * If no, we can remove related logic directly
>  * If yew, we discuss how to fix it later.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to