[ 
https://issues.apache.org/jira/browse/FLINK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Panagiotis Garefalakis updated FLINK-33121:
-------------------------------------------
    Description: 
{{JobExceptionsHandler#createRootExceptionInfo}} makes the assumption that 
*Global* Failures (with null Task name) may *only* be RootExceptions (jobs are 
considered in FAILED state when this happens and no further exceptions are 
captured) and *Local/Task* may be part of concurrent exceptions List *--* if 
this precondition is violated, an assertion is thrown as part of 
{{{}asserLocalExceptionInfo{}}}.

The issue lies within 
[convertFailures|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java#L422]]
 logic where we take the failureCollection pointer and convert it to a 
HistoryEntry.
In more detail, we are passing the first Failure and a pointer to the remaining 
failures collection as part of HistoryEntry creation — and then add the entry 
in the exception History.
In our specific scenario a Local Failure first comes in, we call 
convertFailures that creates a HistoryEntry and removes the LocalFailure from 
the collection while also passing a pointer to the empty failureCollection. 
Then a Global failure comes in (and before conversion), it is added to the 
failureCollection (that was empty) just before serving the requestJob that 
returns the List of History Entries.
This messes things up, as the LocalFailure now has a 
ConcurrentExceptionsCollection with a Global Failure that should never happen 
(causing the assertion).
A solution is to create a Copy of the failureCollection in the conversion 
instead of passing the pointer around (as I did in the updated PR)

This PR also fixes a smaller bug where we dont pass the 
[taskName|[https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104]|https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104]
 properly.

Note: DefaultScheduler does not suffer from this issue as it treats failures 
directly as HistoryEntries (no conversion step)

  was:
{{JobExceptionsHandler#createRootExceptionInfo}} makes the assumption that 
*Global* Failures (with null Task name) may *only* be RootExceptions (jobs are 
considered in FAILED state when this happens and no further exceptions are 
captured) and *Local/Task* may be part of concurrent exceptions List *--* if 
this precondition is violated, an assertion is thrown as part of 
{{{}asserLocalExceptionInfo{}}}.

The issue lies within 
[convertFailures](https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java#L422)
 logic where we take the failureCollection pointer and convert it to a 
HistoryEntry.
In more detail, we are passing the first Failure and a pointer to the remaining 
failures collection as part of HistoryEntry creation — and then add the entry 
in the exception History.
In our specific scenario a Local Failure first comes in, we call 
convertFailures that creates a HistoryEntry and removes the LocalFailure from 
the collection while also passing a pointer to the empty failureCollection. 
Then a Global failure comes in (and before conversion), it is added to the 
failureCollection (that was empty) just before serving the requestJob that 
returns the List of History Entries.
This messes things up, as the LocalFailure now has a 
ConcurrentExceptionsCollection with a Global Failure that should never happen 
(causing the assertion).
A solution is to create a Copy of the failureCollection in the conversion 
instead of passing the pointer around (as I did in the updated PR)

This PR also fixes a smaller bug where we dont pass the 
[taskName](https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104)
 properly.

Note: DefaultScheduler does not suffer from this issue as it treats failures 
directly as HistoryEntries (no conversion step)


> Failed precondition in JobExceptionsHandler due to concurrent global failures
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-33121
>                 URL: https://issues.apache.org/jira/browse/FLINK-33121
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Panagiotis Garefalakis
>            Priority: Major
>              Labels: pull-request-available
>
> {{JobExceptionsHandler#createRootExceptionInfo}} makes the assumption that 
> *Global* Failures (with null Task name) may *only* be RootExceptions (jobs 
> are considered in FAILED state when this happens and no further exceptions 
> are captured) and *Local/Task* may be part of concurrent exceptions List *--* 
> if this precondition is violated, an assertion is thrown as part of 
> {{{}asserLocalExceptionInfo{}}}.
> The issue lies within 
> [convertFailures|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java#L422]]
>  logic where we take the failureCollection pointer and convert it to a 
> HistoryEntry.
> In more detail, we are passing the first Failure and a pointer to the 
> remaining failures collection as part of HistoryEntry creation — and then add 
> the entry in the exception History.
> In our specific scenario a Local Failure first comes in, we call 
> convertFailures that creates a HistoryEntry and removes the LocalFailure from 
> the collection while also passing a pointer to the empty failureCollection. 
> Then a Global failure comes in (and before conversion), it is added to the 
> failureCollection (that was empty) just before serving the requestJob that 
> returns the List of History Entries.
> This messes things up, as the LocalFailure now has a 
> ConcurrentExceptionsCollection with a Global Failure that should never happen 
> (causing the assertion).
> A solution is to create a Copy of the failureCollection in the conversion 
> instead of passing the pointer around (as I did in the updated PR)
> This PR also fixes a smaller bug where we dont pass the 
> [taskName|[https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104]|https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104]
>  properly.
> Note: DefaultScheduler does not suffer from this issue as it treats failures 
> directly as HistoryEntries (no conversion step)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to