XComp edited a comment on pull request #14686:
URL: https://github.com/apache/flink/pull/14686#issuecomment-768170545


   After talking to @chesnay about the exception history, I had to reiterate 
over the implementation as I realized that I tried to solve it using wrong 
assumptions. I try to summarize my conclusions. Let me know if I got something 
wrong. 
   
   Processing a state change runs in the main thread, i.e. at least changing 
the different execution states to `FAILED` does not compete with each other. If 
a failure happens, all executions of the same pipeline region change their 
state to `CANCELING`. In the current implementation, no failures are logged for 
these executions due to the way how 
[Execution.processFail](https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L1093)
 is implemented. 
   
   I need the failure information for the exception history. Hence, I have to 
start collecting the failure even when the `Execution` is in state `CANCELING` 
or `CANCELED`. That should be ok as this `failureInfo` is really only used for 
the exception history (i.e. in the `JobExceptionsHandler`).
   
   For the execution history, we have to differentiate between job failures and 
restartable failures: 
   * Task failures are grouped by a key that is based on the set of 
`Executions` that need to be restarted (using their `ExecutionAttemptId` as a 
reference).
   * ~Job failures are grouped based on the `ExecutionGraph`. There is no 
unique identifier for `ExecutionGraph`, yet. I'm thinking of extending 
`AccessExecutionGraph` to add such an identifier. The alternative would be to 
use the set of `Executions` like it's done for the restartable failures. The 
problem with that approach is that a restartable failure affecting all 
`ExecutionVertices` of the `ExecutionGraph` would overlap with a job failure.~ 
the global failures will be ignored for now
   
   The exception history is then passed along with the `ArchivedExecutionGraph` 
to the `JobExceptionsHandler` in something like a `ExecutionGraphInformation` 
as proposed by @tillrohrmann . `ExecutionGraphCache` and related classes have 
to adapted accordingly. The `JobExceptionsHandler` will do the translation into 
a REST response.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to