[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI

Matthias (Jira) Fri, 15 Jan 2021 00:10:12 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265799#comment-17265799
 ]


Matthias edited comment on FLINK-6042 at 1/15/21, 8:09 AM:
-----------------------------------------------------------

{quote}Thanks for the proposal [~mapohl]. I have a few comments

1) When doing restarts with the new scheduler, then we will recreate the 
{{ExecutionGraph}}. Hence, exposing these error infos on the 
{{AccessExecutionGraph}} might not work.
{quote}
My plan was to not expose the {{ErrorInfo}} through {{ExecutionGraph}} but 
through {{ArchivedExecutionGraph}} which is a kind of serializable copy of 
{{ExecutionGraph}} and also implements {{AccessExecutionGraph}}. Restarting the 
{{ExecutionGraph}} shouldn't harm the {{ErrorInfo}} collection as it is held in 
the {{SchedulerNG}} implementation. The {{ErrorInfo}} collection would be used 
as an additional parameter when instantiating {{ArchivedExecutionGraph}} in 
[SchedulerBase.requestJob()|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L800].

{quote}2) {{UpdateSchedulerNgOnInternalFailuresListener}} will only be called 
if an exception on the JM occurs. If there is a normal task failure, then we 
will call {{updateTaskExecutionState}}
{quote}
Thanks for the hint: {{SchedulerNG.updateTaskExecutionState(..)}} should work 
as well. This would then collect all the {{Exceptions}} thrown in the different 
tasks. We could then utilize the {{JobStatusListener}} interface to identify 
the {{Exception}} that causes the job to restart and group all previously 
collected {{ErrorInfo}}s (that were not already assigned to a different root 
cause) under this root cause. Here, I'm not 100% sure, yet, whether triggering 
the consolidation of the {{ErrorInfo}}s works that easily.

{quote}3) It would be great to group the exceptions wrt to their restart cycles 
in the web UI. So seeing the root causes for a restart and then being able to 
expand the view to see the task failures for this specific restart would be 
awesome.
{quote}
The {{ErrorInfo}} groups mentioned in 2) could be then returned through the 
newly introduced access method described in 1) and forwarded by the 
{{JobExceptionsHandler}} to the web UI.


was (Author: mapohl):
{quote}Thanks for the proposal [~mapohl]. I have a few comments

1) When doing restarts with the new scheduler, then we will recreate the 
{{ExecutionGraph}}. Hence, exposing these error infos on the 
{{AccessExecutionGraph}} might not work.
{quote}
My plan was to not expose the {{ErrorInfo}} through {{ExecutionGraph}} but 
through {{ArchivedExecutionGraph}} which is a kind of serializable copy of 
{{ExecutionGraph}} and also implements {{AccessExecutionGraph}}. Restarting the 
{{ExecutionGraph}} shouldn't harm the {{ErrorInfo}} collection as it is held in 
the {{SchedulerNG}} implementation. The {{ErrorInfo}} collection would be used 
as an additional parameter when instantiating {{ArchivedExecutionGraph}} in 
[SchedulerBase.requestJob()|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L800].

{quote}2) {{UpdateSchedulerNgOnInternalFailuresListener}} will only be called 
if an exception on the JM occurs. If there is a normal task failure, then we 
will call {{updateTaskExecutionState}}
{quote}
Thanks for the hint: {{SchedulerNG.updateTaskExecutionState(..)}} should work 
as well. This would then collect all the {{Exception}}s thrown in the different 
tasks. We could then utilize the {{JobStatusListener}} interface to identify 
the {{Exception}} that causes the job to restart and group all previously 
collected {{ErrorInfo}}s (that were not already assigned to a different root 
cause) under this root cause. Here, I'm not 100% sure, yet, whether triggering 
the consolidation of the {{ErrorInfo}}s works that easily.

{quote}3) It would be great to group the exceptions wrt to their restart cycles 
in the web UI. So seeing the root causes for a restart and then being able to 
expand the view to see the task failures for this specific restart would be 
awesome.
{quote}
The {{ErrorInfo}} groups mentioned in 2) could be then returned through the 
newly introduced access method described in 1) and forwarded by the 
{{JobExceptionsHandler}} to the web UI.

> Display last n exceptions/causes for job restarts in Web UI
> -----------------------------------------------------------
>
>                 Key: FLINK-6042
>                 URL: https://issues.apache.org/jira/browse/FLINK-6042
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination, Runtime / Web Frontend
>    Affects Versions: 1.3.0
>            Reporter: Till Rohrmann
>            Assignee: Matthias
>            Priority: Major
>
> Users requested that it would be nice to see the last {{n}} exceptions 
> causing a job restart in the Web UI. This will help to more easily debug and 
> operate a job.
> We could store the root causes for failures similar to how prior executions 
> are stored in the {{ExecutionVertex}} using the {{EvictingBoundedList}} and 
> then serve this information via the Web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI

Reply via email to