[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI

2021-04-20 Thread Matthias (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325798#comment-17325798
 ] 

Matthias edited comment on FLINK-6042 at 4/20/21, 1:33 PM:
---

I moved the UI extension FLINK-21867 out making it an independent task to work 
on. The initial goal FLINK-6042 exposing the exceptions through the UI is done.


was (Author: mapohl):
I moved the UI extension FLINK-21867 out to work independently on it. The 
initial goal of exposing the exceptions through the UI is done.

> Display last n exceptions/causes for job restarts in Web UI
> ---
>
> Key: FLINK-6042
> URL: https://issues.apache.org/jira/browse/FLINK-6042
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Web Frontend
>Affects Versions: 1.3.0
>Reporter: Till Rohrmann
>Assignee: Matthias
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
> Attachments: 截屏2021-01-28 下午4.47.46.png
>
>
> Users requested that it would be nice to see the last {{n}} exceptions 
> causing a job restart in the Web UI. This will help to more easily debug and 
> operate a job.
> We could store the root causes for failures similar to how prior executions 
> are stored in the {{ExecutionVertex}} using the {{EvictingBoundedList}} and 
> then serve this information via the Web UI.
> _-- Update: January 21, 2021 --_
> The UI can already handle multiple exceptions through the Exception History. 
> Right now, we list one or more exceptions which caused the job to fail. 
> Instead, we could adapt it in a way that the history contains not only the 
> exceptions of the most recent failure but one expandable entry per restart. 
> If there are more than one exception connected to a single restart, we would 
> list their stacktraces within one expandable entry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI

2021-01-22 Thread Matthias (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269499#comment-17269499
 ] 

Matthias edited comment on FLINK-6042 at 1/22/21, 10:47 AM:


{quote}Taking your argument, why is it better to add the exception information 
method to the {{ArchivedExecutionGraph}} and making it thereby accessible to 
all {{AbstractExecutionGraphHandler}} handlers? Wouldn't it make sense to only 
provide access to those information a handler needs? In our case, one could 
give access to the {{AccessExecutionGraph}} for those handlers which extract 
information from the {{ExecutionGraph}} and maybe something like a 
{{FailureHistory}} for the {{JobExceptionsHandler}}? In the end the 
{{ArchivedExecutionGraph}} might also implement {{FailureHistory}} but I think 
the important bit is to segregate the interfaces.
{quote}
Good point: Having a separated interface sounds like the better approach.
{quote}Thinking a step ahead, how would it work with the 
{{ArchivedExecutionGraph}} if we send multiple graphs because it changed over 
the job's lifetime. To which graph will the exception causing the lifetime end 
of a graph be assigned?
{quote}
As we have a list of {{ArchivedExecutionGraphs}} in chronological order, I 
would assume that any instance except for the last one have failureCause that 
triggered the instantiation of a new {{ExecutionGraph}}. If no failure cause is 
given it means that the instantiation happened due to some rescaling efforts 
(alternatively, we could think of a new state to make that more explicit?). The 
most recent {{ExecutionGraph}} is then either holding the failure caused the 
job to fail or no failure cause if the job is in a non-failed state.

But considering that we might want to handover a list of 
{{ArchivedExecutionGraphs}} in the future it would be worth it again to have a 
class holding the {{ArchivedExecutionGraph}} (or later a list of 
{{ArchivedExecutionGraphs}}) which implements {{FailureHistory}} as well.


was (Author: mapohl):
{quote}Taking your argument, why is it better to add the exception information 
method to the {{ArchivedExecutionGraph}} and making it thereby accessible to 
all {{AbstractExecutionGraphHandler}} handlers? Wouldn't it make sense to only 
provide access to those information a handler needs? In our case, one could 
give access to the {{AccessExecutionGraph}} for those handlers which extract 
information from the {{ExecutionGraph}} and maybe something like a 
{{FailureHistory}} for the {{JobExceptionsHandler}}? In the end the 
{{ArchivedExecutionGraph}} might also implement {{FailureHistory}} but I think 
the important bit is to segregate the interfaces.
{quote}
Good point: Having a separated interface sounds like the better approach.
{quote}Thinking a step ahead, how would it work with the 
{{ArchivedExecutionGraph}} if we send multiple graphs because it changed over 
the job's lifetime. To which graph will the exception causing the lifetime end 
of a graph be assigned?
{quote}
As we have a list of {{ArchivedExecutionGraphs}} in chronological order, I 
would assume that any instance except for the last one have failureCause that 
triggered the instantiation of a new {{ExecutionGraph}}. If no failure cause is 
given it means that the instantiation happened due to some rescaling efforts 
(alternatively, we could think of a new state to make that more explicit?). The 
most recent {{ExecutionGraph}} is then either holding the failure caused the 
job to fail or no failure cause if the job is in a non-failed state.

> Display last n exceptions/causes for job restarts in Web UI
> ---
>
> Key: FLINK-6042
> URL: https://issues.apache.org/jira/browse/FLINK-6042
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Web Frontend
>Affects Versions: 1.3.0
>Reporter: Till Rohrmann
>Assignee: Matthias
>Priority: Major
>  Labels: pull-request-available
>
> Users requested that it would be nice to see the last {{n}} exceptions 
> causing a job restart in the Web UI. This will help to more easily debug and 
> operate a job.
> We could store the root causes for failures similar to how prior executions 
> are stored in the {{ExecutionVertex}} using the {{EvictingBoundedList}} and 
> then serve this information via the Web UI.
> _-- Update: January 21, 2021 --_
> The UI can already handle multiple exceptions through the Exception History. 
> Right now, we list one or more exceptions which caused the job to fail. 
> Instead, we could adapt it in a way that the history contains not only the 
> exceptions of the most recent failure but one expandable entry per restart. 
> If there are more than one exception connected to a single restart

[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI

2021-01-21 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269092#comment-17269092
 ] 

Till Rohrmann edited comment on FLINK-6042 at 1/21/21, 9:17 AM:


We have two approach (which we discussed offline) to implement this feature:
 # The {{JobExceptionsHandler}} does most of the work by iterating over the 
{{ArchivedExecutions}} of the passed {{ArchivedExecutionGraph}}. 
{{ArchivedExecutions}} provide the time (through 
{{ArchivedExecution.stateTimestamps}}) and the thrown exception 
({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would 
need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and 
pass it over to the {{JobExceptionsHandler}} along the 
{{ArchivedExecutionGraph}}. This would enable the handler to group exceptions 
happened due to the same failure case.
 ** +Pros:+
 *** This approach has the advantage of using mostly code that is already there.
 *** No extra code in the {{SchedulerBase}} implementation.
 ** Cons:
 *** It does not support restarts of the {{ExecutionGraph}}. This restart 
functionality is planned for the declarative scheduler which we're currently 
working on (see 
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]).
 Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is 
provided.
 *** There might be modifications necessary to the internally used data 
structures allowing random access based on {{ExecutionAttemptID}} instead of 
iterating over collections.
 # The collection of exceptions happens in the scheduler. The mapping of root 
cause to related exceptions is then passed over to the 
{{JobExceptionsHandler}}. The exceptions can be collected as they appear.
 ** +Pros:+
 *** It makes makes it easier to port this functionality into the declarative 
scheduler of FLIP-160. We don't need to think of a history of 
{{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are 
hidden away from the {{JobExceptionsHandler}} 
 ** +Cons:+
 *** The {{SchedulerBase}} code base grows once more which increases complexity.

We decided to go with option 2 for now. This makes it easier for us to 
implement the functionality into the declarative scheduler of FLIP-160.


was (Author: mapohl):
We have two approach (which we discussed offline) to implement this feature:
 # The {{JobExceptionsHandler}} does most of the work by iterating over the 
{{ArchivedExecution}}s of the passed {{ArchivedExecutionGraph}}. 
{{ArchivedExecutions}} provide the time (through 
{{ArchivedExecution.stateTimestamps}}) and the thrown exception 
({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would 
need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and 
pass it over to the {{JobExceptionsHandler}} along the 
{{ArchivedExecutionGraph}}. This would enable the handler to group exceptions 
happened due to the same failure case.
 ** +Pros:+
 *** This approach has the advantage of using mostly code that is already there.
 *** No extra code in the {{SchedulerBase}} implementation.
 ** Cons:
 *** It does not support restarts of the {{ExecutionGraph}}. This restart 
functionality is planned for the declarative scheduler which we're currently 
working on (see 
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]).
 Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is 
provided.
 *** There might be modifications necessary to the internally used data 
structures allowing random access based on {{ExecutionAttemptID}} instead of 
iterating over collections.
 # The collection of exceptions happens in the scheduler. The mapping of root 
cause to related exceptions is then passed over to the 
{{JobExceptionsHandler}}. The exceptions can be collected as they appear.
 ** +Pros:+
 *** It makes makes it easier to port this functionality into the declarative 
scheduler of FLIP-160. We don't need to think of a history of 
{{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are 
hidden away from the {{JobExceptionsHandler}} 
 ** +Cons:+
 *** The {{SchedulerBase}} code base grows once more which increases complexity.

We decided to go with option 2 for now. This makes it easier for us to 
implement the functionality into the declarative scheduler of FLIP-160.

> Display last n exceptions/causes for job restarts in Web UI
> ---
>
> Key: FLINK-6042
> URL: https://issues.apache.org/jira/browse/FLINK-6042
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Web Frontend
>Affects Versions: 1.3.0
>Reporter: Till Rohrmann
>Assignee: Matthias
>Priority: Major
>

[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI

2021-01-20 Thread Matthias (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269092#comment-17269092
 ] 

Matthias edited comment on FLINK-6042 at 1/21/21, 7:34 AM:
---

We have two approach (which we discussed offline) to implement this feature:
 # The {{JobExceptionsHandler}} does most of the work by iterating over the 
{{ArchivedExecution}}s of the passed {{ArchivedExecutionGraph}}. 
{{ArchivedExecutions}} provide the time (through 
{{ArchivedExecution.stateTimestamps}}) and the thrown exception 
({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would 
need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and 
pass it over to the {{JobExceptionsHandler}} along the 
{{ArchivedExecutionGraph}}. This would enable the handler to group exceptions 
happened due to the same failure case.
 ** +Pros:+
 *** This approach has the advantage of using mostly code that is already there.
 *** No extra code in the {{SchedulerBase}} implementation.
 ** Cons:
 *** It does not support restarts of the {{ExecutionGraph}}. This restart 
functionality is planned for the declarative scheduler which we're currently 
working on (see 
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]).
 Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is 
provided.
 *** There might be modifications necessary to the internally used data 
structures allowing random access based on {{ExecutionAttemptID}} instead of 
iterating over collections.

 # The collection of exceptions happens in the scheduler. The mapping of root 
cause to related exceptions is then passed over to the 
{{JobExceptionsHandler}}. The exceptions can be collected as they appear.
 ** +Pros:+
 *** It makes makes it easier to port this functionality into the declarative 
scheduler of FLIP-160. We don't need to think of a history of 
{{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are 
hidden away from the {{JobExceptionsHandler}} 
 ** +Cons:+
 *** The {{SchedulerBase}} code base grows once more which increases complexity.

We decided to go with option 2 for now. This makes it easier for us to 
implement the functionality into the declarative scheduler of FLIP-160.


was (Author: mapohl):
We have two approach (which we discussed offline) to implement this feature:
 # The {{JobExceptionsHandler}} does most of the work by iterating over the 
{{ArchivedExecution}}s of the passed {{ArchivedExecutionGraph}}. 
{{ArchivedExecutions}} provide the time (through 
{{ArchivedExecution.stateTimestamps}}) and the thrown exception 
({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would 
need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and 
pass it over to the {{JobExceptionsHandler}} along the 
{{ArchivedExecutionGraph}}. This would enable the handler to group exceptions 
happened due to the same failure case.
+Pros:+ 
- This approach has the advantage of using mostly code that is already there.
- No extra code in the {{SchedulerBase}} implementation.
+Cons:+ 
- It does not support restarts of the {{ExecutionGraph}}. This restart 
functionality is planned for the declarative scheduler which we're currently 
working on (see 
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]).
 Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is 
provided.
- There might be modifications necessary to the internally used data structures 
allowing random access based on {{ExecutionAttemptID}} instead of iterating 
over collections.
 # The collection of exceptions happens in the scheduler. The mapping of root 
cause to related exceptions is then passed over to the 
{{JobExceptionsHandler}}. The exceptions can be collected as they appear.
+Pros:+ 
- It makes makes it easier to port this functionality into the declarative 
scheduler of FLIP-160. We don't need to think of a history of 
{{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are 
hidden away from the {{JobExceptionsHandler}} 
+Cons:+
- The {{SchedulerBase}} code base grows once more which increases complexity.

We decided to go with option 2 for now. This makes it easier for us to 
implement the functionality into the declarative scheduler of FLIP-160.

> Display last n exceptions/causes for job restarts in Web UI
> ---
>
> Key: FLINK-6042
> URL: https://issues.apache.org/jira/browse/FLINK-6042
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Web Frontend
>Affects Versions: 1.3.0
>Reporter: Till Rohrmann
>Assignee: Matthias
>Priority: Major
>  Labels: pull-request-ava

[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI

2021-01-20 Thread Matthias (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269092#comment-17269092
 ] 

Matthias edited comment on FLINK-6042 at 1/21/21, 7:34 AM:
---

We have two approach (which we discussed offline) to implement this feature:
 # The {{JobExceptionsHandler}} does most of the work by iterating over the 
{{ArchivedExecution}}s of the passed {{ArchivedExecutionGraph}}. 
{{ArchivedExecutions}} provide the time (through 
{{ArchivedExecution.stateTimestamps}}) and the thrown exception 
({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would 
need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and 
pass it over to the {{JobExceptionsHandler}} along the 
{{ArchivedExecutionGraph}}. This would enable the handler to group exceptions 
happened due to the same failure case.
 ** +Pros:+
 *** This approach has the advantage of using mostly code that is already there.
 *** No extra code in the {{SchedulerBase}} implementation.
 ** Cons:
 *** It does not support restarts of the {{ExecutionGraph}}. This restart 
functionality is planned for the declarative scheduler which we're currently 
working on (see 
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]).
 Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is 
provided.
 *** There might be modifications necessary to the internally used data 
structures allowing random access based on {{ExecutionAttemptID}} instead of 
iterating over collections.
 # The collection of exceptions happens in the scheduler. The mapping of root 
cause to related exceptions is then passed over to the 
{{JobExceptionsHandler}}. The exceptions can be collected as they appear.
 ** +Pros:+
 *** It makes makes it easier to port this functionality into the declarative 
scheduler of FLIP-160. We don't need to think of a history of 
{{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are 
hidden away from the {{JobExceptionsHandler}} 
 ** +Cons:+
 *** The {{SchedulerBase}} code base grows once more which increases complexity.

We decided to go with option 2 for now. This makes it easier for us to 
implement the functionality into the declarative scheduler of FLIP-160.


was (Author: mapohl):
We have two approach (which we discussed offline) to implement this feature:
 # The {{JobExceptionsHandler}} does most of the work by iterating over the 
{{ArchivedExecution}}s of the passed {{ArchivedExecutionGraph}}. 
{{ArchivedExecutions}} provide the time (through 
{{ArchivedExecution.stateTimestamps}}) and the thrown exception 
({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would 
need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and 
pass it over to the {{JobExceptionsHandler}} along the 
{{ArchivedExecutionGraph}}. This would enable the handler to group exceptions 
happened due to the same failure case.
 ** +Pros:+
 *** This approach has the advantage of using mostly code that is already there.
 *** No extra code in the {{SchedulerBase}} implementation.
 ** Cons:
 *** It does not support restarts of the {{ExecutionGraph}}. This restart 
functionality is planned for the declarative scheduler which we're currently 
working on (see 
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]).
 Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is 
provided.
 *** There might be modifications necessary to the internally used data 
structures allowing random access based on {{ExecutionAttemptID}} instead of 
iterating over collections.

 # The collection of exceptions happens in the scheduler. The mapping of root 
cause to related exceptions is then passed over to the 
{{JobExceptionsHandler}}. The exceptions can be collected as they appear.
 ** +Pros:+
 *** It makes makes it easier to port this functionality into the declarative 
scheduler of FLIP-160. We don't need to think of a history of 
{{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are 
hidden away from the {{JobExceptionsHandler}} 
 ** +Cons:+
 *** The {{SchedulerBase}} code base grows once more which increases complexity.

We decided to go with option 2 for now. This makes it easier for us to 
implement the functionality into the declarative scheduler of FLIP-160.

> Display last n exceptions/causes for job restarts in Web UI
> ---
>
> Key: FLINK-6042
> URL: https://issues.apache.org/jira/browse/FLINK-6042
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Web Frontend
>Affects Versions: 1.3.0
>Reporter: Till Rohrmann
>Assignee: Matthias
>Priority: Major
> 

[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI

2021-01-15 Thread Matthias (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265799#comment-17265799
 ] 

Matthias edited comment on FLINK-6042 at 1/15/21, 8:09 AM:
---

{quote}Thanks for the proposal [~mapohl]. I have a few comments

1) When doing restarts with the new scheduler, then we will recreate the 
{{ExecutionGraph}}. Hence, exposing these error infos on the 
{{AccessExecutionGraph}} might not work.
{quote}
My plan was to not expose the {{ErrorInfo}} through {{ExecutionGraph}} but 
through {{ArchivedExecutionGraph}} which is a kind of serializable copy of 
{{ExecutionGraph}} and also implements {{AccessExecutionGraph}}. Restarting the 
{{ExecutionGraph}} shouldn't harm the {{ErrorInfo}} collection as it is held in 
the {{SchedulerNG}} implementation. The {{ErrorInfo}} collection would be used 
as an additional parameter when instantiating {{ArchivedExecutionGraph}} in 
[SchedulerBase.requestJob()|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L800].

{quote}2) {{UpdateSchedulerNgOnInternalFailuresListener}} will only be called 
if an exception on the JM occurs. If there is a normal task failure, then we 
will call {{updateTaskExecutionState}}
{quote}
Thanks for the hint: {{SchedulerNG.updateTaskExecutionState(..)}} should work 
as well. This would then collect all the {{Exceptions}} thrown in the different 
tasks. We could then utilize the {{JobStatusListener}} interface to identify 
the {{Exception}} that causes the job to restart and group all previously 
collected {{ErrorInfo}}s (that were not already assigned to a different root 
cause) under this root cause. Here, I'm not 100% sure, yet, whether triggering 
the consolidation of the {{ErrorInfo}}s works that easily.

{quote}3) It would be great to group the exceptions wrt to their restart cycles 
in the web UI. So seeing the root causes for a restart and then being able to 
expand the view to see the task failures for this specific restart would be 
awesome.
{quote}
The {{ErrorInfo}} groups mentioned in 2) could be then returned through the 
newly introduced access method described in 1) and forwarded by the 
{{JobExceptionsHandler}} to the web UI.


was (Author: mapohl):
{quote}Thanks for the proposal [~mapohl]. I have a few comments

1) When doing restarts with the new scheduler, then we will recreate the 
{{ExecutionGraph}}. Hence, exposing these error infos on the 
{{AccessExecutionGraph}} might not work.
{quote}
My plan was to not expose the {{ErrorInfo}} through {{ExecutionGraph}} but 
through {{ArchivedExecutionGraph}} which is a kind of serializable copy of 
{{ExecutionGraph}} and also implements {{AccessExecutionGraph}}. Restarting the 
{{ExecutionGraph}} shouldn't harm the {{ErrorInfo}} collection as it is held in 
the {{SchedulerNG}} implementation. The {{ErrorInfo}} collection would be used 
as an additional parameter when instantiating {{ArchivedExecutionGraph}} in 
[SchedulerBase.requestJob()|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L800].

{quote}2) {{UpdateSchedulerNgOnInternalFailuresListener}} will only be called 
if an exception on the JM occurs. If there is a normal task failure, then we 
will call {{updateTaskExecutionState}}
{quote}
Thanks for the hint: {{SchedulerNG.updateTaskExecutionState(..)}} should work 
as well. This would then collect all the {{Exception}}s thrown in the different 
tasks. We could then utilize the {{JobStatusListener}} interface to identify 
the {{Exception}} that causes the job to restart and group all previously 
collected {{ErrorInfo}}s (that were not already assigned to a different root 
cause) under this root cause. Here, I'm not 100% sure, yet, whether triggering 
the consolidation of the {{ErrorInfo}}s works that easily.

{quote}3) It would be great to group the exceptions wrt to their restart cycles 
in the web UI. So seeing the root causes for a restart and then being able to 
expand the view to see the task failures for this specific restart would be 
awesome.
{quote}
The {{ErrorInfo}} groups mentioned in 2) could be then returned through the 
newly introduced access method described in 1) and forwarded by the 
{{JobExceptionsHandler}} to the web UI.

> Display last n exceptions/causes for job restarts in Web UI
> ---
>
> Key: FLINK-6042
> URL: https://issues.apache.org/jira/browse/FLINK-6042
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Web Frontend
>Affects Versions: 1.3.0
>Reporter: Till Rohrmann
>Assignee: Matthias
>Priority: Major
>

[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI

2021-01-15 Thread Matthias (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265799#comment-17265799
 ] 

Matthias edited comment on FLINK-6042 at 1/15/21, 8:09 AM:
---

{quote}Thanks for the proposal [~mapohl]. I have a few comments

1) When doing restarts with the new scheduler, then we will recreate the 
{{ExecutionGraph}}. Hence, exposing these error infos on the 
{{AccessExecutionGraph}} might not work.
{quote}
My plan was to not expose the {{ErrorInfo}} through {{ExecutionGraph}} but 
through {{ArchivedExecutionGraph}} which is a kind of serializable copy of 
{{ExecutionGraph}} and also implements {{AccessExecutionGraph}}. Restarting the 
{{ExecutionGraph}} shouldn't harm the {{ErrorInfo}} collection as it is held in 
the {{SchedulerNG}} implementation. The {{ErrorInfo}} collection would be used 
as an additional parameter when instantiating {{ArchivedExecutionGraph}} in 
[SchedulerBase.requestJob()|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L800].
{quote}2) {{UpdateSchedulerNgOnInternalFailuresListener}} will only be called 
if an exception on the JM occurs. If there is a normal task failure, then we 
will call {{updateTaskExecutionState}}
{quote}
Thanks for the hint: {{SchedulerNG.updateTaskExecutionState(..)}} should work 
as well. This would then collect all the {{Exceptions}} thrown in the different 
tasks. We could then utilize the {{JobStatusListener}} interface to identify 
the {{Exception}} that causes the job to restart and group all previously 
collected \{{ErrorInfos}} (that were not already assigned to a different root 
cause) under this root cause. Here, I'm not 100% sure, yet, whether triggering 
the consolidation of the \{{ErrorInfos}} works that easily.
{quote}3) It would be great to group the exceptions wrt to their restart cycles 
in the web UI. So seeing the root causes for a restart and then being able to 
expand the view to see the task failures for this specific restart would be 
awesome.
{quote}
The {{ErrorInfo}} groups mentioned in 2) could be then returned through the 
newly introduced access method described in 1) and forwarded by the 
{{JobExceptionsHandler}} to the web UI.


was (Author: mapohl):
{quote}Thanks for the proposal [~mapohl]. I have a few comments

1) When doing restarts with the new scheduler, then we will recreate the 
{{ExecutionGraph}}. Hence, exposing these error infos on the 
{{AccessExecutionGraph}} might not work.
{quote}
My plan was to not expose the {{ErrorInfo}} through {{ExecutionGraph}} but 
through {{ArchivedExecutionGraph}} which is a kind of serializable copy of 
{{ExecutionGraph}} and also implements {{AccessExecutionGraph}}. Restarting the 
{{ExecutionGraph}} shouldn't harm the {{ErrorInfo}} collection as it is held in 
the {{SchedulerNG}} implementation. The {{ErrorInfo}} collection would be used 
as an additional parameter when instantiating {{ArchivedExecutionGraph}} in 
[SchedulerBase.requestJob()|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L800].

{quote}2) {{UpdateSchedulerNgOnInternalFailuresListener}} will only be called 
if an exception on the JM occurs. If there is a normal task failure, then we 
will call {{updateTaskExecutionState}}
{quote}
Thanks for the hint: {{SchedulerNG.updateTaskExecutionState(..)}} should work 
as well. This would then collect all the {{Exceptions}} thrown in the different 
tasks. We could then utilize the {{JobStatusListener}} interface to identify 
the {{Exception}} that causes the job to restart and group all previously 
collected {{ErrorInfo}}s (that were not already assigned to a different root 
cause) under this root cause. Here, I'm not 100% sure, yet, whether triggering 
the consolidation of the {{ErrorInfo}}s works that easily.

{quote}3) It would be great to group the exceptions wrt to their restart cycles 
in the web UI. So seeing the root causes for a restart and then being able to 
expand the view to see the task failures for this specific restart would be 
awesome.
{quote}
The {{ErrorInfo}} groups mentioned in 2) could be then returned through the 
newly introduced access method described in 1) and forwarded by the 
{{JobExceptionsHandler}} to the web UI.

> Display last n exceptions/causes for job restarts in Web UI
> ---
>
> Key: FLINK-6042
> URL: https://issues.apache.org/jira/browse/FLINK-6042
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Web Frontend
>Affects Versions: 1.3.0
>Reporter: Till Rohrmann
>Assignee: Matthias
>Priority: Major
>

[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI

2021-01-15 Thread Matthias (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265799#comment-17265799
 ] 

Matthias edited comment on FLINK-6042 at 1/15/21, 8:08 AM:
---

{quote}Thanks for the proposal [~mapohl]. I have a few comments

1) When doing restarts with the new scheduler, then we will recreate the 
{{ExecutionGraph}}. Hence, exposing these error infos on the 
{{AccessExecutionGraph}} might not work.
{quote}
My plan was to not expose the {{ErrorInfo}} through {{ExecutionGraph}} but 
through {{ArchivedExecutionGraph}} which is a kind of serializable copy of 
{{ExecutionGraph}} and also implements {{AccessExecutionGraph}}. Restarting the 
{{ExecutionGraph}} shouldn't harm the {{ErrorInfo}} collection as it is held in 
the {{SchedulerNG}} implementation. The {{ErrorInfo}} collection would be used 
as an additional parameter when instantiating {{ArchivedExecutionGraph}} in 
[SchedulerBase.requestJob()|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L800].

{quote}2) {{UpdateSchedulerNgOnInternalFailuresListener}} will only be called 
if an exception on the JM occurs. If there is a normal task failure, then we 
will call {{updateTaskExecutionState}}
{quote}
Thanks for the hint: {{SchedulerNG.updateTaskExecutionState(..)}} should work 
as well. This would then collect all the {{Exception}}s thrown in the different 
tasks. We could then utilize the {{JobStatusListener}} interface to identify 
the {{Exception}} that causes the job to restart and group all previously 
collected {{ErrorInfo}}s (that were not already assigned to a different root 
cause) under this root cause. Here, I'm not 100% sure, yet, whether triggering 
the consolidation of the {{ErrorInfo}}s works that easily.

{quote}3) It would be great to group the exceptions wrt to their restart cycles 
in the web UI. So seeing the root causes for a restart and then being able to 
expand the view to see the task failures for this specific restart would be 
awesome.
{quote}
The {{ErrorInfo}} groups mentioned in 2) could be then returned through the 
newly introduced access method described in 1) and forwarded by the 
{{JobExceptionsHandler}} to the web UI.


was (Author: mapohl):
{quote}Thanks for the proposal [~mapohl]. I have a few comments

1) When doing restarts with the new scheduler, then we will recreate the 
{{ExecutionGraph}}. Hence, exposing these error infos on the 
{{AccessExecutionGraph}} might not work.
{quote}
My plan was to not expose the {{ErrorInfo}} through {{ExecutionGraph}} but 
through {{ArchivedExecutionGraph}} which is a kind of serializable copy of 
{{ExecutionGraph}} and also implements {{AccessExecutionGraph}}. Restarting the 
{{ExecutionGraph}} shouldn't harm the {{ErrorInfo}} collection as it is held in 
the {{SchedulerNG}} implementation. The {{ErrorInfo}} collection would be used 
as an additional parameter when instantiating {{ArchivedExecutionGraph}} in 
[SchedulerBase.requestJob()|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L800].

{quote}2) {{UpdateSchedulerNgOnInternalFailuresListener}} will only be called 
if an exception on the JM occurs. If there is a normal task failure, then we 
will call {{updateTaskExecutionState}}
{quote}
Thanks for the hint: {{SchedulerNG.updateTaskExecutionState(..)}} should work 
as well. This would then collect all the {{Exception}}s thrown in the different 
tasks. We could then utilize the {{JobStatusListener}} interface to identify 
the {{Exception}} that causes the job to restart and group all previously 
collected {{ErrorInfo}}s (that were not already assigned to a different root 
cause) under this root cause.

{quote}3) It would be great to group the exceptions wrt to their restart cycles 
in the web UI. So seeing the root causes for a restart and then being able to 
expand the view to see the task failures for this specific restart would be 
awesome.
{quote}
The {{ErrorInfo}} groups mentioned in 2) could be then returned through the 
newly introduced access method described in 1) and forwarded by the 
{{JobExceptionsHandler}} to the web UI.

> Display last n exceptions/causes for job restarts in Web UI
> ---
>
> Key: FLINK-6042
> URL: https://issues.apache.org/jira/browse/FLINK-6042
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Web Frontend
>Affects Versions: 1.3.0
>Reporter: Till Rohrmann
>Assignee: Matthias
>Priority: Major
>
> Users requested that it would be nice to see the last {{n}} exceptions 
> causing a job restart in the Web