[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI
[ https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325798#comment-17325798 ] Matthias edited comment on FLINK-6042 at 4/20/21, 1:33 PM: --- I moved the UI extension FLINK-21867 out making it an independent task to work on. The initial goal FLINK-6042 exposing the exceptions through the UI is done. was (Author: mapohl): I moved the UI extension FLINK-21867 out to work independently on it. The initial goal of exposing the exceptions through the UI is done. > Display last n exceptions/causes for job restarts in Web UI > --- > > Key: FLINK-6042 > URL: https://issues.apache.org/jira/browse/FLINK-6042 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Web Frontend >Affects Versions: 1.3.0 >Reporter: Till Rohrmann >Assignee: Matthias >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > Attachments: 截屏2021-01-28 下午4.47.46.png > > > Users requested that it would be nice to see the last {{n}} exceptions > causing a job restart in the Web UI. This will help to more easily debug and > operate a job. > We could store the root causes for failures similar to how prior executions > are stored in the {{ExecutionVertex}} using the {{EvictingBoundedList}} and > then serve this information via the Web UI. > _-- Update: January 21, 2021 --_ > The UI can already handle multiple exceptions through the Exception History. > Right now, we list one or more exceptions which caused the job to fail. > Instead, we could adapt it in a way that the history contains not only the > exceptions of the most recent failure but one expandable entry per restart. > If there are more than one exception connected to a single restart, we would > list their stacktraces within one expandable entry. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI
[ https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269499#comment-17269499 ] Matthias edited comment on FLINK-6042 at 1/22/21, 10:47 AM: {quote}Taking your argument, why is it better to add the exception information method to the {{ArchivedExecutionGraph}} and making it thereby accessible to all {{AbstractExecutionGraphHandler}} handlers? Wouldn't it make sense to only provide access to those information a handler needs? In our case, one could give access to the {{AccessExecutionGraph}} for those handlers which extract information from the {{ExecutionGraph}} and maybe something like a {{FailureHistory}} for the {{JobExceptionsHandler}}? In the end the {{ArchivedExecutionGraph}} might also implement {{FailureHistory}} but I think the important bit is to segregate the interfaces. {quote} Good point: Having a separated interface sounds like the better approach. {quote}Thinking a step ahead, how would it work with the {{ArchivedExecutionGraph}} if we send multiple graphs because it changed over the job's lifetime. To which graph will the exception causing the lifetime end of a graph be assigned? {quote} As we have a list of {{ArchivedExecutionGraphs}} in chronological order, I would assume that any instance except for the last one have failureCause that triggered the instantiation of a new {{ExecutionGraph}}. If no failure cause is given it means that the instantiation happened due to some rescaling efforts (alternatively, we could think of a new state to make that more explicit?). The most recent {{ExecutionGraph}} is then either holding the failure caused the job to fail or no failure cause if the job is in a non-failed state. But considering that we might want to handover a list of {{ArchivedExecutionGraphs}} in the future it would be worth it again to have a class holding the {{ArchivedExecutionGraph}} (or later a list of {{ArchivedExecutionGraphs}}) which implements {{FailureHistory}} as well. was (Author: mapohl): {quote}Taking your argument, why is it better to add the exception information method to the {{ArchivedExecutionGraph}} and making it thereby accessible to all {{AbstractExecutionGraphHandler}} handlers? Wouldn't it make sense to only provide access to those information a handler needs? In our case, one could give access to the {{AccessExecutionGraph}} for those handlers which extract information from the {{ExecutionGraph}} and maybe something like a {{FailureHistory}} for the {{JobExceptionsHandler}}? In the end the {{ArchivedExecutionGraph}} might also implement {{FailureHistory}} but I think the important bit is to segregate the interfaces. {quote} Good point: Having a separated interface sounds like the better approach. {quote}Thinking a step ahead, how would it work with the {{ArchivedExecutionGraph}} if we send multiple graphs because it changed over the job's lifetime. To which graph will the exception causing the lifetime end of a graph be assigned? {quote} As we have a list of {{ArchivedExecutionGraphs}} in chronological order, I would assume that any instance except for the last one have failureCause that triggered the instantiation of a new {{ExecutionGraph}}. If no failure cause is given it means that the instantiation happened due to some rescaling efforts (alternatively, we could think of a new state to make that more explicit?). The most recent {{ExecutionGraph}} is then either holding the failure caused the job to fail or no failure cause if the job is in a non-failed state. > Display last n exceptions/causes for job restarts in Web UI > --- > > Key: FLINK-6042 > URL: https://issues.apache.org/jira/browse/FLINK-6042 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Web Frontend >Affects Versions: 1.3.0 >Reporter: Till Rohrmann >Assignee: Matthias >Priority: Major > Labels: pull-request-available > > Users requested that it would be nice to see the last {{n}} exceptions > causing a job restart in the Web UI. This will help to more easily debug and > operate a job. > We could store the root causes for failures similar to how prior executions > are stored in the {{ExecutionVertex}} using the {{EvictingBoundedList}} and > then serve this information via the Web UI. > _-- Update: January 21, 2021 --_ > The UI can already handle multiple exceptions through the Exception History. > Right now, we list one or more exceptions which caused the job to fail. > Instead, we could adapt it in a way that the history contains not only the > exceptions of the most recent failure but one expandable entry per restart. > If there are more than one exception connected to a single restart
[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI
[ https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269092#comment-17269092 ] Till Rohrmann edited comment on FLINK-6042 at 1/21/21, 9:17 AM: We have two approach (which we discussed offline) to implement this feature: # The {{JobExceptionsHandler}} does most of the work by iterating over the {{ArchivedExecutions}} of the passed {{ArchivedExecutionGraph}}. {{ArchivedExecutions}} provide the time (through {{ArchivedExecution.stateTimestamps}}) and the thrown exception ({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and pass it over to the {{JobExceptionsHandler}} along the {{ArchivedExecutionGraph}}. This would enable the handler to group exceptions happened due to the same failure case. ** +Pros:+ *** This approach has the advantage of using mostly code that is already there. *** No extra code in the {{SchedulerBase}} implementation. ** Cons: *** It does not support restarts of the {{ExecutionGraph}}. This restart functionality is planned for the declarative scheduler which we're currently working on (see [FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]). Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is provided. *** There might be modifications necessary to the internally used data structures allowing random access based on {{ExecutionAttemptID}} instead of iterating over collections. # The collection of exceptions happens in the scheduler. The mapping of root cause to related exceptions is then passed over to the {{JobExceptionsHandler}}. The exceptions can be collected as they appear. ** +Pros:+ *** It makes makes it easier to port this functionality into the declarative scheduler of FLIP-160. We don't need to think of a history of {{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are hidden away from the {{JobExceptionsHandler}} ** +Cons:+ *** The {{SchedulerBase}} code base grows once more which increases complexity. We decided to go with option 2 for now. This makes it easier for us to implement the functionality into the declarative scheduler of FLIP-160. was (Author: mapohl): We have two approach (which we discussed offline) to implement this feature: # The {{JobExceptionsHandler}} does most of the work by iterating over the {{ArchivedExecution}}s of the passed {{ArchivedExecutionGraph}}. {{ArchivedExecutions}} provide the time (through {{ArchivedExecution.stateTimestamps}}) and the thrown exception ({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and pass it over to the {{JobExceptionsHandler}} along the {{ArchivedExecutionGraph}}. This would enable the handler to group exceptions happened due to the same failure case. ** +Pros:+ *** This approach has the advantage of using mostly code that is already there. *** No extra code in the {{SchedulerBase}} implementation. ** Cons: *** It does not support restarts of the {{ExecutionGraph}}. This restart functionality is planned for the declarative scheduler which we're currently working on (see [FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]). Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is provided. *** There might be modifications necessary to the internally used data structures allowing random access based on {{ExecutionAttemptID}} instead of iterating over collections. # The collection of exceptions happens in the scheduler. The mapping of root cause to related exceptions is then passed over to the {{JobExceptionsHandler}}. The exceptions can be collected as they appear. ** +Pros:+ *** It makes makes it easier to port this functionality into the declarative scheduler of FLIP-160. We don't need to think of a history of {{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are hidden away from the {{JobExceptionsHandler}} ** +Cons:+ *** The {{SchedulerBase}} code base grows once more which increases complexity. We decided to go with option 2 for now. This makes it easier for us to implement the functionality into the declarative scheduler of FLIP-160. > Display last n exceptions/causes for job restarts in Web UI > --- > > Key: FLINK-6042 > URL: https://issues.apache.org/jira/browse/FLINK-6042 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Web Frontend >Affects Versions: 1.3.0 >Reporter: Till Rohrmann >Assignee: Matthias >Priority: Major >
[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI
[ https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269092#comment-17269092 ] Matthias edited comment on FLINK-6042 at 1/21/21, 7:34 AM: --- We have two approach (which we discussed offline) to implement this feature: # The {{JobExceptionsHandler}} does most of the work by iterating over the {{ArchivedExecution}}s of the passed {{ArchivedExecutionGraph}}. {{ArchivedExecutions}} provide the time (through {{ArchivedExecution.stateTimestamps}}) and the thrown exception ({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and pass it over to the {{JobExceptionsHandler}} along the {{ArchivedExecutionGraph}}. This would enable the handler to group exceptions happened due to the same failure case. ** +Pros:+ *** This approach has the advantage of using mostly code that is already there. *** No extra code in the {{SchedulerBase}} implementation. ** Cons: *** It does not support restarts of the {{ExecutionGraph}}. This restart functionality is planned for the declarative scheduler which we're currently working on (see [FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]). Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is provided. *** There might be modifications necessary to the internally used data structures allowing random access based on {{ExecutionAttemptID}} instead of iterating over collections. # The collection of exceptions happens in the scheduler. The mapping of root cause to related exceptions is then passed over to the {{JobExceptionsHandler}}. The exceptions can be collected as they appear. ** +Pros:+ *** It makes makes it easier to port this functionality into the declarative scheduler of FLIP-160. We don't need to think of a history of {{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are hidden away from the {{JobExceptionsHandler}} ** +Cons:+ *** The {{SchedulerBase}} code base grows once more which increases complexity. We decided to go with option 2 for now. This makes it easier for us to implement the functionality into the declarative scheduler of FLIP-160. was (Author: mapohl): We have two approach (which we discussed offline) to implement this feature: # The {{JobExceptionsHandler}} does most of the work by iterating over the {{ArchivedExecution}}s of the passed {{ArchivedExecutionGraph}}. {{ArchivedExecutions}} provide the time (through {{ArchivedExecution.stateTimestamps}}) and the thrown exception ({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and pass it over to the {{JobExceptionsHandler}} along the {{ArchivedExecutionGraph}}. This would enable the handler to group exceptions happened due to the same failure case. +Pros:+ - This approach has the advantage of using mostly code that is already there. - No extra code in the {{SchedulerBase}} implementation. +Cons:+ - It does not support restarts of the {{ExecutionGraph}}. This restart functionality is planned for the declarative scheduler which we're currently working on (see [FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]). Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is provided. - There might be modifications necessary to the internally used data structures allowing random access based on {{ExecutionAttemptID}} instead of iterating over collections. # The collection of exceptions happens in the scheduler. The mapping of root cause to related exceptions is then passed over to the {{JobExceptionsHandler}}. The exceptions can be collected as they appear. +Pros:+ - It makes makes it easier to port this functionality into the declarative scheduler of FLIP-160. We don't need to think of a history of {{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are hidden away from the {{JobExceptionsHandler}} +Cons:+ - The {{SchedulerBase}} code base grows once more which increases complexity. We decided to go with option 2 for now. This makes it easier for us to implement the functionality into the declarative scheduler of FLIP-160. > Display last n exceptions/causes for job restarts in Web UI > --- > > Key: FLINK-6042 > URL: https://issues.apache.org/jira/browse/FLINK-6042 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Web Frontend >Affects Versions: 1.3.0 >Reporter: Till Rohrmann >Assignee: Matthias >Priority: Major > Labels: pull-request-ava
[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI
[ https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269092#comment-17269092 ] Matthias edited comment on FLINK-6042 at 1/21/21, 7:34 AM: --- We have two approach (which we discussed offline) to implement this feature: # The {{JobExceptionsHandler}} does most of the work by iterating over the {{ArchivedExecution}}s of the passed {{ArchivedExecutionGraph}}. {{ArchivedExecutions}} provide the time (through {{ArchivedExecution.stateTimestamps}}) and the thrown exception ({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and pass it over to the {{JobExceptionsHandler}} along the {{ArchivedExecutionGraph}}. This would enable the handler to group exceptions happened due to the same failure case. ** +Pros:+ *** This approach has the advantage of using mostly code that is already there. *** No extra code in the {{SchedulerBase}} implementation. ** Cons: *** It does not support restarts of the {{ExecutionGraph}}. This restart functionality is planned for the declarative scheduler which we're currently working on (see [FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]). Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is provided. *** There might be modifications necessary to the internally used data structures allowing random access based on {{ExecutionAttemptID}} instead of iterating over collections. # The collection of exceptions happens in the scheduler. The mapping of root cause to related exceptions is then passed over to the {{JobExceptionsHandler}}. The exceptions can be collected as they appear. ** +Pros:+ *** It makes makes it easier to port this functionality into the declarative scheduler of FLIP-160. We don't need to think of a history of {{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are hidden away from the {{JobExceptionsHandler}} ** +Cons:+ *** The {{SchedulerBase}} code base grows once more which increases complexity. We decided to go with option 2 for now. This makes it easier for us to implement the functionality into the declarative scheduler of FLIP-160. was (Author: mapohl): We have two approach (which we discussed offline) to implement this feature: # The {{JobExceptionsHandler}} does most of the work by iterating over the {{ArchivedExecution}}s of the passed {{ArchivedExecutionGraph}}. {{ArchivedExecutions}} provide the time (through {{ArchivedExecution.stateTimestamps}}) and the thrown exception ({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and pass it over to the {{JobExceptionsHandler}} along the {{ArchivedExecutionGraph}}. This would enable the handler to group exceptions happened due to the same failure case. ** +Pros:+ *** This approach has the advantage of using mostly code that is already there. *** No extra code in the {{SchedulerBase}} implementation. ** Cons: *** It does not support restarts of the {{ExecutionGraph}}. This restart functionality is planned for the declarative scheduler which we're currently working on (see [FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]). Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is provided. *** There might be modifications necessary to the internally used data structures allowing random access based on {{ExecutionAttemptID}} instead of iterating over collections. # The collection of exceptions happens in the scheduler. The mapping of root cause to related exceptions is then passed over to the {{JobExceptionsHandler}}. The exceptions can be collected as they appear. ** +Pros:+ *** It makes makes it easier to port this functionality into the declarative scheduler of FLIP-160. We don't need to think of a history of {{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are hidden away from the {{JobExceptionsHandler}} ** +Cons:+ *** The {{SchedulerBase}} code base grows once more which increases complexity. We decided to go with option 2 for now. This makes it easier for us to implement the functionality into the declarative scheduler of FLIP-160. > Display last n exceptions/causes for job restarts in Web UI > --- > > Key: FLINK-6042 > URL: https://issues.apache.org/jira/browse/FLINK-6042 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Web Frontend >Affects Versions: 1.3.0 >Reporter: Till Rohrmann >Assignee: Matthias >Priority: Major >
[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI
[ https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265799#comment-17265799 ] Matthias edited comment on FLINK-6042 at 1/15/21, 8:09 AM: --- {quote}Thanks for the proposal [~mapohl]. I have a few comments 1) When doing restarts with the new scheduler, then we will recreate the {{ExecutionGraph}}. Hence, exposing these error infos on the {{AccessExecutionGraph}} might not work. {quote} My plan was to not expose the {{ErrorInfo}} through {{ExecutionGraph}} but through {{ArchivedExecutionGraph}} which is a kind of serializable copy of {{ExecutionGraph}} and also implements {{AccessExecutionGraph}}. Restarting the {{ExecutionGraph}} shouldn't harm the {{ErrorInfo}} collection as it is held in the {{SchedulerNG}} implementation. The {{ErrorInfo}} collection would be used as an additional parameter when instantiating {{ArchivedExecutionGraph}} in [SchedulerBase.requestJob()|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L800]. {quote}2) {{UpdateSchedulerNgOnInternalFailuresListener}} will only be called if an exception on the JM occurs. If there is a normal task failure, then we will call {{updateTaskExecutionState}} {quote} Thanks for the hint: {{SchedulerNG.updateTaskExecutionState(..)}} should work as well. This would then collect all the {{Exceptions}} thrown in the different tasks. We could then utilize the {{JobStatusListener}} interface to identify the {{Exception}} that causes the job to restart and group all previously collected {{ErrorInfo}}s (that were not already assigned to a different root cause) under this root cause. Here, I'm not 100% sure, yet, whether triggering the consolidation of the {{ErrorInfo}}s works that easily. {quote}3) It would be great to group the exceptions wrt to their restart cycles in the web UI. So seeing the root causes for a restart and then being able to expand the view to see the task failures for this specific restart would be awesome. {quote} The {{ErrorInfo}} groups mentioned in 2) could be then returned through the newly introduced access method described in 1) and forwarded by the {{JobExceptionsHandler}} to the web UI. was (Author: mapohl): {quote}Thanks for the proposal [~mapohl]. I have a few comments 1) When doing restarts with the new scheduler, then we will recreate the {{ExecutionGraph}}. Hence, exposing these error infos on the {{AccessExecutionGraph}} might not work. {quote} My plan was to not expose the {{ErrorInfo}} through {{ExecutionGraph}} but through {{ArchivedExecutionGraph}} which is a kind of serializable copy of {{ExecutionGraph}} and also implements {{AccessExecutionGraph}}. Restarting the {{ExecutionGraph}} shouldn't harm the {{ErrorInfo}} collection as it is held in the {{SchedulerNG}} implementation. The {{ErrorInfo}} collection would be used as an additional parameter when instantiating {{ArchivedExecutionGraph}} in [SchedulerBase.requestJob()|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L800]. {quote}2) {{UpdateSchedulerNgOnInternalFailuresListener}} will only be called if an exception on the JM occurs. If there is a normal task failure, then we will call {{updateTaskExecutionState}} {quote} Thanks for the hint: {{SchedulerNG.updateTaskExecutionState(..)}} should work as well. This would then collect all the {{Exception}}s thrown in the different tasks. We could then utilize the {{JobStatusListener}} interface to identify the {{Exception}} that causes the job to restart and group all previously collected {{ErrorInfo}}s (that were not already assigned to a different root cause) under this root cause. Here, I'm not 100% sure, yet, whether triggering the consolidation of the {{ErrorInfo}}s works that easily. {quote}3) It would be great to group the exceptions wrt to their restart cycles in the web UI. So seeing the root causes for a restart and then being able to expand the view to see the task failures for this specific restart would be awesome. {quote} The {{ErrorInfo}} groups mentioned in 2) could be then returned through the newly introduced access method described in 1) and forwarded by the {{JobExceptionsHandler}} to the web UI. > Display last n exceptions/causes for job restarts in Web UI > --- > > Key: FLINK-6042 > URL: https://issues.apache.org/jira/browse/FLINK-6042 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Web Frontend >Affects Versions: 1.3.0 >Reporter: Till Rohrmann >Assignee: Matthias >Priority: Major >
[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI
[ https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265799#comment-17265799 ] Matthias edited comment on FLINK-6042 at 1/15/21, 8:09 AM: --- {quote}Thanks for the proposal [~mapohl]. I have a few comments 1) When doing restarts with the new scheduler, then we will recreate the {{ExecutionGraph}}. Hence, exposing these error infos on the {{AccessExecutionGraph}} might not work. {quote} My plan was to not expose the {{ErrorInfo}} through {{ExecutionGraph}} but through {{ArchivedExecutionGraph}} which is a kind of serializable copy of {{ExecutionGraph}} and also implements {{AccessExecutionGraph}}. Restarting the {{ExecutionGraph}} shouldn't harm the {{ErrorInfo}} collection as it is held in the {{SchedulerNG}} implementation. The {{ErrorInfo}} collection would be used as an additional parameter when instantiating {{ArchivedExecutionGraph}} in [SchedulerBase.requestJob()|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L800]. {quote}2) {{UpdateSchedulerNgOnInternalFailuresListener}} will only be called if an exception on the JM occurs. If there is a normal task failure, then we will call {{updateTaskExecutionState}} {quote} Thanks for the hint: {{SchedulerNG.updateTaskExecutionState(..)}} should work as well. This would then collect all the {{Exceptions}} thrown in the different tasks. We could then utilize the {{JobStatusListener}} interface to identify the {{Exception}} that causes the job to restart and group all previously collected \{{ErrorInfos}} (that were not already assigned to a different root cause) under this root cause. Here, I'm not 100% sure, yet, whether triggering the consolidation of the \{{ErrorInfos}} works that easily. {quote}3) It would be great to group the exceptions wrt to their restart cycles in the web UI. So seeing the root causes for a restart and then being able to expand the view to see the task failures for this specific restart would be awesome. {quote} The {{ErrorInfo}} groups mentioned in 2) could be then returned through the newly introduced access method described in 1) and forwarded by the {{JobExceptionsHandler}} to the web UI. was (Author: mapohl): {quote}Thanks for the proposal [~mapohl]. I have a few comments 1) When doing restarts with the new scheduler, then we will recreate the {{ExecutionGraph}}. Hence, exposing these error infos on the {{AccessExecutionGraph}} might not work. {quote} My plan was to not expose the {{ErrorInfo}} through {{ExecutionGraph}} but through {{ArchivedExecutionGraph}} which is a kind of serializable copy of {{ExecutionGraph}} and also implements {{AccessExecutionGraph}}. Restarting the {{ExecutionGraph}} shouldn't harm the {{ErrorInfo}} collection as it is held in the {{SchedulerNG}} implementation. The {{ErrorInfo}} collection would be used as an additional parameter when instantiating {{ArchivedExecutionGraph}} in [SchedulerBase.requestJob()|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L800]. {quote}2) {{UpdateSchedulerNgOnInternalFailuresListener}} will only be called if an exception on the JM occurs. If there is a normal task failure, then we will call {{updateTaskExecutionState}} {quote} Thanks for the hint: {{SchedulerNG.updateTaskExecutionState(..)}} should work as well. This would then collect all the {{Exceptions}} thrown in the different tasks. We could then utilize the {{JobStatusListener}} interface to identify the {{Exception}} that causes the job to restart and group all previously collected {{ErrorInfo}}s (that were not already assigned to a different root cause) under this root cause. Here, I'm not 100% sure, yet, whether triggering the consolidation of the {{ErrorInfo}}s works that easily. {quote}3) It would be great to group the exceptions wrt to their restart cycles in the web UI. So seeing the root causes for a restart and then being able to expand the view to see the task failures for this specific restart would be awesome. {quote} The {{ErrorInfo}} groups mentioned in 2) could be then returned through the newly introduced access method described in 1) and forwarded by the {{JobExceptionsHandler}} to the web UI. > Display last n exceptions/causes for job restarts in Web UI > --- > > Key: FLINK-6042 > URL: https://issues.apache.org/jira/browse/FLINK-6042 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Web Frontend >Affects Versions: 1.3.0 >Reporter: Till Rohrmann >Assignee: Matthias >Priority: Major >
[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI
[ https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265799#comment-17265799 ] Matthias edited comment on FLINK-6042 at 1/15/21, 8:08 AM: --- {quote}Thanks for the proposal [~mapohl]. I have a few comments 1) When doing restarts with the new scheduler, then we will recreate the {{ExecutionGraph}}. Hence, exposing these error infos on the {{AccessExecutionGraph}} might not work. {quote} My plan was to not expose the {{ErrorInfo}} through {{ExecutionGraph}} but through {{ArchivedExecutionGraph}} which is a kind of serializable copy of {{ExecutionGraph}} and also implements {{AccessExecutionGraph}}. Restarting the {{ExecutionGraph}} shouldn't harm the {{ErrorInfo}} collection as it is held in the {{SchedulerNG}} implementation. The {{ErrorInfo}} collection would be used as an additional parameter when instantiating {{ArchivedExecutionGraph}} in [SchedulerBase.requestJob()|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L800]. {quote}2) {{UpdateSchedulerNgOnInternalFailuresListener}} will only be called if an exception on the JM occurs. If there is a normal task failure, then we will call {{updateTaskExecutionState}} {quote} Thanks for the hint: {{SchedulerNG.updateTaskExecutionState(..)}} should work as well. This would then collect all the {{Exception}}s thrown in the different tasks. We could then utilize the {{JobStatusListener}} interface to identify the {{Exception}} that causes the job to restart and group all previously collected {{ErrorInfo}}s (that were not already assigned to a different root cause) under this root cause. Here, I'm not 100% sure, yet, whether triggering the consolidation of the {{ErrorInfo}}s works that easily. {quote}3) It would be great to group the exceptions wrt to their restart cycles in the web UI. So seeing the root causes for a restart and then being able to expand the view to see the task failures for this specific restart would be awesome. {quote} The {{ErrorInfo}} groups mentioned in 2) could be then returned through the newly introduced access method described in 1) and forwarded by the {{JobExceptionsHandler}} to the web UI. was (Author: mapohl): {quote}Thanks for the proposal [~mapohl]. I have a few comments 1) When doing restarts with the new scheduler, then we will recreate the {{ExecutionGraph}}. Hence, exposing these error infos on the {{AccessExecutionGraph}} might not work. {quote} My plan was to not expose the {{ErrorInfo}} through {{ExecutionGraph}} but through {{ArchivedExecutionGraph}} which is a kind of serializable copy of {{ExecutionGraph}} and also implements {{AccessExecutionGraph}}. Restarting the {{ExecutionGraph}} shouldn't harm the {{ErrorInfo}} collection as it is held in the {{SchedulerNG}} implementation. The {{ErrorInfo}} collection would be used as an additional parameter when instantiating {{ArchivedExecutionGraph}} in [SchedulerBase.requestJob()|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L800]. {quote}2) {{UpdateSchedulerNgOnInternalFailuresListener}} will only be called if an exception on the JM occurs. If there is a normal task failure, then we will call {{updateTaskExecutionState}} {quote} Thanks for the hint: {{SchedulerNG.updateTaskExecutionState(..)}} should work as well. This would then collect all the {{Exception}}s thrown in the different tasks. We could then utilize the {{JobStatusListener}} interface to identify the {{Exception}} that causes the job to restart and group all previously collected {{ErrorInfo}}s (that were not already assigned to a different root cause) under this root cause. {quote}3) It would be great to group the exceptions wrt to their restart cycles in the web UI. So seeing the root causes for a restart and then being able to expand the view to see the task failures for this specific restart would be awesome. {quote} The {{ErrorInfo}} groups mentioned in 2) could be then returned through the newly introduced access method described in 1) and forwarded by the {{JobExceptionsHandler}} to the web UI. > Display last n exceptions/causes for job restarts in Web UI > --- > > Key: FLINK-6042 > URL: https://issues.apache.org/jira/browse/FLINK-6042 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Web Frontend >Affects Versions: 1.3.0 >Reporter: Till Rohrmann >Assignee: Matthias >Priority: Major > > Users requested that it would be nice to see the last {{n}} exceptions > causing a job restart in the Web