[ https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266945#comment-17266945 ]
Zhenqiu Huang commented on FLINK-20833: --------------------------------------- [~rmetzger] [[~trohrmann]] Thanks for these suggestions. 1) I think there is a numRestarts to track all of the restarts in the scheduler. If it is a session cluster, then it counts the restart of all of the jobs. To get the failure of each job, It makes sense to add DefaultFailureListenr to metrics the metrics in tje job level. 2) Agree. I moved the initialization into the JobMaster. 3) It totally makes sense to encourage using to use plugin framework. I changed FailureListenrFactory to lookup FailureListener from both the resource folder and plugin manager. 4) For the documentation of the feature, I am not sure where is the right place. Would you please give some suggestions after reviewing the PR? > Expose pluggable interface for exception analysis and metrics reporting in > Execution Graph > ------------------------------------------------------------------------------------------- > > Key: FLINK-20833 > URL: https://issues.apache.org/jira/browse/FLINK-20833 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.12.0 > Reporter: Zhenqiu Huang > Assignee: Zhenqiu Huang > Priority: Minor > Labels: pull-request-available > > For platform users of Apache flink, people usually want to classify the > failure reason( for example user code, networking, dependencies and etc) for > Flink jobs and emit metrics for those analyzed results. So that platform can > provide an accurate value for system reliability by distinguishing the > failure due to user logic from the system issues. -- This message was sent by Atlassian Jira (v8.3.4#803005)