[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266945#comment-17266945
 ] 

Zhenqiu Huang commented on FLINK-20833:
---------------------------------------

[~rmetzger] [[~trohrmann]]
Thanks for these suggestions.
1) I think there is a numRestarts to track all of the restarts in the 
scheduler. If it is a session cluster, then it counts the restart of all of the 
jobs. To get the failure of each job, It makes sense to add 
DefaultFailureListenr to metrics the metrics in tje job level.
2) Agree. I moved the initialization into the JobMaster.
3) It totally makes sense to encourage using to use plugin framework. I changed 
FailureListenrFactory to lookup FailureListener from both the resource folder 
and plugin manager.
4) For the documentation of the feature, I am not sure where is the right 
place. Would you please give some suggestions after reviewing the PR?

> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-20833
>                 URL: https://issues.apache.org/jira/browse/FLINK-20833
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.12.0
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Minor
>              Labels: pull-request-available
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to