[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

2022-04-07 Thread Zhenqiu Huang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519221#comment-17519221
 ] 

Zhenqiu Huang commented on FLINK-20833:
---

[~xtsong]
I am rebasing master for my diff. Would you please assign it to again?

> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> ---
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Zhenqiu Huang
>Priority: Minor
>  Labels: auto-unassigned, pull-request-available, stale-minor
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

2021-04-28 Thread Zhenqiu Huang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335080#comment-17335080
 ] 

Zhenqiu Huang commented on FLINK-20833:
---

[~rmetzger] [~xintongsong]
Would you please help to review this PR?

> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> ---
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Zhenqiu Huang
>Priority: Minor
>  Labels: auto-unassigned, pull-request-available
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

2021-04-27 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333641#comment-17333641
 ] 

Flink Jira Bot commented on FLINK-20833:


This issue was marked "stale-assigned" and has not received an update in 7 
days. It is now automatically unassigned. If you are still working on it, you 
can assign it to yourself again. Please also give an update about the status of 
the work.

> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> ---
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Minor
>  Labels: pull-request-available, stale-assigned
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

2021-04-16 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322883#comment-17322883
 ] 

Flink Jira Bot commented on FLINK-20833:


This issue is assigned but has not received an update in 7 days so it has been 
labeled "stale-assigned". If you are still working on the issue, please give an 
update and remove the label. If you are no longer working on the issue, please 
unassign so someone else may work on it. In 7 days the issue will be 
automatically unassigned.

> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> ---
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Minor
>  Labels: pull-request-available, stale-assigned
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

2021-03-12 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300219#comment-17300219
 ] 

Robert Metzger commented on FLINK-20833:


Thanks, I'll soon review the PR!

> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> ---
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Minor
>  Labels: pull-request-available
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

2021-02-27 Thread Zhenqiu Huang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292321#comment-17292321
 ] 

Zhenqiu Huang commented on FLINK-20833:
---

[~rmetzger][~trohrmann]
I saw you are adding the new Scheduler. I rebased the master branch add failure 
listener in the new scheduler. Would you please review the PR at your most 
convenient time?

> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> ---
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Minor
>  Labels: pull-request-available
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

2021-01-18 Thread Zhenqiu Huang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267586#comment-17267586
 ] 

Zhenqiu Huang commented on FLINK-20833:
---

[~rmetzger]
Thanks for the comments. 
1) I added comments on the new metrics. I think it is also reasonable to just 
use numRestarts in most cases.
2) Added wiki page in the advanced section.

> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> ---
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Minor
>  Labels: pull-request-available
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

2021-01-18 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267128#comment-17267128
 ] 

Robert Metzger commented on FLINK-20833:


1) See my comment in the PR: I wasn't aware of the "numRestarts" metric. Maybe 
it adds more confusion to count the restarts and the failures in two metrics?!
4) Good question. Maybe add it into the Deployment / Advanced section? 
https://ci.apache.org/projects/flink/flink-docs-master/deployment/advanced/index.html

> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> ---
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Minor
>  Labels: pull-request-available
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

2021-01-17 Thread Zhenqiu Huang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266945#comment-17266945
 ] 

Zhenqiu Huang commented on FLINK-20833:
---

[~rmetzger] [[~trohrmann]]
Thanks for these suggestions.
1) I think there is a numRestarts to track all of the restarts in the 
scheduler. If it is a session cluster, then it counts the restart of all of the 
jobs. To get the failure of each job, It makes sense to add 
DefaultFailureListenr to metrics the metrics in tje job level.
2) Agree. I moved the initialization into the JobMaster.
3) It totally makes sense to encourage using to use plugin framework. I changed 
FailureListenrFactory to lookup FailureListener from both the resource folder 
and plugin manager.
4) For the documentation of the feature, I am not sure where is the right 
place. Would you please give some suggestions after reviewing the PR?

> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> ---
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Minor
>  Labels: pull-request-available
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

2021-01-15 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265814#comment-17265814
 ] 

Robert Metzger commented on FLINK-20833:


Thanks a lot for providing a PoC! This makes the discussion a lot easier!

Do you know if there's already a metric for the number of exceptions, and the 
time since the last exception?
If not, it might make sense to add this as a default listener implementation?

Secondly, we are currently working on adding another scheduler. Once that is 
implemented, not all schedulers will support the ExceptionListener. I'm 
wondering whether we should move the initialization to another location (into 
the JobMaster, and then pass the listener into the scheduler factory?)

Discovering this feature will be very difficult, because of the ServiceLoader. 
Let's make sure we add this to the documentation.

Lastly, I guess we can use Flink's 
{{PluginUtils.createPluginManagerFromRootFolder(flinkConfig)}}, to use the 
Plugin mechanism. This will create a separate classloader per 
{{ExceptionListener}}, avoiding dependency conflicts with Flink's classpath (I 
haven't used this myself, but from a quick look, this seems easy to use).

> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> ---
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Zhenqiu Huang
>Priority: Minor
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

2021-01-15 Thread Zhenqiu Huang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265795#comment-17265795
 ] 

Zhenqiu Huang commented on FLINK-20833:
---

[~rmetzger]
Thanks for these suggestions. 
1) I think the name of ExceptionListener is more reasonable. 
2) Yes, the implementation can be loaded in service provider. As long as the 
implementation is in the flink's classpath, it can be loaded.
3) I prefer to use Flink's metrics system.

I did a poc on the agreement we have. Please review it.
https://github.com/HuangZhenQiu/flink/commit/903c7746217c0cb91a2eff15a72de873ad48a5e7








> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> ---
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Zhenqiu Huang
>Priority: Minor
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

2021-01-14 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265123#comment-17265123
 ] 

Robert Metzger commented on FLINK-20833:


Thanks for your summary.
I haven't understood why you chose "ExecutionFailureClassifier" as the 
interface name? I wonder if "ExceptionListener" is a more suitable name.
Secondly, I don't understand why we need a default no-op operation.
Can't we just call the listener only if it's set?

How do you plan to load a custom ExceptionListener/ExecutionFailureClassifier 
implementation? Are users supposed to put a jar file with the implementation 
into Flink's classpath?

You mentioned that you want to send metrics in your own implementation: Do you 
want to use Flink's metrics system, or will your listener just establish a 
connection to a metrics system to report them there?

> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> ---
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Zhenqiu Huang
>Priority: Minor
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

2021-01-05 Thread Zhenqiu Huang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259457#comment-17259457
 ] 

Zhenqiu Huang commented on FLINK-20833:
---

[~trohrmann]
Thanks for the suggestion. As ExecutionFailureHandler is the central place to 
handle errors, I think we can add it here. I think the change can be summarized 
as below:

1) Add an interface for the customizable failure classifier.  We may name it 
ExecutionFailureClassifier. 
2) Add a DefaultExecutionFailureClassifier, but it basically a no-op 
implementation.
3) Add a JobManagerOption to allow users to set the class name, the default 
value is DefaultExecutionFailureClassifier.
4) In the DefaultSchedule, we use to new JobManagerOption to initialize an 
ExecutionFailureClassifier, and pass it into ExecutionFailureHandler.

After thinking more about implementation, I feel using a service provider here 
is too heavy. As we need to put DefaultExecutionFailureClassifier into the 
resource of the runtime module. If users want to override it, they need to be 
able to exclude the default one. How do you think?



> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> ---
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Zhenqiu Huang
>Priority: Minor
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

2021-01-05 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258774#comment-17258774
 ] 

Till Rohrmann commented on FLINK-20833:
---

Thanks for creating this ticket [~ZhenqiuHuang]. I like the idea in general. 
Before starting this effort, I think we need a bit more concrete proposal how 
to exactly do it and where to place it. I would suggest to not add it directly 
to the {{ExecutionGraph}} since this structure is already too overloaded with 
responsibilities. A starting pointer could be the {{ExecutionFailureHandler}} 
which is responsible for handling execution failures.

> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> ---
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Zhenqiu Huang
>Priority: Minor
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

2021-01-04 Thread Zhenqiu Huang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258517#comment-17258517
 ] 

Zhenqiu Huang commented on FLINK-20833:
---

[~xintongsong] [~trohrmann]
How do you think this propose. I think we can have a service provider interface 
for this purpose. A user can have their own implementation for specifying the 
rules to match the user level errors.

> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> ---
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.12.0
>Reporter: Zhenqiu Huang
>Priority: Minor
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)