[ 
https://issues.apache.org/jira/browse/FLINK-10988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784029#comment-16784029
 ] 

Scott Sue commented on FLINK-10988:
-----------------------------------

This is something that we can do in the code.  However, the sledgehammer 
approach would be to then have to wrap every Flink Operator to ensure that it 
doesn't unexpectedly fail and ultimately kill the job itself.  I would have 
thought this would be something that most users would want to help trace any 
issues within their job.

Even if the job did still stopped due to an exception. It would be nice to have 
some extra information in the logs printed as to what it was attempting to 
perform as opposed to just a stacktrace?  In my experience with Flink, it's 
quite hard to track down exactly what the state of the Operator was along with 
the event that it was processing at the time to trace the root cause of the 
issue.  It would be nice to have some out of the box tools to get to this 
information quicker.

> Improve debugging / visibility of job state
> -------------------------------------------
>
>                 Key: FLINK-10988
>                 URL: https://issues.apache.org/jira/browse/FLINK-10988
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Operators
>            Reporter: Scott Sue
>            Priority: Major
>
> When a Flink Job is running and encounters an unexpected exception, either 
> through processing an expected message, or a message that may be well formed, 
> but the state of the job renders a exception.  It can be difficult to 
> diagnose the cause of the issue.  For example I would get a NPE in one of the 
> Operators:
> 2018-11-13 10:10:26,332 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - 
> Co-Process-Broadcast-Keyed -> Map -> Map -> Sin
> k: Unnamed (1/1) (9a8f3b970570742b7b174a01a9bb1405) switched from RUNNING to 
> FAILED.
> java.lang.NullPointerException
>  at 
> com.celertech.analytics.flink.topology.marketimpact.PriceUtils.findPriceForEntryType(PriceUtils.java:28)
>  at 
> com.celertech.analytics.flink.topology.marketimpact.PriceUtils.getPriceForMarketDataEntryType(PriceUtils.java:18)
>  at 
> com.celertech.analytics.flink.function.midrate.MidRateBroadcaster.processBroadcastElement(MidRateBroadcaster.java:77)
>  at 
> com.celertech.analytics.flink.function.midrate.MidRateTagKeyedBroadcastProcessFunction.processBroadcastElement(MidRateTagKeyedBroa
> dcastProcessFunction.java:36)
>  at 
> com.celertech.analytics.flink.function.midrate.MidRateTagKeyedBroadcastProcessFunction.processBroadcastElement(MidRateTagKeyedBroa
> dcastProcessFunction.java:12)
>  at 
> org.apache.flink.streaming.api.operators.co.CoBroadcastWithKeyedOperator.processElement2(CoBroadcastWithKeyedOperator.java:121)
>  
> An improvement to this would be to allow the printing of the incoming message 
> so the developer can diagnose if that message was correct.  Printing of the 
> state of the job would be nice as well just in case the state of the job was 
> incorrect leading to the exception
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to