Hi all, This discussion thread is to gauge community opinion and gather feedback on implementing a better exception hierarchy in Flink to identify exceptions that come from running “User job code” and exceptions coming from “Flink engine code”.
Problem: Flink provides a distributed processing engine (SYSTEM) to run a data streaming job (USER). There are many places in code where the engine runs “user job provided java classes”, such as serialization/deserialization, configuration objects, credential loading, running setup() method on certain Operators. Sometimes when evaluating a stack trace, it might be hard to automatically determine if an exception is arising out of a Flink engine problem, or a problem associated to a particular job. Proposed way forward: - It would be good to have an exception hierarchy maintained by Flink that separates out the exceptions arising from running “USER provided classes”. That way, we can improve our ability to automatically classify and mitigate these exceptions. - We could also include separating out the places where exception originates based on function - FlinkSerializationException, FlinkConfigurationException.. etc. (we already have a similar concept with IncompatibleKeysException) - This has synergy with FLIP-304: Pluggable Failure Enrichers (since it would simplify the logic in the USER/SYSTEM classifier there) [1]. - In addition, this has been discussed before in the context of updating the exception thrown by serialisers to be a Flink-specific serialisation exception instead of IllegalStateException [2] Any thoughts on the above? Regards, Hong [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers [2] https://lists.apache.org/thread/0o859h1vdx6mwv0fqvmybpn574692jtg