Hey Hong, Keep in mind that Flink 2.0 is also under discussion and breaking changes could be introduced -- lets just make sure there is real value in a cleaner exception hierarchy (which I believe there is).
Cheers, Panagiotis On Sat, Jun 10, 2023 at 4:22 AM Teoh, Hong <lian...@amazon.co.uk.invalid> wrote: > Thanks for the engagement on the thread! Sorry for the late reply, was off > on holidays for a bit. > > @Paul > > Thanks for linking the historical discussion. Yes I would agree that using > classloading to determine if the exception type has come from a User > classloader rather than System classloader would be helpful. > > In my opinion, we should enhance this further by also introducing a good > exception hierarchy depending on where the USER code was called. However, I > also note that this might be a breaking change for some, because they might > rely on the current exception type for job management. We could address > this by wrapping the existing exception rather than replacing. > > @Panagiotis > I agree with all your points. This proposal is in synergy with Pluggable > Failure Enrichers. > > Regards, > Hong > > > On 6 Jun 2023, at 06:50, Panagiotis Garefalakis <pga...@apache.org> > wrote: > > > > CAUTION: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > > > > > Thanks for bringing this up Hong! > > > > Classifying exceptions was also the main driving factor behind pluggable > > failure enrichers <https://issues.apache.org/jira/browse/FLINK-31508>. > > However, we could do a much better job maintaining a hierarchy of System > > and User exceptions thus making the classification logic more > > straightforward. > > > > - Defining better system/user exceptions with some kind of hierarchy is > > definitely a step forward (and refactoring the existing ones) > > - Classloader filtering could definitely be used for discovering errors > > originating from user defined code, see doc > > < > https://docs.google.com/document/d/1pcHg9F3GoDDeVD5GIIo2wO67Hmjgy0-hRDeuFnrMgT4/edit#heading=h.ato31xdnm7nk > > > > - Eventually we could also release a simple failure enricher using the > > above improvements to automatically classify errors on JMs exceptions > > endpoint > > > > Cheers, > > Panagiotis > > > > On Wed, May 31, 2023 at 9:12 PM Paul Lam <paullin3...@gmail.com> wrote: > > > >> Hi Hong, > >> > >> Thanks for starting the discussion! I believe the exception > classification > >> between > >> user exceptions and system exceptions has been long-awaited. > >> > >> It's worth mentioning that years ago there was a related discussion [1], > >> FYI. > >> > >> I’m in favor of the heuristic approach to classify the exceptions by > which > >> classloader it comes from. In addition, we could introduce extra > >> configurations > >> to allow manual execution classification based on the package name of > >> exceptions. > >> > >> [1] https://lists.apache.org/thread/gms4nysnb3o4v2k6421m5hsq0g7gtr81 > >> > >> Best, > >> Paul Lam > >> > >>> 2023年5月25日 23:07,Teoh, Hong <lian...@amazon.co.uk.INVALID> 写道: > >>> > >>> Hi all, > >>> > >>> This discussion thread is to gauge community opinion and gather > feedback > >> on implementing a better exception hierarchy in Flink to identify > >> exceptions that come from running “User job code” and exceptions coming > >> from “Flink engine code”. > >>> > >>> Problem: > >>> Flink provides a distributed processing engine (SYSTEM) to run a data > >> streaming job (USER). There are many places in code where the engine > runs > >> “user job provided java classes”, such as serialization/deserialization, > >> configuration objects, credential loading, running setup() method on > >> certain Operators. > >>> Sometimes when evaluating a stack trace, it might be hard to > >> automatically determine if an exception is arising out of a Flink engine > >> problem, or a problem associated to a particular job. > >>> > >>> Proposed way forward: > >>> - It would be good to have an exception hierarchy maintained by Flink > >> that separates out the exceptions arising from running “USER provided > >> classes”. That way, we can improve our ability to automatically classify > >> and mitigate these exceptions. > >>> - We could also include separating out the places where exception > >> originates based on function - FlinkSerializationException, > >> FlinkConfigurationException.. etc. (we already have a similar concept > with > >> IncompatibleKeysException) > >>> - This has synergy with FLIP-304: Pluggable Failure Enrichers (since it > >> would simplify the logic in the USER/SYSTEM classifier there) [1]. > >>> - In addition, this has been discussed before in the context of > updating > >> the exception thrown by serialisers to be a Flink-specific serialisation > >> exception instead of IllegalStateException [2] > >>> > >>> > >>> Any thoughts on the above? > >>> > >>> Regards, > >>> Hong > >>> > >>> > >>> [1] > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers > >>> [2] https://lists.apache.org/thread/0o859h1vdx6mwv0fqvmybpn574692jtg > >>> > >>> > >> > >> > >