Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

Panagiotis Garefalakis Sun, 19 Mar 2023 23:23:51 -0700

Hey David, Shammon,

Thanks for the valuable comments!
I am glad you find this proposal useful, some thoughts:


@Shammon

1. How about adding more job information in FailureListenerContext? For
> example, job vertext, subtask, taskmanager location. And then user can do
> more statistics according to different dimensions.


Thats a valid point, passing the JobGraph containing all the above
information
is also something to consider, I was mostly trying to be conservative:
i.e., passingly only the information we need, and extend as we see fit

2. Users may want to save results in listener, and then they can get the
> historical results even jabmanager failover. Can we provide a unified
> implementation for data storage requirements?


The idea is to store only the output of the Listeners (tags) and treat them
as stateless.
Tags are be stored along with HistoryEntries, and will be available through
the HistoryServer
even after a JM dies.

@David

1) Should we also consider adding labels? The combination of tags and
> labels seems to be what most systems offer; sometimes, they offer labels
> only (key=value pairs) because tags can be implemented using those, but not
> the other way around.


Indeed changing tags to k:v labels could be more expressive, I like it!
Let's see what others think.

2) Since we can not predict how heavy user-defined models ("listeners") are
> going to be, it would be great to keep the interfaces/data structures
> immutable so we can push things over to the I/O threads. Also, it sounds
> off to call the main interface a Listener since it's supposed to enhance
> the original throwable with additional metadata.


The idea was for the name to be generic as there could be Listener
implementations
just communicating with external monitoring/alerting systems and no
metadata output
-- but lets rethink that. For immutability, see below:

3) You're proposing to support a set of listeners. Since you're passing the
> mutable context around, which includes tags set by the previous listener,
> do you expect users to make any assumptions about the order in which
> listeners are executed?


In the existing proposal we are not making any assumptions about the order
of listeners,
(system ones come first and then the ones loaded from the plugin manager)
however listeners can use previous state (tags/labels) to make decisions:
e.g., wont assign *UNKNOWN* failureType when we have already seen *USER *or
the other way around -- when we have seen *UNKNOWN* remove in favor of
*USER*


Cheers,
Panagiotis

On Sun, Mar 19, 2023 at 10:42 AM David Morávek <d...@apache.org> wrote:

> Hi Panagiotis,
>
> This is an excellent proposal and something everyone trying to provide
> "Flink as a service" needs to solve at some point. I have a couple of
> questions:
>
> If I understand the proposal correctly, this is just about adding tags to
> the Throwable by running a tuple of (Throwable, FailureContext) through a
> user-defined model.
>
> 1) Should we also consider adding labels? The combination of tags and
> labels seems to be what most systems offer; sometimes, they offer labels
> only (key=value pairs) because tags can be implemented using those, but not
> the other way around.
>
> 2) Since we can not predict how heavy user-defined models ("listeners") are
> going to be, it would be great to keep the interfaces/data structures
> immutable so we can push things over to the I/O threads. Also, it sounds
> off to call the main interface a Listener since it's supposed to enhance
> the original throwable with additional metadata.
>
> I'd propose something along the lines of (we should have better names, this
> is just to outline the idea):
>
> interface FailureEnricher {
>
>   ThrowableWithTagsAndLabels enrichFailure(Throwable cause,
> ImmutableContextualMetadataAboutTheThrowable context);
> }
>
> The names should change; this is just to outline the idea.
>
> 3) You're proposing to support a set of listeners. Since you're passing the
> mutable context around, which includes tags set by the previous listener,
> do you expect users to make any assumptions about the order in which
> listeners are executed?
>
> *@Shammon*
>
> Users may want to save results in listener, and then they can get the
> > historical results even jabmanager failover. Can we provide a unified
> > implementation for data storage requirements?
>
>
> I think we should explicitly state that all "listeners" are treated as
> stateless. I don't see any strong reason for snapshotting them.
>
> Best,
> D.
>
> On Sat, Mar 18, 2023 at 1:00 AM Shammon FY <zjur...@gmail.com> wrote:
>
> > Hi Panagiotis
> >
> > Thank you for starting this discussion. I think this FLIP is valuable and
> > can help user to analyze the causes of job failover better!
> >
> > I have two comments as follows
> >
> > 1. How about adding more job information in FailureListenerContext? For
> > example, job vertext, subtask, taskmanager location. And then user can do
> > more statistics according to different dimensions.
> >
> > 2. Users may want to save results in listener, and then they can get the
> > historical results even jabmanager failover. Can we provide a unified
> > implementation for data storage requirements?
> >
> >
> > Best,
> > shammon FY
> >
> >
> > On Saturday, March 18, 2023, Panagiotis Garefalakis <pga...@apache.org>
> > wrote:
> >
> > > Hi everyone,
> > >
> > > This FLIP [1] proposes a pluggable interface for failure handling
> > allowing
> > > users to implement custom failure logic using the plugin framework.
> > > Motivated by existing proposals [2] and tickets [3], this enables
> > use-cases
> > > like: assigning particular types to failures (e.g., User or System),
> > > emitting custom metrics per type (e.g., application or platform), even
> > > exposing errors to downstream consumers (e.g., notification systems).
> > >
> > > Thanks to Piotr and Anton for the initial reviews and discussions!
> > >
> > > For anyone interested, the starting point would be the FLIP [1] that I
> > > created,
> > > describing the motivation and the proposed changes (part of the core,
> > > runtime and web).
> > >
> > > The intuition behind this FLIP is being able to execute custom logic on
> > > failures by exposing a FailureListener interface. Implementation by
> users
> > > can be simply loaded to the system as Jar files. FailureListeners may
> > also
> > > decide to assign failure tags to errors (expressed as strings),
> > > that will then be exposed as metadata by the UI/Rest interfaces.
> > >
> > > Feedback is always appreciated! Looking forward to your thoughts!
> > >
> > > [1]
> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%
> > > 3A+Pluggable+failure+handling+for+Apache+Flink
> > > [2]
> > > https://docs.google.com/document/d/1pcHg9F3GoDDeVD5GIIo2wO67
> > > Hmjgy0-hRDeuFnrMgT4
> > > [3] https://issues.apache.org/jira/browse/FLINK-20833
> > >
> > > Cheers,
> > > Panagiotis
> > >
> >
>

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

Reply via email to