Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

Leonard Xu Wed, 12 Apr 2023 20:35:40 -0700

Thanks Panagiotis for the update, 

the updated FLIP looks good to me.



Best,
Leonard


> On Apr 13, 2023, at 7:42 AM, Panagiotis Garefalakis <[email protected]> wrote:
> 
> Hello there,
> 
> Zhu: agree with the config option, great suggestion
> Hong: global timeout is also interesting and a good addition -- only
> downside I see is just another config option
> 
> If everyone is happy, I suggest we keep the discussion open until Friday
> and start a Vote shortly after.
> 
> Cheers,
> Panagiotis
> 
> On Tue, Apr 11, 2023 at 12:58 AM Teoh, Hong <[email protected]>
> wrote:
> 
>> Hi Panagiotis,
>> 
>> Thank you for the update. Looks great! Just one suggestion below:
>> 
>> 1. We seem to be waiting for the future(s) to complete before restarting
>> the job - should we add a configurable timeout for the enrichment? Since
>> each failure enricher are run in parallel, we could probably settle for 1
>> timeout for all failure handlers.
>> 2. +1 to Zhu’s comment on adding a comma separated list of FailureHandlers
>> instead of boolean toggle!
>> 
>> Other than the above, the FLIP looks great! Thank you for your efforts.
>> 
>> Regards,
>> Hong
>> 
>> 
>> 
>>> On 11 Apr 2023, at 08:01, Zhu Zhu <[email protected]> wrote:
>>> 
>>> CAUTION: This email originated from outside of the organization. Do not
>> click links or open attachments unless you can confirm the sender and know
>> the content is safe.
>>> 
>>> 
>>> 
>>> Hi Panagiotis,
>>> 
>>> Thanks for updating the FLIP.
>>> 
>>>> Regarding the config option
>> `jobmanager.failure-enricher-plugins.enabled`
>>> I think a config option `jobmanager.failure-enrichers`, which accepts
>>> the names of enrichers to use, may be better. It allows the users to
>>> deploy and use the plugins in a more flexible way. The default value
>>> of the config can be none, which means failure enrichment will be
>>> disabled by default.
>>> A reference can be the config option `metrics.reporters` which helps
>>> to load metric reporter plugins.
>>> 
>>> Thanks,
>>> Zhu
>>> 
>>> Panagiotis Garefalakis <[email protected]> 于2023年4月10日周一 03:47写道：
>>>> 
>>>> Hello again everyone,
>>>> 
>>>> FLIP is now updated based on our discussion!
>>>> In short, FLIP-304 [1] proposes the addition of a pluggable interface
>> that
>>>> will allow users to add custom logic and enrich failures with custom
>>>> metadata labels.
>>>> While as discussed, custom restart strategies will be part of a
>> different
>>>> effort. Every pluggable FaulireEnricher:
>>>> 
>>>>  - Is triggered on every global/non-global failure
>>>>  - Receives a Throwable cause and an immutable Context
>>>>  - Performs asynchronous execution (separate IoExecutor) to avoid
>>>>  blocking the main thread for RPCs
>>>>  - Is completely independent from other Enrichers
>>>>  - Emits failure labels/tags for its unique, pre-defined keys (defined
>> at
>>>>  startup time)
>>>> 
>>>> 
>>>> Check the link for implementation details and please let me know what
>> you
>>>> think :)
>>>> 
>>>> 
>>>> [1]
>>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
>>>> 
>>>> 
>>>> Panagiotis
>>>> 
>>>> 
>>>> On Tue, Mar 28, 2023 at 5:01 AM Zhu Zhu <[email protected]> wrote:
>>>> 
>>>>> Hi Panagiotis,
>>>>> 
>>>>> How about to introduce a config option to control which error handling
>>>>> plugins should be used? It is more flexible for deployments.
>> Additionally,
>>>>> it can also enable users to explicitly specify the order that the
>> plugins
>>>>> take effects.
>>>>> 
>>>>> Thanks,
>>>>> Zhu
>>>>> 
>>>>> Gen Luo <[email protected]> 于2023年3月27日周一 15:02写道：
>>>>>> 
>>>>>> Thanks for the summary!
>>>>>> 
>>>>>> Also +1 to support custom restart strategies in a different FLIP,
>>>>>> as long as we can make sure that the plugin interface won't be
>>>>>> changed when the restart strategy interface is introduced.
>>>>>> 
>>>>>> To achieve this, maybe we should think well how the handler
>>>>>> would cooperate with the restart strategy, like would it executes b
>>>>>> efore the strategy (e.g. some strategy may use the tag), or after
>>>>>> it (e.g. some metric reporting handler may use the handling result).
>>>>>> Though we can implement in one way, and extend if the other is
>>>>>> really necessary by someone.
>>>>>> 
>>>>>> Besides, instead of using either of the names, shall we just make
>>>>>> them two subclasses named FailureEnricher and FailureListener?
>>>>>> The former executes synchronously and can modify the context,
>>>>>> while the latter executes asynchronously and has a read-only view
>>>>>> of context. In this way we can make sure a handler behaves in
>>>>>> the expected way.
>>>>>> 
>>>>>> 
>>>>>> On Thu, Mar 23, 2023 at 5:19 PM Zhu Zhu <[email protected]> wrote:
>>>>>> 
>>>>>>> +1 to support custom restart strategies in a different FLIP.
>>>>>>> 
>>>>>>> It's fine to have a different plugin for custom restart strategy.
>>>>>>> If so, since we do not treat the FLIP-304 plugin as a common failure
>>>>>>> handler, but instead mainly targets to add labels to errors, I would
>>>>>>> +1 for the name `FailureEnricher`.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Zhu
>>>>>>> 
>>>>>>> David Morávek <[email protected]> 于2023年3月23日周四 15:51写道：
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> One additional remark on introducing it as an async operation: We
>>>>> would
>>>>>>>>> need a new configuration parameter to define the timeout for such a
>>>>>>>>> listener call, wouldn't we?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> This could be left up to the implementor to handle.
>>>>>>>> 
>>>>>>>> What about adding an extra method getNamespace() to the Listener
>>>>>>> interface
>>>>>>>>> which returns an Optional<String>.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> I'd avoid mixing an additional concept into this. We can simply have
>>>>> a
>>>>>>> new
>>>>>>>> method that returns a set of keys the listener can output. We can
>>>>>>> validate
>>>>>>>> this at the JM startup time and fail fast (since it's a
>> configuration
>>>>>>>> error) if there is an overlap. If the listener outputs the key that
>>>>> is
>>>>>>> not
>>>>>>>> allowed to, I wouldn't be afraid to call into a fatal error handler
>>>>> since
>>>>>>>> it's an invalid implementation.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> D.
>>>>>>>> 
>>>>>>>> On Thu, Mar 23, 2023 at 8:34 AM Matthias Pohl
>>>>>>>> <[email protected]> wrote:
>>>>>>>> 
>>>>>>>>> Sounds good. Two points I want to add:
>>>>>>>>> 
>>>>>>>>>  - Listener execution should be independent — however we need a
>>>>> way
>>>>>>> to
>>>>>>>>>> enforce a Label key/key-prefix is only assigned to a single
>>>>> Listener,
>>>>>>>>>> thinking of a validation step both at Listener init and runtime
>>>>>>> stages
>>>>>>>>>> 
>>>>>>>>> What about adding an extra method getNamespace() to the Listener
>>>>>>> interface
>>>>>>>>> which returns an Optional<String>. Therefore, the
>>>>> implementation/the
>>>>>>> user
>>>>>>>>> can decide depending on the use case whether it's necessary to have
>>>>>>>>> separate namespaces for the key/value pairs or not. On the Flink
>>>>> side,
>>>>>>> we
>>>>>>>>> would just merge the different maps considering their namespaces.
>>>>>>>>> 
>>>>>>>>> A flaw of this approach is that if a user decides to use the same
>>>>>>> namespace
>>>>>>>>> for multiple listeners, how is an error in one of the listeners
>>>>>>> represented
>>>>>>>>> in the outcome? We would have to overwrite either the successful
>>>>>>> listener's
>>>>>>>>> result or the failed ones. I wanted to share it, anyway.
>>>>>>>>> 
>>>>>>>>> One additional remark on introducing it as an async operation: We
>>>>> would
>>>>>>>>> need a new configuration parameter to define the timeout for such a
>>>>>>>>> listener call, wouldn't we?
>>>>>>>>> 
>>>>>>>>> Matthias
>>>>>>>>> 
>>>>>>>>> On Wed, Mar 22, 2023 at 4:56 PM Panagiotis Garefalakis <
>>>>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi everyone,
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks for the valuable comments!
>>>>>>>>>> Excited to see this is an area of interest for the community!
>>>>>>>>>> 
>>>>>>>>>> Summarizing some of the main points raised along with my
>>>>> thoughts:
>>>>>>>>>> 
>>>>>>>>>>  - Labels (Key/Value) pairs are more expressive than Tags
>>>>>>> (Strings) so
>>>>>>>>>>  using the former is a good idea — I am also debating if we
>>>>> want to
>>>>>>>>>> return
>>>>>>>>>>  multiple KV pairs per Listener (one could argue that we could
>>>>>>> split
>>>>>>>>> the
>>>>>>>>>>  logic in multiple Listeners to support that)
>>>>>>>>>>  - An immutable context along with data returned using the
>>>>>>> interface
>>>>>>>>>>  method implementations is a better approach than a mutable
>>>>>>> Collection
>>>>>>>>>>  - Listener execution should be independent — however we need a
>>>>>>> way to
>>>>>>>>>>  enforce a Label key/key-prefix is only assigned to a single
>>>>>>> Listener,
>>>>>>>>>>  thinking of a validation step both at Listener init and
>>>>> runtime
>>>>>>> stages
>>>>>>>>>>  - We want to perform async Listener operations as sync could
>>>>>>> block the
>>>>>>>>>>  main thread — exposing an ioExecutor pool through the context
>>>>>>> could be
>>>>>>>>>> an
>>>>>>>>>>  elegant solution here
>>>>>>>>>>  - Make sure Listener errors are not failing jobs — make sure
>>>>> to
>>>>>>> log
>>>>>>>>> and
>>>>>>>>>>  keep the job alive
>>>>>>>>>>  - We need better naming / public interface
>>>>> separation/description
>>>>>>>>>> 
>>>>>>>>>>       -  Even though custom restart strategies share some
>>>>>>> properties
>>>>>>>>> with
>>>>>>>>>> Listeners, they would probably need a separate interface with a
>>>>>>> different
>>>>>>>>>> return type anyway (restart strategy not labels) and in general
>>>>> they
>>>>>>> are
>>>>>>>>>> different and complex enough to justify their own FLIP (that can
>>>>>>> also be
>>>>>>>>> a
>>>>>>>>>> follow-up).
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> What do people think? I am planning to modify the FLIP to reflect
>>>>>>> these
>>>>>>>>>> changes if they make sense to everyone.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Panagiotis
>>>>>>>>>> 
>>>>>>>>>> On Wed, Mar 22, 2023 at 6:28 AM Hong Teoh <[email protected]>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi all,
>>>>>>>>>>> 
>>>>>>>>>>> Thank you Panagiotis for proposing this. From the size of the
>>>>>>> thread,
>>>>>>>>>> this
>>>>>>>>>>> is a much needed feature in Flink!
>>>>>>>>>>> Some thoughts, to extend those already adeptly summarised by
>>>>> Piotr,
>>>>>>>>>>> Matthias and Jing.
>>>>>>>>>>> 
>>>>>>>>>>> - scope of FLIP: +1 to scoping this FLIP to observability
>>>>> around a
>>>>>>>>>>> restart. That would include adding metadata + exposing
>>>>> metadata to
>>>>>>>>>> external
>>>>>>>>>>> systems. IMO, introducing a new restart strategy solves
>>>>> different
>>>>>>>>>> problems,
>>>>>>>>>>> is much larger scope and should be covered in a separate FLIP.
>>>>>>>>>>> 
>>>>>>>>>>> - failure handling: At the moment, we propose transitioning the
>>>>>>> Flink
>>>>>>>>> job
>>>>>>>>>>> to a terminal FAILED state when JobListener fails, when the job
>>>>>>> could
>>>>>>>>>> have
>>>>>>>>>>> transitioned to RESTARTING->RUNNING. If we are keeping in line
>>>>>>> with the
>>>>>>>>>>> scope to add metadata/observability around job restarts, we
>>>>> should
>>>>>>> not
>>>>>>>>> be
>>>>>>>>>>> affecting the running of the Flink job itself. Could I propose
>>>>> we
>>>>>>>>> instead
>>>>>>>>>>> log WARN/ERROR.
>>>>>>>>>>> 
>>>>>>>>>>> - immutable context: +1 to keeping the contract clear via
>>>>> return
>>>>>>> types.
>>>>>>>>>>> - async operation: +1 to adding ioexecutor to context, however,
>>>>>>> given
>>>>>>>>> we
>>>>>>>>>>> don’t want to block the actual job restart on adding metadata /
>>>>>>> calling
>>>>>>>>>>> external services, should we consider returning and letting
>>>>> futures
>>>>>>>>>>> complete independently?
>>>>>>>>>>> 
>>>>>>>>>>> - independent vs ordered execution: Should we consider making
>>>>> the
>>>>>>> order
>>>>>>>>>> of
>>>>>>>>>>> execution deterministic (use a List instead of Set)?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Once again, thank you for working on this.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Hong
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 21 Mar 2023, at 21:07, Jing Ge <[email protected]
>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks Panagiotis for this FLIP and thanks for all valuable
>>>>>>>>>> discussions.
>>>>>>>>>>>> I'd like to share my two cents:
>>>>>>>>>>>> 
>>>>>>>>>>>> - FailureListenerContext#addTag and
>>>>>>> FailureListenerContext#getTags.
>>>>>>>>> It
>>>>>>>>>>>> seems that we have to call getTags() and then do remove
>>>>>>> activities if
>>>>>>>>>> we
>>>>>>>>>>>> want to delete any tags (according to the javadoc in the
>>>>> FLIP).
>>>>>>> It
>>>>>>>>> is
>>>>>>>>>>>> inconsistent for me too. Either offer addTag(), deleteTag(),
>>>>> and
>>>>>>> let
>>>>>>>>>>>> getTags() return immutable collection, or offer getTags()
>>>>> only to
>>>>>>>>>> return
>>>>>>>>>>>> mutable collection.
>>>>>>>>>>>> 
>>>>>>>>>>>> - label vs tag. Label is a great idea +1. AFAIC, tag could
>>>>> be a
>>>>>>>>> special
>>>>>>>>>>>> case of label, i.e. key="tag". It is convenient to offer the
>>>>>>> xxxTag()
>>>>>>>>>>>> method if the user only needs one label. I would love to have
>>>>>>> both of
>>>>>>>>>>> them.
>>>>>>>>>>>> Another thought is that tag implicitly contains the meaning
>>>>> of
>>>>>>>>>>> "immutable".
>>>>>>>>>>>> 
>>>>>>>>>>>> - +1 for a separate FLIP of customized restart strategy.
>>>>>>> Attention
>>>>>>>>>> should
>>>>>>>>>>>> be taken to make sure it works well with Flink built-in
>>>>>>>>> restartStrategy
>>>>>>>>>>> in
>>>>>>>>>>>> order to have the single source of truth.
>>>>>>>>>>>> 
>>>>>>>>>>>> - execution order. The default independent execution should
>>>>> be
>>>>>>> fine.
>>>>>>>>>>>> According to the FailureListener interface definition in the
>>>>>>> FLIP,
>>>>>>>>>> users
>>>>>>>>>>>> should be able to easily build a listener chain[1] to offer
>>>>>>>>> sequential
>>>>>>>>>>>> execution, e.g. public FailureListener(FailureListener
>>>>>>> nextListener).
>>>>>>>>>>>> Another option is to modify the interface or provide another
>>>>>>>>> interface
>>>>>>>>>>>> alongside the current one to extend the method to support
>>>>>>>>>> ListenerChain,
>>>>>>>>>>>> i.e. void onFailure(Throwable cause, FailureListenerContext
>>>>>>> context,
>>>>>>>>>>>> ListenerChain listenerChain). Users can also mix them up.
>>>>>>>>>>>> 
>>>>>>>>>>>> - naming. Afaiu, the pluggable extension is not limited to
>>>>>>> failure
>>>>>>>>>>>> enrichment. Conceptually it can do everything for the given
>>>>>>> failure,
>>>>>>>>>> e.g.
>>>>>>>>>>>> start counting metric as the FLIP described, calling an
>>>>> external
>>>>>>>>>> system,
>>>>>>>>>>>> sending notification to slack channel, etc. you name it. It
>>>>>>> sounds to
>>>>>>>>>> me
>>>>>>>>>>>> more like a FailureActionListener - it can trigger actions
>>>>> based
>>>>>>> on
>>>>>>>>>>>> failure. Failure enrichment is one type of action.
>>>>>>>>>>>> 
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Jing
>>>>>>>>>>>> 
>>>>>>>>>>>> [1]
>>>>>>> https://en.wikipedia.org/wiki/Chain-of-responsibility_pattern
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Mar 21, 2023 at 3:39 PM Matthias Pohl
>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for the proposal, Panagiotis. A lot of good points
>>>>> have
>>>>>>> been
>>>>>>>>>>> already
>>>>>>>>>>>>> shared. I just want to add my view on some of the items:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - independent execution vs ordered execution: I prefer the
>>>>>>> listeners
>>>>>>>>>>> being
>>>>>>>>>>>>> processed independently from each other because it adds less
>>>>>>>>>> complexity
>>>>>>>>>>>>> code-wise. The use case Piotr described (where you want to
>>>>> reuse
>>>>>>>>> some
>>>>>>>>>>> other
>>>>>>>>>>>>> classifier) is the only one I can think of where we actually
>>>>>>> need
>>>>>>>>>>>>> classifiers depending on each other. Supporting such a use
>>>>> case
>>>>>>>>> right
>>>>>>>>>>> from
>>>>>>>>>>>>> the start feels a bit over-engineered and could be covered
>>>>> in a
>>>>>>>>>>> follow-up
>>>>>>>>>>>>> FLIP if we really come to that point where such a feature is
>>>>>>>>> requested
>>>>>>>>>>> by
>>>>>>>>>>>>> users.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - key/value pairs instead of plain labels: I think that's a
>>>>> good
>>>>>>>>> idea.
>>>>>>>>>>>>> key/value pairs are more expressive. +1
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - extending the FLIP to cover restart strategy: I understand
>>>>>>> Gen's
>>>>>>>>>>> concern
>>>>>>>>>>>>> about introducing too many different types of plugins. But I
>>>>>>> would
>>>>>>>>>> still
>>>>>>>>>>>>> favor not extending the FLIP in this regard. A pluggable
>>>>> restart
>>>>>>>>>>> strategy
>>>>>>>>>>>>> sounds reasonable. But an error classifier and a restart
>>>>>>> strategy
>>>>>>>>> are
>>>>>>>>>>> still
>>>>>>>>>>>>> different enough to justify separate plugins, IMHO. And
>>>>>>> therefore, I
>>>>>>>>>>> would
>>>>>>>>>>>>> think that covering the restart strategy in a separate FLIP
>>>>> is
>>>>>>> the
>>>>>>>>>>> better
>>>>>>>>>>>>> option for the sake of simplicity.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - immutable context: Passing in an immutable context and
>>>>>>> returning
>>>>>>>>>> data
>>>>>>>>>>>>> through the interface method's return value sounds like a
>>>>> better
>>>>>>>>>>> approach
>>>>>>>>>>>>> to harden the contract of the interface. +1 for that
>>>>> proposal
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - async operation: I think David is right. An async
>>>>> interface
>>>>>>> makes
>>>>>>>>>> the
>>>>>>>>>>>>> listener implementations more robust when it comes to heavy
>>>>> IO
>>>>>>>>>>> operations.
>>>>>>>>>>>>> The ioExecutor can be passed through the context object. +1
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Matthias
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Mar 21, 2023 at 2:09 PM David Morávek <
>>>>>>>>>> [email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> *@Piotr*
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I was thinking about actually defining the order of the
>>>>>>>>>>>>>>> classifiers/handlers and not allowing them to be
>>>>> asynchronous.
>>>>>>>>>>>>>>> Asynchronousity would create some problems: when to
>>>>> actually
>>>>>>>>> return
>>>>>>>>>>> the
>>>>>>>>>>>>>>> error to the user? After all async responses will get
>>>>> back?
>>>>>>>>> Before,
>>>>>>>>>>> but
>>>>>>>>>>>>>>> without classified exception? It would also add
>>>>> implementation
>>>>>>>>>>>>> complexity
>>>>>>>>>>>>>>> and I think we can always expand the API with async
>>>>> version
>>>>>>> in the
>>>>>>>>>>>>> future
>>>>>>>>>>>>>>> if needed.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As long as the classifiers need to talk to an external
>>>>> system,
>>>>>>> we
>>>>>>>>> by
>>>>>>>>>>>>>> definition need to allow them to be asynchronous to
>>>>> unblock the
>>>>>>>>> main
>>>>>>>>>>>>> thread
>>>>>>>>>>>>>> for handling other RPCs. Exposing ioExecutor via the
>>>>> context
>>>>>>>>> proposed
>>>>>>>>>>>>> above
>>>>>>>>>>>>>> would be great.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> After all async responses will get back
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This would be the same if we trigger them synchronously
>>>>> one by
>>>>>>> one,
>>>>>>>>>>> with
>>>>>>>>>>>>> a
>>>>>>>>>>>>>> caveat that synchronous execution might take significantly
>>>>>>> longer
>>>>>>>>> and
>>>>>>>>>>>>>> introduce unnecessary downtime to a job.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> D.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Mar 21, 2023 at 1:12 PM Zhu Zhu <[email protected]
>>>>>> 
>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Piotr,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It's fine to me to have a separate FLIP to extend this
>>>>>>>>>>>>> `FailureListener`
>>>>>>>>>>>>>>> to support custom restart strategy.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> What I was a bit concerned is that if we just treat the
>>>>>>>>>>>>> `FailureListener`
>>>>>>>>>>>>>>> as an error classifier which is not crucial to Flink
>>>>> framework
>>>>>>>>>>> process,
>>>>>>>>>>>>>>> we may design it to run asynchronously and not trigger
>>>>> Flink
>>>>>>>>>> failures.
>>>>>>>>>>>>>>> This may be a blocker if later we want to enable it to
>>>>> support
>>>>>>>>>> custom
>>>>>>>>>>>>>>> restart strategy.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Zhu
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Dian Fu <[email protected]> 于2023年3月21日周二 19:53写道：
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Panagiotis,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for the proposal. This is a very valuable feature
>>>>> and
>>>>>>> will
>>>>>>>>>> be
>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>> add-on for Flink.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I also think that it will be great if we can consider
>>>>> how to
>>>>>>> make
>>>>>>>>>> it
>>>>>>>>>>>>>>>> possible for users to customize the failure handling in
>>>>> this
>>>>>>>>> FLIP.
>>>>>>>>>>>>> It's
>>>>>>>>>>>>>>>> highly related to the problem we want to address in this
>>>>>>> FLIP and
>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>> avoid refactoring the interfaces proposed in this FLIP
>>>>> too
>>>>>>>>> quickly.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Currently it treats all kinds of exceptions the same.
>>>>>>> However,
>>>>>>>>> some
>>>>>>>>>>>>>> kinds
>>>>>>>>>>>>>>>> of exceptions are actually not recoverable at all. It
>>>>> could
>>>>>>> let
>>>>>>>>>> users
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> customize the failure handling logic to fail fast for
>>>>> certain
>>>>>>>>> known
>>>>>>>>>>>>>>>> unrecoverable exceptions and finally make these kinds of
>>>>>>> jobs get
>>>>>>>>>>>>>> noticed
>>>>>>>>>>>>>>>> and recoveried more quickly.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Dian
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Mar 21, 2023 at 4:36 PM Gen Luo <
>>>>> [email protected]
>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Panagiotis,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks for the proposal.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> It's useful to enrich the information so that users can
>>>>> be
>>>>>>> more
>>>>>>>>>>>>>>>>> clear why the job is failing, especially platform
>>>>>>> developers who
>>>>>>>>>>>>>>>>> need to provide the information to their end users.
>>>>>>>>>>>>>>>>> And for the very FLIP, I'd prefer the naming
>>>>>>> `FailureEnricher`
>>>>>>>>>>>>>>>>> proposed by David, as the plugin doesn't really handle
>>>>> the
>>>>>>>>>> failure.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> However, like Zhu and Lijie said, I also joined a
>>>>> discussion
>>>>>>>>>>>>>>>>> recently about customized failure handling, e.g.
>>>>> counting
>>>>>>> the
>>>>>>>>>>>>>>>>> failure rate of pipeline regions separately, and failing
>>>>>>> the job
>>>>>>>>>>>>>>>>> when a specific error occurs, and so on.
>>>>>>>>>>>>>>>>> I suppose a custom restart strategy, or I'd call it a
>>>>> custom
>>>>>>>>>>>>>>>>> failure "handler", is indeed necessary. It can also
>>>>> enrich
>>>>>>> the
>>>>>>>>>>>>>>>>> information as the current proposed handler does.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> To avoid adding too many plugin interfaces which may
>>>>> confuse
>>>>>>>>> users
>>>>>>>>>>>>>>>>> and make the ExecutionFailureHandler more complex,
>>>>>>>>>>>>>>>>> I think it'd be better to consider the requirements at
>>>>> the
>>>>>>> same
>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> IMO, we can add a handler interface, then make the
>>>>> current
>>>>>>>>> restart
>>>>>>>>>>>>>>>>> strategy and the enricher both types of the handler. The
>>>>>>>>> handlers
>>>>>>>>>>>>>>>>> execute in sequence, and the failure is considered
>>>>>>> unrecoverable
>>>>>>>>>> if
>>>>>>>>>>>>>>>>> any of the handlers decides.
>>>>>>>>>>>>>>>>> In this way, users can also implement a handler using
>>>>> the
>>>>>>>>> enriched
>>>>>>>>>>>>>>>>> information provided by the previous handlers, e.g. fail
>>>>>>> the job
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> send a notification if too many failures are caused by
>>>>> the
>>>>>>> end
>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Gen
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Tue, Mar 21, 2023 at 11:38 AM Weihua Hu <
>>>>>>>>>> [email protected]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Panagiotis,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for your proposal. It is valuable to analyze the
>>>>>>> reason
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> failure with the user plug-in.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Making the context immutable could make the contract
>>>>>>> stronger.
>>>>>>>>>>>>>>>>>> Letting the listener return an enriching result may be
>>>>> a
>>>>>>> better
>>>>>>>>>>>>>> way.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> IIUC, listeners could do two things, enrich more
>>>>>>> information
>>>>>>>>>>>>>>>>> (tags/labels)
>>>>>>>>>>>>>>>>>> to FailureHandlingResult, and push data out of Flink
>>>>>>> (metrics
>>>>>>>>> or
>>>>>>>>>>>>>>>>>> something).
>>>>>>>>>>>>>>>>>> IMO, we could split these two types into Listener and
>>>>>>> Advisor
>>>>>>>>>>>>>> (maybe
>>>>>>>>>>>>>>>>>> other names). The Listener just pushes the data out and
>>>>>>> returns
>>>>>>>>>>>>>>> nothing
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> Flink, so we can run these async and don't have to
>>>>> wait for
>>>>>>>>>>>>>>> Listener's
>>>>>>>>>>>>>>>>>> result.
>>>>>>>>>>>>>>>>>> The Advisor returns rich information to the
>>>>>>>>> FailureHadingResult,
>>>>>>>>>>>>>>> and it
>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>> have a lighter logic.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Supporting a custom restart strategy is also valuable.
>>>>> In
>>>>>>> this
>>>>>>>>>>>>>>> design, we
>>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>> RestartStrategy to construct a FailureHandingResult,
>>>>> and
>>>>>>> then
>>>>>>>>>>>>> pass
>>>>>>>>>>>>>>> it to
>>>>>>>>>>>>>>>>>> Listener.
>>>>>>>>>>>>>>>>>> My question is, should we change the restart strategy
>>>>>>> interface
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> support
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> custom restart strategy, or keep the current restart
>>>>>>> strategy
>>>>>>>>> and
>>>>>>>>>>>>>>> let the
>>>>>>>>>>>>>>>>>> later
>>>>>>>>>>>>>>>>>> Listener enrich the restartable information to
>>>>>>>>>>>>>> FailureHandingResult?
>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>> latter
>>>>>>>>>>>>>>>>>> may cause some confusion when we use a custom restart
>>>>>>> strategy.
>>>>>>>>>>>>>>>>>> The default flink restart strategy also runs but does
>>>>> not
>>>>>>> take
>>>>>>>>>>>>>>> effect.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Weihua
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Mon, Mar 20, 2023 at 11:42 PM Lijie Wang <
>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Panagiotis,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks for driving this.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> +1 for supporting custom restart strategy, we did
>>>>> receive
>>>>>>> such
>>>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>>>>>> from the user mailing list [1][2].
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Besides, in current design, the plugin will only do
>>>>> some
>>>>>>>>>>>>>>> statistical
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> classification work, and will not affect the
>>>>>>>>>>>>>>> *FailureHandlingResult*.
>>>>>>>>>>>>>>>>>> Just
>>>>>>>>>>>>>>>>>>> listening, no handling, it doesn't quite match the
>>>>> title.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> 
>>>>>>> https://lists.apache.org/thread/ch3s4jhh09wnff3tscqnb6btp2zlp2r1
>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>> 
>>>>>>> https://lists.apache.org/thread/lwjfdr7c1ypo77r4rwojdk7kxx2sw4sx
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Zhu Zhu <[email protected]> 于2023年3月20日周一 21:39写道：
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Panagiotis,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for creating this proposal! It's good to
>>>>> enable
>>>>>>> Flink
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> handle
>>>>>>>>>>>>>>>>>>>> different errors in different ways, through a
>>>>> pluggable
>>>>>>> way.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> There are requests for flexible restart strategies
>>>>> from
>>>>>>> time
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> time,
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> different strategies of restart backoff time, or to
>>>>>>> suppress
>>>>>>>>>>>>>>>>> restarting
>>>>>>>>>>>>>>>>>>>> on certain errors. Therefore, I think it's better
>>>>> that
>>>>>>> the
>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>> failure handling plugin can also support custom
>>>>> restart
>>>>>>>>>>>>>>> strategies.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Maybe we can call it FailureHandlingAdvisor which
>>>>>>> provides
>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>> information (labels) and gives advice (restart
>>>>> backoff
>>>>>>> time,
>>>>>>>>>>>>>>> whether
>>>>>>>>>>>>>>>>>>>> to restart)? I do not have a strong opinion though,
>>>>> any
>>>>>>>>>>>>>>> explanatory
>>>>>>>>>>>>>>>>>>>> name would be good.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> To avoid unexpected mutation, how about to make the
>>>>>>> context
>>>>>>>>>>>>>>> immutable
>>>>>>>>>>>>>>>>>>>> and let the plugin return an immutable result? i.e.
>>>>>>> remove
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> setters
>>>>>>>>>>>>>>>>>>>> from the context, and let the plugin method return a
>>>>>>> result
>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>> contains `labels`, `canRestart` and
>>>>> `restartBackoffTime`.
>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>> apply the result to the context before invoking the
>>>>> next
>>>>>>>>>>>>>> plugin,
>>>>>>>>>>>>>>> so
>>>>>>>>>>>>>>>>>>>> that the next plugin will see the updated context.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> The plugin should avoid taking too much time to
>>>>> return
>>>>>>> the
>>>>>>>>>>>>>>> result,
>>>>>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>>>>>> it will block the RPC and result in instability.
>>>>>>> However, it
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>>>> perform heavy actions in a different thread. The
>>>>> context
>>>>>>> can
>>>>>>>>>>>>>>> provide
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>> `ioExecutor` to the plugins for reuse.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Zhu
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Shammon FY <[email protected]> 于2023年3月20日周一
>>>>> 20:21写道：
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi Panagiotis
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thank you for your answer. I agree that
>>>>>>> `FailureListener`
>>>>>>>>>>>>>>> could be
>>>>>>>>>>>>>>>>>>>>> stateless, then I have some thoughts as follows
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 1. I see that listeners and tag collections are
>>>>>>> associated.
>>>>>>>>>>>>>>> When
>>>>>>>>>>>>>>>>>>>> JobManager
>>>>>>>>>>>>>>>>>>>>> fails and restarts, how can the new listener be
>>>>>>> associated
>>>>>>>>>>>>>>> with the
>>>>>>>>>>>>>>>>>> tag
>>>>>>>>>>>>>>>>>>>>> collection before failover? Is the listener loading
>>>>>>> order?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 2. The tag collection may be too large, resulting
>>>>> in the
>>>>>>>>>>>>>>> JobManager
>>>>>>>>>>>>>>>>>>> OOM,
>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>> we need to provide a management class that supports
>>>>> some
>>>>>>>>>>>>>>>>> obsolescence
>>>>>>>>>>>>>>>>>>>>> strategies instead of a direct Collection?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 3. Is it possible to provide a more complex data
>>>>>>> structure
>>>>>>>>>>>>>>> than a
>>>>>>>>>>>>>>>>>>> simple
>>>>>>>>>>>>>>>>>>>>> string collection for tags in listeners, such as
>>>>>>> key-value?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>> Shammon FY
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Mon, Mar 20, 2023 at 7:48 PM Leonard Xu <
>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi，Panagiotis
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thank you for kicking off this discussion.
>>>>> Overall, the
>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>>>>> feature of
>>>>>>>>>>>>>>>>>>>>>> this FLIP makes sense to me. We have also discussed
>>>>>>>>>>>>> similar
>>>>>>>>>>>>>>>>>>>> requirements
>>>>>>>>>>>>>>>>>>>>>> with our users and developers, and I believe it
>>>>> will
>>>>>>> help
>>>>>>>>>>>>>>> many
>>>>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> In terms of FLIP content, I have some thoughts:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> (1) For the FailureListenerContextget interface,
>>>>> the
>>>>>>>>>>>>>> methods
>>>>>>>>>>>>>>>>>>>>>> FailureListenerContext#addTag and
>>>>>>>>>>>>>>> FailureListenerContextgetTags
>>>>>>>>>>>>>>>>>> looks
>>>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>> inconsistent because they imply specific
>>>>> implementation
>>>>>>>>>>>>>>> details,
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>> all FailureListeners need to handle them, we
>>>>> shouldn't
>>>>>>>>>>>>> put
>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> interface. Minor: The comment "UDF loading" in the
>>>>>>>>>>>>>>>>>>> getUserClassLoader()
>>>>>>>>>>>>>>>>>>>>>> method looks like a typo, IIUC it should return the
>>>>>>>>>>>>>>> classloader
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> current job.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> (2) Regarding the implementation in
>>>>>>>>>>>>>>>>>>>> ExecutionFailureHandler#handleFailure,
>>>>>>>>>>>>>>>>>>>>>> some custom listeners may have heavy IO operations,
>>>>>>> such
>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>> reporting
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> their monitoring system. The current logic appears
>>>>> to
>>>>>>> be
>>>>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> JobMaster's main thread, and it is recommended not
>>>>> to
>>>>>>> do
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> kind
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> processing in the main thread.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> (3) The results of FailureListener's processing
>>>>> and the
>>>>>>>>>>>>>>>>>>>>>> FailureHandlingResult returned by
>>>>>>> ExecutionFailureHandler
>>>>>>>>>>>>>>> are not
>>>>>>>>>>>>>>>>>>>> related.
>>>>>>>>>>>>>>>>>>>>>> I think these two are closely related, the
>>>>> motivation
>>>>>>> of
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> FLIP
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> make current failure handling more flexible. From
>>>>> this
>>>>>>>>>>>>>>>>> perspective,
>>>>>>>>>>>>>>>>>>>>>> different listeners should have the opportunity to
>>>>>>> affect
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> job's
>>>>>>>>>>>>>>>>>>>> failure
>>>>>>>>>>>>>>>>>>>>>> handling flow. For example, a Flink job is
>>>>> configured
>>>>>>>>>>>>> with
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> RestartStrategy with huge numbers retry , but the
>>>>> Kafka
>>>>>>>>>>>>>>> topic of
>>>>>>>>>>>>>>>>>>>> Source has
>>>>>>>>>>>>>>>>>>>>>> been deleted, the job will failover continuously.
>>>>> In
>>>>>>> this
>>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>> should have their listener to determine whether
>>>>> this
>>>>>>>>>>>>>> failure
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> recoverable
>>>>>>>>>>>>>>>>>>>>>> or unrecoverable, and then wrap the processing
>>>>> result
>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>>>>>>>> FailureHandlingResult.unrecoverable(xx) and pass
>>>>> it to
>>>>>>>>>>>>>>> JobMaster,
>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>> approach will be more flexible.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> (4) All FLIPs have an important section named
>>>>> Public
>>>>>>>>>>>>>>> Interfaces.
>>>>>>>>>>>>>>>>>>>> Current
>>>>>>>>>>>>>>>>>>>>>> FLIP mixes the interface section and the
>>>>> implementation
>>>>>>>>>>>>>>> section
>>>>>>>>>>>>>>>>>>>> together.
>>>>>>>>>>>>>>>>>>>>>> It is better for us to refer to the FLIP
>>>>> template[1]
>>>>>>> and
>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>>> them,
>>>>>>>>>>>>>>>>>>>>>> this will make the entire FLIP clearer.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> In addition, regarding the FLIP process, there is a
>>>>>>> small
>>>>>>>>>>>>>>>>>> suggestion:
>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>> community generally creates a JIRA issue after the
>>>>> FLIP
>>>>>>>>>>>>>> vote
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> passed,
>>>>>>>>>>>>>>>>>>>>>> instead of during the FLIP preparation phase
>>>>> because
>>>>>>> the
>>>>>>>>>>>>>>> FLIP may
>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>> rejected. Although this FLIP is very reasonable,
>>>>> it's
>>>>>>>>>>>>>> better
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> process.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Leonard
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>> 
>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP+Template
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Mon, Mar 20, 2023 at 7:04 PM David Morávek <
>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> however listeners can use previous state
>>>>>>>>>>>>> (tags/labels)
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>>> decisions
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> That sounds like a very fragile contract. We
>>>>> should
>>>>>>>>>>>>>> either
>>>>>>>>>>>>>>>>> allow
>>>>>>>>>>>>>>>>>>>> passing
>>>>>>>>>>>>>>>>>>>>>>> tags between listeners and then need to define
>>>>>>> ordering
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>>>>>>> independent. I prefer the latter because it
>>>>> allows us
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> parallelize
>>>>>>>>>>>>>>>>>>>>>> things
>>>>>>>>>>>>>>>>>>>>>>> if needed (if all listeners trigger an RCP to the
>>>>>>>>>>>>>> external
>>>>>>>>>>>>>>>>>> system,
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> example).
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Can you expand on why we need more than one
>>>>> classifier
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> able
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> output
>>>>>>>>>>>>>>>>>>>>>>> the same tag?
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> system ones come first and then the ones loaded
>>>>> from
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> plugin
>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Since they're returned as a Set, the order is
>>>>>>>>>>>>> completely
>>>>>>>>>>>>>>>>>>>>>> non-deterministic,
>>>>>>>>>>>>>>>>>>>>>>> no matter in which order they're loaded.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> just communicating with external
>>>>> monitoring/alerting
>>>>>>>>>>>>>>> systems
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> That makes the need for pushing things out of the
>>>>> main
>>>>>>>>>>>>>>> thread
>>>>>>>>>>>>>>>>>> even
>>>>>>>>>>>>>>>>>>>>>>> stronger. This almost sounds like we need to
>>>>> return a
>>>>>>>>>>>>>>>>>>>> CompletableFuture
>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> the per-throwable classification because an
>>>>> external
>>>>>>>>>>>>>> system
>>>>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>>>>>>> take a
>>>>>>>>>>>>>>>>>>>>>>> significant time to respond. We need to unblock
>>>>> the
>>>>>>>>>>>>> main
>>>>>>>>>>>>>>> thread
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>>>>>> RPCs.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Also, in the proposal, this happens in the failure
>>>>>>>>>>>>>>> handler. If
>>>>>>>>>>>>>>>>>>>> that's the
>>>>>>>>>>>>>>>>>>>>>>> case, this might block the job from being
>>>>> restarted
>>>>>>> (if
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> restart
>>>>>>>>>>>>>>>>>>>>>>> strategy allows for another restart), which would
>>>>> be
>>>>>>>>>>>>>> great
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> avoid
>>>>>>>>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>>>>>>>>> it can introduce extra downtime.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> This raises another question: what should happen
>>>>> if
>>>>>>> the
>>>>>>>>>>>>>>>>>>>> classification
>>>>>>>>>>>>>>>>>>>>>>> fails? Crashing the job (which is what's currently
>>>>>>>>>>>>>>> proposed)
>>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>> dangerous if this might depend on an external
>>>>> system.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thats a valid point, passing the JobGraph
>>>>> containing
>>>>>>>>>>>>> all
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> above
>>>>>>>>>>>>>>>>>>>>>>>> information is also something to consider
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> We should avoid passing JG around because it's
>>>>> mutable
>>>>>>>>>>>>>>> (which
>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>> must fix
>>>>>>>>>>>>>>>>>>>>>>> in the long term), and letting users change it
>>>>> might
>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>> consequences.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>> D.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Mon, Mar 20, 2023 at 7:23 AM Panagiotis
>>>>> Garefalakis
>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hey David, Shammon,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the valuable comments!
>>>>>>>>>>>>>>>>>>>>>>>> I am glad you find this proposal useful, some
>>>>>>>>>>>>> thoughts:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> @Shammon
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 1. How about adding more job information in
>>>>>>>>>>>>>>>>>>>> FailureListenerContext? For
>>>>>>>>>>>>>>>>>>>>>>>>> example, job vertext, subtask, taskmanager
>>>>>>>>>>>>> location.
>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>>>> more statistics according to different
>>>>> dimensions.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thats a valid point, passing the JobGraph
>>>>> containing
>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> above
>>>>>>>>>>>>>>>>>>>>>>>> information
>>>>>>>>>>>>>>>>>>>>>>>> is also something to consider, I was mostly
>>>>> trying to
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>> conservative:
>>>>>>>>>>>>>>>>>>>>>>>> i.e., passingly only the information we need, and
>>>>>>>>>>>>>> extend
>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 2. Users may want to save results in listener,
>>>>> and
>>>>>>>>>>>>> then
>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> historical results even jabmanager failover.
>>>>> Can we
>>>>>>>>>>>>>>>>> provide a
>>>>>>>>>>>>>>>>>>>> unified
>>>>>>>>>>>>>>>>>>>>>>>>> implementation for data storage requirements?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> The idea is to store only the output of the
>>>>> Listeners
>>>>>>>>>>>>>>> (tags)
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> treat
>>>>>>>>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>>>>>>>> as stateless.
>>>>>>>>>>>>>>>>>>>>>>>> Tags are be stored along with HistoryEntries, and
>>>>>>>>>>>>> will
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>> the HistoryServer
>>>>>>>>>>>>>>>>>>>>>>>> even after a JM dies.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> @David
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 1) Should we also consider adding labels? The
>>>>>>>>>>>>>>> combination of
>>>>>>>>>>>>>>>>>> tags
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>> labels seems to be what most systems offer;
>>>>>>>>>>>>>> sometimes,
>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>> offer
>>>>>>>>>>>>>>>>>>>>>>> labels
>>>>>>>>>>>>>>>>>>>>>>>>> only (key=value pairs) because tags can be
>>>>>>>>>>>>>> implemented
>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>>>> those,
>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>> the other way around.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Indeed changing tags to k:v labels could be more
>>>>>>>>>>>>>>> expressive,
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>> like it!
>>>>>>>>>>>>>>>>>>>>>>>> Let's see what others think.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 2) Since we can not predict how heavy
>>>>> user-defined
>>>>>>>>>>>>>> models
>>>>>>>>>>>>>>>>>>>> ("listeners")
>>>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>>> going to be, it would be great to keep the
>>>>>>>>>>>>>>> interfaces/data
>>>>>>>>>>>>>>>>>>>> structures
>>>>>>>>>>>>>>>>>>>>>>>>> immutable so we can push things over to the I/O
>>>>>>>>>>>>>>> threads.
>>>>>>>>>>>>>>>>>> Also,
>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>> sounds
>>>>>>>>>>>>>>>>>>>>>>>>> off to call the main interface a Listener since
>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>> supposed
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> enhance
>>>>>>>>>>>>>>>>>>>>>>>>> the original throwable with additional metadata.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> The idea was for the name to be generic as there
>>>>>>>>>>>>> could
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> Listener
>>>>>>>>>>>>>>>>>>>>>>>> implementations
>>>>>>>>>>>>>>>>>>>>>>>> just communicating with external
>>>>> monitoring/alerting
>>>>>>>>>>>>>>> systems
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> no
>>>>>>>>>>>>>>>>>>>>>>>> metadata output
>>>>>>>>>>>>>>>>>>>>>>>> -- but lets rethink that. For immutability, see
>>>>>>>>>>>>> below:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 3) You're proposing to support a set of
>>>>> listeners.
>>>>>>>>>>>>>> Since
>>>>>>>>>>>>>>>>> you're
>>>>>>>>>>>>>>>>>>>> passing
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> mutable context around, which includes tags set
>>>>> by
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> previous
>>>>>>>>>>>>>>>>>>>>>>> listener,
>>>>>>>>>>>>>>>>>>>>>>>>> do you expect users to make any assumptions
>>>>> about
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>> listeners are executed?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> In the existing proposal we are not making any
>>>>>>>>>>>>>>> assumptions
>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>>>>>>>>>> of listeners,
>>>>>>>>>>>>>>>>>>>>>>>> (system ones come first and then the ones loaded
>>>>> from
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> plugin
>>>>>>>>>>>>>>>>>>>>>> manager)
>>>>>>>>>>>>>>>>>>>>>>>> however listeners can use previous state
>>>>>>>>>>>>> (tags/labels)
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>>> decisions:
>>>>>>>>>>>>>>>>>>>>>>>> e.g., wont assign *UNKNOWN* failureType when we
>>>>> have
>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>>> seen
>>>>>>>>>>>>>>>>>>>> *USER
>>>>>>>>>>>>>>>>>>>>>>> *or
>>>>>>>>>>>>>>>>>>>>>>>> the other way around -- when we have seen
>>>>> *UNKNOWN*
>>>>>>>>>>>>>>> remove in
>>>>>>>>>>>>>>>>>>>> favor of
>>>>>>>>>>>>>>>>>>>>>>>> *USER*
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>> Panagiotis
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Sun, Mar 19, 2023 at 10:42 AM David Morávek <
>>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Panagiotis,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> This is an excellent proposal and something
>>>>>>>>>>>>> everyone
>>>>>>>>>>>>>>> trying
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> provide
>>>>>>>>>>>>>>>>>>>>>>>>> "Flink as a service" needs to solve at some
>>>>> point.
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> have a
>>>>>>>>>>>>>>>>>>>> couple of
>>>>>>>>>>>>>>>>>>>>>>>>> questions:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> If I understand the proposal correctly, this is
>>>>>>>>>>>>> just
>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>>>>>>>>> tags
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>> the Throwable by running a tuple of (Throwable,
>>>>>>>>>>>>>>>>>> FailureContext)
>>>>>>>>>>>>>>>>>>>>>>> through a
>>>>>>>>>>>>>>>>>>>>>>>>> user-defined model.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 1) Should we also consider adding labels? The
>>>>>>>>>>>>>>> combination
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> tags and
>>>>>>>>>>>>>>>>>>>>>>>>> labels seems to be what most systems offer;
>>>>>>>>>>>>>> sometimes,
>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>> offer
>>>>>>>>>>>>>>>>>>>>>>> labels
>>>>>>>>>>>>>>>>>>>>>>>>> only (key=value pairs) because tags can be
>>>>>>>>>>>>>> implemented
>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>>>> those,
>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>> the other way around.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 2) Since we can not predict how heavy
>>>>> user-defined
>>>>>>>>>>>>>>> models
>>>>>>>>>>>>>>>>>>>>>> ("listeners")
>>>>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>>> going to be, it would be great to keep the
>>>>>>>>>>>>>>> interfaces/data
>>>>>>>>>>>>>>>>>>>> structures
>>>>>>>>>>>>>>>>>>>>>>>>> immutable so we can push things over to the I/O
>>>>>>>>>>>>>>> threads.
>>>>>>>>>>>>>>>>>> Also,
>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>> sounds
>>>>>>>>>>>>>>>>>>>>>>>>> off to call the main interface a Listener since
>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>> supposed
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> enhance
>>>>>>>>>>>>>>>>>>>>>>>>> the original throwable with additional metadata.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I'd propose something along the lines of (we
>>>>> should
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> better
>>>>>>>>>>>>>>>>>>>>>> names,
>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>> is just to outline the idea):
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> interface FailureEnricher {
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> ThrowableWithTagsAndLabels
>>>>>>>>>>>>> enrichFailure(Throwable
>>>>>>>>>>>>>>> cause,
>>>>>>>>>>>>>>>>>>>>>>>>> ImmutableContextualMetadataAboutTheThrowable
>>>>>>>>>>>>>> context);
>>>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> The names should change; this is just to outline
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> idea.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 3) You're proposing to support a set of
>>>>> listeners.
>>>>>>>>>>>>>>> Since
>>>>>>>>>>>>>>>>>> you're
>>>>>>>>>>>>>>>>>>>>>> passing
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> mutable context around, which includes tags set
>>>>> by
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> previous
>>>>>>>>>>>>>>>>>>>>>>> listener,
>>>>>>>>>>>>>>>>>>>>>>>>> do you expect users to make any assumptions
>>>>> about
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>> listeners are executed?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> *@Shammon*
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Users may want to save results in listener, and
>>>>>>>>>>>>> then
>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>> get the
>>>>>>>>>>>>>>>>>>>>>>>>>> historical results even jabmanager failover.
>>>>> Can
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>> provide a
>>>>>>>>>>>>>>>>>>>>>> unified
>>>>>>>>>>>>>>>>>>>>>>>>>> implementation for data storage requirements?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I think we should explicitly state that all
>>>>>>>>>>>>>>> "listeners" are
>>>>>>>>>>>>>>>>>>>> treated
>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>> stateless. I don't see any strong reason for
>>>>>>>>>>>>>>> snapshotting
>>>>>>>>>>>>>>>>>> them.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>> D.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Sat, Mar 18, 2023 at 1:00 AM Shammon FY <
>>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Panagiotis
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for starting this discussion. I think
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>> FLIP
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> valuable
>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> can help user to analyze the causes of job
>>>>>>>>>>>>> failover
>>>>>>>>>>>>>>>>> better!
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> I have two comments as follows
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 1. How about adding more job information in
>>>>>>>>>>>>>>>>>>>> FailureListenerContext?
>>>>>>>>>>>>>>>>>>>>>>> For
>>>>>>>>>>>>>>>>>>>>>>>>>> example, job vertext, subtask, taskmanager
>>>>>>>>>>>>>> location.
>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>>>>> more statistics according to different
>>>>>>>>>>>>> dimensions.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Users may want to save results in listener,
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> historical results even jabmanager failover.
>>>>> Can
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>> provide a
>>>>>>>>>>>>>>>>>>>>>> unified
>>>>>>>>>>>>>>>>>>>>>>>>>> implementation for data storage requirements?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>> shammon FY
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Saturday, March 18, 2023, Panagiotis
>>>>>>>>>>>>>> Garefalakis <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> This FLIP [1] proposes a pluggable interface
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> failure
>>>>>>>>>>>>>>>>>>>> handling
>>>>>>>>>>>>>>>>>>>>>>>>>> allowing
>>>>>>>>>>>>>>>>>>>>>>>>>>> users to implement custom failure logic using
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> plugin
>>>>>>>>>>>>>>>>>>>>>> framework.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Motivated by existing proposals [2] and
>>>>> tickets
>>>>>>>>>>>>>>> [3],
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>> enables
>>>>>>>>>>>>>>>>>>>>>>>>>> use-cases
>>>>>>>>>>>>>>>>>>>>>>>>>>> like: assigning particular types to failures
>>>>>>>>>>>>>> (e.g.,
>>>>>>>>>>>>>>>>> User
>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>> System),
>>>>>>>>>>>>>>>>>>>>>>>>>>> emitting custom metrics per type (e.g.,
>>>>>>>>>>>>>>> application or
>>>>>>>>>>>>>>>>>>>> platform),
>>>>>>>>>>>>>>>>>>>>>>>> even
>>>>>>>>>>>>>>>>>>>>>>>>>>> exposing errors to downstream consumers (e.g.,
>>>>>>>>>>>>>>>>>> notification
>>>>>>>>>>>>>>>>>>>>>>> systems).
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks to Piotr and Anton for the initial
>>>>>>>>>>>>> reviews
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> discussions!
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> For anyone interested, the starting point
>>>>> would
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> FLIP
>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>> created,
>>>>>>>>>>>>>>>>>>>>>>>>>>> describing the motivation and the proposed
>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>>>>> (part
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> core,
>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime and web).
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> The intuition behind this FLIP is being able
>>>>> to
>>>>>>>>>>>>>>> execute
>>>>>>>>>>>>>>>>>>>> custom
>>>>>>>>>>>>>>>>>>>>>>> logic
>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>> failures by exposing a FailureListener
>>>>>>>>>>>>> interface.
>>>>>>>>>>>>>>>>>>>> Implementation
>>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>>>>>>>>> can be simply loaded to the system as Jar
>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>>>>>>> FailureListeners
>>>>>>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>>>>>>> also
>>>>>>>>>>>>>>>>>>>>>>>>>>> decide to assign failure tags to errors
>>>>>>>>>>>>>> (expressed
>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>> strings),
>>>>>>>>>>>>>>>>>>>>>>>>>>> that will then be exposed as metadata by the
>>>>>>>>>>>>>>> UI/Rest
>>>>>>>>>>>>>>>>>>>> interfaces.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback is always appreciated! Looking
>>>>> forward
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>>>>> thoughts!
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%
>>>>>>>>>>>>>>>>>>>>>>>>>>> 3A+Pluggable+failure+handling+for+Apache+Flink
>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>> https://docs.google.com/document/d/1pcHg9F3GoDDeVD5GIIo2wO67
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hmjgy0-hRDeuFnrMgT4
>>>>>>>>>>>>>>>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-20833
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Panagiotis
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>> 
>>

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

Reply via email to