Thanks Panagiotis for the update, the updated FLIP looks good to me.
Best, Leonard > On Apr 13, 2023, at 7:42 AM, Panagiotis Garefalakis <pga...@apache.org> wrote: > > Hello there, > > Zhu: agree with the config option, great suggestion > Hong: global timeout is also interesting and a good addition -- only > downside I see is just another config option > > If everyone is happy, I suggest we keep the discussion open until Friday > and start a Vote shortly after. > > Cheers, > Panagiotis > > On Tue, Apr 11, 2023 at 12:58 AM Teoh, Hong <lian...@amazon.co.uk.invalid> > wrote: > >> Hi Panagiotis, >> >> Thank you for the update. Looks great! Just one suggestion below: >> >> 1. We seem to be waiting for the future(s) to complete before restarting >> the job - should we add a configurable timeout for the enrichment? Since >> each failure enricher are run in parallel, we could probably settle for 1 >> timeout for all failure handlers. >> 2. +1 to Zhu’s comment on adding a comma separated list of FailureHandlers >> instead of boolean toggle! >> >> Other than the above, the FLIP looks great! Thank you for your efforts. >> >> Regards, >> Hong >> >> >> >>> On 11 Apr 2023, at 08:01, Zhu Zhu <reed...@gmail.com> wrote: >>> >>> CAUTION: This email originated from outside of the organization. Do not >> click links or open attachments unless you can confirm the sender and know >> the content is safe. >>> >>> >>> >>> Hi Panagiotis, >>> >>> Thanks for updating the FLIP. >>> >>>> Regarding the config option >> `jobmanager.failure-enricher-plugins.enabled` >>> I think a config option `jobmanager.failure-enrichers`, which accepts >>> the names of enrichers to use, may be better. It allows the users to >>> deploy and use the plugins in a more flexible way. The default value >>> of the config can be none, which means failure enrichment will be >>> disabled by default. >>> A reference can be the config option `metrics.reporters` which helps >>> to load metric reporter plugins. >>> >>> Thanks, >>> Zhu >>> >>> Panagiotis Garefalakis <pga...@apache.org> 于2023年4月10日周一 03:47写道: >>>> >>>> Hello again everyone, >>>> >>>> FLIP is now updated based on our discussion! >>>> In short, FLIP-304 [1] proposes the addition of a pluggable interface >> that >>>> will allow users to add custom logic and enrich failures with custom >>>> metadata labels. >>>> While as discussed, custom restart strategies will be part of a >> different >>>> effort. Every pluggable FaulireEnricher: >>>> >>>> - Is triggered on every global/non-global failure >>>> - Receives a Throwable cause and an immutable Context >>>> - Performs asynchronous execution (separate IoExecutor) to avoid >>>> blocking the main thread for RPCs >>>> - Is completely independent from other Enrichers >>>> - Emits failure labels/tags for its unique, pre-defined keys (defined >> at >>>> startup time) >>>> >>>> >>>> Check the link for implementation details and please let me know what >> you >>>> think :) >>>> >>>> >>>> [1] >>>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers >>>> >>>> >>>> Panagiotis >>>> >>>> >>>> On Tue, Mar 28, 2023 at 5:01 AM Zhu Zhu <reed...@gmail.com> wrote: >>>> >>>>> Hi Panagiotis, >>>>> >>>>> How about to introduce a config option to control which error handling >>>>> plugins should be used? It is more flexible for deployments. >> Additionally, >>>>> it can also enable users to explicitly specify the order that the >> plugins >>>>> take effects. >>>>> >>>>> Thanks, >>>>> Zhu >>>>> >>>>> Gen Luo <luogen...@gmail.com> 于2023年3月27日周一 15:02写道: >>>>>> >>>>>> Thanks for the summary! >>>>>> >>>>>> Also +1 to support custom restart strategies in a different FLIP, >>>>>> as long as we can make sure that the plugin interface won't be >>>>>> changed when the restart strategy interface is introduced. >>>>>> >>>>>> To achieve this, maybe we should think well how the handler >>>>>> would cooperate with the restart strategy, like would it executes b >>>>>> efore the strategy (e.g. some strategy may use the tag), or after >>>>>> it (e.g. some metric reporting handler may use the handling result). >>>>>> Though we can implement in one way, and extend if the other is >>>>>> really necessary by someone. >>>>>> >>>>>> Besides, instead of using either of the names, shall we just make >>>>>> them two subclasses named FailureEnricher and FailureListener? >>>>>> The former executes synchronously and can modify the context, >>>>>> while the latter executes asynchronously and has a read-only view >>>>>> of context. In this way we can make sure a handler behaves in >>>>>> the expected way. >>>>>> >>>>>> >>>>>> On Thu, Mar 23, 2023 at 5:19 PM Zhu Zhu <reed...@gmail.com> wrote: >>>>>> >>>>>>> +1 to support custom restart strategies in a different FLIP. >>>>>>> >>>>>>> It's fine to have a different plugin for custom restart strategy. >>>>>>> If so, since we do not treat the FLIP-304 plugin as a common failure >>>>>>> handler, but instead mainly targets to add labels to errors, I would >>>>>>> +1 for the name `FailureEnricher`. >>>>>>> >>>>>>> Thanks, >>>>>>> Zhu >>>>>>> >>>>>>> David Morávek <d...@apache.org> 于2023年3月23日周四 15:51写道: >>>>>>>> >>>>>>>>> >>>>>>>>> One additional remark on introducing it as an async operation: We >>>>> would >>>>>>>>> need a new configuration parameter to define the timeout for such a >>>>>>>>> listener call, wouldn't we? >>>>>>>>> >>>>>>>> >>>>>>>> This could be left up to the implementor to handle. >>>>>>>> >>>>>>>> What about adding an extra method getNamespace() to the Listener >>>>>>> interface >>>>>>>>> which returns an Optional<String>. >>>>>>>>> >>>>>>>> >>>>>>>> I'd avoid mixing an additional concept into this. We can simply have >>>>> a >>>>>>> new >>>>>>>> method that returns a set of keys the listener can output. We can >>>>>>> validate >>>>>>>> this at the JM startup time and fail fast (since it's a >> configuration >>>>>>>> error) if there is an overlap. If the listener outputs the key that >>>>> is >>>>>>> not >>>>>>>> allowed to, I wouldn't be afraid to call into a fatal error handler >>>>> since >>>>>>>> it's an invalid implementation. >>>>>>>> >>>>>>>> Best, >>>>>>>> D. >>>>>>>> >>>>>>>> On Thu, Mar 23, 2023 at 8:34 AM Matthias Pohl >>>>>>>> <matthias.p...@aiven.io.invalid> wrote: >>>>>>>> >>>>>>>>> Sounds good. Two points I want to add: >>>>>>>>> >>>>>>>>> - Listener execution should be independent — however we need a >>>>> way >>>>>>> to >>>>>>>>>> enforce a Label key/key-prefix is only assigned to a single >>>>> Listener, >>>>>>>>>> thinking of a validation step both at Listener init and runtime >>>>>>> stages >>>>>>>>>> >>>>>>>>> What about adding an extra method getNamespace() to the Listener >>>>>>> interface >>>>>>>>> which returns an Optional<String>. Therefore, the >>>>> implementation/the >>>>>>> user >>>>>>>>> can decide depending on the use case whether it's necessary to have >>>>>>>>> separate namespaces for the key/value pairs or not. On the Flink >>>>> side, >>>>>>> we >>>>>>>>> would just merge the different maps considering their namespaces. >>>>>>>>> >>>>>>>>> A flaw of this approach is that if a user decides to use the same >>>>>>> namespace >>>>>>>>> for multiple listeners, how is an error in one of the listeners >>>>>>> represented >>>>>>>>> in the outcome? We would have to overwrite either the successful >>>>>>> listener's >>>>>>>>> result or the failed ones. I wanted to share it, anyway. >>>>>>>>> >>>>>>>>> One additional remark on introducing it as an async operation: We >>>>> would >>>>>>>>> need a new configuration parameter to define the timeout for such a >>>>>>>>> listener call, wouldn't we? >>>>>>>>> >>>>>>>>> Matthias >>>>>>>>> >>>>>>>>> On Wed, Mar 22, 2023 at 4:56 PM Panagiotis Garefalakis < >>>>>>> pga...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi everyone, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks for the valuable comments! >>>>>>>>>> Excited to see this is an area of interest for the community! >>>>>>>>>> >>>>>>>>>> Summarizing some of the main points raised along with my >>>>> thoughts: >>>>>>>>>> >>>>>>>>>> - Labels (Key/Value) pairs are more expressive than Tags >>>>>>> (Strings) so >>>>>>>>>> using the former is a good idea — I am also debating if we >>>>> want to >>>>>>>>>> return >>>>>>>>>> multiple KV pairs per Listener (one could argue that we could >>>>>>> split >>>>>>>>> the >>>>>>>>>> logic in multiple Listeners to support that) >>>>>>>>>> - An immutable context along with data returned using the >>>>>>> interface >>>>>>>>>> method implementations is a better approach than a mutable >>>>>>> Collection >>>>>>>>>> - Listener execution should be independent — however we need a >>>>>>> way to >>>>>>>>>> enforce a Label key/key-prefix is only assigned to a single >>>>>>> Listener, >>>>>>>>>> thinking of a validation step both at Listener init and >>>>> runtime >>>>>>> stages >>>>>>>>>> - We want to perform async Listener operations as sync could >>>>>>> block the >>>>>>>>>> main thread — exposing an ioExecutor pool through the context >>>>>>> could be >>>>>>>>>> an >>>>>>>>>> elegant solution here >>>>>>>>>> - Make sure Listener errors are not failing jobs — make sure >>>>> to >>>>>>> log >>>>>>>>> and >>>>>>>>>> keep the job alive >>>>>>>>>> - We need better naming / public interface >>>>> separation/description >>>>>>>>>> >>>>>>>>>> - Even though custom restart strategies share some >>>>>>> properties >>>>>>>>> with >>>>>>>>>> Listeners, they would probably need a separate interface with a >>>>>>> different >>>>>>>>>> return type anyway (restart strategy not labels) and in general >>>>> they >>>>>>> are >>>>>>>>>> different and complex enough to justify their own FLIP (that can >>>>>>> also be >>>>>>>>> a >>>>>>>>>> follow-up). >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> What do people think? I am planning to modify the FLIP to reflect >>>>>>> these >>>>>>>>>> changes if they make sense to everyone. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Panagiotis >>>>>>>>>> >>>>>>>>>> On Wed, Mar 22, 2023 at 6:28 AM Hong Teoh <hlteo...@gmail.com> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> Thank you Panagiotis for proposing this. From the size of the >>>>>>> thread, >>>>>>>>>> this >>>>>>>>>>> is a much needed feature in Flink! >>>>>>>>>>> Some thoughts, to extend those already adeptly summarised by >>>>> Piotr, >>>>>>>>>>> Matthias and Jing. >>>>>>>>>>> >>>>>>>>>>> - scope of FLIP: +1 to scoping this FLIP to observability >>>>> around a >>>>>>>>>>> restart. That would include adding metadata + exposing >>>>> metadata to >>>>>>>>>> external >>>>>>>>>>> systems. IMO, introducing a new restart strategy solves >>>>> different >>>>>>>>>> problems, >>>>>>>>>>> is much larger scope and should be covered in a separate FLIP. >>>>>>>>>>> >>>>>>>>>>> - failure handling: At the moment, we propose transitioning the >>>>>>> Flink >>>>>>>>> job >>>>>>>>>>> to a terminal FAILED state when JobListener fails, when the job >>>>>>> could >>>>>>>>>> have >>>>>>>>>>> transitioned to RESTARTING->RUNNING. If we are keeping in line >>>>>>> with the >>>>>>>>>>> scope to add metadata/observability around job restarts, we >>>>> should >>>>>>> not >>>>>>>>> be >>>>>>>>>>> affecting the running of the Flink job itself. Could I propose >>>>> we >>>>>>>>> instead >>>>>>>>>>> log WARN/ERROR. >>>>>>>>>>> >>>>>>>>>>> - immutable context: +1 to keeping the contract clear via >>>>> return >>>>>>> types. >>>>>>>>>>> - async operation: +1 to adding ioexecutor to context, however, >>>>>>> given >>>>>>>>> we >>>>>>>>>>> don’t want to block the actual job restart on adding metadata / >>>>>>> calling >>>>>>>>>>> external services, should we consider returning and letting >>>>> futures >>>>>>>>>>> complete independently? >>>>>>>>>>> >>>>>>>>>>> - independent vs ordered execution: Should we consider making >>>>> the >>>>>>> order >>>>>>>>>> of >>>>>>>>>>> execution deterministic (use a List instead of Set)? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Once again, thank you for working on this. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Hong >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On 21 Mar 2023, at 21:07, Jing Ge <j...@ververica.com.INVALID >>>>>> >>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> Thanks Panagiotis for this FLIP and thanks for all valuable >>>>>>>>>> discussions. >>>>>>>>>>>> I'd like to share my two cents: >>>>>>>>>>>> >>>>>>>>>>>> - FailureListenerContext#addTag and >>>>>>> FailureListenerContext#getTags. >>>>>>>>> It >>>>>>>>>>>> seems that we have to call getTags() and then do remove >>>>>>> activities if >>>>>>>>>> we >>>>>>>>>>>> want to delete any tags (according to the javadoc in the >>>>> FLIP). >>>>>>> It >>>>>>>>> is >>>>>>>>>>>> inconsistent for me too. Either offer addTag(), deleteTag(), >>>>> and >>>>>>> let >>>>>>>>>>>> getTags() return immutable collection, or offer getTags() >>>>> only to >>>>>>>>>> return >>>>>>>>>>>> mutable collection. >>>>>>>>>>>> >>>>>>>>>>>> - label vs tag. Label is a great idea +1. AFAIC, tag could >>>>> be a >>>>>>>>> special >>>>>>>>>>>> case of label, i.e. key="tag". It is convenient to offer the >>>>>>> xxxTag() >>>>>>>>>>>> method if the user only needs one label. I would love to have >>>>>>> both of >>>>>>>>>>> them. >>>>>>>>>>>> Another thought is that tag implicitly contains the meaning >>>>> of >>>>>>>>>>> "immutable". >>>>>>>>>>>> >>>>>>>>>>>> - +1 for a separate FLIP of customized restart strategy. >>>>>>> Attention >>>>>>>>>> should >>>>>>>>>>>> be taken to make sure it works well with Flink built-in >>>>>>>>> restartStrategy >>>>>>>>>>> in >>>>>>>>>>>> order to have the single source of truth. >>>>>>>>>>>> >>>>>>>>>>>> - execution order. The default independent execution should >>>>> be >>>>>>> fine. >>>>>>>>>>>> According to the FailureListener interface definition in the >>>>>>> FLIP, >>>>>>>>>> users >>>>>>>>>>>> should be able to easily build a listener chain[1] to offer >>>>>>>>> sequential >>>>>>>>>>>> execution, e.g. public FailureListener(FailureListener >>>>>>> nextListener). >>>>>>>>>>>> Another option is to modify the interface or provide another >>>>>>>>> interface >>>>>>>>>>>> alongside the current one to extend the method to support >>>>>>>>>> ListenerChain, >>>>>>>>>>>> i.e. void onFailure(Throwable cause, FailureListenerContext >>>>>>> context, >>>>>>>>>>>> ListenerChain listenerChain). Users can also mix them up. >>>>>>>>>>>> >>>>>>>>>>>> - naming. Afaiu, the pluggable extension is not limited to >>>>>>> failure >>>>>>>>>>>> enrichment. Conceptually it can do everything for the given >>>>>>> failure, >>>>>>>>>> e.g. >>>>>>>>>>>> start counting metric as the FLIP described, calling an >>>>> external >>>>>>>>>> system, >>>>>>>>>>>> sending notification to slack channel, etc. you name it. It >>>>>>> sounds to >>>>>>>>>> me >>>>>>>>>>>> more like a FailureActionListener - it can trigger actions >>>>> based >>>>>>> on >>>>>>>>>>>> failure. Failure enrichment is one type of action. >>>>>>>>>>>> >>>>>>>>>>>> Best regards, >>>>>>>>>>>> Jing >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>> https://en.wikipedia.org/wiki/Chain-of-responsibility_pattern >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Mar 21, 2023 at 3:39 PM Matthias Pohl >>>>>>>>>>>> <matthias.p...@aiven.io.invalid> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thanks for the proposal, Panagiotis. A lot of good points >>>>> have >>>>>>> been >>>>>>>>>>> already >>>>>>>>>>>>> shared. I just want to add my view on some of the items: >>>>>>>>>>>>> >>>>>>>>>>>>> - independent execution vs ordered execution: I prefer the >>>>>>> listeners >>>>>>>>>>> being >>>>>>>>>>>>> processed independently from each other because it adds less >>>>>>>>>> complexity >>>>>>>>>>>>> code-wise. The use case Piotr described (where you want to >>>>> reuse >>>>>>>>> some >>>>>>>>>>> other >>>>>>>>>>>>> classifier) is the only one I can think of where we actually >>>>>>> need >>>>>>>>>>>>> classifiers depending on each other. Supporting such a use >>>>> case >>>>>>>>> right >>>>>>>>>>> from >>>>>>>>>>>>> the start feels a bit over-engineered and could be covered >>>>> in a >>>>>>>>>>> follow-up >>>>>>>>>>>>> FLIP if we really come to that point where such a feature is >>>>>>>>> requested >>>>>>>>>>> by >>>>>>>>>>>>> users. >>>>>>>>>>>>> >>>>>>>>>>>>> - key/value pairs instead of plain labels: I think that's a >>>>> good >>>>>>>>> idea. >>>>>>>>>>>>> key/value pairs are more expressive. +1 >>>>>>>>>>>>> >>>>>>>>>>>>> - extending the FLIP to cover restart strategy: I understand >>>>>>> Gen's >>>>>>>>>>> concern >>>>>>>>>>>>> about introducing too many different types of plugins. But I >>>>>>> would >>>>>>>>>> still >>>>>>>>>>>>> favor not extending the FLIP in this regard. A pluggable >>>>> restart >>>>>>>>>>> strategy >>>>>>>>>>>>> sounds reasonable. But an error classifier and a restart >>>>>>> strategy >>>>>>>>> are >>>>>>>>>>> still >>>>>>>>>>>>> different enough to justify separate plugins, IMHO. And >>>>>>> therefore, I >>>>>>>>>>> would >>>>>>>>>>>>> think that covering the restart strategy in a separate FLIP >>>>> is >>>>>>> the >>>>>>>>>>> better >>>>>>>>>>>>> option for the sake of simplicity. >>>>>>>>>>>>> >>>>>>>>>>>>> - immutable context: Passing in an immutable context and >>>>>>> returning >>>>>>>>>> data >>>>>>>>>>>>> through the interface method's return value sounds like a >>>>> better >>>>>>>>>>> approach >>>>>>>>>>>>> to harden the contract of the interface. +1 for that >>>>> proposal >>>>>>>>>>>>> >>>>>>>>>>>>> - async operation: I think David is right. An async >>>>> interface >>>>>>> makes >>>>>>>>>> the >>>>>>>>>>>>> listener implementations more robust when it comes to heavy >>>>> IO >>>>>>>>>>> operations. >>>>>>>>>>>>> The ioExecutor can be passed through the context object. +1 >>>>>>>>>>>>> >>>>>>>>>>>>> Matthias >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Mar 21, 2023 at 2:09 PM David Morávek < >>>>>>>>>> david.mora...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> *@Piotr* >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I was thinking about actually defining the order of the >>>>>>>>>>>>>>> classifiers/handlers and not allowing them to be >>>>> asynchronous. >>>>>>>>>>>>>>> Asynchronousity would create some problems: when to >>>>> actually >>>>>>>>> return >>>>>>>>>>> the >>>>>>>>>>>>>>> error to the user? After all async responses will get >>>>> back? >>>>>>>>> Before, >>>>>>>>>>> but >>>>>>>>>>>>>>> without classified exception? It would also add >>>>> implementation >>>>>>>>>>>>> complexity >>>>>>>>>>>>>>> and I think we can always expand the API with async >>>>> version >>>>>>> in the >>>>>>>>>>>>> future >>>>>>>>>>>>>>> if needed. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> As long as the classifiers need to talk to an external >>>>> system, >>>>>>> we >>>>>>>>> by >>>>>>>>>>>>>> definition need to allow them to be asynchronous to >>>>> unblock the >>>>>>>>> main >>>>>>>>>>>>> thread >>>>>>>>>>>>>> for handling other RPCs. Exposing ioExecutor via the >>>>> context >>>>>>>>> proposed >>>>>>>>>>>>> above >>>>>>>>>>>>>> would be great. >>>>>>>>>>>>>> >>>>>>>>>>>>>> After all async responses will get back >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> This would be the same if we trigger them synchronously >>>>> one by >>>>>>> one, >>>>>>>>>>> with >>>>>>>>>>>>> a >>>>>>>>>>>>>> caveat that synchronous execution might take significantly >>>>>>> longer >>>>>>>>> and >>>>>>>>>>>>>> introduce unnecessary downtime to a job. >>>>>>>>>>>>>> >>>>>>>>>>>>>> D. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Mar 21, 2023 at 1:12 PM Zhu Zhu <reed...@gmail.com >>>>>> >>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Piotr, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It's fine to me to have a separate FLIP to extend this >>>>>>>>>>>>> `FailureListener` >>>>>>>>>>>>>>> to support custom restart strategy. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> What I was a bit concerned is that if we just treat the >>>>>>>>>>>>> `FailureListener` >>>>>>>>>>>>>>> as an error classifier which is not crucial to Flink >>>>> framework >>>>>>>>>>> process, >>>>>>>>>>>>>>> we may design it to run asynchronously and not trigger >>>>> Flink >>>>>>>>>> failures. >>>>>>>>>>>>>>> This may be a blocker if later we want to enable it to >>>>> support >>>>>>>>>> custom >>>>>>>>>>>>>>> restart strategy. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Zhu >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dian Fu <dian0511...@gmail.com> 于2023年3月21日周二 19:53写道: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Panagiotis, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for the proposal. This is a very valuable feature >>>>> and >>>>>>> will >>>>>>>>>> be >>>>>>>>>>>>> a >>>>>>>>>>>>>>> good >>>>>>>>>>>>>>>> add-on for Flink. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I also think that it will be great if we can consider >>>>> how to >>>>>>> make >>>>>>>>>> it >>>>>>>>>>>>>>>> possible for users to customize the failure handling in >>>>> this >>>>>>>>> FLIP. >>>>>>>>>>>>> It's >>>>>>>>>>>>>>>> highly related to the problem we want to address in this >>>>>>> FLIP and >>>>>>>>>>>>> could >>>>>>>>>>>>>>>> avoid refactoring the interfaces proposed in this FLIP >>>>> too >>>>>>>>> quickly. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Currently it treats all kinds of exceptions the same. >>>>>>> However, >>>>>>>>> some >>>>>>>>>>>>>> kinds >>>>>>>>>>>>>>>> of exceptions are actually not recoverable at all. It >>>>> could >>>>>>> let >>>>>>>>>> users >>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> customize the failure handling logic to fail fast for >>>>> certain >>>>>>>>> known >>>>>>>>>>>>>>>> unrecoverable exceptions and finally make these kinds of >>>>>>> jobs get >>>>>>>>>>>>>> noticed >>>>>>>>>>>>>>>> and recoveried more quickly. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> Dian >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Mar 21, 2023 at 4:36 PM Gen Luo < >>>>> luogen...@gmail.com >>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Panagiotis, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks for the proposal. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It's useful to enrich the information so that users can >>>>> be >>>>>>> more >>>>>>>>>>>>>>>>> clear why the job is failing, especially platform >>>>>>> developers who >>>>>>>>>>>>>>>>> need to provide the information to their end users. >>>>>>>>>>>>>>>>> And for the very FLIP, I'd prefer the naming >>>>>>> `FailureEnricher` >>>>>>>>>>>>>>>>> proposed by David, as the plugin doesn't really handle >>>>> the >>>>>>>>>> failure. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> However, like Zhu and Lijie said, I also joined a >>>>> discussion >>>>>>>>>>>>>>>>> recently about customized failure handling, e.g. >>>>> counting >>>>>>> the >>>>>>>>>>>>>>>>> failure rate of pipeline regions separately, and failing >>>>>>> the job >>>>>>>>>>>>>>>>> when a specific error occurs, and so on. >>>>>>>>>>>>>>>>> I suppose a custom restart strategy, or I'd call it a >>>>> custom >>>>>>>>>>>>>>>>> failure "handler", is indeed necessary. It can also >>>>> enrich >>>>>>> the >>>>>>>>>>>>>>>>> information as the current proposed handler does. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> To avoid adding too many plugin interfaces which may >>>>> confuse >>>>>>>>> users >>>>>>>>>>>>>>>>> and make the ExecutionFailureHandler more complex, >>>>>>>>>>>>>>>>> I think it'd be better to consider the requirements at >>>>> the >>>>>>> same >>>>>>>>>>>>> time. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> IMO, we can add a handler interface, then make the >>>>> current >>>>>>>>> restart >>>>>>>>>>>>>>>>> strategy and the enricher both types of the handler. The >>>>>>>>> handlers >>>>>>>>>>>>>>>>> execute in sequence, and the failure is considered >>>>>>> unrecoverable >>>>>>>>>> if >>>>>>>>>>>>>>>>> any of the handlers decides. >>>>>>>>>>>>>>>>> In this way, users can also implement a handler using >>>>> the >>>>>>>>> enriched >>>>>>>>>>>>>>>>> information provided by the previous handlers, e.g. fail >>>>>>> the job >>>>>>>>>>>>> and >>>>>>>>>>>>>>>>> send a notification if too many failures are caused by >>>>> the >>>>>>> end >>>>>>>>>>>>> users. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>> Gen >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tue, Mar 21, 2023 at 11:38 AM Weihua Hu < >>>>>>>>>> huweihua....@gmail.com >>>>>>>>>>>>>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Panagiotis, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks for your proposal. It is valuable to analyze the >>>>>>> reason >>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>> failure with the user plug-in. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Making the context immutable could make the contract >>>>>>> stronger. >>>>>>>>>>>>>>>>>> Letting the listener return an enriching result may be >>>>> a >>>>>>> better >>>>>>>>>>>>>> way. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> IIUC, listeners could do two things, enrich more >>>>>>> information >>>>>>>>>>>>>>>>> (tags/labels) >>>>>>>>>>>>>>>>>> to FailureHandlingResult, and push data out of Flink >>>>>>> (metrics >>>>>>>>> or >>>>>>>>>>>>>>>>>> something). >>>>>>>>>>>>>>>>>> IMO, we could split these two types into Listener and >>>>>>> Advisor >>>>>>>>>>>>>> (maybe >>>>>>>>>>>>>>>>>> other names). The Listener just pushes the data out and >>>>>>> returns >>>>>>>>>>>>>>> nothing >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>> Flink, so we can run these async and don't have to >>>>> wait for >>>>>>>>>>>>>>> Listener's >>>>>>>>>>>>>>>>>> result. >>>>>>>>>>>>>>>>>> The Advisor returns rich information to the >>>>>>>>> FailureHadingResult, >>>>>>>>>>>>>>> and it >>>>>>>>>>>>>>>>>> should >>>>>>>>>>>>>>>>>> have a lighter logic. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Supporting a custom restart strategy is also valuable. >>>>> In >>>>>>> this >>>>>>>>>>>>>>> design, we >>>>>>>>>>>>>>>>>> use >>>>>>>>>>>>>>>>>> RestartStrategy to construct a FailureHandingResult, >>>>> and >>>>>>> then >>>>>>>>>>>>> pass >>>>>>>>>>>>>>> it to >>>>>>>>>>>>>>>>>> Listener. >>>>>>>>>>>>>>>>>> My question is, should we change the restart strategy >>>>>>> interface >>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> support >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> custom restart strategy, or keep the current restart >>>>>>> strategy >>>>>>>>> and >>>>>>>>>>>>>>> let the >>>>>>>>>>>>>>>>>> later >>>>>>>>>>>>>>>>>> Listener enrich the restartable information to >>>>>>>>>>>>>> FailureHandingResult? >>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>>> latter >>>>>>>>>>>>>>>>>> may cause some confusion when we use a custom restart >>>>>>> strategy. >>>>>>>>>>>>>>>>>> The default flink restart strategy also runs but does >>>>> not >>>>>>> take >>>>>>>>>>>>>>> effect. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>> Weihua >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Mon, Mar 20, 2023 at 11:42 PM Lijie Wang < >>>>>>>>>>>>>>> wangdachui9...@gmail.com> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Panagiotis, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks for driving this. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> +1 for supporting custom restart strategy, we did >>>>> receive >>>>>>> such >>>>>>>>>>>>>>> requests >>>>>>>>>>>>>>>>>>> from the user mailing list [1][2]. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Besides, in current design, the plugin will only do >>>>> some >>>>>>>>>>>>>>> statistical >>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>> classification work, and will not affect the >>>>>>>>>>>>>>> *FailureHandlingResult*. >>>>>>>>>>>>>>>>>> Just >>>>>>>>>>>>>>>>>>> listening, no handling, it doesn't quite match the >>>>> title. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>> >>>>>>> https://lists.apache.org/thread/ch3s4jhh09wnff3tscqnb6btp2zlp2r1 >>>>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>> >>>>>>> https://lists.apache.org/thread/lwjfdr7c1ypo77r4rwojdk7kxx2sw4sx >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>> Lijie >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Zhu Zhu <reed...@gmail.com> 于2023年3月20日周一 21:39写道: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Panagiotis, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks for creating this proposal! It's good to >>>>> enable >>>>>>> Flink >>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> handle >>>>>>>>>>>>>>>>>>>> different errors in different ways, through a >>>>> pluggable >>>>>>> way. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> There are requests for flexible restart strategies >>>>> from >>>>>>> time >>>>>>>>>>>>> to >>>>>>>>>>>>>>> time, >>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>> different strategies of restart backoff time, or to >>>>>>> suppress >>>>>>>>>>>>>>>>> restarting >>>>>>>>>>>>>>>>>>>> on certain errors. Therefore, I think it's better >>>>> that >>>>>>> the >>>>>>>>>>>>>>> proposed >>>>>>>>>>>>>>>>>>>> failure handling plugin can also support custom >>>>> restart >>>>>>>>>>>>>>> strategies. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Maybe we can call it FailureHandlingAdvisor which >>>>>>> provides >>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>>>> information (labels) and gives advice (restart >>>>> backoff >>>>>>> time, >>>>>>>>>>>>>>> whether >>>>>>>>>>>>>>>>>>>> to restart)? I do not have a strong opinion though, >>>>> any >>>>>>>>>>>>>>> explanatory >>>>>>>>>>>>>>>>>>>> name would be good. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> To avoid unexpected mutation, how about to make the >>>>>>> context >>>>>>>>>>>>>>> immutable >>>>>>>>>>>>>>>>>>>> and let the plugin return an immutable result? i.e. >>>>>>> remove >>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> setters >>>>>>>>>>>>>>>>>>>> from the context, and let the plugin method return a >>>>>>> result >>>>>>>>>>>>>> which >>>>>>>>>>>>>>>>>>>> contains `labels`, `canRestart` and >>>>> `restartBackoffTime`. >>>>>>>>>>>>> Flink >>>>>>>>>>>>>>>>> should >>>>>>>>>>>>>>>>>>>> apply the result to the context before invoking the >>>>> next >>>>>>>>>>>>>> plugin, >>>>>>>>>>>>>>> so >>>>>>>>>>>>>>>>>>>> that the next plugin will see the updated context. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The plugin should avoid taking too much time to >>>>> return >>>>>>> the >>>>>>>>>>>>>>> result, >>>>>>>>>>>>>>>>>>> because >>>>>>>>>>>>>>>>>>>> it will block the RPC and result in instability. >>>>>>> However, it >>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>> still >>>>>>>>>>>>>>>>>>>> perform heavy actions in a different thread. The >>>>> context >>>>>>> can >>>>>>>>>>>>>>> provide >>>>>>>>>>>>>>>>> an >>>>>>>>>>>>>>>>>>>> `ioExecutor` to the plugins for reuse. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> Zhu >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Shammon FY <zjur...@gmail.com> 于2023年3月20日周一 >>>>> 20:21写道: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Panagiotis >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thank you for your answer. I agree that >>>>>>> `FailureListener` >>>>>>>>>>>>>>> could be >>>>>>>>>>>>>>>>>>>>> stateless, then I have some thoughts as follows >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 1. I see that listeners and tag collections are >>>>>>> associated. >>>>>>>>>>>>>>> When >>>>>>>>>>>>>>>>>>>> JobManager >>>>>>>>>>>>>>>>>>>>> fails and restarts, how can the new listener be >>>>>>> associated >>>>>>>>>>>>>>> with the >>>>>>>>>>>>>>>>>> tag >>>>>>>>>>>>>>>>>>>>> collection before failover? Is the listener loading >>>>>>> order? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 2. The tag collection may be too large, resulting >>>>> in the >>>>>>>>>>>>>>> JobManager >>>>>>>>>>>>>>>>>>> OOM, >>>>>>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>>>>>>> we need to provide a management class that supports >>>>> some >>>>>>>>>>>>>>>>> obsolescence >>>>>>>>>>>>>>>>>>>>> strategies instead of a direct Collection? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 3. Is it possible to provide a more complex data >>>>>>> structure >>>>>>>>>>>>>>> than a >>>>>>>>>>>>>>>>>>> simple >>>>>>>>>>>>>>>>>>>>> string collection for tags in listeners, such as >>>>>>> key-value? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>>> Shammon FY >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Mon, Mar 20, 2023 at 7:48 PM Leonard Xu < >>>>>>>>>>>>>> xbjt...@gmail.com> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi,Panagiotis >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thank you for kicking off this discussion. >>>>> Overall, the >>>>>>>>>>>>>>> proposed >>>>>>>>>>>>>>>>>>>> feature of >>>>>>>>>>>>>>>>>>>>>> this FLIP makes sense to me. We have also discussed >>>>>>>>>>>>> similar >>>>>>>>>>>>>>>>>>>> requirements >>>>>>>>>>>>>>>>>>>>>> with our users and developers, and I believe it >>>>> will >>>>>>> help >>>>>>>>>>>>>>> many >>>>>>>>>>>>>>>>>> users. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> In terms of FLIP content, I have some thoughts: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> (1) For the FailureListenerContextget interface, >>>>> the >>>>>>>>>>>>>> methods >>>>>>>>>>>>>>>>>>>>>> FailureListenerContext#addTag and >>>>>>>>>>>>>>> FailureListenerContextgetTags >>>>>>>>>>>>>>>>>> looks >>>>>>>>>>>>>>>>>>>> very >>>>>>>>>>>>>>>>>>>>>> inconsistent because they imply specific >>>>> implementation >>>>>>>>>>>>>>> details, >>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>>>>>>> all FailureListeners need to handle them, we >>>>> shouldn't >>>>>>>>>>>>> put >>>>>>>>>>>>>>> them >>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> interface. Minor: The comment "UDF loading" in the >>>>>>>>>>>>>>>>>>> getUserClassLoader() >>>>>>>>>>>>>>>>>>>>>> method looks like a typo, IIUC it should return the >>>>>>>>>>>>>>> classloader >>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> current job. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> (2) Regarding the implementation in >>>>>>>>>>>>>>>>>>>> ExecutionFailureHandler#handleFailure, >>>>>>>>>>>>>>>>>>>>>> some custom listeners may have heavy IO operations, >>>>>>> such >>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>> reporting >>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>> their monitoring system. The current logic appears >>>>> to >>>>>>> be >>>>>>>>>>>>>>>>> processing >>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> JobMaster's main thread, and it is recommended not >>>>> to >>>>>>> do >>>>>>>>>>>>>> this >>>>>>>>>>>>>>>>> kind >>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>> processing in the main thread. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> (3) The results of FailureListener's processing >>>>> and the >>>>>>>>>>>>>>>>>>>>>> FailureHandlingResult returned by >>>>>>> ExecutionFailureHandler >>>>>>>>>>>>>>> are not >>>>>>>>>>>>>>>>>>>> related. >>>>>>>>>>>>>>>>>>>>>> I think these two are closely related, the >>>>> motivation >>>>>>> of >>>>>>>>>>>>>> this >>>>>>>>>>>>>>>>> FLIP >>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>> make current failure handling more flexible. From >>>>> this >>>>>>>>>>>>>>>>> perspective, >>>>>>>>>>>>>>>>>>>>>> different listeners should have the opportunity to >>>>>>> affect >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> job's >>>>>>>>>>>>>>>>>>>> failure >>>>>>>>>>>>>>>>>>>>>> handling flow. For example, a Flink job is >>>>> configured >>>>>>>>>>>>> with >>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>>> RestartStrategy with huge numbers retry , but the >>>>> Kafka >>>>>>>>>>>>>>> topic of >>>>>>>>>>>>>>>>>>>> Source has >>>>>>>>>>>>>>>>>>>>>> been deleted, the job will failover continuously. >>>>> In >>>>>>> this >>>>>>>>>>>>>>> case, >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> user >>>>>>>>>>>>>>>>>>>>>> should have their listener to determine whether >>>>> this >>>>>>>>>>>>>> failure >>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>> recoverable >>>>>>>>>>>>>>>>>>>>>> or unrecoverable, and then wrap the processing >>>>> result >>>>>>>>>>>>> into >>>>>>>>>>>>>>>>>>>>>> FailureHandlingResult.unrecoverable(xx) and pass >>>>> it to >>>>>>>>>>>>>>> JobMaster, >>>>>>>>>>>>>>>>>>> this >>>>>>>>>>>>>>>>>>>>>> approach will be more flexible. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> (4) All FLIPs have an important section named >>>>> Public >>>>>>>>>>>>>>> Interfaces. >>>>>>>>>>>>>>>>>>>> Current >>>>>>>>>>>>>>>>>>>>>> FLIP mixes the interface section and the >>>>> implementation >>>>>>>>>>>>>>> section >>>>>>>>>>>>>>>>>>>> together. >>>>>>>>>>>>>>>>>>>>>> It is better for us to refer to the FLIP >>>>> template[1] >>>>>>> and >>>>>>>>>>>>>>> separate >>>>>>>>>>>>>>>>>>> them, >>>>>>>>>>>>>>>>>>>>>> this will make the entire FLIP clearer. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> In addition, regarding the FLIP process, there is a >>>>>>> small >>>>>>>>>>>>>>>>>> suggestion: >>>>>>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>>>>>>> community generally creates a JIRA issue after the >>>>> FLIP >>>>>>>>>>>>>> vote >>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>> passed, >>>>>>>>>>>>>>>>>>>>>> instead of during the FLIP preparation phase >>>>> because >>>>>>> the >>>>>>>>>>>>>>> FLIP may >>>>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>>>>>>> rejected. Although this FLIP is very reasonable, >>>>> it's >>>>>>>>>>>>>> better >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> follow >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> process. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Leonard >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>> >>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP+Template >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Mon, Mar 20, 2023 at 7:04 PM David Morávek < >>>>>>>>>>>>>>> d...@apache.org> >>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> however listeners can use previous state >>>>>>>>>>>>> (tags/labels) >>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> make >>>>>>>>>>>>>>>>>>>>>> decisions >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> That sounds like a very fragile contract. We >>>>> should >>>>>>>>>>>>>> either >>>>>>>>>>>>>>>>> allow >>>>>>>>>>>>>>>>>>>> passing >>>>>>>>>>>>>>>>>>>>>>> tags between listeners and then need to define >>>>>>> ordering >>>>>>>>>>>>>> or >>>>>>>>>>>>>>> make >>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>> them >>>>>>>>>>>>>>>>>>>>>>> independent. I prefer the latter because it >>>>> allows us >>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> parallelize >>>>>>>>>>>>>>>>>>>>>> things >>>>>>>>>>>>>>>>>>>>>>> if needed (if all listeners trigger an RCP to the >>>>>>>>>>>>>> external >>>>>>>>>>>>>>>>>> system, >>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>> example). >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Can you expand on why we need more than one >>>>> classifier >>>>>>>>>>>>> to >>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>> able >>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>> output >>>>>>>>>>>>>>>>>>>>>>> the same tag? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> system ones come first and then the ones loaded >>>>> from >>>>>>>>>>>>> the >>>>>>>>>>>>>>> plugin >>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Since they're returned as a Set, the order is >>>>>>>>>>>>> completely >>>>>>>>>>>>>>>>>>>>>> non-deterministic, >>>>>>>>>>>>>>>>>>>>>>> no matter in which order they're loaded. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> just communicating with external >>>>> monitoring/alerting >>>>>>>>>>>>>>> systems >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> That makes the need for pushing things out of the >>>>> main >>>>>>>>>>>>>>> thread >>>>>>>>>>>>>>>>>> even >>>>>>>>>>>>>>>>>>>>>>> stronger. This almost sounds like we need to >>>>> return a >>>>>>>>>>>>>>>>>>>> CompletableFuture >>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>> the per-throwable classification because an >>>>> external >>>>>>>>>>>>>> system >>>>>>>>>>>>>>>>> might >>>>>>>>>>>>>>>>>>>> take a >>>>>>>>>>>>>>>>>>>>>>> significant time to respond. We need to unblock >>>>> the >>>>>>>>>>>>> main >>>>>>>>>>>>>>> thread >>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>> other >>>>>>>>>>>>>>>>>>>>>>> RPCs. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Also, in the proposal, this happens in the failure >>>>>>>>>>>>>>> handler. If >>>>>>>>>>>>>>>>>>>> that's the >>>>>>>>>>>>>>>>>>>>>>> case, this might block the job from being >>>>> restarted >>>>>>> (if >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> restart >>>>>>>>>>>>>>>>>>>>>>> strategy allows for another restart), which would >>>>> be >>>>>>>>>>>>>> great >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>> avoid >>>>>>>>>>>>>>>>>>>>>> because >>>>>>>>>>>>>>>>>>>>>>> it can introduce extra downtime. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> This raises another question: what should happen >>>>> if >>>>>>> the >>>>>>>>>>>>>>>>>>>> classification >>>>>>>>>>>>>>>>>>>>>>> fails? Crashing the job (which is what's currently >>>>>>>>>>>>>>> proposed) >>>>>>>>>>>>>>>>>> seems >>>>>>>>>>>>>>>>>>>> very >>>>>>>>>>>>>>>>>>>>>>> dangerous if this might depend on an external >>>>> system. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thats a valid point, passing the JobGraph >>>>> containing >>>>>>>>>>>>> all >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> above >>>>>>>>>>>>>>>>>>>>>>>> information is also something to consider >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> We should avoid passing JG around because it's >>>>> mutable >>>>>>>>>>>>>>> (which >>>>>>>>>>>>>>>>> we >>>>>>>>>>>>>>>>>>>> must fix >>>>>>>>>>>>>>>>>>>>>>> in the long term), and letting users change it >>>>> might >>>>>>>>>>>>> have >>>>>>>>>>>>>>>>>>>> consequences. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>>>>> D. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Mon, Mar 20, 2023 at 7:23 AM Panagiotis >>>>> Garefalakis >>>>>>>>>>>>> < >>>>>>>>>>>>>>>>>>>>>> pga...@apache.org> >>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hey David, Shammon, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the valuable comments! >>>>>>>>>>>>>>>>>>>>>>>> I am glad you find this proposal useful, some >>>>>>>>>>>>> thoughts: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> @Shammon >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 1. How about adding more job information in >>>>>>>>>>>>>>>>>>>> FailureListenerContext? For >>>>>>>>>>>>>>>>>>>>>>>>> example, job vertext, subtask, taskmanager >>>>>>>>>>>>> location. >>>>>>>>>>>>>>> And >>>>>>>>>>>>>>>>> then >>>>>>>>>>>>>>>>>>>> user >>>>>>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>>>>>>>>>>> more statistics according to different >>>>> dimensions. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thats a valid point, passing the JobGraph >>>>> containing >>>>>>>>>>>>>> all >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> above >>>>>>>>>>>>>>>>>>>>>>>> information >>>>>>>>>>>>>>>>>>>>>>>> is also something to consider, I was mostly >>>>> trying to >>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>>>>> conservative: >>>>>>>>>>>>>>>>>>>>>>>> i.e., passingly only the information we need, and >>>>>>>>>>>>>> extend >>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>> we >>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>> fit >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 2. Users may want to save results in listener, >>>>> and >>>>>>>>>>>>> then >>>>>>>>>>>>>>> they >>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>>>> get >>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>> historical results even jabmanager failover. >>>>> Can we >>>>>>>>>>>>>>>>> provide a >>>>>>>>>>>>>>>>>>>> unified >>>>>>>>>>>>>>>>>>>>>>>>> implementation for data storage requirements? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> The idea is to store only the output of the >>>>> Listeners >>>>>>>>>>>>>>> (tags) >>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>> treat >>>>>>>>>>>>>>>>>>>>>>> them >>>>>>>>>>>>>>>>>>>>>>>> as stateless. >>>>>>>>>>>>>>>>>>>>>>>> Tags are be stored along with HistoryEntries, and >>>>>>>>>>>>> will >>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>>>> available >>>>>>>>>>>>>>>>>>>>>>> through >>>>>>>>>>>>>>>>>>>>>>>> the HistoryServer >>>>>>>>>>>>>>>>>>>>>>>> even after a JM dies. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> @David >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 1) Should we also consider adding labels? The >>>>>>>>>>>>>>> combination of >>>>>>>>>>>>>>>>>> tags >>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>>>> labels seems to be what most systems offer; >>>>>>>>>>>>>> sometimes, >>>>>>>>>>>>>>> they >>>>>>>>>>>>>>>>>>> offer >>>>>>>>>>>>>>>>>>>>>>> labels >>>>>>>>>>>>>>>>>>>>>>>>> only (key=value pairs) because tags can be >>>>>>>>>>>>>> implemented >>>>>>>>>>>>>>>>> using >>>>>>>>>>>>>>>>>>>> those, >>>>>>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>>>>>>>>>> the other way around. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Indeed changing tags to k:v labels could be more >>>>>>>>>>>>>>> expressive, >>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>> like it! >>>>>>>>>>>>>>>>>>>>>>>> Let's see what others think. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 2) Since we can not predict how heavy >>>>> user-defined >>>>>>>>>>>>>> models >>>>>>>>>>>>>>>>>>>> ("listeners") >>>>>>>>>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>>>>>>>>>> going to be, it would be great to keep the >>>>>>>>>>>>>>> interfaces/data >>>>>>>>>>>>>>>>>>>> structures >>>>>>>>>>>>>>>>>>>>>>>>> immutable so we can push things over to the I/O >>>>>>>>>>>>>>> threads. >>>>>>>>>>>>>>>>>> Also, >>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>>>>> sounds >>>>>>>>>>>>>>>>>>>>>>>>> off to call the main interface a Listener since >>>>>>>>>>>>> it's >>>>>>>>>>>>>>>>> supposed >>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>> enhance >>>>>>>>>>>>>>>>>>>>>>>>> the original throwable with additional metadata. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> The idea was for the name to be generic as there >>>>>>>>>>>>> could >>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>>>> Listener >>>>>>>>>>>>>>>>>>>>>>>> implementations >>>>>>>>>>>>>>>>>>>>>>>> just communicating with external >>>>> monitoring/alerting >>>>>>>>>>>>>>> systems >>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>> no >>>>>>>>>>>>>>>>>>>>>>>> metadata output >>>>>>>>>>>>>>>>>>>>>>>> -- but lets rethink that. For immutability, see >>>>>>>>>>>>> below: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 3) You're proposing to support a set of >>>>> listeners. >>>>>>>>>>>>>> Since >>>>>>>>>>>>>>>>> you're >>>>>>>>>>>>>>>>>>>> passing >>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>> mutable context around, which includes tags set >>>>> by >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> previous >>>>>>>>>>>>>>>>>>>>>>> listener, >>>>>>>>>>>>>>>>>>>>>>>>> do you expect users to make any assumptions >>>>> about >>>>>>>>>>>>> the >>>>>>>>>>>>>>> order >>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>> which >>>>>>>>>>>>>>>>>>>>>>>>> listeners are executed? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> In the existing proposal we are not making any >>>>>>>>>>>>>>> assumptions >>>>>>>>>>>>>>>>>> about >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>> order >>>>>>>>>>>>>>>>>>>>>>>> of listeners, >>>>>>>>>>>>>>>>>>>>>>>> (system ones come first and then the ones loaded >>>>> from >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> plugin >>>>>>>>>>>>>>>>>>>>>> manager) >>>>>>>>>>>>>>>>>>>>>>>> however listeners can use previous state >>>>>>>>>>>>> (tags/labels) >>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> make >>>>>>>>>>>>>>>>>>>>>> decisions: >>>>>>>>>>>>>>>>>>>>>>>> e.g., wont assign *UNKNOWN* failureType when we >>>>> have >>>>>>>>>>>>>>> already >>>>>>>>>>>>>>>>>> seen >>>>>>>>>>>>>>>>>>>> *USER >>>>>>>>>>>>>>>>>>>>>>> *or >>>>>>>>>>>>>>>>>>>>>>>> the other way around -- when we have seen >>>>> *UNKNOWN* >>>>>>>>>>>>>>> remove in >>>>>>>>>>>>>>>>>>>> favor of >>>>>>>>>>>>>>>>>>>>>>>> *USER* >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>>>>>> Panagiotis >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Sun, Mar 19, 2023 at 10:42 AM David Morávek < >>>>>>>>>>>>>>>>>> d...@apache.org> >>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hi Panagiotis, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> This is an excellent proposal and something >>>>>>>>>>>>> everyone >>>>>>>>>>>>>>> trying >>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>> provide >>>>>>>>>>>>>>>>>>>>>>>>> "Flink as a service" needs to solve at some >>>>> point. >>>>>>>>>>>>> I >>>>>>>>>>>>>>> have a >>>>>>>>>>>>>>>>>>>> couple of >>>>>>>>>>>>>>>>>>>>>>>>> questions: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> If I understand the proposal correctly, this is >>>>>>>>>>>>> just >>>>>>>>>>>>>>> about >>>>>>>>>>>>>>>>>>> adding >>>>>>>>>>>>>>>>>>>>>> tags >>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>> the Throwable by running a tuple of (Throwable, >>>>>>>>>>>>>>>>>> FailureContext) >>>>>>>>>>>>>>>>>>>>>>> through a >>>>>>>>>>>>>>>>>>>>>>>>> user-defined model. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> 1) Should we also consider adding labels? The >>>>>>>>>>>>>>> combination >>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>> tags and >>>>>>>>>>>>>>>>>>>>>>>>> labels seems to be what most systems offer; >>>>>>>>>>>>>> sometimes, >>>>>>>>>>>>>>> they >>>>>>>>>>>>>>>>>>> offer >>>>>>>>>>>>>>>>>>>>>>> labels >>>>>>>>>>>>>>>>>>>>>>>>> only (key=value pairs) because tags can be >>>>>>>>>>>>>> implemented >>>>>>>>>>>>>>>>> using >>>>>>>>>>>>>>>>>>>> those, >>>>>>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>>>>>>>>>> the other way around. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> 2) Since we can not predict how heavy >>>>> user-defined >>>>>>>>>>>>>>> models >>>>>>>>>>>>>>>>>>>>>> ("listeners") >>>>>>>>>>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>>>>>>>>>> going to be, it would be great to keep the >>>>>>>>>>>>>>> interfaces/data >>>>>>>>>>>>>>>>>>>> structures >>>>>>>>>>>>>>>>>>>>>>>>> immutable so we can push things over to the I/O >>>>>>>>>>>>>>> threads. >>>>>>>>>>>>>>>>>> Also, >>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>>>>> sounds >>>>>>>>>>>>>>>>>>>>>>>>> off to call the main interface a Listener since >>>>>>>>>>>>> it's >>>>>>>>>>>>>>>>> supposed >>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>> enhance >>>>>>>>>>>>>>>>>>>>>>>>> the original throwable with additional metadata. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I'd propose something along the lines of (we >>>>> should >>>>>>>>>>>>>>> have >>>>>>>>>>>>>>>>>> better >>>>>>>>>>>>>>>>>>>>>> names, >>>>>>>>>>>>>>>>>>>>>>>> this >>>>>>>>>>>>>>>>>>>>>>>>> is just to outline the idea): >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> interface FailureEnricher { >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> ThrowableWithTagsAndLabels >>>>>>>>>>>>> enrichFailure(Throwable >>>>>>>>>>>>>>> cause, >>>>>>>>>>>>>>>>>>>>>>>>> ImmutableContextualMetadataAboutTheThrowable >>>>>>>>>>>>>> context); >>>>>>>>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> The names should change; this is just to outline >>>>>>>>>>>>> the >>>>>>>>>>>>>>> idea. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> 3) You're proposing to support a set of >>>>> listeners. >>>>>>>>>>>>>>> Since >>>>>>>>>>>>>>>>>> you're >>>>>>>>>>>>>>>>>>>>>> passing >>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>> mutable context around, which includes tags set >>>>> by >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> previous >>>>>>>>>>>>>>>>>>>>>>> listener, >>>>>>>>>>>>>>>>>>>>>>>>> do you expect users to make any assumptions >>>>> about >>>>>>>>>>>>> the >>>>>>>>>>>>>>> order >>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>> which >>>>>>>>>>>>>>>>>>>>>>>>> listeners are executed? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> *@Shammon* >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Users may want to save results in listener, and >>>>>>>>>>>>> then >>>>>>>>>>>>>>> they >>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>>>> get the >>>>>>>>>>>>>>>>>>>>>>>>>> historical results even jabmanager failover. >>>>> Can >>>>>>>>>>>>> we >>>>>>>>>>>>>>>>>> provide a >>>>>>>>>>>>>>>>>>>>>> unified >>>>>>>>>>>>>>>>>>>>>>>>>> implementation for data storage requirements? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I think we should explicitly state that all >>>>>>>>>>>>>>> "listeners" are >>>>>>>>>>>>>>>>>>>> treated >>>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>>>>>> stateless. I don't see any strong reason for >>>>>>>>>>>>>>> snapshotting >>>>>>>>>>>>>>>>>> them. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>>>>>>> D. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Sat, Mar 18, 2023 at 1:00 AM Shammon FY < >>>>>>>>>>>>>>>>>> zjur...@gmail.com> >>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Panagiotis >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for starting this discussion. I think >>>>>>>>>>>>>> this >>>>>>>>>>>>>>> FLIP >>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>> valuable >>>>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>>>>> can help user to analyze the causes of job >>>>>>>>>>>>> failover >>>>>>>>>>>>>>>>> better! >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I have two comments as follows >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> 1. How about adding more job information in >>>>>>>>>>>>>>>>>>>> FailureListenerContext? >>>>>>>>>>>>>>>>>>>>>>> For >>>>>>>>>>>>>>>>>>>>>>>>>> example, job vertext, subtask, taskmanager >>>>>>>>>>>>>> location. >>>>>>>>>>>>>>> And >>>>>>>>>>>>>>>>>> then >>>>>>>>>>>>>>>>>>>> user >>>>>>>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>>>>>>>>>>>> more statistics according to different >>>>>>>>>>>>> dimensions. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> 2. Users may want to save results in listener, >>>>>>>>>>>>> and >>>>>>>>>>>>>>> then >>>>>>>>>>>>>>>>>> they >>>>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>>>>>> get >>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>> historical results even jabmanager failover. >>>>> Can >>>>>>>>>>>>> we >>>>>>>>>>>>>>>>>> provide a >>>>>>>>>>>>>>>>>>>>>> unified >>>>>>>>>>>>>>>>>>>>>>>>>> implementation for data storage requirements? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>>>>>>>> shammon FY >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Saturday, March 18, 2023, Panagiotis >>>>>>>>>>>>>> Garefalakis < >>>>>>>>>>>>>>>>>>>>>>> pga...@apache.org >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> This FLIP [1] proposes a pluggable interface >>>>>>>>>>>>> for >>>>>>>>>>>>>>>>> failure >>>>>>>>>>>>>>>>>>>> handling >>>>>>>>>>>>>>>>>>>>>>>>>> allowing >>>>>>>>>>>>>>>>>>>>>>>>>>> users to implement custom failure logic using >>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> plugin >>>>>>>>>>>>>>>>>>>>>> framework. >>>>>>>>>>>>>>>>>>>>>>>>>>> Motivated by existing proposals [2] and >>>>> tickets >>>>>>>>>>>>>>> [3], >>>>>>>>>>>>>>>>> this >>>>>>>>>>>>>>>>>>>> enables >>>>>>>>>>>>>>>>>>>>>>>>>> use-cases >>>>>>>>>>>>>>>>>>>>>>>>>>> like: assigning particular types to failures >>>>>>>>>>>>>> (e.g., >>>>>>>>>>>>>>>>> User >>>>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>>>>>>>> System), >>>>>>>>>>>>>>>>>>>>>>>>>>> emitting custom metrics per type (e.g., >>>>>>>>>>>>>>> application or >>>>>>>>>>>>>>>>>>>> platform), >>>>>>>>>>>>>>>>>>>>>>>> even >>>>>>>>>>>>>>>>>>>>>>>>>>> exposing errors to downstream consumers (e.g., >>>>>>>>>>>>>>>>>> notification >>>>>>>>>>>>>>>>>>>>>>> systems). >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks to Piotr and Anton for the initial >>>>>>>>>>>>> reviews >>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>> discussions! >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> For anyone interested, the starting point >>>>> would >>>>>>>>>>>>>> be >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> FLIP >>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>>>>>>>> created, >>>>>>>>>>>>>>>>>>>>>>>>>>> describing the motivation and the proposed >>>>>>>>>>>>>> changes >>>>>>>>>>>>>>>>> (part >>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>> core, >>>>>>>>>>>>>>>>>>>>>>>>>>> runtime and web). >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> The intuition behind this FLIP is being able >>>>> to >>>>>>>>>>>>>>> execute >>>>>>>>>>>>>>>>>>>> custom >>>>>>>>>>>>>>>>>>>>>>> logic >>>>>>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>>>>>>>>> failures by exposing a FailureListener >>>>>>>>>>>>> interface. >>>>>>>>>>>>>>>>>>>> Implementation >>>>>>>>>>>>>>>>>>>>>> by >>>>>>>>>>>>>>>>>>>>>>>>> users >>>>>>>>>>>>>>>>>>>>>>>>>>> can be simply loaded to the system as Jar >>>>>>>>>>>>> files. >>>>>>>>>>>>>>>>>>>> FailureListeners >>>>>>>>>>>>>>>>>>>>>>> may >>>>>>>>>>>>>>>>>>>>>>>>>> also >>>>>>>>>>>>>>>>>>>>>>>>>>> decide to assign failure tags to errors >>>>>>>>>>>>>> (expressed >>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>> strings), >>>>>>>>>>>>>>>>>>>>>>>>>>> that will then be exposed as metadata by the >>>>>>>>>>>>>>> UI/Rest >>>>>>>>>>>>>>>>>>>> interfaces. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback is always appreciated! Looking >>>>> forward >>>>>>>>>>>>>> to >>>>>>>>>>>>>>> your >>>>>>>>>>>>>>>>>>>> thoughts! >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304% >>>>>>>>>>>>>>>>>>>>>>>>>>> 3A+Pluggable+failure+handling+for+Apache+Flink >>>>>>>>>>>>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>> https://docs.google.com/document/d/1pcHg9F3GoDDeVD5GIIo2wO67 >>>>>>>>>>>>>>>>>>>>>>>>>>> Hmjgy0-hRDeuFnrMgT4 >>>>>>>>>>>>>>>>>>>>>>>>>>> [3] >>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-20833 >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>>>>>>>>> Panagiotis >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >> >>