Thanks for the proposal, Panagiotis. A lot of good points have been already shared. I just want to add my view on some of the items:
- independent execution vs ordered execution: I prefer the listeners being processed independently from each other because it adds less complexity code-wise. The use case Piotr described (where you want to reuse some other classifier) is the only one I can think of where we actually need classifiers depending on each other. Supporting such a use case right from the start feels a bit over-engineered and could be covered in a follow-up FLIP if we really come to that point where such a feature is requested by users. - key/value pairs instead of plain labels: I think that's a good idea. key/value pairs are more expressive. +1 - extending the FLIP to cover restart strategy: I understand Gen's concern about introducing too many different types of plugins. But I would still favor not extending the FLIP in this regard. A pluggable restart strategy sounds reasonable. But an error classifier and a restart strategy are still different enough to justify separate plugins, IMHO. And therefore, I would think that covering the restart strategy in a separate FLIP is the better option for the sake of simplicity. - immutable context: Passing in an immutable context and returning data through the interface method's return value sounds like a better approach to harden the contract of the interface. +1 for that proposal - async operation: I think David is right. An async interface makes the listener implementations more robust when it comes to heavy IO operations. The ioExecutor can be passed through the context object. +1 Matthias On Tue, Mar 21, 2023 at 2:09 PM David Morávek <david.mora...@gmail.com> wrote: > *@Piotr* > > > > I was thinking about actually defining the order of the > > classifiers/handlers and not allowing them to be asynchronous. > > Asynchronousity would create some problems: when to actually return the > > error to the user? After all async responses will get back? Before, but > > without classified exception? It would also add implementation complexity > > and I think we can always expand the API with async version in the future > > if needed. > > > As long as the classifiers need to talk to an external system, we by > definition need to allow them to be asynchronous to unblock the main thread > for handling other RPCs. Exposing ioExecutor via the context proposed above > would be great. > > After all async responses will get back > > > This would be the same if we trigger them synchronously one by one, with a > caveat that synchronous execution might take significantly longer and > introduce unnecessary downtime to a job. > > D. > > On Tue, Mar 21, 2023 at 1:12 PM Zhu Zhu <reed...@gmail.com> wrote: > > > Hi Piotr, > > > > It's fine to me to have a separate FLIP to extend this `FailureListener` > > to support custom restart strategy. > > > > What I was a bit concerned is that if we just treat the `FailureListener` > > as an error classifier which is not crucial to Flink framework process, > > we may design it to run asynchronously and not trigger Flink failures. > > This may be a blocker if later we want to enable it to support custom > > restart strategy. > > > > Thanks, > > Zhu > > > > Dian Fu <dian0511...@gmail.com> 于2023年3月21日周二 19:53写道: > > > > > > Hi Panagiotis, > > > > > > Thanks for the proposal. This is a very valuable feature and will be a > > good > > > add-on for Flink. > > > > > > I also think that it will be great if we can consider how to make it > > > possible for users to customize the failure handling in this FLIP. It's > > > highly related to the problem we want to address in this FLIP and could > > > avoid refactoring the interfaces proposed in this FLIP too quickly. > > > > > > Currently it treats all kinds of exceptions the same. However, some > kinds > > > of exceptions are actually not recoverable at all. It could let users > to > > > customize the failure handling logic to fail fast for certain known > > > unrecoverable exceptions and finally make these kinds of jobs get > noticed > > > and recoveried more quickly. > > > > > > Regards, > > > Dian > > > > > > > > > > > > On Tue, Mar 21, 2023 at 4:36 PM Gen Luo <luogen...@gmail.com> wrote: > > > > > > > Hi Panagiotis, > > > > > > > > Thanks for the proposal. > > > > > > > > It's useful to enrich the information so that users can be more > > > > clear why the job is failing, especially platform developers who > > > > need to provide the information to their end users. > > > > And for the very FLIP, I'd prefer the naming `FailureEnricher` > > > > proposed by David, as the plugin doesn't really handle the failure. > > > > > > > > However, like Zhu and Lijie said, I also joined a discussion > > > > recently about customized failure handling, e.g. counting the > > > > failure rate of pipeline regions separately, and failing the job > > > > when a specific error occurs, and so on. > > > > I suppose a custom restart strategy, or I'd call it a custom > > > > failure "handler", is indeed necessary. It can also enrich the > > > > information as the current proposed handler does. > > > > > > > > To avoid adding too many plugin interfaces which may confuse users > > > > and make the ExecutionFailureHandler more complex, > > > > I think it'd be better to consider the requirements at the same time. > > > > > > > > IMO, we can add a handler interface, then make the current restart > > > > strategy and the enricher both types of the handler. The handlers > > > > execute in sequence, and the failure is considered unrecoverable if > > > > any of the handlers decides. > > > > In this way, users can also implement a handler using the enriched > > > > information provided by the previous handlers, e.g. fail the job and > > > > send a notification if too many failures are caused by the end users. > > > > > > > > Best, > > > > Gen > > > > > > > > > > > > On Tue, Mar 21, 2023 at 11:38 AM Weihua Hu <huweihua....@gmail.com> > > wrote: > > > > > > > > > Hi Panagiotis, > > > > > > > > > > Thanks for your proposal. It is valuable to analyze the reason for > > > > > failure with the user plug-in. > > > > > > > > > > Making the context immutable could make the contract stronger. > > > > > Letting the listener return an enriching result may be a better > way. > > > > > > > > > > IIUC, listeners could do two things, enrich more information > > > > (tags/labels) > > > > > to FailureHandlingResult, and push data out of Flink (metrics or > > > > > something). > > > > > IMO, we could split these two types into Listener and Advisor > (maybe > > > > > other names). The Listener just pushes the data out and returns > > nothing > > > > to > > > > > Flink, so we can run these async and don't have to wait for > > Listener's > > > > > result. > > > > > The Advisor returns rich information to the FailureHadingResult, > > and it > > > > > should > > > > > have a lighter logic. > > > > > > > > > > > > > > > Supporting a custom restart strategy is also valuable. In this > > design, we > > > > > use > > > > > RestartStrategy to construct a FailureHandingResult, and then pass > > it to > > > > > Listener. > > > > > My question is, should we change the restart strategy interface to > > > > support > > > > > the > > > > > custom restart strategy, or keep the current restart strategy and > > let the > > > > > later > > > > > Listener enrich the restartable information to > FailureHandingResult? > > The > > > > > latter > > > > > may cause some confusion when we use a custom restart strategy. > > > > > The default flink restart strategy also runs but does not take > > effect. > > > > > > > > > > > > > > > Best, > > > > > Weihua > > > > > > > > > > > > > > > On Mon, Mar 20, 2023 at 11:42 PM Lijie Wang < > > wangdachui9...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi Panagiotis, > > > > > > > > > > > > Thanks for driving this. > > > > > > > > > > > > +1 for supporting custom restart strategy, we did receive such > > requests > > > > > > from the user mailing list [1][2]. > > > > > > > > > > > > Besides, in current design, the plugin will only do some > > statistical > > > > and > > > > > > classification work, and will not affect the > > *FailureHandlingResult*. > > > > > Just > > > > > > listening, no handling, it doesn't quite match the title. > > > > > > > > > > > > [1] > > https://lists.apache.org/thread/ch3s4jhh09wnff3tscqnb6btp2zlp2r1 > > > > > > [2] > > https://lists.apache.org/thread/lwjfdr7c1ypo77r4rwojdk7kxx2sw4sx > > > > > > > > > > > > Best, > > > > > > Lijie > > > > > > > > > > > > Zhu Zhu <reed...@gmail.com> 于2023年3月20日周一 21:39写道: > > > > > > > > > > > > > Hi Panagiotis, > > > > > > > > > > > > > > Thanks for creating this proposal! It's good to enable Flink to > > > > handle > > > > > > > different errors in different ways, through a pluggable way. > > > > > > > > > > > > > > There are requests for flexible restart strategies from time to > > time, > > > > > for > > > > > > > different strategies of restart backoff time, or to suppress > > > > restarting > > > > > > > on certain errors. Therefore, I think it's better that the > > proposed > > > > > > > failure handling plugin can also support custom restart > > strategies. > > > > > > > > > > > > > > Maybe we can call it FailureHandlingAdvisor which provides more > > > > > > > information (labels) and gives advice (restart backoff time, > > whether > > > > > > > to restart)? I do not have a strong opinion though, any > > explanatory > > > > > > > name would be good. > > > > > > > > > > > > > > To avoid unexpected mutation, how about to make the context > > immutable > > > > > > > and let the plugin return an immutable result? i.e. remove the > > > > setters > > > > > > > from the context, and let the plugin method return a result > which > > > > > > > contains `labels`, `canRestart` and `restartBackoffTime`. Flink > > > > should > > > > > > > apply the result to the context before invoking the next > plugin, > > so > > > > > > > that the next plugin will see the updated context. > > > > > > > > > > > > > > The plugin should avoid taking too much time to return the > > result, > > > > > > because > > > > > > > it will block the RPC and result in instability. However, it > can > > > > still > > > > > > > perform heavy actions in a different thread. The context can > > provide > > > > an > > > > > > > `ioExecutor` to the plugins for reuse. > > > > > > > > > > > > > > Thanks, > > > > > > > Zhu > > > > > > > > > > > > > > Shammon FY <zjur...@gmail.com> 于2023年3月20日周一 20:21写道: > > > > > > > > > > > > > > > > Hi Panagiotis > > > > > > > > > > > > > > > > Thank you for your answer. I agree that `FailureListener` > > could be > > > > > > > > stateless, then I have some thoughts as follows > > > > > > > > > > > > > > > > 1. I see that listeners and tag collections are associated. > > When > > > > > > > JobManager > > > > > > > > fails and restarts, how can the new listener be associated > > with the > > > > > tag > > > > > > > > collection before failover? Is the listener loading order? > > > > > > > > > > > > > > > > 2. The tag collection may be too large, resulting in the > > JobManager > > > > > > OOM, > > > > > > > do > > > > > > > > we need to provide a management class that supports some > > > > obsolescence > > > > > > > > strategies instead of a direct Collection? > > > > > > > > > > > > > > > > 3. Is it possible to provide a more complex data structure > > than a > > > > > > simple > > > > > > > > string collection for tags in listeners, such as key-value? > > > > > > > > > > > > > > > > Best, > > > > > > > > Shammon FY > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Mar 20, 2023 at 7:48 PM Leonard Xu < > xbjt...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > > > Hi,Panagiotis > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you for kicking off this discussion. Overall, the > > proposed > > > > > > > feature of > > > > > > > > > this FLIP makes sense to me. We have also discussed similar > > > > > > > requirements > > > > > > > > > with our users and developers, and I believe it will help > > many > > > > > users. > > > > > > > > > > > > > > > > > > > > > > > > > > > In terms of FLIP content, I have some thoughts: > > > > > > > > > > > > > > > > > > (1) For the FailureListenerContextget interface, the > methods > > > > > > > > > FailureListenerContext#addTag and > > FailureListenerContextgetTags > > > > > looks > > > > > > > very > > > > > > > > > inconsistent because they imply specific implementation > > details, > > > > > and > > > > > > > not > > > > > > > > > all FailureListeners need to handle them, we shouldn't put > > them > > > > in > > > > > > the > > > > > > > > > interface. Minor: The comment "UDF loading" in the > > > > > > getUserClassLoader() > > > > > > > > > method looks like a typo, IIUC it should return the > > classloader > > > > of > > > > > > the > > > > > > > > > current job. > > > > > > > > > > > > > > > > > > (2) Regarding the implementation in > > > > > > > ExecutionFailureHandler#handleFailure, > > > > > > > > > some custom listeners may have heavy IO operations, such as > > > > > reporting > > > > > > > to > > > > > > > > > their monitoring system. The current logic appears to be > > > > processing > > > > > > in > > > > > > > the > > > > > > > > > JobMaster's main thread, and it is recommended not to do > this > > > > kind > > > > > of > > > > > > > > > processing in the main thread. > > > > > > > > > > > > > > > > > > (3) The results of FailureListener's processing and the > > > > > > > > > FailureHandlingResult returned by ExecutionFailureHandler > > are not > > > > > > > related. > > > > > > > > > I think these two are closely related, the motivation of > this > > > > FLIP > > > > > is > > > > > > > to > > > > > > > > > make current failure handling more flexible. From this > > > > perspective, > > > > > > > > > different listeners should have the opportunity to affect > the > > > > job's > > > > > > > failure > > > > > > > > > handling flow. For example, a Flink job is configured with > a > > > > > > > > > RestartStrategy with huge numbers retry , but the Kafka > > topic of > > > > > > > Source has > > > > > > > > > been deleted, the job will failover continuously. In this > > case, > > > > the > > > > > > > user > > > > > > > > > should have their listener to determine whether this > failure > > is > > > > > > > recoverable > > > > > > > > > or unrecoverable, and then wrap the processing result into > > > > > > > > > FailureHandlingResult.unrecoverable(xx) and pass it to > > JobMaster, > > > > > > this > > > > > > > > > approach will be more flexible. > > > > > > > > > > > > > > > > > > (4) All FLIPs have an important section named Public > > Interfaces. > > > > > > > Current > > > > > > > > > FLIP mixes the interface section and the implementation > > section > > > > > > > together. > > > > > > > > > It is better for us to refer to the FLIP template[1] and > > separate > > > > > > them, > > > > > > > > > this will make the entire FLIP clearer. > > > > > > > > > > > > > > > > > > In addition, regarding the FLIP process, there is a small > > > > > suggestion: > > > > > > > The > > > > > > > > > community generally creates a JIRA issue after the FLIP > vote > > is > > > > > > passed, > > > > > > > > > instead of during the FLIP preparation phase because the > > FLIP may > > > > > be > > > > > > > > > rejected. Although this FLIP is very reasonable, it's > better > > to > > > > > > follow > > > > > > > the > > > > > > > > > process. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > Leonard > > > > > > > > > > > > > > > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP+Template > > > > > > > > > > > > > > > > > > On Mon, Mar 20, 2023 at 7:04 PM David Morávek < > > d...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > however listeners can use previous state (tags/labels) > to > > > > make > > > > > > > > > decisions > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > That sounds like a very fragile contract. We should > either > > > > allow > > > > > > > passing > > > > > > > > > > tags between listeners and then need to define ordering > or > > make > > > > > all > > > > > > > of > > > > > > > > > them > > > > > > > > > > independent. I prefer the latter because it allows us to > > > > > > parallelize > > > > > > > > > things > > > > > > > > > > if needed (if all listeners trigger an RCP to the > external > > > > > system, > > > > > > > for > > > > > > > > > > example). > > > > > > > > > > > > > > > > > > > > Can you expand on why we need more than one classifier to > > be > > > > able > > > > > > to > > > > > > > > > output > > > > > > > > > > the same tag? > > > > > > > > > > > > > > > > > > > > system ones come first and then the ones loaded from the > > plugin > > > > > > > manager > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Since they're returned as a Set, the order is completely > > > > > > > > > non-deterministic, > > > > > > > > > > no matter in which order they're loaded. > > > > > > > > > > > > > > > > > > > > just communicating with external monitoring/alerting > > systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > That makes the need for pushing things out of the main > > thread > > > > > even > > > > > > > > > > stronger. This almost sounds like we need to return a > > > > > > > CompletableFuture > > > > > > > > > for > > > > > > > > > > the per-throwable classification because an external > system > > > > might > > > > > > > take a > > > > > > > > > > significant time to respond. We need to unblock the main > > thread > > > > > for > > > > > > > other > > > > > > > > > > RPCs. > > > > > > > > > > > > > > > > > > > > Also, in the proposal, this happens in the failure > > handler. If > > > > > > > that's the > > > > > > > > > > case, this might block the job from being restarted (if > the > > > > > restart > > > > > > > > > > strategy allows for another restart), which would be > great > > to > > > > > avoid > > > > > > > > > because > > > > > > > > > > it can introduce extra downtime. > > > > > > > > > > > > > > > > > > > > This raises another question: what should happen if the > > > > > > > classification > > > > > > > > > > fails? Crashing the job (which is what's currently > > proposed) > > > > > seems > > > > > > > very > > > > > > > > > > dangerous if this might depend on an external system. > > > > > > > > > > > > > > > > > > > > Thats a valid point, passing the JobGraph containing all > > the > > > > > above > > > > > > > > > > > information is also something to consider > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > We should avoid passing JG around because it's mutable > > (which > > > > we > > > > > > > must fix > > > > > > > > > > in the long term), and letting users change it might have > > > > > > > consequences. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > D. > > > > > > > > > > > > > > > > > > > > On Mon, Mar 20, 2023 at 7:23 AM Panagiotis Garefalakis < > > > > > > > > > pga...@apache.org> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hey David, Shammon, > > > > > > > > > > > > > > > > > > > > > > Thanks for the valuable comments! > > > > > > > > > > > I am glad you find this proposal useful, some thoughts: > > > > > > > > > > > > > > > > > > > > > > @Shammon > > > > > > > > > > > > > > > > > > > > > > 1. How about adding more job information in > > > > > > > FailureListenerContext? For > > > > > > > > > > > > example, job vertext, subtask, taskmanager location. > > And > > > > then > > > > > > > user > > > > > > > > > can > > > > > > > > > > do > > > > > > > > > > > > more statistics according to different dimensions. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thats a valid point, passing the JobGraph containing > all > > the > > > > > > above > > > > > > > > > > > information > > > > > > > > > > > is also something to consider, I was mostly trying to > be > > > > > > > conservative: > > > > > > > > > > > i.e., passingly only the information we need, and > extend > > as > > > > we > > > > > > see > > > > > > > fit > > > > > > > > > > > > > > > > > > > > > > 2. Users may want to save results in listener, and then > > they > > > > > can > > > > > > > get > > > > > > > > > the > > > > > > > > > > > > historical results even jabmanager failover. Can we > > > > provide a > > > > > > > unified > > > > > > > > > > > > implementation for data storage requirements? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The idea is to store only the output of the Listeners > > (tags) > > > > > and > > > > > > > treat > > > > > > > > > > them > > > > > > > > > > > as stateless. > > > > > > > > > > > Tags are be stored along with HistoryEntries, and will > be > > > > > > available > > > > > > > > > > through > > > > > > > > > > > the HistoryServer > > > > > > > > > > > even after a JM dies. > > > > > > > > > > > > > > > > > > > > > > @David > > > > > > > > > > > > > > > > > > > > > > 1) Should we also consider adding labels? The > > combination of > > > > > tags > > > > > > > and > > > > > > > > > > > > labels seems to be what most systems offer; > sometimes, > > they > > > > > > offer > > > > > > > > > > labels > > > > > > > > > > > > only (key=value pairs) because tags can be > implemented > > > > using > > > > > > > those, > > > > > > > > > but > > > > > > > > > > > not > > > > > > > > > > > > the other way around. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Indeed changing tags to k:v labels could be more > > expressive, > > > > I > > > > > > > like it! > > > > > > > > > > > Let's see what others think. > > > > > > > > > > > > > > > > > > > > > > 2) Since we can not predict how heavy user-defined > models > > > > > > > ("listeners") > > > > > > > > > > are > > > > > > > > > > > > going to be, it would be great to keep the > > interfaces/data > > > > > > > structures > > > > > > > > > > > > immutable so we can push things over to the I/O > > threads. > > > > > Also, > > > > > > it > > > > > > > > > > sounds > > > > > > > > > > > > off to call the main interface a Listener since it's > > > > supposed > > > > > > to > > > > > > > > > > enhance > > > > > > > > > > > > the original throwable with additional metadata. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The idea was for the name to be generic as there could > be > > > > > > Listener > > > > > > > > > > > implementations > > > > > > > > > > > just communicating with external monitoring/alerting > > systems > > > > > and > > > > > > no > > > > > > > > > > > metadata output > > > > > > > > > > > -- but lets rethink that. For immutability, see below: > > > > > > > > > > > > > > > > > > > > > > 3) You're proposing to support a set of listeners. > Since > > > > you're > > > > > > > passing > > > > > > > > > > the > > > > > > > > > > > > mutable context around, which includes tags set by > the > > > > > previous > > > > > > > > > > listener, > > > > > > > > > > > > do you expect users to make any assumptions about the > > order > > > > > in > > > > > > > which > > > > > > > > > > > > listeners are executed? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In the existing proposal we are not making any > > assumptions > > > > > about > > > > > > > the > > > > > > > > > > order > > > > > > > > > > > of listeners, > > > > > > > > > > > (system ones come first and then the ones loaded from > the > > > > > plugin > > > > > > > > > manager) > > > > > > > > > > > however listeners can use previous state (tags/labels) > to > > > > make > > > > > > > > > decisions: > > > > > > > > > > > e.g., wont assign *UNKNOWN* failureType when we have > > already > > > > > seen > > > > > > > *USER > > > > > > > > > > *or > > > > > > > > > > > the other way around -- when we have seen *UNKNOWN* > > remove in > > > > > > > favor of > > > > > > > > > > > *USER* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > Panagiotis > > > > > > > > > > > > > > > > > > > > > > On Sun, Mar 19, 2023 at 10:42 AM David Morávek < > > > > > d...@apache.org> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Panagiotis, > > > > > > > > > > > > > > > > > > > > > > > > This is an excellent proposal and something everyone > > trying > > > > > to > > > > > > > > > provide > > > > > > > > > > > > "Flink as a service" needs to solve at some point. I > > have a > > > > > > > couple of > > > > > > > > > > > > questions: > > > > > > > > > > > > > > > > > > > > > > > > If I understand the proposal correctly, this is just > > about > > > > > > adding > > > > > > > > > tags > > > > > > > > > > to > > > > > > > > > > > > the Throwable by running a tuple of (Throwable, > > > > > FailureContext) > > > > > > > > > > through a > > > > > > > > > > > > user-defined model. > > > > > > > > > > > > > > > > > > > > > > > > 1) Should we also consider adding labels? The > > combination > > > > of > > > > > > > tags and > > > > > > > > > > > > labels seems to be what most systems offer; > sometimes, > > they > > > > > > offer > > > > > > > > > > labels > > > > > > > > > > > > only (key=value pairs) because tags can be > implemented > > > > using > > > > > > > those, > > > > > > > > > but > > > > > > > > > > > not > > > > > > > > > > > > the other way around. > > > > > > > > > > > > > > > > > > > > > > > > 2) Since we can not predict how heavy user-defined > > models > > > > > > > > > ("listeners") > > > > > > > > > > > are > > > > > > > > > > > > going to be, it would be great to keep the > > interfaces/data > > > > > > > structures > > > > > > > > > > > > immutable so we can push things over to the I/O > > threads. > > > > > Also, > > > > > > it > > > > > > > > > > sounds > > > > > > > > > > > > off to call the main interface a Listener since it's > > > > supposed > > > > > > to > > > > > > > > > > enhance > > > > > > > > > > > > the original throwable with additional metadata. > > > > > > > > > > > > > > > > > > > > > > > > I'd propose something along the lines of (we should > > have > > > > > better > > > > > > > > > names, > > > > > > > > > > > this > > > > > > > > > > > > is just to outline the idea): > > > > > > > > > > > > > > > > > > > > > > > > interface FailureEnricher { > > > > > > > > > > > > > > > > > > > > > > > > ThrowableWithTagsAndLabels enrichFailure(Throwable > > cause, > > > > > > > > > > > > ImmutableContextualMetadataAboutTheThrowable > context); > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > The names should change; this is just to outline the > > idea. > > > > > > > > > > > > > > > > > > > > > > > > 3) You're proposing to support a set of listeners. > > Since > > > > > you're > > > > > > > > > passing > > > > > > > > > > > the > > > > > > > > > > > > mutable context around, which includes tags set by > the > > > > > previous > > > > > > > > > > listener, > > > > > > > > > > > > do you expect users to make any assumptions about the > > order > > > > > in > > > > > > > which > > > > > > > > > > > > listeners are executed? > > > > > > > > > > > > > > > > > > > > > > > > *@Shammon* > > > > > > > > > > > > > > > > > > > > > > > > Users may want to save results in listener, and then > > they > > > > can > > > > > > > get the > > > > > > > > > > > > > historical results even jabmanager failover. Can we > > > > > provide a > > > > > > > > > unified > > > > > > > > > > > > > implementation for data storage requirements? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think we should explicitly state that all > > "listeners" are > > > > > > > treated > > > > > > > > > as > > > > > > > > > > > > stateless. I don't see any strong reason for > > snapshotting > > > > > them. > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > D. > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Mar 18, 2023 at 1:00 AM Shammon FY < > > > > > zjur...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hi Panagiotis > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you for starting this discussion. I think > this > > FLIP > > > > > is > > > > > > > > > valuable > > > > > > > > > > > and > > > > > > > > > > > > > can help user to analyze the causes of job failover > > > > better! > > > > > > > > > > > > > > > > > > > > > > > > > > I have two comments as follows > > > > > > > > > > > > > > > > > > > > > > > > > > 1. How about adding more job information in > > > > > > > FailureListenerContext? > > > > > > > > > > For > > > > > > > > > > > > > example, job vertext, subtask, taskmanager > location. > > And > > > > > then > > > > > > > user > > > > > > > > > > can > > > > > > > > > > > do > > > > > > > > > > > > > more statistics according to different dimensions. > > > > > > > > > > > > > > > > > > > > > > > > > > 2. Users may want to save results in listener, and > > then > > > > > they > > > > > > > can > > > > > > > > > get > > > > > > > > > > > the > > > > > > > > > > > > > historical results even jabmanager failover. Can we > > > > > provide a > > > > > > > > > unified > > > > > > > > > > > > > implementation for data storage requirements? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > shammon FY > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Saturday, March 18, 2023, Panagiotis > Garefalakis < > > > > > > > > > > pga...@apache.org > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > > > > > > > > > > > > > This FLIP [1] proposes a pluggable interface for > > > > failure > > > > > > > handling > > > > > > > > > > > > > allowing > > > > > > > > > > > > > > users to implement custom failure logic using the > > > > plugin > > > > > > > > > framework. > > > > > > > > > > > > > > Motivated by existing proposals [2] and tickets > > [3], > > > > this > > > > > > > enables > > > > > > > > > > > > > use-cases > > > > > > > > > > > > > > like: assigning particular types to failures > (e.g., > > > > User > > > > > or > > > > > > > > > > System), > > > > > > > > > > > > > > emitting custom metrics per type (e.g., > > application or > > > > > > > platform), > > > > > > > > > > > even > > > > > > > > > > > > > > exposing errors to downstream consumers (e.g., > > > > > notification > > > > > > > > > > systems). > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks to Piotr and Anton for the initial reviews > > and > > > > > > > > > discussions! > > > > > > > > > > > > > > > > > > > > > > > > > > > > For anyone interested, the starting point would > be > > the > > > > > FLIP > > > > > > > [1] > > > > > > > > > > that > > > > > > > > > > > I > > > > > > > > > > > > > > created, > > > > > > > > > > > > > > describing the motivation and the proposed > changes > > > > (part > > > > > of > > > > > > > the > > > > > > > > > > core, > > > > > > > > > > > > > > runtime and web). > > > > > > > > > > > > > > > > > > > > > > > > > > > > The intuition behind this FLIP is being able to > > execute > > > > > > > custom > > > > > > > > > > logic > > > > > > > > > > > on > > > > > > > > > > > > > > failures by exposing a FailureListener interface. > > > > > > > Implementation > > > > > > > > > by > > > > > > > > > > > > users > > > > > > > > > > > > > > can be simply loaded to the system as Jar files. > > > > > > > FailureListeners > > > > > > > > > > may > > > > > > > > > > > > > also > > > > > > > > > > > > > > decide to assign failure tags to errors > (expressed > > as > > > > > > > strings), > > > > > > > > > > > > > > that will then be exposed as metadata by the > > UI/Rest > > > > > > > interfaces. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Feedback is always appreciated! Looking forward > to > > your > > > > > > > thoughts! > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-304% > > > > > > > > > > > > > > 3A+Pluggable+failure+handling+for+Apache+Flink > > > > > > > > > > > > > > [2] > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1pcHg9F3GoDDeVD5GIIo2wO67 > > > > > > > > > > > > > > Hmjgy0-hRDeuFnrMgT4 > > > > > > > > > > > > > > [3] > > https://issues.apache.org/jira/browse/FLINK-20833 > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > Panagiotis > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >