Hi all, Quick update on FLIP-545. I've realized the proposed AsynchronousDispatcher design has a head-of-line blocking problem. If a sink is slow but not failing, the single dispatch thread would block and stall all other reporters. I've updated the design to a "Poll Thread + Worker Pool" model to fix this. This gives us true isolation against slow sinks, and properly delivers the FLIP's goals.
I've updated the FLIP GDoc [1] with this change. It primarily adds one new config key along with some follow-up changes: events.dispatcher.worker.threads. Could a committer please help sync this to Confluence [2]? Thanks, Kartikey [1] https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?usp=sharing [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-545%3A+Hardening+the+Event+Reporter+with+an+Asynchronous+Core On Wed, Oct 1, 2025 at 12:08 PM Kartikey Pant <[email protected]> wrote: > > Hi all, > > Circling back on this thread. > > Thanks to the great feedback from the earlier discussion, the proposal has > been updated to use a more flexible, interface-based design. The final FLIP > is available on the Cwiki [1] (thanks, Piotr, for creating the page). > > My intention is to move this to a formal vote next week. > > Before I do, please raise any blocking concerns by this Friday, October 3rd. > If there are no blocking issues, I will start the [VOTE] thread on Monday. > > Thanks, > Kartikey > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-545%3A+Hardening+the+Event+Reporter+with+an+Asynchronous+Core > > > On Tue, Sep 2, 2025 at 5:00 PM Piotr Nowojski <[email protected]> wrote: >> >> Hi, >> >> Here you go: >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-545%3A+Hardening+the+Event+Reporter+with+an+Asynchronous+Core >> >> Best, >> Piotrek >> >> pon., 1 wrz 2025 o 19:37 Kartikey Pant <[email protected]> >> napisał(a): >> >> > Hi all, >> > >> > Thanks, Aleksandr, for the great suggestion on using an >> > interface-based strategy. It's a much cleaner approach that ensures >> > backward compatibility while keeping the design extensible. >> > >> > Based on this feedback, I've updated the FLIP document. The design now >> > uses an EventDispatcher interface, controlled by a single >> > events.dispatcher.type config key, allowing users to opt-in to the new >> > asynchronous behavior. >> > >> > I believe the proposal has now stabilized. As I don't have Confluence >> > write access, could a committer please help assign an official FLIP >> > number this: >> > https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?tab=t.0 >> > >> > Best, >> > Kartikey Pant >> > >> > >> > On Tue, Aug 26, 2025 at 11:13 PM Aleksandr Iushmanov >> > <[email protected]> wrote: >> > > >> > > Hi Kartikey, >> > > >> > > Thank you for looking into this. >> > > >> > > I might not be very familiar with the naming conventions in Flink, >> > > so please bear with me if my suggestion doesn't make complete sense. >> > > I suggest introducing a feature flag, something like: >> > > >> > > > events.reporter.<name>.dispatcher.type >> > > >> > > which would default to *sync* to make this change backwards compatible. >> > > >> > > Also, are there any reasons why we would not want to introduce an >> > > interface with two implementations? >> > > 1. sync: for the existing behaviour. >> > > 2. memory-queue: for the proposed implementation with the queue. >> > > >> > > This way: >> > > >> > > - we don't break anything by default >> > > - we can change the default in future releases once it has been proven >> > > to be stable >> > > - we keep the door open for other implementations (e.g. file-based >> > queue >> > > or spillover to logs). >> > > >> > > >> > > I look forward to hearing your thoughts on it. >> > > >> > > Kind regards, >> > > Aleksandr Iushmanov >> > > >> > > >> > > On Fri, 22 Aug 2025 at 09:54, Kartikey Pant <[email protected]> >> > > wrote: >> > > >> > > > Hi Aleksandr, >> > > > >> > > > Thanks for the great feedback. Your points on guaranteed delivery and >> > the >> > > > *FileEventsReporter* are spot on, and I agree with your reasoning. I'll >> > > > update the FLIP to incorporate them, as it will make the proposal much >> > > > stronger. >> > > > >> > > > Regarding the delivery guarantee, I'll add a new configuration key, >> > > > *events.reporter.<name>.delivery.guarantee*, to allow a choice between >> > two >> > > > modes. The default will be best-effort for the asynchronous, >> > non-blocking >> > > > dispatch. I'll also add a guaranteed mode for a synchronous, blocking >> > > > dispatch that bypasses the queue, perfect for the critical autoscaling >> > use >> > > > case you mentioned. >> > > > >> > > > On your question about the *FileEventsReporter*, you're right that a >> > local >> > > > file append is cheap. The async core isn't really designed for the >> > > > *FileEventsReporter* specifically, but for the general case where >> > reporters >> > > > write to network sinks (e.g., *OpenTelemetry*) where latency and >> > > > backpressure are real concerns. The file reporter is just meant to be a >> > > > simple, built-in option for users. >> > > > >> > > > I'll get these changes into the design doc shortly and will follow up >> > on >> > > > this thread once it's updated. Thanks again for helping improve the >> > FLIP. >> > > > >> > > > Best, >> > > > Kartikey >> > > > >> > > > On Thu, Aug 21, 2025 at 11:19 PM Aleksandr Iushmanov < >> > [email protected]> >> > > > wrote: >> > > > >> > > > > Hi Kartikey, >> > > > > >> > > > > I like the idea and I agree with general direction, thank you for >> > > > > putting it together! >> > > > > >> > > > > I have one concern about making this modification "forced", imho >> > there >> > > > > should be a room for "guaranteed important events delivery" from the >> > > > > operations point of view. If Flink job is struggling/backpressured it >> > > > > may make sense to emit some events at priority that would be used for >> > > > > external triggers like "autoscaling" or external dynamic >> > configuration >> > > > > tuning. >> > > > > >> > > > > Imho, interfaces should either allow to choose "sync" vs "non >> > guaranteed >> > > > > async" delivery for different events (or event reporters). With >> > proposal >> > > > > "as is" it won't be possible to "ensure" that important messages have >> > > > > been delivered and can be actioned by external monitoring system. >> > Could >> > > > > we make "queue / async" behaviour opt-in? >> > > > > Second question I had was around FileEventReporter implementation, >> > at a >> > > > > glance, "append to file" is a fairly cheap operation, do you have a >> > > > > concern that amount of events is large enough to have significant >> > > > > bottleneck on disk IO and requires memory queue? >> > > > > >> > > > > Kind regards, >> > > > > >> > > > > Aleksandr Iushmanov >> > > > > >> > > > > >> > > > > On 2025/08/19 06:56:36 Kartikey Pant wrote: >> > > > > > Hi everyone, >> > > > > > >> > > > > > I'd like to propose a new FLIP that builds directly on the >> > excellent >> > > > > > foundation laid by FLIP-481 (Introduce Event Reporting). For >> > anyone >> > > > > > needing context, the original proposal is available here: >> > > > > > >> > > > > >> > > > > >> > > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-481%3A+Introduce+Event+Reporting >> > > > > > >> > > > > > Now that the community has this powerful API, the logical next >> > step is >> > > > > > to ensure it's fully robust for large-scale production >> > environments >> > > > > > where users will be writing their own diverse, custom reporters. >> > > > > > >> > > > > > This proposal focuses on one key enhancement: introducing a >> > resilient, >> > > > > > asynchronous dispatch core. The goal is to decouple event >> > generation >> > > > > > from the reporter's execution, ensuring that a slow or >> > experimental >> > > > > > sink can never impact Flink's core stability. >> > > > > > >> > > > > > I've drafted a detailed design document that I hope can form the >> > basis >> > > > > > of this new FLIP: >> > > > > > >> > > > > >> > > > > >> > > > >> > https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?usp=sharing >> > > > > > >> > > > > > I'm keen to get the community's initial feedback on this direction >> > > > > > before moving forward with the formal process. >> > > > > > >> > > > > > Thanks, >> > > > > > Kartikey Pant >> > > > > > >> > > > > >> > > > >> >
