Hi all, Apologies for the long pause. Picking FLIP-545 back up.
The Google doc [1] has the latest version, incorporating feedback from Aleksandr, Li, and David from the earlier rounds. I don't have Cwiki write access - could a committer please help sync the doc to [2]? Once that's done, I'll give until end of next week for any final concerns before starting the [VOTE]. Thanks, Kartikey [1] https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?usp=sharing <https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?usp=sharing> [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-545%3A+Hardening+the+Event+Reporter+with+an+Asynchronous+Core On Thu, Nov 13, 2025 at 12:57 PM Kartikey Pant <[email protected]> wrote: > Hi Li and David, > > Thank you both for the great questions. > > Li - Thanks for the support. To your question: Yes, the circuit breaker is > 100% per-reporter*. *A faulty sink (one that throws errors) will trip its > own breaker. A slow sink (one that just blocks) will simply be isolated to > a thread in the worker pool. In either case, it will not block or affect > dispatching to any other configured reporters. This isolation is the core > of the new design. > > David - Thank you for the detailed operational questions. > > > 1. *On Benchmarks*: The primary benefit is JobManager stability. The > benchmark is qualitative: moving the addEvent() call from a blocking I/O > operation (in sync mode) to a non-blocking, near-constant-time > queue.offer() (in async-queued mode). The integration tests are > specifically designed to validate that the caller thread (i.e., JobManager) > remains unblocked, even when one reporter is slow. On the > quantitative impact of async vs sync, I would be adding a benchmark of the > same post implementation. > 2. *On Dropped Events*: You are correct. This is the explicit > tradeoff: the async dispatcher prioritizes JM stability over guaranteed > event delivery. The sync dispatcher remains the default for users who need > guaranteed, blocking delivery. The new metrics (events.droppedCount, > events.reporter.<name>.latency, events.reporter.<name>.failureCount) are > the diagnostic tools for operators to identify and resolve the source of > backpressure. > 3. *On the Queue (Sync vs. Async):* In sync mode, there is no queue; > the JM thread is the "queue" and blocks. In async mode, we use a bounded > in-memory queue, configurable via events.dispatcher.queue.size (default: > 1024). We also provide the events.dispatcher.queue.currentUsage gauge, > which allows operators to set up their own "90% full" alerts as you > suggested. > 4. *On Using Kafka*: In my opinion, this would be overkill for the > dispatcher. The goal is lightweight thread isolation. If a user needs > Kafka's durability, they should implement a KafkaEventReporter. This FLIP > makes that pattern safe, as the reporter's blocking calls will be isolated > to the worker pool and won't affect the JM or other reporters. > > Thanks, > Kartikey > > > On Mon, Nov 10, 2025 at 5:44 PM David Radley <[email protected]> > wrote: > >> Hi Li, >> Some comments on the Flip: >> >> * >> Have you created a benchmark for this to showcase the benefits of the >> single thread and the asynchronous approach. I assume we can get more >> throughout, it would be good to quantify the improvement >> * >> You mention counting dropped events, in back pressure situations, so this >> mean that the metrics will incomplete. It would be good to see a picture of >> this happening and what the user can do in this situation. >> * >> You talk about the dispatch queue becoming full. Can you detail what you >> mean here and how it is different in the sync and async cases? How big is >> the queue and can it be increased ? Can we put out warnings around 90% full? >> * >> Would a Kafka queue be overkill / possible to solve this full queue >> problem? The separation between production and consumption that Kafka >> brings seems applicable here. >> >> Kind regards, David. >> >> From: Li Wang <[email protected]> >> Date: Monday, 10 November 2025 at 11:45 >> To: [email protected] <[email protected]> >> Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX: Hardening the Event Reporter >> with an Asynchronous Core >> >> Hi all, >> >> Just want to check on this thread. >> >> I think FLIP-545 is important work. The new async dispatcher will help >> Flink stability a lot when we use custom reporters - it helps unblock the >> JobManager. >> >> Kartikey, I have a quick question on the circuit breaker logic. Is the >> state managed per-reporter (so each reporter is isolated), or will one >> faulty reporter potentially stop dispatches for all reporters? This is a >> key detail for our setup. >> >> Ready to see the [VOTE] start soon. Thank you for the FLIP. >> >> Thanks, >> Li >> >> On Wed, Oct 1, 2025 at 12:32 PM Kartikey Pant <[email protected] >> > >> wrote: >> >> > Hi all, >> > >> > Circling back on this thread. >> > >> > Thanks to the great feedback from the earlier discussion, the proposal >> has >> > been updated to use a more flexible, interface-based design. The final >> FLIP >> > is available on the Cwiki [1] (thanks, Piotr, for creating the page). >> > >> > My intention is to move this to a formal vote next week. >> > >> > Before I do, please raise any blocking concerns by this Friday, October >> > 3rd. If there are no blocking issues, I will start the [VOTE] thread on >> > Monday. >> > >> > Thanks, >> > Kartikey >> > >> > [1] >> > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-545%3A+Hardening+the+Event+Reporter+with+an+Asynchronous+Core >> > >> > >> > On Tue, Sep 2, 2025 at 5:00 PM Piotr Nowojski <[email protected]> >> > wrote: >> > >> > > Hi, >> > > >> > > Here you go: >> > > >> > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-545%3A+Hardening+the+Event+Reporter+with+an+Asynchronous+Core >> > > >> > > Best, >> > > Piotrek >> > > >> > > pon., 1 wrz 2025 o 19:37 Kartikey Pant <[email protected]> >> > > napisał(a): >> > > >> > > > Hi all, >> > > > >> > > > Thanks, Aleksandr, for the great suggestion on using an >> > > > interface-based strategy. It's a much cleaner approach that ensures >> > > > backward compatibility while keeping the design extensible. >> > > > >> > > > Based on this feedback, I've updated the FLIP document. The design >> now >> > > > uses an EventDispatcher interface, controlled by a single >> > > > events.dispatcher.type config key, allowing users to opt-in to the >> new >> > > > asynchronous behavior. >> > > > >> > > > I believe the proposal has now stabilized. As I don't have >> Confluence >> > > > write access, could a committer please help assign an official FLIP >> > > > number this: >> > > > >> > > >> > >> https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?tab=t.0 >> > > > >> > > > Best, >> > > > Kartikey Pant >> > > > >> > > > >> > > > On Tue, Aug 26, 2025 at 11:13 PM Aleksandr Iushmanov >> > > > <[email protected]> wrote: >> > > > > >> > > > > Hi Kartikey, >> > > > > >> > > > > Thank you for looking into this. >> > > > > >> > > > > I might not be very familiar with the naming conventions in Flink, >> > > > > so please bear with me if my suggestion doesn't make complete >> sense. >> > > > > I suggest introducing a feature flag, something like: >> > > > > >> > > > > > events.reporter.<name>.dispatcher.type >> > > > > >> > > > > which would default to *sync* to make this change backwards >> > compatible. >> > > > > >> > > > > Also, are there any reasons why we would not want to introduce an >> > > > > interface with two implementations? >> > > > > 1. sync: for the existing behaviour. >> > > > > 2. memory-queue: for the proposed implementation with the queue. >> > > > > >> > > > > This way: >> > > > > >> > > > > - we don't break anything by default >> > > > > - we can change the default in future releases once it has been >> > > proven >> > > > > to be stable >> > > > > - we keep the door open for other implementations (e.g. >> file-based >> > > > queue >> > > > > or spillover to logs). >> > > > > >> > > > > >> > > > > I look forward to hearing your thoughts on it. >> > > > > >> > > > > Kind regards, >> > > > > Aleksandr Iushmanov >> > > > > >> > > > > >> > > > > On Fri, 22 Aug 2025 at 09:54, Kartikey Pant < >> > > [email protected]> >> > > > > wrote: >> > > > > >> > > > > > Hi Aleksandr, >> > > > > > >> > > > > > Thanks for the great feedback. Your points on guaranteed >> delivery >> > and >> > > > the >> > > > > > *FileEventsReporter* are spot on, and I agree with your >> reasoning. >> > > I'll >> > > > > > update the FLIP to incorporate them, as it will make the >> proposal >> > > much >> > > > > > stronger. >> > > > > > >> > > > > > Regarding the delivery guarantee, I'll add a new configuration >> key, >> > > > > > *events.reporter.<name>.delivery.guarantee*, to allow a choice >> > > between >> > > > two >> > > > > > modes. The default will be best-effort for the asynchronous, >> > > > non-blocking >> > > > > > dispatch. I'll also add a guaranteed mode for a synchronous, >> > blocking >> > > > > > dispatch that bypasses the queue, perfect for the critical >> > > autoscaling >> > > > use >> > > > > > case you mentioned. >> > > > > > >> > > > > > On your question about the *FileEventsReporter*, you're right >> that >> > a >> > > > local >> > > > > > file append is cheap. The async core isn't really designed for >> the >> > > > > > *FileEventsReporter* specifically, but for the general case >> where >> > > > reporters >> > > > > > write to network sinks (e.g., *OpenTelemetry*) where latency and >> > > > > > backpressure are real concerns. The file reporter is just meant >> to >> > > be a >> > > > > > simple, built-in option for users. >> > > > > > >> > > > > > I'll get these changes into the design doc shortly and will >> follow >> > up >> > > > on >> > > > > > this thread once it's updated. Thanks again for helping improve >> the >> > > > FLIP. >> > > > > > >> > > > > > Best, >> > > > > > Kartikey >> > > > > > >> > > > > > On Thu, Aug 21, 2025 at 11:19 PM Aleksandr Iushmanov < >> > > > [email protected]> >> > > > > > wrote: >> > > > > > >> > > > > > > Hi Kartikey, >> > > > > > > >> > > > > > > I like the idea and I agree with general direction, thank you >> for >> > > > > > > putting it together! >> > > > > > > >> > > > > > > I have one concern about making this modification "forced", >> imho >> > > > there >> > > > > > > should be a room for "guaranteed important events delivery" >> from >> > > the >> > > > > > > operations point of view. If Flink job is >> > struggling/backpressured >> > > it >> > > > > > > may make sense to emit some events at priority that would be >> used >> > > for >> > > > > > > external triggers like "autoscaling" or external dynamic >> > > > configuration >> > > > > > > tuning. >> > > > > > > >> > > > > > > Imho, interfaces should either allow to choose "sync" vs "non >> > > > guaranteed >> > > > > > > async" delivery for different events (or event reporters). >> With >> > > > proposal >> > > > > > > "as is" it won't be possible to "ensure" that important >> messages >> > > have >> > > > > > > been delivered and can be actioned by external monitoring >> system. >> > > > Could >> > > > > > > we make "queue / async" behaviour opt-in? >> > > > > > > Second question I had was around FileEventReporter >> > implementation, >> > > > at a >> > > > > > > glance, "append to file" is a fairly cheap operation, do you >> > have a >> > > > > > > concern that amount of events is large enough to have >> significant >> > > > > > > bottleneck on disk IO and requires memory queue? >> > > > > > > >> > > > > > > Kind regards, >> > > > > > > >> > > > > > > Aleksandr Iushmanov >> > > > > > > >> > > > > > > >> > > > > > > On 2025/08/19 06:56:36 Kartikey Pant wrote: >> > > > > > > > Hi everyone, >> > > > > > > > >> > > > > > > > I'd like to propose a new FLIP that builds directly on the >> > > > excellent >> > > > > > > > foundation laid by FLIP-481 (Introduce Event Reporting). >> For >> > > > anyone >> > > > > > > > needing context, the original proposal is available here: >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-481%3A+Introduce+Event+Reporting >> > > > > > > > >> > > > > > > > Now that the community has this powerful API, the logical >> next >> > > > step is >> > > > > > > > to ensure it's fully robust for large-scale production >> > > > environments >> > > > > > > > where users will be writing their own diverse, custom >> > reporters. >> > > > > > > > >> > > > > > > > This proposal focuses on one key enhancement: introducing a >> > > > resilient, >> > > > > > > > asynchronous dispatch core. The goal is to decouple event >> > > > generation >> > > > > > > > from the reporter's execution, ensuring that a slow or >> > > > experimental >> > > > > > > > sink can never impact Flink's core stability. >> > > > > > > > >> > > > > > > > I've drafted a detailed design document that I hope can >> form >> > the >> > > > basis >> > > > > > > > of this new FLIP: >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > >> > > >> > >> https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?usp=sharing >> > > > > > > > >> > > > > > > > I'm keen to get the community's initial feedback on this >> > > direction >> > > > > > > > before moving forward with the formal process. >> > > > > > > > >> > > > > > > > Thanks, >> > > > > > > > Kartikey Pant >> > > > > > > > >> > > > > > > >> > > > > > >> > > > >> > > >> > >> >> Unless otherwise stated above: >> >> IBM United Kingdom Limited >> Registered in England and Wales with number 741598 >> Registered office: Building C, IBM Hursley Office, Hursley Park Road, >> Winchester, Hampshire SO21 2JN >> >
