Hi Li,
Some comments on the Flip:

  *
Have you created a benchmark for this to showcase the benefits of the single 
thread and the asynchronous approach. I assume we can get more throughout, it 
would be good to quantify the improvement
  *
You mention counting dropped events, in back pressure situations, so this mean 
that the metrics will incomplete. It would be good to see a picture of this 
happening and what the user can do in this situation.
  *
You talk about the dispatch queue becoming full. Can you detail what you mean 
here and how it is different in the sync and async cases? How big is the queue 
and can it be increased ? Can we put out warnings around 90% full?
  *
Would a Kafka queue be overkill / possible to solve this full queue problem? 
The separation between production and consumption that Kafka brings seems 
applicable here.

Kind regards, David.

From: Li Wang <[email protected]>
Date: Monday, 10 November 2025 at 11:45
To: [email protected] <[email protected]>
Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX: Hardening the Event Reporter with 
an Asynchronous Core

Hi all,

Just want to check on this thread.

I think FLIP-545 is important work. The new async dispatcher will help
Flink stability a lot when we use custom reporters - it helps unblock the
JobManager.

Kartikey, I have a quick question on the circuit breaker logic. Is the
state managed per-reporter (so each reporter is isolated), or will one
faulty reporter potentially stop dispatches for all reporters? This is a
key detail for our setup.

Ready to see the [VOTE] start soon. Thank you for the FLIP.

Thanks,
Li

On Wed, Oct 1, 2025 at 12:32 PM Kartikey Pant <[email protected]>
wrote:

> Hi all,
>
> Circling back on this thread.
>
> Thanks to the great feedback from the earlier discussion, the proposal has
> been updated to use a more flexible, interface-based design. The final FLIP
> is available on the Cwiki [1] (thanks, Piotr, for creating the page).
>
> My intention is to move this to a formal vote next week.
>
> Before I do, please raise any blocking concerns by this Friday, October
> 3rd. If there are no blocking issues, I will start the [VOTE] thread on
> Monday.
>
> Thanks,
> Kartikey
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-545%3A+Hardening+the+Event+Reporter+with+an+Asynchronous+Core
>
>
> On Tue, Sep 2, 2025 at 5:00 PM Piotr Nowojski <[email protected]>
> wrote:
>
> > Hi,
> >
> > Here you go:
> >
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-545%3A+Hardening+the+Event+Reporter+with+an+Asynchronous+Core
> >
> > Best,
> > Piotrek
> >
> > pon., 1 wrz 2025 o 19:37 Kartikey Pant <[email protected]>
> > napisał(a):
> >
> > > Hi all,
> > >
> > > Thanks, Aleksandr, for the great suggestion on using an
> > > interface-based strategy. It's a much cleaner approach that ensures
> > > backward compatibility while keeping the design extensible.
> > >
> > > Based on this feedback, I've updated the FLIP document. The design now
> > > uses an EventDispatcher interface, controlled by a single
> > > events.dispatcher.type config key, allowing users to opt-in to the new
> > > asynchronous behavior.
> > >
> > > I believe the proposal has now stabilized. As I don't have Confluence
> > > write access, could a committer please help assign an official FLIP
> > > number this:
> > >
> >
> https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?tab=t.0
> > >
> > > Best,
> > > Kartikey Pant
> > >
> > >
> > > On Tue, Aug 26, 2025 at 11:13 PM Aleksandr Iushmanov
> > > <[email protected]> wrote:
> > > >
> > > > Hi Kartikey,
> > > >
> > > > Thank you for looking into this.
> > > >
> > > > I might not be very familiar with the naming conventions in Flink,
> > > > so please bear with me if my suggestion doesn't make complete sense.
> > > > I suggest introducing a feature flag, something like:
> > > >
> > > > > events.reporter.<name>.dispatcher.type
> > > >
> > > > which would default to *sync* to make this change backwards
> compatible.
> > > >
> > > > Also, are there any reasons why we would not want to introduce an
> > > > interface with two implementations?
> > > > 1. sync: for the existing behaviour.
> > > > 2. memory-queue: for the proposed implementation with the queue.
> > > >
> > > > This way:
> > > >
> > > >    - we don't break anything by default
> > > >    - we can change the default in future releases once it has been
> > proven
> > > >    to be stable
> > > >    - we keep the door open for other implementations (e.g. file-based
> > > queue
> > > >    or spillover to logs).
> > > >
> > > >
> > > > I look forward to hearing your thoughts on it.
> > > >
> > > > Kind regards,
> > > > Aleksandr Iushmanov
> > > >
> > > >
> > > > On Fri, 22 Aug 2025 at 09:54, Kartikey Pant <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > Hi Aleksandr,
> > > > >
> > > > > Thanks for the great feedback. Your points on guaranteed delivery
> and
> > > the
> > > > > *FileEventsReporter* are spot on, and I agree with your reasoning.
> > I'll
> > > > > update the FLIP to incorporate them, as it will make the proposal
> > much
> > > > > stronger.
> > > > >
> > > > > Regarding the delivery guarantee, I'll add a new configuration key,
> > > > > *events.reporter.<name>.delivery.guarantee*, to allow a choice
> > between
> > > two
> > > > > modes. The default will be best-effort for the asynchronous,
> > > non-blocking
> > > > > dispatch. I'll also add a guaranteed mode for a synchronous,
> blocking
> > > > > dispatch that bypasses the queue, perfect for the critical
> > autoscaling
> > > use
> > > > > case you mentioned.
> > > > >
> > > > > On your question about the *FileEventsReporter*, you're right that
> a
> > > local
> > > > > file append is cheap. The async core isn't really designed for the
> > > > > *FileEventsReporter* specifically, but for the general case where
> > > reporters
> > > > > write to network sinks (e.g., *OpenTelemetry*) where latency and
> > > > > backpressure are real concerns. The file reporter is just meant to
> > be a
> > > > > simple, built-in option for users.
> > > > >
> > > > > I'll get these changes into the design doc shortly and will follow
> up
> > > on
> > > > > this thread once it's updated. Thanks again for helping improve the
> > > FLIP.
> > > > >
> > > > > Best,
> > > > > Kartikey
> > > > >
> > > > > On Thu, Aug 21, 2025 at 11:19 PM Aleksandr Iushmanov <
> > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi Kartikey,
> > > > > >
> > > > > > I like the idea and I agree with general direction, thank you for
> > > > > > putting it together!
> > > > > >
> > > > > > I have one concern about making this modification "forced", imho
> > > there
> > > > > > should be a room for "guaranteed important events delivery" from
> > the
> > > > > > operations point of view. If Flink job is
> struggling/backpressured
> > it
> > > > > > may make sense to emit some events at priority that would be used
> > for
> > > > > > external triggers like "autoscaling" or external dynamic
> > > configuration
> > > > > > tuning.
> > > > > >
> > > > > > Imho, interfaces should either allow to choose "sync" vs "non
> > > guaranteed
> > > > > > async" delivery for different events (or event reporters). With
> > > proposal
> > > > > > "as is" it won't be possible to "ensure" that important messages
> > have
> > > > > > been delivered and can be actioned by external monitoring system.
> > > Could
> > > > > > we make "queue / async" behaviour opt-in?
> > > > > > Second question I had was around FileEventReporter
> implementation,
> > > at a
> > > > > > glance, "append to file" is a fairly cheap operation, do you
> have a
> > > > > > concern that amount of events is large enough to have significant
> > > > > > bottleneck on disk IO and requires memory queue?
> > > > > >
> > > > > > Kind regards,
> > > > > >
> > > > > > Aleksandr Iushmanov
> > > > > >
> > > > > >
> > > > > > On 2025/08/19 06:56:36 Kartikey Pant wrote:
> > > > > >  > Hi everyone,
> > > > > >  >
> > > > > >  > I'd like to propose a new FLIP that builds directly on the
> > > excellent
> > > > > >  > foundation laid by FLIP-481 (Introduce Event Reporting). For
> > > anyone
> > > > > >  > needing context, the original proposal is available here:
> > > > > >  >
> > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-481%3A+Introduce+Event+Reporting
> > > > > >  >
> > > > > >  > Now that the community has this powerful API, the logical next
> > > step is
> > > > > >  > to ensure it's fully robust for large-scale production
> > > environments
> > > > > >  > where users will be writing their own diverse, custom
> reporters.
> > > > > >  >
> > > > > >  > This proposal focuses on one key enhancement: introducing a
> > > resilient,
> > > > > >  > asynchronous dispatch core. The goal is to decouple event
> > > generation
> > > > > >  > from the reporter's execution, ensuring that a slow or
> > > experimental
> > > > > >  > sink can never impact Flink's core stability.
> > > > > >  >
> > > > > >  > I've drafted a detailed design document that I hope can form
> the
> > > basis
> > > > > >  > of this new FLIP:
> > > > > >  >
> > > > > >
> > > > > >
> > > > >
> > >
> >
> https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?usp=sharing
> > > > > >  >
> > > > > >  > I'm keen to get the community's initial feedback on this
> > direction
> > > > > >  > before moving forward with the formal process.
> > > > > >  >
> > > > > >  > Thanks,
> > > > > >  > Kartikey Pant
> > > > > >  >
> > > > > >
> > > > >
> > >
> >
>

Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: Building C, IBM Hursley Office, Hursley Park Road, 
Winchester, Hampshire SO21 2JN

Reply via email to