Re: [DISCUSS] FLIP-545: Hardening the Event Reporter with an Asynchronous Core

Kartikey Pant Tue, 28 Apr 2026 01:33:48 -0700

Hi all,

Apologies for the long pause. Picking FLIP-545 back up.


The Google doc [1] has the latest version, incorporating feedback from
Aleksandr, Li, and David from the earlier rounds.

I don't have Cwiki write access - could a committer please help sync the
doc to [2]? Once that's done, I'll give until end of next week for any
final concerns before starting the [VOTE].

Thanks,
Kartikey


[1]
https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?usp=sharing
<https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?usp=sharing>
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-545%3A+Hardening+the+Event+Reporter+with+an+Asynchronous+Core



On Thu, Nov 13, 2025 at 12:57 PM Kartikey Pant <[email protected]>
wrote:

> Hi Li and David,
>
> Thank you both for the great questions.
>
> Li - Thanks for the support. To your question: Yes, the circuit breaker is
> 100% per-reporter*. *A faulty sink (one that throws errors) will trip its
> own breaker. A slow sink (one that just blocks) will simply be isolated to
> a thread in the worker pool. In either case, it will not block or affect
> dispatching to any other configured reporters. This isolation is the core
> of the new design.
>
> David - Thank you for the detailed operational questions.
>
>
>    1. *On Benchmarks*: The primary benefit is JobManager stability. The
>    benchmark is qualitative: moving the addEvent() call from a blocking I/O
>    operation (in sync mode) to a non-blocking, near-constant-time
>    queue.offer() (in async-queued mode). The integration tests are
>    specifically designed to validate that the caller thread (i.e., JobManager)
>    remains unblocked, even when one reporter is slow. On the
>    quantitative impact of async vs sync, I would be adding a benchmark of the
>    same post implementation.
>    2. *On Dropped Events*: You are correct. This is the explicit
>    tradeoff: the async dispatcher prioritizes JM stability over guaranteed
>    event delivery. The sync dispatcher remains the default for users who need
>    guaranteed, blocking delivery. The new metrics (events.droppedCount,
>    events.reporter.<name>.latency, events.reporter.<name>.failureCount) are
>    the diagnostic tools for operators to identify and resolve the source of
>    backpressure.
>    3. *On the Queue (Sync vs. Async):* In sync mode, there is no queue;
>    the JM thread is the "queue" and blocks. In async mode, we use a bounded
>    in-memory queue, configurable via events.dispatcher.queue.size (default:
>    1024). We also provide the events.dispatcher.queue.currentUsage gauge,
>    which allows operators to set up their own "90% full" alerts as you
>    suggested.
>    4. *On Using Kafka*: In my opinion, this would be overkill for the
>    dispatcher. The goal is lightweight thread isolation. If a user needs
>    Kafka's durability, they should implement a KafkaEventReporter. This FLIP
>    makes that pattern safe, as the reporter's blocking calls will be isolated
>    to the worker pool and won't affect the JM or other reporters.
>
> Thanks,
> Kartikey
>
>
> On Mon, Nov 10, 2025 at 5:44 PM David Radley <[email protected]>
> wrote:
>
>> Hi Li,
>> Some comments on the Flip:
>>
>>   *
>> Have you created a benchmark for this to showcase the benefits of the
>> single thread and the asynchronous approach. I assume we can get more
>> throughout, it would be good to quantify the improvement
>>   *
>> You mention counting dropped events, in back pressure situations, so this
>> mean that the metrics will incomplete. It would be good to see a picture of
>> this happening and what the user can do in this situation.
>>   *
>> You talk about the dispatch queue becoming full. Can you detail what you
>> mean here and how it is different in the sync and async cases? How big is
>> the queue and can it be increased ? Can we put out warnings around 90% full?
>>   *
>> Would a Kafka queue be overkill / possible to solve this full queue
>> problem? The separation between production and consumption that Kafka
>> brings seems applicable here.
>>
>> Kind regards, David.
>>
>> From: Li Wang <[email protected]>
>> Date: Monday, 10 November 2025 at 11:45
>> To: [email protected] <[email protected]>
>> Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX: Hardening the Event Reporter
>> with an Asynchronous Core
>>
>> Hi all,
>>
>> Just want to check on this thread.
>>
>> I think FLIP-545 is important work. The new async dispatcher will help
>> Flink stability a lot when we use custom reporters - it helps unblock the
>> JobManager.
>>
>> Kartikey, I have a quick question on the circuit breaker logic. Is the
>> state managed per-reporter (so each reporter is isolated), or will one
>> faulty reporter potentially stop dispatches for all reporters? This is a
>> key detail for our setup.
>>
>> Ready to see the [VOTE] start soon. Thank you for the FLIP.
>>
>> Thanks,
>> Li
>>
>> On Wed, Oct 1, 2025 at 12:32 PM Kartikey Pant <[email protected]
>> >
>> wrote:
>>
>> > Hi all,
>> >
>> > Circling back on this thread.
>> >
>> > Thanks to the great feedback from the earlier discussion, the proposal
>> has
>> > been updated to use a more flexible, interface-based design. The final
>> FLIP
>> > is available on the Cwiki [1] (thanks, Piotr, for creating the page).
>> >
>> > My intention is to move this to a formal vote next week.
>> >
>> > Before I do, please raise any blocking concerns by this Friday, October
>> > 3rd. If there are no blocking issues, I will start the [VOTE] thread on
>> > Monday.
>> >
>> > Thanks,
>> > Kartikey
>> >
>> > [1]
>> >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-545%3A+Hardening+the+Event+Reporter+with+an+Asynchronous+Core
>> >
>> >
>> > On Tue, Sep 2, 2025 at 5:00 PM Piotr Nowojski <[email protected]>
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > Here you go:
>> > >
>> > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-545%3A+Hardening+the+Event+Reporter+with+an+Asynchronous+Core
>> > >
>> > > Best,
>> > > Piotrek
>> > >
>> > > pon., 1 wrz 2025 o 19:37 Kartikey Pant <[email protected]>
>> > > napisał(a):
>> > >
>> > > > Hi all,
>> > > >
>> > > > Thanks, Aleksandr, for the great suggestion on using an
>> > > > interface-based strategy. It's a much cleaner approach that ensures
>> > > > backward compatibility while keeping the design extensible.
>> > > >
>> > > > Based on this feedback, I've updated the FLIP document. The design
>> now
>> > > > uses an EventDispatcher interface, controlled by a single
>> > > > events.dispatcher.type config key, allowing users to opt-in to the
>> new
>> > > > asynchronous behavior.
>> > > >
>> > > > I believe the proposal has now stabilized. As I don't have
>> Confluence
>> > > > write access, could a committer please help assign an official FLIP
>> > > > number this:
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?tab=t.0
>> > > >
>> > > > Best,
>> > > > Kartikey Pant
>> > > >
>> > > >
>> > > > On Tue, Aug 26, 2025 at 11:13 PM Aleksandr Iushmanov
>> > > > <[email protected]> wrote:
>> > > > >
>> > > > > Hi Kartikey,
>> > > > >
>> > > > > Thank you for looking into this.
>> > > > >
>> > > > > I might not be very familiar with the naming conventions in Flink,
>> > > > > so please bear with me if my suggestion doesn't make complete
>> sense.
>> > > > > I suggest introducing a feature flag, something like:
>> > > > >
>> > > > > > events.reporter.<name>.dispatcher.type
>> > > > >
>> > > > > which would default to *sync* to make this change backwards
>> > compatible.
>> > > > >
>> > > > > Also, are there any reasons why we would not want to introduce an
>> > > > > interface with two implementations?
>> > > > > 1. sync: for the existing behaviour.
>> > > > > 2. memory-queue: for the proposed implementation with the queue.
>> > > > >
>> > > > > This way:
>> > > > >
>> > > > >    - we don't break anything by default
>> > > > >    - we can change the default in future releases once it has been
>> > > proven
>> > > > >    to be stable
>> > > > >    - we keep the door open for other implementations (e.g.
>> file-based
>> > > > queue
>> > > > >    or spillover to logs).
>> > > > >
>> > > > >
>> > > > > I look forward to hearing your thoughts on it.
>> > > > >
>> > > > > Kind regards,
>> > > > > Aleksandr Iushmanov
>> > > > >
>> > > > >
>> > > > > On Fri, 22 Aug 2025 at 09:54, Kartikey Pant <
>> > > [email protected]>
>> > > > > wrote:
>> > > > >
>> > > > > > Hi Aleksandr,
>> > > > > >
>> > > > > > Thanks for the great feedback. Your points on guaranteed
>> delivery
>> > and
>> > > > the
>> > > > > > *FileEventsReporter* are spot on, and I agree with your
>> reasoning.
>> > > I'll
>> > > > > > update the FLIP to incorporate them, as it will make the
>> proposal
>> > > much
>> > > > > > stronger.
>> > > > > >
>> > > > > > Regarding the delivery guarantee, I'll add a new configuration
>> key,
>> > > > > > *events.reporter.<name>.delivery.guarantee*, to allow a choice
>> > > between
>> > > > two
>> > > > > > modes. The default will be best-effort for the asynchronous,
>> > > > non-blocking
>> > > > > > dispatch. I'll also add a guaranteed mode for a synchronous,
>> > blocking
>> > > > > > dispatch that bypasses the queue, perfect for the critical
>> > > autoscaling
>> > > > use
>> > > > > > case you mentioned.
>> > > > > >
>> > > > > > On your question about the *FileEventsReporter*, you're right
>> that
>> > a
>> > > > local
>> > > > > > file append is cheap. The async core isn't really designed for
>> the
>> > > > > > *FileEventsReporter* specifically, but for the general case
>> where
>> > > > reporters
>> > > > > > write to network sinks (e.g., *OpenTelemetry*) where latency and
>> > > > > > backpressure are real concerns. The file reporter is just meant
>> to
>> > > be a
>> > > > > > simple, built-in option for users.
>> > > > > >
>> > > > > > I'll get these changes into the design doc shortly and will
>> follow
>> > up
>> > > > on
>> > > > > > this thread once it's updated. Thanks again for helping improve
>> the
>> > > > FLIP.
>> > > > > >
>> > > > > > Best,
>> > > > > > Kartikey
>> > > > > >
>> > > > > > On Thu, Aug 21, 2025 at 11:19 PM Aleksandr Iushmanov <
>> > > > [email protected]>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi Kartikey,
>> > > > > > >
>> > > > > > > I like the idea and I agree with general direction, thank you
>> for
>> > > > > > > putting it together!
>> > > > > > >
>> > > > > > > I have one concern about making this modification "forced",
>> imho
>> > > > there
>> > > > > > > should be a room for "guaranteed important events delivery"
>> from
>> > > the
>> > > > > > > operations point of view. If Flink job is
>> > struggling/backpressured
>> > > it
>> > > > > > > may make sense to emit some events at priority that would be
>> used
>> > > for
>> > > > > > > external triggers like "autoscaling" or external dynamic
>> > > > configuration
>> > > > > > > tuning.
>> > > > > > >
>> > > > > > > Imho, interfaces should either allow to choose "sync" vs "non
>> > > > guaranteed
>> > > > > > > async" delivery for different events (or event reporters).
>> With
>> > > > proposal
>> > > > > > > "as is" it won't be possible to "ensure" that important
>> messages
>> > > have
>> > > > > > > been delivered and can be actioned by external monitoring
>> system.
>> > > > Could
>> > > > > > > we make "queue / async" behaviour opt-in?
>> > > > > > > Second question I had was around FileEventReporter
>> > implementation,
>> > > > at a
>> > > > > > > glance, "append to file" is a fairly cheap operation, do you
>> > have a
>> > > > > > > concern that amount of events is large enough to have
>> significant
>> > > > > > > bottleneck on disk IO and requires memory queue?
>> > > > > > >
>> > > > > > > Kind regards,
>> > > > > > >
>> > > > > > > Aleksandr Iushmanov
>> > > > > > >
>> > > > > > >
>> > > > > > > On 2025/08/19 06:56:36 Kartikey Pant wrote:
>> > > > > > >  > Hi everyone,
>> > > > > > >  >
>> > > > > > >  > I'd like to propose a new FLIP that builds directly on the
>> > > > excellent
>> > > > > > >  > foundation laid by FLIP-481 (Introduce Event Reporting).
>> For
>> > > > anyone
>> > > > > > >  > needing context, the original proposal is available here:
>> > > > > > >  >
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-481%3A+Introduce+Event+Reporting
>> > > > > > >  >
>> > > > > > >  > Now that the community has this powerful API, the logical
>> next
>> > > > step is
>> > > > > > >  > to ensure it's fully robust for large-scale production
>> > > > environments
>> > > > > > >  > where users will be writing their own diverse, custom
>> > reporters.
>> > > > > > >  >
>> > > > > > >  > This proposal focuses on one key enhancement: introducing a
>> > > > resilient,
>> > > > > > >  > asynchronous dispatch core. The goal is to decouple event
>> > > > generation
>> > > > > > >  > from the reporter's execution, ensuring that a slow or
>> > > > experimental
>> > > > > > >  > sink can never impact Flink's core stability.
>> > > > > > >  >
>> > > > > > >  > I've drafted a detailed design document that I hope can
>> form
>> > the
>> > > > basis
>> > > > > > >  > of this new FLIP:
>> > > > > > >  >
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?usp=sharing
>> > > > > > >  >
>> > > > > > >  > I'm keen to get the community's initial feedback on this
>> > > direction
>> > > > > > >  > before moving forward with the formal process.
>> > > > > > >  >
>> > > > > > >  > Thanks,
>> > > > > > >  > Kartikey Pant
>> > > > > > >  >
>> > > > > > >
>> > > > > >
>> > > >
>> > >
>> >
>>
>> Unless otherwise stated above:
>>
>> IBM United Kingdom Limited
>> Registered in England and Wales with number 741598
>> Registered office: Building C, IBM Hursley Office, Hursley Park Road,
>> Winchester, Hampshire SO21 2JN
>>
>

Re: [DISCUSS] FLIP-545: Hardening the Event Reporter with an Asynchronous Core

Reply via email to