Hello,

Sorry for the late reply.

Reading this FLIP, I don't clearly understand why this belongs in the Flink
ecosystem. Nothing seems unique to Flink other than its implementation of
Exactly-Once semantics. We aren't the only Exactly-Once producer.

I read the following from the mailing list.

>For exactly-once delivery, Flink acts as the orchestrator on top of
Kafka's transactions, deciding when to commit based on checkpoint
completion. The problem arises when Flink disappears (there is no
running Flink job anymore) and nobody is left to tell Kafka's
coordinator to commit or abort. Kafka's coordinator in this case is
behaving as designed - it holds the transaction open because it has no
way to know whether the data should be committed or aborted. That
decision belongs to Flink, which is no longer running.

The decision belonged* to Flink. However once we start discussing using a
CLI to complete transactions, the decision no longer belongs to Flink. The
user employing that CLI should decide what to do with that transaction.
That boundary no longer requires Flink.

I would push back pretty hard on adding this maintenance burden to this
community. It seems to me that Kafka has an API limitation that we are
ignoring, pushing this responsibility onto Flink.

Before we add this to Flink, can we discuss more why it belongs in this
community? I see in your FLIP the `Use Kafka's built-in
kafka-transactions.sh tool`, which is fine but we don't discuss what could
be added or changed there.

Ryan van Huuksloot
Staff Engineer, Infrastructure | Streaming Platform
[image: Shopify]
<https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>


On Fri, Apr 24, 2026 at 7:44 AM Aleksandr Savonin <[email protected]>
wrote:

> Hi Shekhar and Hongshun,
> Thank you for the questions and the participation.
> Could you please clarify if you have any further questions? If not, I
> will start a voting thread.
>
> On Fri, 10 Apr 2026 at 14:07, Aleksandr Savonin <[email protected]>
> wrote:
> >
> > Hi Shekhar Prasad Rajak,
> >
> > > Don't we have timeout, session timeout at kafka side ?
> >
> > Yes, Kafka does have a timeout and when the timeout fires, the
> > transaction is aborted by the broker. However, if Flink successfully
> > completed a checkpoint but didn't commit the transaction (e.g. the JVM
> > crashed before committing the transaction to Kafka), the data should
> > be committed to satisfy 2PC protocol promises. Relying on the timeout
> > may belong to data loss in this scenario.
> > Moreover, some Flink-Kafka setups use high values for Kafka
> > transaction timeouts, in that case, even transactions that must be
> > aborted (because they were not checkpointed), will not be aborted
> > until timeout fires, blocking the LSO. The tool provides a way to
> > resolve this immediately.
> >
> > Kind regards,
> > Aleksandr
> >
> > On Wed, 8 Apr 2026 at 05:34, Shekhar Rajak <[email protected]>
> wrote:
> > >
> > > Thanks for clarifying but
> > > >  it holds the transaction open because it has noway to know whether
> the data should be committed or aborted. That
> > > decision belongs to Flink, which is no longer running.
> > >
> > > Don't we have timeout, session timeout at kafka side ? After enabling
> 2PC we must be using those config to make sure Kafka broker releases the
> locks and have durability.
> > >
> > > Regards,
> > > Shekhar Prasad Rajak
> > >
> > >
> > >     On Monday 30 March 2026 at 04:34:07 pm GMT+5:30, Aleksandr Savonin
> <[email protected]> wrote:
> > >
> > >  Hi Shekhar Rajak,
> > > Good questions.
> > > Let me clarify the transaction coordination model and why a
> > > Flink-specific tool is needed.
> > >
> > > > Curious to know if we debug and analyse the root cause and try to
> fix/enhance in the Kafka transaction coordinator side, will this issue be
> resolved ?
> > >
> > >
> > > Kafka has a built-in Transaction Coordinator, but it only manages the
> > > Kafka protocol layer, it executes commit/abort when told to.
> > > For exactly-once delivery, Flink acts as the orchestrator on top of
> > > Kafka's transactions, deciding when to commit based on checkpoint
> > > completion. The problem arises when Flink disappears (there is no
> > > running Flink job anymore) and nobody is left to tell Kafka's
> > > coordinator to commit or abort. Kafka's coordinator in this case is
> > > behaving as designed - it holds the transaction open because it has no
> > > way to know whether the data should be committed or aborted. That
> > > decision belongs to Flink, which is no longer running.
> > >
> > > > Do we really need to think about transaction management tool
> specifically for kafka ?
> > >
> > >
> > > The Kafka sink is one of the most widely used Flink connectors, and
> > > exactly-once delivery with Kafka transactions is a primary production
> > > use case in some companies.
> > >
> > > --
> > > Kind regards,
> > > Aleksandr
> > >
> >
> >
> >
> > --
> > Kind regards,
> > Aleksandr
>
>
>
> --
> Kind regards,
> Aleksandr
>

Reply via email to