Hey,
I've got a question regarding the transaction failures in EXACTLY_ONCE flow
with Flink 1.15.3 with Confluent Cloud Kafka.

The case is that there is a FlinkKafkaProducer in EXACTLY_ONCE setup with
default *transaction.timeout.ms <http://transaction.timeout.ms> *of 15min.

During the processing the job had some issues that caused checkpoint to
timeout, that in turn caused the transaction issues, which caused
transaction to fail with the following logs:
Unable to commit transaction
(org.apache.flink.streaming.runtime.operators.sink.committables.CommitRequestImpl@5d0d5082)
because its producer is already fenced. This means that you either have a
different producer with the same 'transactional.id' (this is unlikely with
the 'KafkaSink' as all generated ids are unique and shouldn't be reused) or
recovery took longer than 'transaction.timeout.ms' (900000ms). In both
cases this most likely signals data loss, please consult the Flink
documentation for more details.
Up to this point everything is pretty clear. After that however, the job
continued to work normally but every single transaction was failing with:
Unable to commit transaction
(org.apache.flink.streaming.runtime.operators.sink.committables.CommitRequestImpl@5a924600)
because it's in an invalid state. Most likely the transaction has been
aborted for some reason. Please check the Kafka logs for more details.
Which effectively stalls all downstream processing because no transaction
would be ever commited.

I've read through the docs and understand that this is kind of a known
issue due to the fact that Kafka doesn't effectively support 2PC, but why
doesn't that cause the failure and restart of the whole job? Currently, the
job will process everything normally and hides the issue until it has grown
catastrophically.

Thanks in advance,
Cheers.

Reply via email to