Hey, I've got a question regarding the transaction failures in EXACTLY_ONCE flow with Flink 1.15.3 with Confluent Cloud Kafka.
The case is that there is a FlinkKafkaProducer in EXACTLY_ONCE setup with default *transaction.timeout.ms <http://transaction.timeout.ms> *of 15min. During the processing the job had some issues that caused checkpoint to timeout, that in turn caused the transaction issues, which caused transaction to fail with the following logs: Unable to commit transaction (org.apache.flink.streaming.runtime.operators.sink.committables.CommitRequestImpl@5d0d5082) because its producer is already fenced. This means that you either have a different producer with the same 'transactional.id' (this is unlikely with the 'KafkaSink' as all generated ids are unique and shouldn't be reused) or recovery took longer than 'transaction.timeout.ms' (900000ms). In both cases this most likely signals data loss, please consult the Flink documentation for more details. Up to this point everything is pretty clear. After that however, the job continued to work normally but every single transaction was failing with: Unable to commit transaction (org.apache.flink.streaming.runtime.operators.sink.committables.CommitRequestImpl@5a924600) because it's in an invalid state. Most likely the transaction has been aborted for some reason. Please check the Kafka logs for more details. Which effectively stalls all downstream processing because no transaction would be ever commited. I've read through the docs and understand that this is kind of a known issue due to the fact that Kafka doesn't effectively support 2PC, but why doesn't that cause the failure and restart of the whole job? Currently, the job will process everything normally and hides the issue until it has grown catastrophically. Thanks in advance, Cheers.