[
https://issues.apache.org/jira/browse/KAFKA-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bruno Cadonna reopened KAFKA-18067:
-----------------------------------
We had to revert the fix for this bug
(https://github.com/apache/kafka/pull/19078) because it introduces a blocking
bug for AK 4.0.
The issue is that the fix prevented Kafka Streams from re-initializing its
transactional producer under exactly-once semantics. That led to an infinite
loop of {{ProducerFencedException}}s with corresponding rebalances.
For example:
# 1 A network partitions happens that causes the timeout of a transaction.
# 2 The transactional producer is fenced due to invalid producer epoch.
# 3 Kafka Streams closes the tasks dirty and re-joins the group, i.e., a
rebalance is triggered.
# 4 The transactional producer is NOT re-initialized and does NOT get a new
producer epoch.
# 5 Processing starts but the transactional producer is immediately fenced
during the attempt to start a new transaction because of the invalid producer
epoch.
# step 3 is repeated
> Kafka Streams can leak Producer client under EOS
> ------------------------------------------------
>
> Key: KAFKA-18067
> URL: https://issues.apache.org/jira/browse/KAFKA-18067
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Reporter: A. Sophie Blee-Goldman
> Assignee: TengYao Chi
> Priority: Major
> Labels: newbie, newbie++
> Fix For: 4.0.0
>
>
> Under certain conditions Kafka Streams can end up closing a producer client
> twice and creating a new one that then is never closed.
> During a StreamThread's shutdown, the TaskManager is closed first, through
> which the thread's producer client is also closed. Later on we call
> #unsubscribe on the main consumer, which can result in the #onPartitionsLost
> callback being invoked and ultimately trying to reset/reinitialize the
> StreamsProducer if EOS is enabled. This in turn includes closing the current
> producer and creating a new one. And since the current producer was already
> closed, we end up closing that client twice and never closing the newly
> created producer.
> Ideally we would just skip the reset/reinitialize process entirely when
> invoked during shutdown. This solves the two problems here (leaked client and
> double close), while also removing the unnecessary overhead of creating an
> entirely new client just to throw it away
--
This message was sent by Atlassian Jira
(v8.20.10#820010)