Hi all,

After looking for an answer / some discussion on this matter on the community 
Slack and 
StackOverflow<https://stackoverflow.com/questions/70335773/kafka-streams-apps-threads-fail-transaction-and-are-fenced-and-restarted-after-k>
 this mailing list is my last hope :-)

We are noticing that our streams apps threads sometimes fail their transaction 
and get fenced after a broker restart. After the broker has started up again 
the streams apps log either an InvalidProducerEpochException: Producer 
attempted to produce with an old epoch or a ProducedFencedException: There is a 
newer producer with the same transactionalId which fences the current one. 
After these exceptions the thread dies and gets restarted, which causes 
rebalancing and a delay in processing for the partitions assigned to that 
thread.

Some more details on our setup:

  1.  We use Kafka 2.8 (Confluent Platform 6.2) for Brokers and 2.8.1 for 
streams apps.
  2.  To ensure smooth broker restarts we use controlled shutdown for our 
brokers, and restart them 1-by-1 while waiting for all partitions to be in-sync 
before restarting.
  3.  We use three brokers, with min in-sync replicas set to 2. As far as I 
know this should facilitate broker restarts that don't affect clients given 2.
  4.  The streams apps are configured with a group instance id and a session 
timeout that allows for smooth restarts of the streams apps.

In the logs we noticate that during Broker shutdown the clients log 
NOT_LEADER_OR_FOLLOWER exceptions (this is to be expected when partition 
leadership is being migrated). Then we see heartbeats failing (expected because 
broker shutting down, group leadership is migrated). Then we see discovering of 
a new group coordinator (expected, but bounces a bit between the old and new 
leader which I didnt expect). Finally the app stabilizes with a new group 
coordinator.

Then after the broker starts up again we see the clients log 
FETCH_SESSION_ID_NOT_FOUND exceptions for the starting broker. The starting 
broker is rediscovered as a transaction coordinator. Shortly after that the 
InvalidProducerEpochExceptions and ProducedFencedExceptions occur for some 
Streams app threads causing the thread fencing and restarting.

What could be reason for this happening? My first guess would be that the 
starting broker is taking over a transaction coordinator before it has synced 
its transaction states with the in-sync brokers. This difference in transaction 
state could be a reason the starting broker disagrees on the current producer 
epoch and/or transactional ids.

Does anyone with more knowledge on this topic have an idea what could be 
causing the exceptions? Or how we could get more information on what's going on 
here.

Best regards and thank you in advance!

Pieter Hameete

Reply via email to