[ https://issues.apache.org/jira/browse/KAFKA-9803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383576#comment-17383576 ]
Konstantine Karantasis commented on KAFKA-9803: ----------------------------------------------- Postponing to the subsequent release given that this issue is not a blocker and did not make it on time for 3.0 code freeze. > Allow producers to recover gracefully from transaction timeouts > --------------------------------------------------------------- > > Key: KAFKA-9803 > URL: https://issues.apache.org/jira/browse/KAFKA-9803 > Project: Kafka > Issue Type: Improvement > Components: producer , streams > Reporter: Jason Gustafson > Assignee: Boyang Chen > Priority: Major > Labels: needs-kip > Fix For: 3.0.0 > > > Transaction timeouts are detected by the transaction coordinator. When the > coordinator detects a timeout, it bumps the producer epoch and aborts the > transaction. The epoch bump is necessary in order to prevent the current > producer from being able to begin writing to a new transaction which was not > started through the coordinator. > Transactions may also be aborted if a new producer with the same > `transactional.id` starts up. Similarly this results in an epoch bump. > Currently the coordinator does not distinguish these two cases. Both will end > up as a `ProducerFencedException`, which means the producer needs to shut > itself down. > We can improve this with the new APIs from KIP-360. When the coordinator > times out a transaction, it can remember that fact and allow the existing > producer to claim the bumped epoch and continue. Roughly the logic would work > like this: > 1. When a transaction times out, set lastProducerEpoch to the current epoch > and do the normal bump. > 2. Any transactional requests from the old epoch result in a new > TRANSACTION_TIMED_OUT error code, which is propagated to the application. > 3. The producer recovers by sending InitProducerId with the current epoch. > The coordinator returns the bumped epoch. > One issue that needs to be addressed is how to handle INVALID_PRODUCER_EPOCH > from Produce requests. Partition leaders will not generally know if a bumped > epoch was the result of a timed out transaction or a fenced producer. > Possibly the producer can treat these errors as abortable when they come from > Produce responses. In that case, the user would try to abort the transaction > and then we can see if it was due to a timeout or otherwise. -- This message was sent by Atlassian Jira (v8.3.4#803005)