[ https://issues.apache.org/jira/browse/KAFKA-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17777507#comment-17777507 ]
Justine Olshan commented on KAFKA-14402: ---------------------------------------- HI [~twmb] . Sorry if this was unclear. The plan was that when produce version is bumped to support it, we will no longer need addPartitionsToTxn calls from the client. Clients should continue to send v3 calls until this client/produce change is made. When the produce version is bumped, the broker can send v4 calls to add partitions without the client needing to. In other words, the necessity of sending addPartitons calls is on the produce request version/the completion of part 2 and not the addPartitionsToTxn version. > Transactions Server Side Defense > -------------------------------- > > Key: KAFKA-14402 > URL: https://issues.apache.org/jira/browse/KAFKA-14402 > Project: Kafka > Issue Type: Improvement > Affects Versions: 3.5.0 > Reporter: Justine Olshan > Assignee: Justine Olshan > Priority: Major > > We have seen hanging transactions in Kafka where the last stable offset (LSO) > does not update, we can’t clean the log (if the topic is compacted), and > read_committed consumers get stuck. > This can happen when a message gets stuck or delayed due to networking issues > or a network partition, the transaction aborts, and then the delayed message > finally comes in. The delayed message case can also violate EOS if the > delayed message comes in after the next addPartitionsToTxn request comes in. > Effectively we may see a message from a previous (aborted) transaction become > part of the next transaction. > Another way hanging transactions can occur is that a client is buggy and may > somehow try to write to a partition before it adds the partition to the > transaction. In both of these cases, we want the server to have some control > to prevent these incorrect records from being written and either causing > hanging transactions or violating Exactly once semantics (EOS) by including > records in the wrong transaction. > The best way to avoid this issue is to: > # *Uniquely identify transactions by bumping the producer epoch after every > commit/abort marker. That way, each transaction can be identified by > (producer id, epoch).* > # {*}Remove the addPartitionsToTxn call and implicitly just add partitions > to the transaction on the first produce request during a transaction{*}. > We avoid the late arrival case because the transaction is uniquely identified > and fenced AND we avoid the buggy client case because we remove the need for > the client to explicitly add partitions to begin the transaction. > Of course, 1 and 2 require client-side changes, so for older clients, those > approaches won’t apply. > 3. *To cover older clients, we will ensure a transaction is ongoing before we > write to a transaction. We can do this by querying the transaction > coordinator and caching the result.* > > See KIP-890 for more information: ** > https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense -- This message was sent by Atlassian Jira (v8.20.10#820010)