Justine Olshan created KAFKA-14402:
--------------------------------------
Summary: Transactions Server Side Defense
Key: KAFKA-14402
URL: https://issues.apache.org/jira/browse/KAFKA-14402
Project: Kafka
Issue Type: Task
Reporter: Justine Olshan
Assignee: Justine Olshan
We have seen hanging transactions in Kafka where the last stable offset (LSO)
does not update, we can’t clean the log (if the topic is compacted), and
read_committed consumers get stuck.
This can happen when a message gets stuck or delayed due to networking issues
or a network partition, the transaction aborts, and then the delayed message
finally comes in. The delayed message case can also violate EOS if the delayed
message comes in after the next addPartitionsToTxn request comes in.
Effectively we may see a message from a previous (aborted) transaction become
part of the next transaction.
Another way hanging transactions can occur is that a client is buggy and may
somehow try to write to a partition before it adds the partition to the
transaction. In both of these cases, we want the server to have some control to
prevent these incorrect records from being written and either causing hanging
transactions or violating Exactly once semantics (EOS) by including records in
the wrong transaction.
The best way to avoid this issue is to:
# *Uniquely identify transactions by bumping the producer epoch after every
commit/abort marker. That way, each transaction can be identified by (producer
id, epoch).*
# {*}Remove the addPartitionsToTxn call and implicitly just add partitions to
the transaction on the first produce request during a transaction{*}.
We avoid the late arrival case because the transaction is uniquely identified
and fenced AND we avoid the buggy client case because we remove the need for
the client to explicitly add partitions to begin the transaction.
Of course, 1 and 2 require client-side changes, so for older clients, those
approaches won’t apply.
3. *To cover older clients, we will ensure a transaction is ongoing before we
write to a transaction. We can do this by querying the transaction coordinator
and caching the result.*
See KIP-890 for more information: **
https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense
--
This message was sent by Atlassian Jira
(v8.20.10#820010)