Justine Olshan created KAFKA-14402:
--------------------------------------

             Summary: Transactions Server Side Defense
                 Key: KAFKA-14402
                 URL: https://issues.apache.org/jira/browse/KAFKA-14402
             Project: Kafka
          Issue Type: Task
            Reporter: Justine Olshan
            Assignee: Justine Olshan


We have seen hanging transactions in Kafka where the last stable offset (LSO) 
does not update, we can’t clean the log (if the topic is compacted), and 
read_committed consumers get stuck.

This can happen when a message gets stuck or delayed due to networking issues 
or a network partition, the transaction aborts, and then the delayed message 
finally comes in. The delayed message case can also violate EOS if the delayed 
message comes in after the next addPartitionsToTxn request comes in. 
Effectively we may see a message from a previous (aborted) transaction become 
part of the next transaction.

Another way hanging transactions can occur is that a client is buggy and may 
somehow try to write to a partition before it adds the partition to the 
transaction. In both of these cases, we want the server to have some control to 
prevent these incorrect records from being written and either causing hanging 
transactions or violating Exactly once semantics (EOS) by including records in 
the wrong transaction.

The best way to avoid this issue is to:
 # *Uniquely identify transactions by bumping the producer epoch after every 
commit/abort marker. That way, each transaction can be identified by (producer 
id, epoch).* 

 # {*}Remove the addPartitionsToTxn call and implicitly just add partitions to 
the transaction on the first produce request during a transaction{*}.

We avoid the late arrival case because the transaction is uniquely identified 
and fenced AND we avoid the buggy client case because we remove the need for 
the client to explicitly add partitions to begin the transaction.

Of course, 1 and 2 require client-side changes, so for older clients, those 
approaches won’t apply.

3. *To cover older clients, we will ensure a transaction is ongoing before we 
write to a transaction. We can do this by querying the transaction coordinator 
and caching the result.*

 

See KIP-890 for more information: ** 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to