[jira] [Commented] (KAFKA-14402) Transactions Server Side Defense

2023-10-30 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781148#comment-17781148
 ] 

Justine Olshan commented on KAFKA-14402:


Sorry the wording is unclear. I will fix this. For older clients, the server 
will verify the partitions are in a given transaction. Older clients require no 
changes. If we changed them, they would be considered new (version) clients.

The idea of KIP-890 part 1 would be that there were no client changes required. 
The flow is the same from the client side. Just on the server side we verify 
the transaction is ongoing. Hopefully this makes sense.

> Transactions Server Side Defense
> 
>
> Key: KAFKA-14402
> URL: https://issues.apache.org/jira/browse/KAFKA-14402
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.5.0
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> We have seen hanging transactions in Kafka where the last stable offset (LSO) 
> does not update, we can’t clean the log (if the topic is compacted), and 
> read_committed consumers get stuck.
> This can happen when a message gets stuck or delayed due to networking issues 
> or a network partition, the transaction aborts, and then the delayed message 
> finally comes in. The delayed message case can also violate EOS if the 
> delayed message comes in after the next addPartitionsToTxn request comes in. 
> Effectively we may see a message from a previous (aborted) transaction become 
> part of the next transaction.
> Another way hanging transactions can occur is that a client is buggy and may 
> somehow try to write to a partition before it adds the partition to the 
> transaction. In both of these cases, we want the server to have some control 
> to prevent these incorrect records from being written and either causing 
> hanging transactions or violating Exactly once semantics (EOS) by including 
> records in the wrong transaction.
> The best way to avoid this issue is to:
>  # *Uniquely identify transactions by bumping the producer epoch after every 
> commit/abort marker. That way, each transaction can be identified by 
> (producer id, epoch).* 
>  # {*}Remove the addPartitionsToTxn call and implicitly just add partitions 
> to the transaction on the first produce request during a transaction{*}.
> We avoid the late arrival case because the transaction is uniquely identified 
> and fenced AND we avoid the buggy client case because we remove the need for 
> the client to explicitly add partitions to begin the transaction.
> Of course, 1 and 2 require client-side changes, so for older clients, those 
> approaches won’t apply.
> 3. *To cover older clients, we will ensure a transaction is ongoing before we 
> write to a transaction. We can do this by querying the transaction 
> coordinator and caching the result.*
>  
> See KIP-890 for more information: ** 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14402) Transactions Server Side Defense

2023-10-30 Thread Travis Bischel (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781144#comment-17781144
 ] 

Travis Bischel commented on KAFKA-14402:


Sorry, I may be a bit confused then: when it says "for old clients to verify 
the partitions are in a given transaction" -- this can only happen in v4. How 
can a client / an admin verify that a partition is a part of the transaction if 
v4 is meant to be broker<=>broker only?

> Transactions Server Side Defense
> 
>
> Key: KAFKA-14402
> URL: https://issues.apache.org/jira/browse/KAFKA-14402
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.5.0
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> We have seen hanging transactions in Kafka where the last stable offset (LSO) 
> does not update, we can’t clean the log (if the topic is compacted), and 
> read_committed consumers get stuck.
> This can happen when a message gets stuck or delayed due to networking issues 
> or a network partition, the transaction aborts, and then the delayed message 
> finally comes in. The delayed message case can also violate EOS if the 
> delayed message comes in after the next addPartitionsToTxn request comes in. 
> Effectively we may see a message from a previous (aborted) transaction become 
> part of the next transaction.
> Another way hanging transactions can occur is that a client is buggy and may 
> somehow try to write to a partition before it adds the partition to the 
> transaction. In both of these cases, we want the server to have some control 
> to prevent these incorrect records from being written and either causing 
> hanging transactions or violating Exactly once semantics (EOS) by including 
> records in the wrong transaction.
> The best way to avoid this issue is to:
>  # *Uniquely identify transactions by bumping the producer epoch after every 
> commit/abort marker. That way, each transaction can be identified by 
> (producer id, epoch).* 
>  # {*}Remove the addPartitionsToTxn call and implicitly just add partitions 
> to the transaction on the first produce request during a transaction{*}.
> We avoid the late arrival case because the transaction is uniquely identified 
> and fenced AND we avoid the buggy client case because we remove the need for 
> the client to explicitly add partitions to begin the transaction.
> Of course, 1 and 2 require client-side changes, so for older clients, those 
> approaches won’t apply.
> 3. *To cover older clients, we will ensure a transaction is ongoing before we 
> write to a transaction. We can do this by querying the transaction 
> coordinator and caching the result.*
>  
> See KIP-890 for more information: ** 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14402) Transactions Server Side Defense

2023-10-30 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781132#comment-17781132
 ] 

Justine Olshan commented on KAFKA-14402:


Hi [~twmb] 
The KIP current states:

> We will bump the versioning for AddPartitionsToTxn to add support for 
> "verifyOnly" mode used for old clients to verify the partitions are in a 
> given transaction. We will also add support to batch multiple waiting 
> requests by transactional ID. This newer version of the request will require 
> CLUSTER authorization so clients will not be able to use this version.

I'm not sure I understand your statement. Admins are not sending this request. 
The broker is sending the request. The goal is to remove the API from the 
client-side altogether.

> Transactions Server Side Defense
> 
>
> Key: KAFKA-14402
> URL: https://issues.apache.org/jira/browse/KAFKA-14402
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.5.0
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> We have seen hanging transactions in Kafka where the last stable offset (LSO) 
> does not update, we can’t clean the log (if the topic is compacted), and 
> read_committed consumers get stuck.
> This can happen when a message gets stuck or delayed due to networking issues 
> or a network partition, the transaction aborts, and then the delayed message 
> finally comes in. The delayed message case can also violate EOS if the 
> delayed message comes in after the next addPartitionsToTxn request comes in. 
> Effectively we may see a message from a previous (aborted) transaction become 
> part of the next transaction.
> Another way hanging transactions can occur is that a client is buggy and may 
> somehow try to write to a partition before it adds the partition to the 
> transaction. In both of these cases, we want the server to have some control 
> to prevent these incorrect records from being written and either causing 
> hanging transactions or violating Exactly once semantics (EOS) by including 
> records in the wrong transaction.
> The best way to avoid this issue is to:
>  # *Uniquely identify transactions by bumping the producer epoch after every 
> commit/abort marker. That way, each transaction can be identified by 
> (producer id, epoch).* 
>  # {*}Remove the addPartitionsToTxn call and implicitly just add partitions 
> to the transaction on the first produce request during a transaction{*}.
> We avoid the late arrival case because the transaction is uniquely identified 
> and fenced AND we avoid the buggy client case because we remove the need for 
> the client to explicitly add partitions to begin the transaction.
> Of course, 1 and 2 require client-side changes, so for older clients, those 
> approaches won’t apply.
> 3. *To cover older clients, we will ensure a transaction is ongoing before we 
> write to a transaction. We can do this by querying the transaction 
> coordinator and caching the result.*
>  
> See KIP-890 for more information: ** 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14402) Transactions Server Side Defense

2023-10-30 Thread Travis Bischel (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781122#comment-17781122
 ] 

Travis Bischel commented on KAFKA-14402:


Another addendum, after your update: v4 of the AddPartitionsToTxn API requires 
CLUSTER_ACTION to semi-enforce broker-to-broker requests. Can the KIP be 
updated to document this? Alternatively, can requiring CLUSTER_ACTION be 
dropped? The KIP itself indicates that admins will send AddPartitionsToTxn to 
see whether a partition is in the transaction, and admins do not need 
CLUSTER_ACTION. As well, as a client library -- if I send the request with one 
transaction (i.e. use v4 the same as I use v3), there's no reason I should be 
limited to v3.

> Transactions Server Side Defense
> 
>
> Key: KAFKA-14402
> URL: https://issues.apache.org/jira/browse/KAFKA-14402
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.5.0
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> We have seen hanging transactions in Kafka where the last stable offset (LSO) 
> does not update, we can’t clean the log (if the topic is compacted), and 
> read_committed consumers get stuck.
> This can happen when a message gets stuck or delayed due to networking issues 
> or a network partition, the transaction aborts, and then the delayed message 
> finally comes in. The delayed message case can also violate EOS if the 
> delayed message comes in after the next addPartitionsToTxn request comes in. 
> Effectively we may see a message from a previous (aborted) transaction become 
> part of the next transaction.
> Another way hanging transactions can occur is that a client is buggy and may 
> somehow try to write to a partition before it adds the partition to the 
> transaction. In both of these cases, we want the server to have some control 
> to prevent these incorrect records from being written and either causing 
> hanging transactions or violating Exactly once semantics (EOS) by including 
> records in the wrong transaction.
> The best way to avoid this issue is to:
>  # *Uniquely identify transactions by bumping the producer epoch after every 
> commit/abort marker. That way, each transaction can be identified by 
> (producer id, epoch).* 
>  # {*}Remove the addPartitionsToTxn call and implicitly just add partitions 
> to the transaction on the first produce request during a transaction{*}.
> We avoid the late arrival case because the transaction is uniquely identified 
> and fenced AND we avoid the buggy client case because we remove the need for 
> the client to explicitly add partitions to begin the transaction.
> Of course, 1 and 2 require client-side changes, so for older clients, those 
> approaches won’t apply.
> 3. *To cover older clients, we will ensure a transaction is ongoing before we 
> write to a transaction. We can do this by querying the transaction 
> coordinator and caching the result.*
>  
> See KIP-890 for more information: ** 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14402) Transactions Server Side Defense

2023-10-19 Thread Travis Bischel (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1512#comment-1512
 ] 

Travis Bischel commented on KAFKA-14402:


Thanks, that is clearer. I note that clients also work with v4 – I'm testing 
that locally – but also I notice locally that occasionally I get 
INVALID_TXN_STATE when I should not (note this test failure: 
[https://github.com/twmb/franz-go/actions/runs/6581193303/job/17880673632?pr=599]).
 In the linked test, the client itself is still sending v3; there is no change 
in the client from the tests that pass on 3.5 and this test that is failing on 
3.6.

 

However, trying to reproduce this locally, I'm running into INVALID_RECORD 
problems as well only on 3.6 which I'll create a separate issue for :|.

> Transactions Server Side Defense
> 
>
> Key: KAFKA-14402
> URL: https://issues.apache.org/jira/browse/KAFKA-14402
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.5.0
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> We have seen hanging transactions in Kafka where the last stable offset (LSO) 
> does not update, we can’t clean the log (if the topic is compacted), and 
> read_committed consumers get stuck.
> This can happen when a message gets stuck or delayed due to networking issues 
> or a network partition, the transaction aborts, and then the delayed message 
> finally comes in. The delayed message case can also violate EOS if the 
> delayed message comes in after the next addPartitionsToTxn request comes in. 
> Effectively we may see a message from a previous (aborted) transaction become 
> part of the next transaction.
> Another way hanging transactions can occur is that a client is buggy and may 
> somehow try to write to a partition before it adds the partition to the 
> transaction. In both of these cases, we want the server to have some control 
> to prevent these incorrect records from being written and either causing 
> hanging transactions or violating Exactly once semantics (EOS) by including 
> records in the wrong transaction.
> The best way to avoid this issue is to:
>  # *Uniquely identify transactions by bumping the producer epoch after every 
> commit/abort marker. That way, each transaction can be identified by 
> (producer id, epoch).* 
>  # {*}Remove the addPartitionsToTxn call and implicitly just add partitions 
> to the transaction on the first produce request during a transaction{*}.
> We avoid the late arrival case because the transaction is uniquely identified 
> and fenced AND we avoid the buggy client case because we remove the need for 
> the client to explicitly add partitions to begin the transaction.
> Of course, 1 and 2 require client-side changes, so for older clients, those 
> approaches won’t apply.
> 3. *To cover older clients, we will ensure a transaction is ongoing before we 
> write to a transaction. We can do this by querying the transaction 
> coordinator and caching the result.*
>  
> See KIP-890 for more information: ** 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14402) Transactions Server Side Defense

2023-10-19 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1511#comment-1511
 ] 

Justine Olshan commented on KAFKA-14402:


I've added a note to the KIP to make this clearer as well.

> Transactions Server Side Defense
> 
>
> Key: KAFKA-14402
> URL: https://issues.apache.org/jira/browse/KAFKA-14402
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.5.0
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> We have seen hanging transactions in Kafka where the last stable offset (LSO) 
> does not update, we can’t clean the log (if the topic is compacted), and 
> read_committed consumers get stuck.
> This can happen when a message gets stuck or delayed due to networking issues 
> or a network partition, the transaction aborts, and then the delayed message 
> finally comes in. The delayed message case can also violate EOS if the 
> delayed message comes in after the next addPartitionsToTxn request comes in. 
> Effectively we may see a message from a previous (aborted) transaction become 
> part of the next transaction.
> Another way hanging transactions can occur is that a client is buggy and may 
> somehow try to write to a partition before it adds the partition to the 
> transaction. In both of these cases, we want the server to have some control 
> to prevent these incorrect records from being written and either causing 
> hanging transactions or violating Exactly once semantics (EOS) by including 
> records in the wrong transaction.
> The best way to avoid this issue is to:
>  # *Uniquely identify transactions by bumping the producer epoch after every 
> commit/abort marker. That way, each transaction can be identified by 
> (producer id, epoch).* 
>  # {*}Remove the addPartitionsToTxn call and implicitly just add partitions 
> to the transaction on the first produce request during a transaction{*}.
> We avoid the late arrival case because the transaction is uniquely identified 
> and fenced AND we avoid the buggy client case because we remove the need for 
> the client to explicitly add partitions to begin the transaction.
> Of course, 1 and 2 require client-side changes, so for older clients, those 
> approaches won’t apply.
> 3. *To cover older clients, we will ensure a transaction is ongoing before we 
> write to a transaction. We can do this by querying the transaction 
> coordinator and caching the result.*
>  
> See KIP-890 for more information: ** 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14402) Transactions Server Side Defense

2023-10-19 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1507#comment-1507
 ] 

Justine Olshan commented on KAFKA-14402:


HI [~twmb] . Sorry if this was unclear. The plan was that when produce version 
is bumped to support it, we will no longer need addPartitionsToTxn calls from 
the client. 
Clients should continue to send v3 calls until this client/produce change is 
made. When the produce version is bumped, the broker can send v4 calls to add 
partitions without the client needing to.

In other words, the necessity of sending addPartitons calls is on the produce 
request version/the completion of part 2 and not the addPartitionsToTxn 
version. 

> Transactions Server Side Defense
> 
>
> Key: KAFKA-14402
> URL: https://issues.apache.org/jira/browse/KAFKA-14402
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.5.0
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> We have seen hanging transactions in Kafka where the last stable offset (LSO) 
> does not update, we can’t clean the log (if the topic is compacted), and 
> read_committed consumers get stuck.
> This can happen when a message gets stuck or delayed due to networking issues 
> or a network partition, the transaction aborts, and then the delayed message 
> finally comes in. The delayed message case can also violate EOS if the 
> delayed message comes in after the next addPartitionsToTxn request comes in. 
> Effectively we may see a message from a previous (aborted) transaction become 
> part of the next transaction.
> Another way hanging transactions can occur is that a client is buggy and may 
> somehow try to write to a partition before it adds the partition to the 
> transaction. In both of these cases, we want the server to have some control 
> to prevent these incorrect records from being written and either causing 
> hanging transactions or violating Exactly once semantics (EOS) by including 
> records in the wrong transaction.
> The best way to avoid this issue is to:
>  # *Uniquely identify transactions by bumping the producer epoch after every 
> commit/abort marker. That way, each transaction can be identified by 
> (producer id, epoch).* 
>  # {*}Remove the addPartitionsToTxn call and implicitly just add partitions 
> to the transaction on the first produce request during a transaction{*}.
> We avoid the late arrival case because the transaction is uniquely identified 
> and fenced AND we avoid the buggy client case because we remove the need for 
> the client to explicitly add partitions to begin the transaction.
> Of course, 1 and 2 require client-side changes, so for older clients, those 
> approaches won’t apply.
> 3. *To cover older clients, we will ensure a transaction is ongoing before we 
> write to a transaction. We can do this by querying the transaction 
> coordinator and caching the result.*
>  
> See KIP-890 for more information: ** 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14402) Transactions Server Side Defense

2023-10-19 Thread Travis Bischel (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1489#comment-1489
 ] 

Travis Bischel commented on KAFKA-14402:


Can the KIP be updated to refer that v5 of AddPartitionsToTxn will be the 
version where clients do _not_ send the request? v4 still requires sending the 
request to the broker. I consistently receive INVALID_TXN_STATE if I do not 
send AddPartitionsToTxn.

> Transactions Server Side Defense
> 
>
> Key: KAFKA-14402
> URL: https://issues.apache.org/jira/browse/KAFKA-14402
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.5.0
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> We have seen hanging transactions in Kafka where the last stable offset (LSO) 
> does not update, we can’t clean the log (if the topic is compacted), and 
> read_committed consumers get stuck.
> This can happen when a message gets stuck or delayed due to networking issues 
> or a network partition, the transaction aborts, and then the delayed message 
> finally comes in. The delayed message case can also violate EOS if the 
> delayed message comes in after the next addPartitionsToTxn request comes in. 
> Effectively we may see a message from a previous (aborted) transaction become 
> part of the next transaction.
> Another way hanging transactions can occur is that a client is buggy and may 
> somehow try to write to a partition before it adds the partition to the 
> transaction. In both of these cases, we want the server to have some control 
> to prevent these incorrect records from being written and either causing 
> hanging transactions or violating Exactly once semantics (EOS) by including 
> records in the wrong transaction.
> The best way to avoid this issue is to:
>  # *Uniquely identify transactions by bumping the producer epoch after every 
> commit/abort marker. That way, each transaction can be identified by 
> (producer id, epoch).* 
>  # {*}Remove the addPartitionsToTxn call and implicitly just add partitions 
> to the transaction on the first produce request during a transaction{*}.
> We avoid the late arrival case because the transaction is uniquely identified 
> and fenced AND we avoid the buggy client case because we remove the need for 
> the client to explicitly add partitions to begin the transaction.
> Of course, 1 and 2 require client-side changes, so for older clients, those 
> approaches won’t apply.
> 3. *To cover older clients, we will ensure a transaction is ongoing before we 
> write to a transaction. We can do this by querying the transaction 
> coordinator and caching the result.*
>  
> See KIP-890 for more information: ** 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense



--
This message was sent by Atlassian Jira
(v8.20.10#820010)