[jira] [Commented] (KAFKA-14053) Transactional producer should bump the epoch when a batch encounters delivery timeout

2022-07-14 Thread Daniel Urban (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566808#comment-17566808
 ] 

Daniel Urban commented on KAFKA-14053:
--

I understand that increasing the epoch on the client side is probably violating 
the contract in the protocol.

Refactored my change so the client side timeouts (both delivery and request 
timeout) will become fatal errors in transactional producers, resulting a last, 
best-effort epoch bump.

> Transactional producer should bump the epoch when a batch encounters delivery 
> timeout
> -
>
> Key: KAFKA-14053
> URL: https://issues.apache.org/jira/browse/KAFKA-14053
> Project: Kafka
>  Issue Type: Bug
>Reporter: Daniel Urban
>Assignee: Daniel Urban
>Priority: Major
>
> When a batch fails due to delivery timeout, it is possible that the batch is 
> still in-flight. Due to underlying infra issues, it is possible that an 
> EndTxnRequest and a WriteTxnMarkerRequest is processed before the in-flight 
> batch is processed on the leader. This can cause transactional batches to be 
> appended to the log after the corresponding abort marker.
> This can cause the LSO to be infinitely blocked in the partition, or can even 
> violate processing guarantees, as the out-of-order batch can become part of 
> the next transaction.
> Because of this, the producer should skip aborting the partition, and bump 
> the epoch to fence the in-flight requests.
>  
> More detail can be found here: 
> [https://lists.apache.org/thread/8d2oblsjtdv7740glc37v79f0r7p99dp]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14053) Transactional producer should bump the epoch when a batch encounters delivery timeout

2022-07-08 Thread Daniel Urban (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564222#comment-17564222
 ] 

Daniel Urban commented on KAFKA-14053:
--

About to submit a PR, but wanted to bring up a few more points:
 * When we bump in the middle of an ongoing transaction, the coordinator 
handles that as a fencing operation, and does not return a valid epoch - the 
coordinator assumes that a new producer was started with the same ID, and 
forces the InitProducerIDRequest to be retried. This will result in a 2nd bump, 
in which the epoch is returned.
 * If we go with the simplest solution, and bump once, the producer must 
transition into a fatal state. The producer cannot safely get a new epoch from 
the coordinator, as sending the original epoch multiple times will first result 
in CONCURRENT_TRANSACTIONS (while the ongoing transaction is aborted) then in 
PRODUCER_FENCED (when the coordinator realizes that the epoch was bumped 
already). It is not safe to send an empty InitProducerIDRequest, as we might 
fence a newer producer instance with that.
 * In my PR, I try to bump twice. First, with using the original epoch (which 
will eventually get a PRODUCER_FENCED), then I increase the epoch by 1 on the 
client side, and try again. Conceptually, this might sound wrong (the producer 
is using an epoch which wasn't returned by the coordinator), but I think that 
theoretically it is safe to do. The first bump during an ongoing transaction 
results in an epoch which is never returned to any producers. If we retry with 
the increased epoch, that won't collide with any other producers. If the bump 
with the increased epoch succeeds, then there was no concurrent bump from other 
instances, and the producer can continue using the new epoch. If the increased 
epoch receives a PRODUCER_FENCED, then the producer was actually fenced by 
another instance.
 * Due to the previous point, the producer needs to make sure that it can 
safely increase the epoch by 1 at any point. This means that if the producer 
got Short.MAX as the epoch, it won't be able to do the +1 increase locally. To 
avoid this, I made a change in the producer to force an epoch reset from the 
coordinator when the epoch is Short.MAX.
 * I understand that the local epoch increase and the intentional epoch reset 
might not be ideal, but without those, the producer cannot be safely used after 
a delivery timeout. So in my current understanding, a delivery timeout either 
means a fatal error, or we need to accept the local epoch increase and the 
epoch reset on Short.MAX.

> Transactional producer should bump the epoch when a batch encounters delivery 
> timeout
> -
>
> Key: KAFKA-14053
> URL: https://issues.apache.org/jira/browse/KAFKA-14053
> Project: Kafka
>  Issue Type: Bug
>Reporter: Daniel Urban
>Assignee: Daniel Urban
>Priority: Major
>
> When a batch fails due to delivery timeout, it is possible that the batch is 
> still in-flight. Due to underlying infra issues, it is possible that an 
> EndTxnRequest and a WriteTxnMarkerRequest is processed before the in-flight 
> batch is processed on the leader. This can cause transactional batches to be 
> appended to the log after the corresponding abort marker.
> This can cause the LSO to be infinitely blocked in the partition, or can even 
> violate processing guarantees, as the out-of-order batch can become part of 
> the next transaction.
> Because of this, the producer should skip aborting the partition, and bump 
> the epoch to fence the in-flight requests.
>  
> More detail can be found here: 
> [https://lists.apache.org/thread/8d2oblsjtdv7740glc37v79f0r7p99dp]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14053) Transactional producer should bump the epoch when a batch encounters delivery timeout

2022-07-08 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564186#comment-17564186
 ] 

Luke Chen commented on KAFKA-14053:
---

[~hachikuji] , I'd like to hear your opinion about this. Thanks.

> Transactional producer should bump the epoch when a batch encounters delivery 
> timeout
> -
>
> Key: KAFKA-14053
> URL: https://issues.apache.org/jira/browse/KAFKA-14053
> Project: Kafka
>  Issue Type: Bug
>Reporter: Daniel Urban
>Assignee: Daniel Urban
>Priority: Major
>
> When a batch fails due to delivery timeout, it is possible that the batch is 
> still in-flight. Due to underlying infra issues, it is possible that an 
> EndTxnRequest and a WriteTxnMarkerRequest is processed before the in-flight 
> batch is processed on the leader. This can cause transactional batches to be 
> appended to the log after the corresponding abort marker.
> This can cause the LSO to be infinitely blocked in the partition, or can even 
> violate processing guarantees, as the out-of-order batch can become part of 
> the next transaction.
> Because of this, the producer should skip aborting the partition, and bump 
> the epoch to fence the in-flight requests.
>  
> More detail can be found here: 
> [https://lists.apache.org/thread/8d2oblsjtdv7740glc37v79f0r7p99dp]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)