[jira] [Commented] (KAFKA-9199) Improve handling of out of sequence errors lower than last acked sequence

2023-08-26 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759283#comment-17759283
 ] 

Justine Olshan commented on KAFKA-9199:
---

Hey[~_gargantua_], thanks again for following up. Feel free to work on it if 
you have time. I think the major thing we need to keep in mind is client 
compatibility. If we are to add a new error, we need a Kip for that. There is 
some flexibility for using existing errors, but we’d have to look at how it 
would be used and how the client reacts. Feel free to tag me in any reviews.

> Improve handling of out of sequence errors lower than last acked sequence
> -
>
> Key: KAFKA-9199
> URL: https://issues.apache.org/jira/browse/KAFKA-9199
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Reporter: Jason Gustafson
>Priority: Major
>
> The broker attempts to cache the state of the last 5 batches in order to 
> enable duplicate detection. This caching is not guaranteed across restarts: 
> we only write the state of the last batch to the snapshot file. It is 
> possible in some cases for this to result in a sequence such as the following:
>  # Send sequence=n
>  # Sequence=n successfully written, but response is not received
>  # Leader changes after broker restart
>  # Send sequence=n+1
>  # Receive successful response for n+1
>  # Sequence=n times out and is retried, results in out of order sequence
> There are a couple problems here. First, it would probably be better for the 
> broker to return DUPLICATE_SEQUENCE_NUMBER when a sequence number is received 
> which is lower than any of the cached batches. Second, the producer handles 
> this situation by just retrying until expiration of the delivery timeout. 
> Instead it should just fail the batch. 
> This issue popped up in the reassignment system test. It ultimately caused 
> the test to fail because the producer was stuck retrying the duplicate batch 
> repeatedly until ultimately giving up.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-9199) Improve handling of out of sequence errors lower than last acked sequence

2023-08-26 Thread Fei Xie (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759264#comment-17759264
 ] 

Fei Xie commented on KAFKA-9199:


Okay. I looked a bit at the PR linked to 
https://issues.apache.org/jira/browse/KAFKA-14920 . Looks like the PR resolves 
similar issues in the transaction verification so this ticket is still valid. 
But like you said, do you prefer me to wait for a generalized approach of your 
PR or I can work this ticket now?

> Improve handling of out of sequence errors lower than last acked sequence
> -
>
> Key: KAFKA-9199
> URL: https://issues.apache.org/jira/browse/KAFKA-9199
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Reporter: Jason Gustafson
>Priority: Major
>
> The broker attempts to cache the state of the last 5 batches in order to 
> enable duplicate detection. This caching is not guaranteed across restarts: 
> we only write the state of the last batch to the snapshot file. It is 
> possible in some cases for this to result in a sequence such as the following:
>  # Send sequence=n
>  # Sequence=n successfully written, but response is not received
>  # Leader changes after broker restart
>  # Send sequence=n+1
>  # Receive successful response for n+1
>  # Sequence=n times out and is retried, results in out of order sequence
> There are a couple problems here. First, it would probably be better for the 
> broker to return DUPLICATE_SEQUENCE_NUMBER when a sequence number is received 
> which is lower than any of the cached batches. Second, the producer handles 
> this situation by just retrying until expiration of the delivery timeout. 
> Instead it should just fail the batch. 
> This issue popped up in the reassignment system test. It ultimately caused 
> the test to fail because the producer was stuck retrying the duplicate batch 
> repeatedly until ultimately giving up.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-9199) Improve handling of out of sequence errors lower than last acked sequence

2023-08-25 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759180#comment-17759180
 ] 

Justine Olshan commented on KAFKA-9199:
---

Hey [~_gargantua_] – I think for idempotent producers, sequence number doesn't 
wrap, but rather we bump the epoch. I think the solution of returning a 
different fatal and/or abortable error would work if the sequence is lower. 

We also were considering a change that would prevent the scenario in the first 
place by storing "seen" sequences and blocking appends for any sequence higher 
than that lowest seen sequence. We did something similar for 
https://issues.apache.org/jira/browse/KAFKA-14920 and are considering how it 
can apply generally.

> Improve handling of out of sequence errors lower than last acked sequence
> -
>
> Key: KAFKA-9199
> URL: https://issues.apache.org/jira/browse/KAFKA-9199
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Reporter: Jason Gustafson
>Priority: Major
>
> The broker attempts to cache the state of the last 5 batches in order to 
> enable duplicate detection. This caching is not guaranteed across restarts: 
> we only write the state of the last batch to the snapshot file. It is 
> possible in some cases for this to result in a sequence such as the following:
>  # Send sequence=n
>  # Sequence=n successfully written, but response is not received
>  # Leader changes after broker restart
>  # Send sequence=n+1
>  # Receive successful response for n+1
>  # Sequence=n times out and is retried, results in out of order sequence
> There are a couple problems here. First, it would probably be better for the 
> broker to return DUPLICATE_SEQUENCE_NUMBER when a sequence number is received 
> which is lower than any of the cached batches. Second, the producer handles 
> this situation by just retrying until expiration of the delivery timeout. 
> Instead it should just fail the batch. 
> This issue popped up in the reassignment system test. It ultimately caused 
> the test to fail because the producer was stuck retrying the duplicate batch 
> repeatedly until ultimately giving up.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-9199) Improve handling of out of sequence errors lower than last acked sequence

2023-08-12 Thread Fei Xie (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753593#comment-17753593
 ] 

Fei Xie commented on KAFKA-9199:


Hi [~hachikuji] I took some look into the relevant code but it seemed that the 
proposed solution won't work because the sequence number wraps around to 0 from 
INT_MAX which means that we cannot know whether a sequence number if lower or 
higher, we can only know OUT_OF_ORDER.

> Improve handling of out of sequence errors lower than last acked sequence
> -
>
> Key: KAFKA-9199
> URL: https://issues.apache.org/jira/browse/KAFKA-9199
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Reporter: Jason Gustafson
>Priority: Major
>
> The broker attempts to cache the state of the last 5 batches in order to 
> enable duplicate detection. This caching is not guaranteed across restarts: 
> we only write the state of the last batch to the snapshot file. It is 
> possible in some cases for this to result in a sequence such as the following:
>  # Send sequence=n
>  # Sequence=n successfully written, but response is not received
>  # Leader changes after broker restart
>  # Send sequence=n+1
>  # Receive successful response for n+1
>  # Sequence=n times out and is retried, results in out of order sequence
> There are a couple problems here. First, it would probably be better for the 
> broker to return DUPLICATE_SEQUENCE_NUMBER when a sequence number is received 
> which is lower than any of the cached batches. Second, the producer handles 
> this situation by just retrying until expiration of the delivery timeout. 
> Instead it should just fail the batch. 
> This issue popped up in the reassignment system test. It ultimately caused 
> the test to fail because the producer was stuck retrying the duplicate batch 
> repeatedly until ultimately giving up.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-9199) Improve handling of out of sequence errors lower than last acked sequence

2023-08-09 Thread Fei Xie (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752596#comment-17752596
 ] 

Fei Xie commented on KAFKA-9199:


Hi [~hachikuji] , is this issue still open? You've mentioned that this issue 
poped up in the reassignment system test. I was wondering if there are any 
instructions to rerun the failed system test? Thank you for answering my 
questions.

> Improve handling of out of sequence errors lower than last acked sequence
> -
>
> Key: KAFKA-9199
> URL: https://issues.apache.org/jira/browse/KAFKA-9199
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Reporter: Jason Gustafson
>Priority: Major
>
> The broker attempts to cache the state of the last 5 batches in order to 
> enable duplicate detection. This caching is not guaranteed across restarts: 
> we only write the state of the last batch to the snapshot file. It is 
> possible in some cases for this to result in a sequence such as the following:
>  # Send sequence=n
>  # Sequence=n successfully written, but response is not received
>  # Leader changes after broker restart
>  # Send sequence=n+1
>  # Receive successful response for n+1
>  # Sequence=n times out and is retried, results in out of order sequence
> There are a couple problems here. First, it would probably be better for the 
> broker to return DUPLICATE_SEQUENCE_NUMBER when a sequence number is received 
> which is lower than any of the cached batches. Second, the producer handles 
> this situation by just retrying until expiration of the delivery timeout. 
> Instead it should just fail the batch. 
> This issue popped up in the reassignment system test. It ultimately caused 
> the test to fail because the producer was stuck retrying the duplicate batch 
> repeatedly until ultimately giving up.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)