urbandan commented on PR #13796: URL: https://github.com/apache/kafka/pull/13796#issuecomment-1582188829
> I guess I just need to clarify what retried batches are here -- is the idea that we wait for inflight batches to return a response or time out? What if the response triggers another retry? Would we prevent that from sending out? The core idea is that we let each of the in-flight batches complete, even if they need multiple retries. This would allow the producer to 1. Avoid inconsistency - by letting in-flight batches finish, we do not run the risk of overwriting their sequence number while we are still not sure if they were appended or not. 2. Operate with best-effort - when using an idempotent producer, and encountering an error, it is costly to verify if a message was appended to the log or not (I think the "official" suggestion is to consume the topic to verify). By letting the in-flight batches finish, the idempotent producer will report fewer false positive errors. > I'm also wondering the benefit of preserving the previous batches if there is an error. How does the system recover differently if we allow those batches to "complete". I think we could run into cases where the error causes the inflight batches to be unable to be written. Do we prefer to fail them (what we may do with this change) and start clean or try to write them with new sequences? I can see both scenarios causing issues. I believe that produce errors should be handled separately, and should not cascade to other batches. I think most errors do not really predict the result of other produce requests. > I guess it boils down to availability of writes (rewriting the sequences allows us to continue writing) or idempotency correctness (trying to wait for them to complete with their old sequences). The sticking point I'm running into is why getting those extra inflight requests (potentially) written is better if we've hit a non-retriable error. My understanding is that here correctness beats availability. Are you suggesting that we should just cancel in-flight batches when encountering an error? > Maybe I just need an example :) I will try to write up some examples, and also write more unit tests to demonstrate those scenarios. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org