jolshan commented on PR #17022:
URL: https://github.com/apache/kafka/pull/17022#issuecomment-2981743810

   Ok -- I'm back. I think the thing that wasn't sitting right for me, and what 
I realized from our discussion with the producer ID overflow is whether this is 
the right place to make the change and the right mental model. 
   
   Specifically, we don't have a great way to distinguish benign stale requests 
from those that that could indicate a divergence of state or other real 
problem. 
   
   I'm wondering if we can get to the heart of this:
   > 4. A timeout threshold is hit
   > 5. _T1_ starts the abort process
       1. _TM_ state is set to `ABORTING_TRANSACTION`
       2. The batches involved with _T1_ are marked as expired
       3. _TM_ is reinitialized, bumping the epoch from `0` to `1` and
   setting its state to `READY`
   > 6. A moment later, in the `Sender` thread, one of the failed batches
   calls `handleFailedBatch()`
   
   Do we know if there is any way we can ensure the exact inflight requests 
during the timeout are marked as stale vs just assuming this from the epoch in 
the batch? I think for some errors, we close the connection -- we may not want 
to do that here, but thinking about addressing the problem at the root and not 
adding small changes in ways that we may not realize the side effects for yet. 
(As we saw in the test failures from making the change, the current part of 
code may cause other issues)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to