hachikuji commented on pull request #9632:
URL: https://github.com/apache/kafka/pull/9632#issuecomment-735990319


   For a little more background about the 
`LogTest.testAppendToTransactionIndexFailure` failure, it is due to an 
inconsistency in how we update state in `ProducerStateManager`. The current 
append flow is the following:
   
   1. Build producer state in `ProducerAppendInfo` instances and collect 
completed transactions
   2. Append the entry to the log
   3. Update log end offset
   4. Apply individual producer state to `ProducerStateManager`
   5. Update the transaction index
   6. Update completed transactions and advance LSO
   
   The idea is that the LSO is stuck if an append to the transaction index 
fails. However, because we have already updated producer state before the index 
write, we are left with an inconsistency. The LSO will reflect an ongoing 
transaction which is not reflected in any of the producer states. 
   
   The test case that is failing is validating the behavior when the index 
write fails. It works like this:
   
   1. First append some transactional data to the log
   2. Append an ABORT marker, but let the write to the transaction index fail
   3. Retry the append of the ABORT and verify that append still fails and the 
LSO is stuck
   
   The test fails because the second append no longer attempts to write to the 
transaction index. I can change the test of course, but I was disturbed about 
the underlying assumption that the write of the transaction marker can be 
retried on the `Log` after a failure. In fact, the path to fencing the `Log` 
after a write failure is asynchronous today. We use `LogDirFailureChannel` to 
propagate log failures to a separate thread which is responsible for marking 
the log dir offline or shutting down the broker. So there is indeed a (small) 
window during which a `WriteTxnMarkers` request could be retried. My feeling is 
that EOS demands a stronger guarantee and we need to fence off the `Log` 
instance synchronously while still holding the lock.
   
   So I think we need a separate jira to fix this issue. The question then is 
whether it should block this patch or not. I am thinking not at the moment. The 
test fails because there is no second append to the transaction index, but this 
is not required for correctness, and the LSO remains stuck as expected in any 
case. Basically I'd say we're no worse than before. I will add a commit which 
alters the test case so that it can pass and we can discuss tightening up the 
failure logic in a separate jira.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to