Jason Gustafson created KAFKA-10778:
---------------------------------------

             Summary: Stronger log fencing after write failure
                 Key: KAFKA-10778
                 URL: https://issues.apache.org/jira/browse/KAFKA-10778
             Project: Kafka
          Issue Type: Bug
            Reporter: Jason Gustafson


If a log operation fails with an IO error, the broker attempts to fail the log 
dir that it resides in. Currently this is done asynchronously, which means 
there is no guarantee that additional appends won't be attempted before the log 
is fenced. This can be a problem for EOS because of the need to maintain 
consistent producer state.

1. Iterate through batches to build producer state and collect completed 
transactions
2. Append the batches to the log 
3. Update the offset/timestamp indexes
4. Update log end offset
5. Apply individual producer state to `ProducerStateManager`
6. Update the transaction index
7. Update completed transactions and advance LSO

One example of how this process can go wrong is if the index updates in step 3 
fail. In this case, the log will contain updated producer state which has not 
been reflected in `ProducerStateManager`. If the append is retried before the 
log is fenced, then we can have duplicates. There are probably other potential 
failures that are possible as well.

I'm sure we can come up with some way to fix this specific case, but the 
general fencing approach is slippery enough that we'll have a hard time 
convincing ourselves that it handles all potential cases. It would be simpler 
to add synchronous fencing logic for the case when an append fails due to an IO 
error. For example, we can mark a flag to indicate that the log is closed for 
additional read/write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to