[
https://issues.apache.org/jira/browse/KAFKA-10778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Gustafson resolved KAFKA-10778.
-------------------------------------
Fix Version/s: 2.8.0
Assignee: Tom Bentley
Resolution: Fixed
> Stronger log fencing after write failure
> ----------------------------------------
>
> Key: KAFKA-10778
> URL: https://issues.apache.org/jira/browse/KAFKA-10778
> Project: Kafka
> Issue Type: Bug
> Reporter: Jason Gustafson
> Assignee: Tom Bentley
> Priority: Major
> Fix For: 2.8.0
>
>
> If a log append operation fails with an IO error, the broker attempts to fail
> the log dir that it resides in. Currently this is done asynchronously, which
> means there is no guarantee that additional appends won't be attempted before
> the log is fenced. This can be a problem for EOS because of the need to
> maintain consistent producer state.
> 1. Iterate through batches to build producer state and collect completed
> transactions
> 2. Append the batches to the log
> 3. Update the offset/timestamp indexes
> 4. Update log end offset
> 5. Apply individual producer state to `ProducerStateManager`
> 6. Update the transaction index
> 7. Update completed transactions and advance LSO
> One example of how this process can go wrong is if the index updates in step
> 3 fail. In this case, the log will contain updated producer state which has
> not been reflected in `ProducerStateManager`. If the append is retried before
> the log is fenced, then we can have duplicates. There are probably other
> potential failures that are possible as well.
> I'm sure we can come up with some way to fix this specific case, but the
> general fencing approach is slippery enough that we'll have a hard time
> convincing ourselves that it handles all potential cases. It would be simpler
> to add synchronous fencing logic for the case when an append fails due to an
> IO error. For example, we can mark a flag to indicate that the log is closed
> for additional read/write operations.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)