[ https://issues.apache.org/jira/browse/KAFKA-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826431#comment-16826431 ]
Andrew Olson commented on KAFKA-7866: ------------------------------------- [~hachikuji] Do you have an estimated release data for 2.2.1? > Duplicate offsets after transaction index append failure > -------------------------------------------------------- > > Key: KAFKA-7866 > URL: https://issues.apache.org/jira/browse/KAFKA-7866 > Project: Kafka > Issue Type: Bug > Reporter: Jason Gustafson > Assignee: Jason Gustafson > Priority: Major > Fix For: 2.0.2, 2.1.2, 2.2.1 > > > We have encountered a situation in which an ABORT marker was written > successfully to the log, but failed to be written to the transaction index. > This prevented the log end offset from being incremented. This resulted in > duplicate offsets when the next append was attempted. The broker was using > JBOD and we would normally expect IOExceptions to cause the log directory to > be failed. That did not seem to happen here and the duplicates continued for > several hours. > Unfortunately, we are not sure what the cause of the failure was. > Significantly, the first duplicate was also the first ABORT marker in the > log. Unlike the offset and timestamp index, the transaction index is created > on demand after the first aborted transction. It is likely that the attempt > to create and open the transaction index failed. There is some suggestion > that the process may have bumped into the open file limit. Whatever the > problem was, it also prevented log collection, so we cannot confirm our > guesses. > Without knowing the underlying cause, we can still consider some potential > improvements: > 1. We probably should be catching non-IO exceptions in the append process. If > the append to one of the indexes fails, we potentially truncate the log or > re-throw it as an IOException to ensure that the log directory is no longer > used. > 2. Even without the unexpected exception, there is a small window during > which even an IOException could lead to duplicate offsets. Marking a log > directory offline is an asynchronous operation and there is no guarantee that > another append cannot happen first. Given this, we probably need to detect > and truncate duplicates during the log recovery process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)