Stanislav Kozlovski created KAFKA-7984: ------------------------------------------
Summary: Do not rebuild leader epochs on segments that do not support it Key: KAFKA-7984 URL: https://issues.apache.org/jira/browse/KAFKA-7984 Project: Kafka Issue Type: Bug Reporter: Stanislav Kozlovski Assignee: Stanislav Kozlovski h3. Preface https://issues.apache.org/jira/browse/KAFKA-7897 (logs would store some leader epochs even if they did not support them - this is essentially a regression from https://issues.apache.org/jira/browse/KAFKA-7415) https://issues.apache.org/jira/browse/KAFKA-7959 If users are running Kafka with https://issues.apache.org/jira/browse/KAFKA-7415 merged in, chances are they have sparsely-populated leader epoch cache files. KAFKA-7897's implementation unintentionally handled the case of deletes those leader epoch cache files for versions 2.1+. For versions below, KAFKA-7959 fixes that. In any case, as it currently stands, a broker started up with a message format of `0.10.0` will have those leader epoch cache files deleted. h3. Problem We have logic [that rebuilds these leader epoch cache files|https://github.com/apache/kafka/blob/217f45ed554b34d5221e1dd3db76e4be892661cf/core/src/main/scala/kafka/log/Log.scala#L614] when recovering segments that do not have a clean shutdown file. It goes over the record batches and rebuilds the leader epoch. KAFKA-7959's implementation guards against this by checking that the log.message.format supports it, *but* that issue is only merged for versions *below 2.1*. Moreover, the case where `log.message.format >= 0.11` *is not handled*. If a broker has the following log segment file: {code:java} offset 0, format v2, epoch 1 offset 1, format v2, epoch 1 offset 2, format v1, no epoch offset 3, format v1, no epoch {code} and gets upgraded to a new log message format that supports it, the rebuild of any logs that had an unclean shutdown will populate the leader epoch cache again, potentially resulting in the issue described in KAFKA-7897 One potential simple way to solve this is to clear the accumulated leader epoch cache when encountering a batch with no epoch upon segment rebuilding. -- This message was sent by Atlassian JIRA (v7.6.3#76005)