[ https://issues.apache.org/jira/browse/KAFKA-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Rodrigues updated KAFKA-9826: ----------------------------------- Summary: Log cleaning repeatedly picks same segment with no effect when first dirty offset is past start of active segment (was: Log cleaning goes in infinite loop when first dirty offset is past start of active segment) > Log cleaning repeatedly picks same segment with no effect when first dirty > offset is past start of active segment > ----------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-9826 > URL: https://issues.apache.org/jira/browse/KAFKA-9826 > Project: Kafka > Issue Type: Bug > Components: log cleaner > Affects Versions: 2.4.1 > Reporter: Steve Rodrigues > Assignee: Steve Rodrigues > Priority: Major > > Seen on a system where a given partition had a single segment, and for > whatever reason (deleteRecords?), the logStartOffset was greater than the > base segment of the log, there were a continuous series of > ``` > [2020-03-03 16:56:31,374] WARN Resetting first dirty offset of FOO-3 to log > start offset 55649 since the checkpointed offset 0 is invalid. > (kafka.log.LogCleanerManager$) > ``` > messages (partition name changed, it wasn't really FOO). This was expected to > be resolved by KAFKA-6266 but clearly wasn't. > Further investigation revealed that a few segments were continuously > cleaning and generating messages in the `log-cleaner.log` of the form: > ``` > [2020-03-31 13:34:50,924] INFO Cleaner 1: Beginning cleaning of log FOO-3 > (kafka.log.LogCleaner) > [2020-03-31 13:34:50,924] INFO Cleaner 1: Building offset map for FOO-3... > (kafka.log.LogCleaner) > [2020-03-31 13:34:50,927] INFO Cleaner 1: Building offset map for log FOO-3 > for 0 segments in offset range [55287, 54237). (kafka.log.LogCleaner) > [2020-03-31 13:34:50,927] INFO Cleaner 1: Offset map for log FOO-3 complete. > (kafka.log.LogCleaner) > [2020-03-31 13:34:50,927] INFO Cleaner 1: Cleaning log FOO-3 (cleaning prior > to Wed Dec 31 19:00:00 EST 1969, discarding tombstones prior to Tue Dec 10 > 13:39:08 EST 2019)... (kafka.log.LogCleaner) > [2020-03-31 13:34:50,927] INFO [kafka-log-cleaner-thread-1]: Log cleaner > thread 1 cleaned log FOO-3 (dirty section = [55287, 55287]) > 0.0 MB of log processed in 0.0 seconds (0.0 MB/sec). > Indexed 0.0 MB in 0.0 seconds (0.0 Mb/sec, 100.0% of total time) > Buffer utilization: 0.0% > Cleaned 0.0 MB in 0.0 seconds (NaN Mb/sec, 0.0% of total time) > Start size: 0.0 MB (0 messages) > End size: 0.0 MB (0 messages) NaN% size reduction (NaN% fewer messages) > (kafka.log.LogCleaner) > ``` > What seems to have happened here (data determined for a different partition) > is: > There exist a number of partitions here which get relatively low traffic, > including our friend FOO-5. For whatever reason, LogStartOffset of this > partition has moved beyond the baseOffset of the active segment. (Notes in > other issues indicate that this is a reasonable scenario.) So there’s one > segment, starting at 166266, and a log start of 166301. > grabFilthiestCompactedLog runs and reads the checkpoint file. We see that > this topicpartition needs to be cleaned, and call cleanableOffsets on it > which returns an OffsetsToClean with firstDirtyOffset == logStartOffset == > 166301 and firstUncleanableOffset = max(logStart, activeSegment.baseOffset) = > 116301, and forceCheckpoint = true. > The checkpoint file is updated in grabFilthiestCompactedLog (this is the fix > for KAFKA-6266). We then create a LogToClean object based on the > firstDirtyOffset and firstUncleanableOffset of 166301 (past the active > segment’s base offset). > The LogToClean object has cleanBytes = logSegments(-1, > firstDirtyOffset).map(_.size).sum → the size of this segment. It has > firstUncleanableOffset and cleanableBytes determined by > calculateCleanableBytes. calculateCleanableBytes returns: > {{}} > {{val firstUncleanableSegment = > log.nonActiveLogSegmentsFrom(uncleanableOffset).headOption.getOrElse(log.activeSegment)}} > {{val firstUncleanableOffset = firstUncleanableSegment.baseOffset}} > {{val cleanableBytes = log.logSegments(firstDirtyOffset, > math.max(firstDirtyOffset, firstUncleanableOffset)).map(_.size.toLong).sum > (firstUncleanableOffset, cleanableBytes)}} > firstUncleanableSegment is activeSegment. firstUncleanableOffset is the base > offset, 166266. cleanableBytes is looking for logSegments(166301, max(166301, > 166266) → which _is the active segment_ > So there are “cleanableBytes” > 0. > We then filter out segments with totalbytes (clean + cleanable) > 0. This > segment has totalBytes > 0, and it has cleanablebytes, so great! It’s > filthiest. > The cleaner picks it, calls cleanLog on it, which then does cleaner.clean, > which returns nextDirtyOffset and cleaner stats. cleaner.clean callls > doClean, which builds an offsetMap. The offsetMap looks at non-active > segments, when building, but there aren’t any. So the endOffset of the > offsetMap is lastOffset (default -1) + 1 → 0. We record the stats (including > logging to log-cleaner.log). After this we call cleanerManager.doneCleaning, > which writes the checkpoint file with the latest value… of 0. > And then the process starts all over. > It appears that there's at least one bug here, where `log.logSegments(from, > to)` will return an empty list if from == to and both are segment-aligned, > but _not_ if they are in the middle of a segment, and possibly that > LogToClean does start=firstDirtyOffset, end = max(firstDirtyOffset, > firstUncleanableOffset) – it can move the firstUncleanableOffset even when > the firstDirtyOffset is past firstUncleanable. > -- This message was sent by Atlassian Jira (v8.3.4#803005)