Steve Rodrigues created KAFKA-9826:
--------------------------------------

             Summary: Log cleaning goes in infinite loop when first dirty 
offset is past start of active segment
                 Key: KAFKA-9826
                 URL: https://issues.apache.org/jira/browse/KAFKA-9826
             Project: Kafka
          Issue Type: Bug
          Components: log cleaner
    Affects Versions: 2.4.1
            Reporter: Steve Rodrigues
            Assignee: Steve Rodrigues


Seen on a system where a given partition had a single segment, and for whatever 
reason (deleteRecords?), the logStartOffset was greater than the base segment 
of the log, there were a continuous series of 

```

[2020-03-03 16:56:31,374] WARN Resetting first dirty offset of  FOO-3 to log 
start offset 55649 since the checkpointed offset 0 is invalid. 
(kafka.log.LogCleanerManager$)

```

messages (partition name changed, it wasn't really FOO). This was expected to 
be resolved by KAFKA-6266 but clearly wasn't. 

Further investigation revealed that  a few segments were continuously cleaning 
and generating messages in the `log-cleaner.log` of the form:

```

[2020-03-31 13:34:50,924] INFO Cleaner 1: Beginning cleaning of log FOO-3 
(kafka.log.LogCleaner)

[2020-03-31 13:34:50,924] INFO Cleaner 1: Building offset map for FOO-3... 
(kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO Cleaner 1: Building offset map for log FOO-3 for 
0 segments in offset range [55287, 54237). (kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO Cleaner 1: Offset map for log FOO-3 complete. 
(kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO Cleaner 1: Cleaning log FOO-3 (cleaning prior to 
Wed Dec 31 19:00:00 EST 1969, discarding tombstones prior to Tue Dec 10 
13:39:08 EST 2019)... (kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO [kafka-log-cleaner-thread-1]: Log cleaner thread 
1 cleaned log FOO-3 (dirty section = [55287, 55287])

0.0 MB of log processed in 0.0 seconds (0.0 MB/sec).

Indexed 0.0 MB in 0.0 seconds (0.0 Mb/sec, 100.0% of total time)

Buffer utilization: 0.0%

Cleaned 0.0 MB in 0.0 seconds (NaN Mb/sec, 0.0% of total time)

Start size: 0.0 MB (0 messages)

End size: 0.0 MB (0 messages) NaN% size reduction (NaN% fewer messages) 
(kafka.log.LogCleaner)

```

What seems to have happened here (data determined for a different partition) is:

There exist a number of partitions here which get relatively low traffic, 
including our friend FOO-5. For whatever reason, LogStartOffset of this 
partition has moved beyond the baseOffset of the active segment. (Notes in 
other issues indicate that this is a reasonable scenario.) So there’s one 
segment, starting at 166266, and a log start of 166301.

grabFilthiestCompactedLog runs and reads the checkpoint file. We see that this 
topicpartition needs to be cleaned, and call cleanableOffsets on it which 
returns an OffsetsToClean with firstDirtyOffset == logStartOffset == 166301 and 
firstUncleanableOffset = max(logStart, activeSegment.baseOffset) = 116301, and 
forceCheckpoint = true.

The checkpoint file is updated in grabFilthiestCompactedLog (this is the fix 
for KAFKA-6266). We then create a LogToClean object based on the 
firstDirtyOffset and firstUncleanableOffset of 166301 (past the active 
segment’s base offset).

The LogToClean object has cleanBytes = logSegments(-1, 
firstDirtyOffset).map(_.size).sum → the size of this segment. It has 
firstUncleanableOffset and cleanableBytes determined by 
calculateCleanableBytes. calculateCleanableBytes returns:
{{}}
{{val firstUncleanableSegment = 
log.nonActiveLogSegmentsFrom(uncleanableOffset).headOption.getOrElse(log.activeSegment)}}
{{val firstUncleanableOffset = firstUncleanableSegment.baseOffset}}
{{val cleanableBytes = log.logSegments(firstDirtyOffset, 
math.max(firstDirtyOffset, firstUncleanableOffset)).map(_.size.toLong).sum

(firstUncleanableOffset, cleanableBytes)}}
firstUncleanableSegment is activeSegment. firstUncleanableOffset is the base 
offset, 166266. cleanableBytes is looking for logSegments(166301, max(166301, 
166266) → which _is the active segment_

So there are “cleanableBytes” > 0.

We then filter out segments with totalbytes (clean + cleanable) > 0. This 
segment has totalBytes > 0, and it has cleanablebytes, so great! It’s filthiest.

The cleaner picks it, calls cleanLog on it, which then does cleaner.clean, 
which returns nextDirtyOffset and cleaner stats. cleaner.clean callls doClean, 
which builds an offsetMap. The offsetMap looks at non-active segments, when 
building, but there aren’t any. So the endOffset of the offsetMap is lastOffset 
(default -1) + 1 → 0. We record the stats (including logging to 
log-cleaner.log). After this we call cleanerManager.doneCleaning, which writes 
the checkpoint file with the latest value… of 0.

And then the process starts all over.

It appears that there's at least one bug here, where `log.logSegments(from, 
to)` will return an empty list if from == to and both are segment-aligned, but 
_not_ if they are in the middle of a segment, and possibly that LogToClean does 
start=firstDirtyOffset, end = max(firstDirtyOffset, firstUncleanableOffset) – 
it can move the firstUncleanableOffset even when the firstDirtyOffset is past 
firstUncleanable.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to