[ 
https://issues.apache.org/jira/browse/KAFKA-6264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399698#comment-16399698
 ] 

James Cheng commented on KAFKA-6264:
------------------------------------

We recently ran into this on one of our clusters.
{code:java}
[2018-02-21 14:24:17,638] INFO Cleaner 0: Beginning cleaning of log 
__consumer_offsets-45. (kafka.log.LogCleaner) 
[2018-02-21 14:24:17,638] INFO Cleaner 0: Building offset map for 
__consumer_offsets-45... (kafka.log.LogCleaner) 
[2018-02-21 14:24:17,796] INFO Cleaner 0: Building offset map for log 
__consumer_offsets-45 for 20 segments in offset range [0, 2469849110). 
(kafka.log.LogCleaner) 
[2018-02-21 14:24:19,143] INFO Cleaner 0: Offset map for log 
__consumer_offsets-45 complete. (kafka.log.LogCleaner) 
[2018-02-21 14:24:19,143] INFO Cleaner 0: Cleaning log __consumer_offsets-45 
(cleaning prior to Wed Feb 14 07:56:02 UTC 2018, discarding tombstones prior to 
Thu Jan 01 00:00:00 UTC 1970)... (kafka.log.LogCleaner) 
[2018-02-21 14:24:19,144] INFO Cleaner 0: Cleaning segment 0 in log 
__consumer_offsets-45 (largest timestamp Tue Feb 28 02:56:04 UTC 2017) into 0, 
retaining deletes. (kafka.log.LogCleaner) 
[2018-02-21 14:24:19,155] ERROR [kafka-log-cleaner-thread-0]: Error due to 
(kafka.log.LogCleaner) 
java.lang.IllegalArgumentException: requirement failed: largest offset in 
message set can not be safely converted to relative offset. 
at scala.Predef$.require(Predef.scala:224) 
at kafka.log.LogSegment.append(LogSegment.scala:121) 
at kafka.log.Cleaner.cleanInto(LogCleaner.scala:547) 
at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:443) 
at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:385) 
at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:384) 
at scala.collection.immutable.List.foreach(List.scala:392) 
at kafka.log.Cleaner.doClean(LogCleaner.scala:384) 
at kafka.log.Cleaner.clean(LogCleaner.scala:361) 
at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:256) 
at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:236) 
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64) 
[2018-02-21 14:24:19,169] INFO [kafka-log-cleaner-thread-0]: Stopped 
(kafka.log.LogCleaner){code}
We used DumpLogSegments to look at the difference between the first and last 
offset for each of the .log files. I didn't find any where the difference was 
greater than MAXINT (2^31-1).

However, when reading https://issues.apache.org/jira/browse/KAFKA-6264, it 
mentioned that the filename is relevant and is part of this calculation. So I 
compared the filenames vs the last offset.
{code:java}
sh-4.2# ls -l *.log 
-rw-r--r--. 1 root root 12195 Apr 27 2017 00000000000000000000.log 
-rw-r--r--. 1 root root 236030 Oct 10 23:07 00000000002469031807.log 
-rw-r--r--. 1 root root 1925969 Oct 18 18:46 00000000002469285574.log 
-rw-r--r--. 1 root root 78693 Oct 25 22:26 00000000002469290346.log 
-rw-r--r--. 1 root root 94544 Nov 1 17:29 00000000002469290862.log 
-rw-r--r--. 1 root root 270094 Nov 8 22:51 00000000002469291421.log 
-rw-r--r--. 1 root root 497500 Nov 16 06:24 00000000002469292884.log 
-rw-r--r--. 1 root root 5476783 Nov 23 21:19 00000000002469295854.log 
-rw-r--r--. 1 root root 18245427 Dec 1 08:45 00000000002469321787.log 
-rw-r--r--. 1 root root 7527104 Dec 8 07:38 00000000002469414611.log 
-rw-r--r--. 1 root root 6751417 Dec 15 09:36 00000000002469453997.log 
-rw-r--r--. 1 root root 5904919 Dec 22 10:01 00000000002469487104.log 
-rw-r--r--. 1 root root 293325 Dec 29 10:02 00000000002469528006.log 
-rw-r--r--. 1 root root 798008 Jan 5 23:02 00000000002469529772.log 
-rw-r--r--. 1 root root 9366028 Jan 13 10:02 00000000002469535454.log 
-rw-r--r--. 1 root root 7357635 Jan 20 10:01 00000000002469577027.log 
-rw-r--r--. 1 root root 1073890 Jan 27 10:05 00000000002469615817.log 
-rw-r--r--. 1 root root 6390581 Feb 3 13:15 00000000002469622670.log 
-rw-r--r--. 1 root root 44379090 Feb 11 04:01 00000000002469652193.log 
-rw-r--r--. 1 root root 970717 Feb 17 16:27 00000000002469845650.log 
-rw-r--r--. 1 root root 468160 Feb 24 10:01 00000000002469850446.log 
-rw-r--r--. 1 root root 104857353 Feb 28 16:55 00000000002469854492.log 
-rw-r--r--. 1 root root 104857445 Feb 28 17:22 00000000002470671339.log 
-rw-r--r--. 1 root root 104857445 Feb 28 18:23 00000000002471488410.log 
-rw-r--r--. 1 root root 104857445 Feb 28 19:49 00000000002472305481.log 
-rw-r--r--. 1 root root 104857445 Feb 28 21:05 00000000002473122552.log 
-rw-r--r--. 1 root root 104857445 Feb 28 23:13 00000000002473939623.log 
-rw-r--r--. 1 root root 104857445 Mar 1 02:05 00000000002474756694.log 
-rw-r--r--. 1 root root 104857445 Mar 1 02:47 00000000002475573765.log 
-rw-r--r--. 1 root root 104857445 Mar 1 03:19 00000000002476390836.log 
-rw-r--r--. 1 root root 104857438 Mar 1 05:33 00000000002477207907.log 
-rw-r--r--. 1 root root 104857572 Mar 1 06:24 00000000002478024876.log 
-rw-r--r--. 1 root root 104857445 Mar 1 07:26 00000000002478841838.log 
-rw-r--r--. 1 root root 41995747 Mar 1 13:19 00000000002479658909.log
{code}
As you can see, the first log file is named 00000000000000000000.log. The last 
offset in that file is 2469031806, which is greater than 2^31-1.

In our case, the actual range of offsets in 00000000000000000000.log (from the 
first message in the file to the last message in the file) was less than 
2^31-1. So Confluent provided the following steps to help us recover from this 
issue:

1) Identify the broker which has the above stack trace for __consumer_offsets-45
2) shutdown that broker, making sure that some other follower has taken over 
leadership for it
3) rm -rf all the files in that partition's data directory
4) restart broker, wait for re-replication of that partition.

Repeat as necessary on any other brokers and any other partitions that have 
this exact problem.

After this, the log cleaner was able to properly run and clean those partitions.

I believe the reason this worked was because when we re-replicated the data, 
the new log segments used the *actual* range of offsets, and *not* the original 
filenames. And since our actual offsets were less than 2^31-1, we were under 
the MAX_INT threshold and so the log cleaner was able to run properly.

 If, instead, we had re-replicated and there was actually a range between the 
offsets that was greater than 2^31-1, then I believe that the newly 
re-replicated partition would have run into this problem again and that the 
workaround would not work.

> Log cleaner thread may die on legacy segment containing messages whose 
> offsets are too large
> --------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-6264
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6264
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.10.2.1, 1.0.0, 0.11.0.2
>            Reporter: Jiangjie Qin
>            Assignee: Dhruvil Shah
>            Priority: Critical
>             Fix For: 1.2.0
>
>
> We encountered a problem that some of the legacy log segments contains 
> messages whose offsets are larger than {{SegmentBaseOffset + Int.MaxValue}}.
> Prior to 0.10.2.0, we do not assert the offset of the messages when appending 
> them to the log segments. Due to KAFKA-5413, the log cleaner may append 
> messages whose offset is greater than {{base_offset + Int.MaxValue}} into the 
> segment during the log compaction.
> After the brokers are upgraded, those log segments cannot be compacted 
> anymore because the compaction will fail immediately due to the offset range 
> assertion we added to the LogSegment.
> We have seen this issue in the __consumer_offsets topic so it could be a 
> general problem. There is no easy solution for the users to recover from this 
> case. 
> One solution is to split such log segments in the log cleaner once it sees a 
> message with problematic offset and append those messages to a separate log 
> segment with a larger base_offset.
> Due to the impact of the issue. We may want to consider backporting the fix 
> to previous affected versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to