[
https://issues.apache.org/jira/browse/KAFKA-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049156#comment-16049156
]
Carsten Rietz edited comment on KAFKA-5431 at 6/14/17 1:11 PM:
---------------------------------------------------------------
I still have no idea what exactly to do to trigger this behaviour. Maybe its
some strange behaviour of our producers. I have been able to find one instance
on our dev system with a low roll interval, a one broker cluster and some
restarts.
Here is the corrupt log file:
{code}
[user@host __consumer_offsets-12# ll -hs | head
total 2.7M
0 -rw-r--r-- 1 jboss jboss 0 Jun 14 13:31 00000000000000000000.index
4.0K -rw-r--r-- 1 jboss jboss 100M Jun 14 12:30 00000000000000000000.log
4.0K -rw-r--r-- 1 jboss jboss 12 Jun 14 13:31 00000000000000000000.timeindex
0 -rw-r--r-- 1 jboss jboss 0 Jun 14 13:31 00000000000000000001.index
4.0K -rw-r--r-- 1 jboss jboss 122 Jun 14 12:30 00000000000000000001.log
4.0K -rw-r--r-- 1 jboss jboss 12 Jun 14 13:31 00000000000000000001.timeindex
0 -rw-r--r-- 1 jboss jboss 0 Jun 14 13:31 00000000000000000002.index
4.0K -rw-r--r-- 1 jboss jboss 122 Jun 14 12:30 00000000000000000002.log
4.0K -rw-r--r-- 1 jboss jboss 12 Jun 14 13:31 00000000000000000002.timeindex
[user@host __consumer_offsets-10]# stat 00000000000000000000.log
File: `00000000000000000000.log'
Size: 104857600 Blocks: 8 IO Block: 4096 regular file
Device: fd00h/64768d Inode: 4718956 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 494/ jboss) Gid: ( 488/ jboss)
Access: 2017-06-14 12:31:03.585079439 +0100
Modify: 2017-06-14 12:30:42.000000000 +0100
Change: 2017-06-14 12:30:48.520886205 +0100
{code}
{code}
[user@host __consumer_offsets-10]# kafka-run-class.sh
kafka.tools.DumpLogSegments --files 00000000000000000000.log
Dumping 00000000000000000000.log
Starting offset: 0
offset: 0 position: 0 CreateTime: 1497439842157 isvalid: true payloadsize: 257
magic: 1 compresscodec: NONE crc: 1102674092 keysize: 26
Exception in thread "main"
org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller
than minimum record overhead (14).
{code}
hexdumping the file also shows that its still a sparse file with trailing zero
content
{code}
[user@host __consumer_offsets-10]# hexdump -C 00000000000000000000.log
00000000 00 00 00 00 00 00 00 00 00 00 01 31 41 b9 78 ac |...........1A.x.|
00000010 01 00 00 00 01 5c a6 5e af 6d 00 00 00 1a 00 02 |.....\.^.m......|
00000020 00 16 63 6f 6e 73 6f 6c 65 2d 63 6f 6e 73 75 6d |..console-consum|
00000030 65 72 2d 35 39 37 37 33 00 00 01 01 00 01 00 08 |er-59773........|
00000040 63 6f 6e 73 75 6d 65 72 00 00 00 01 00 05 72 61 |consumer......ra|
00000050 6e 67 65 00 2f 63 6f 6e 73 75 6d 65 72 2d 31 2d |nge./consumer-1-|
00000060 39 39 36 33 62 31 36 36 2d 39 32 39 39 2d 34 34 |9963b166-9299-44|
00000070 30 66 2d 38 64 39 37 2d 35 62 34 38 31 33 64 35 |0f-8d97-5b4813d5|
00000080 63 39 38 34 00 00 00 01 00 2f 63 6f 6e 73 75 6d |c984...../consum|
00000090 65 72 2d 31 2d 39 39 36 33 62 31 36 36 2d 39 32 |er-1-9963b166-92|
000000a0 39 39 2d 34 34 30 66 2d 38 64 39 37 2d 35 62 34 |99-440f-8d97-5b4|
000000b0 38 31 33 64 35 63 39 38 34 00 0a 63 6f 6e 73 75 |813d5c984..consu|
000000c0 6d 65 72 2d 31 00 0e 2f 31 30 2e 32 34 38 2e 34 |mer-1../10.248.4|
000000d0 33 2e 32 33 33 00 04 93 e0 00 00 27 10 00 00 00 |3.233......'....|
000000e0 28 00 00 00 00 00 01 00 1c 74 63 2e 69 6e 2e 64 |(........tc.in.d|
000000f0 65 76 2e 73 74 61 74 69 73 74 69 63 73 2e 67 72 |ev.statistics.gr|
00000100 61 70 68 69 74 00 00 00 00 00 00 00 30 00 00 00 |aphit.......0...|
00000110 00 00 01 00 1c 74 63 2e 69 6e 2e 64 65 76 2e 73 |.....tc.in.dev.s|
00000120 74 61 74 69 73 74 69 63 73 2e 67 72 61 70 68 69 |tatistics.graphi|
00000130 74 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 |t...............|
00000140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
06400000
{code}
We are running kafka_2.12-0.10.2.1 on jdk1.8.0_121 on Red Hat Enterprise Linux
Server release 6.9.
Relevant parameters:
{code}
log.segment.bytes=10485760
log.retention.check.interval.ms=1000
log.roll.ms=10000
log.preallocate=true
log.retention.bytes=10485760
{code}
And finally the log files. I removed all __consumer_offsets and restarted the
kafka broker due to full disk before.
{code}
[2017-06-14 12:28:46,198] INFO Started Kafka Graphite metrics reporter with
polling period 1 seconds (com.criteo.kafka.KafkaGraphiteMetricsReporter)
...
[2017-06-14 12:28:51,170] INFO [Group Metadata Manager on Broker 1]: Loading
offsets and group metadata from __consumer_offsets-12
(kafka.coordinator.GroupMetadataManager)
[2017-06-14 12:28:51,189] INFO [Group Metadata Manager on Broker 1]: Finished
loading offsets from __consumer_offsets-12 in 19 milliseconds.
(kafka.coordinator.GroupMetadataManager)
[2017-06-14 12:30:47,164] INFO Rolled new log segment for
'__consumer_offsets-12' in 1 ms. (kafka.log.Log)
[2017-06-14 12:30:48,431] INFO Cleaner 0: Beginning cleaning of log
__consumer_offsets-12. (kafka.log.LogCleaner)
[2017-06-14 12:30:48,432] INFO Cleaner 0: Building offset map for
__consumer_offsets-12... (kafka.log.LogCleaner)
[2017-06-14 12:30:48,470] INFO Cleaner 0: Building offset map for log
__consumer_offsets-12 for 1 segments in offset range [0, 1).
(kafka.log.LogCleaner)
[2017-06-14 12:30:48,499] INFO Cleaner 0: Offset map for log
__consumer_offsets-12 complete. (kafka.log.LogCleaner)
[2017-06-14 12:30:48,503] INFO Cleaner 0: Cleaning log __consumer_offsets-12
(cleaning prior to Wed Jun 14 12:30:42 BST 2017, discarding tombstones prior to
Thu Jan 01 01:00:00 GMT 1970)... (kafka.log.LogCleaner)
[2017-06-14 12:30:48,507] INFO Cleaner 0: Cleaning segment 0 in log
__consumer_offsets-12 (largest timestamp Wed Jun 14 12:30:42 BST 2017) into 0,
retaining deletes. (kafka.log.LogCleaner)
[2017-06-14 12:30:48,521] INFO Cleaner 0: Swapping in cleaned segment 0 for
segment(s) 0 in log __consumer_offsets-12. (kafka.log.LogCleaner)
Log cleaner thread 0 cleaned log __consumer_offsets-12 (dirty section =
[0, 0])
[2017-06-14 12:30:52,161] INFO Rolled new log segment for
'__consumer_offsets-12' in 1 ms. (kafka.log.Log)
{code}
Here the log file seems to have been rolled and not truncated. What tips over
the logcleaner a bit later
{code}
[2017-06-14 12:31:03,564] INFO Cleaner 0: Beginning cleaning of log
__consumer_offsets-12. (kafka.log.LogCleaner)
[2017-06-14 12:31:03,564] INFO Cleaner 0: Building offset map for
__consumer_offsets-12... (kafka.log.LogCleaner)
[2017-06-14 12:31:03,583] INFO Cleaner 0: Building offset map for log
__consumer_offsets-12 for 3 segments in offset range [1, 4).
(kafka.log.LogCleaner)
[2017-06-14 12:31:03,584] INFO Cleaner 0: Offset map for log
__consumer_offsets-12 complete. (kafka.log.LogCleaner)
[2017-06-14 12:31:03,584] INFO Cleaner 0: Cleaning log __consumer_offsets-12
(cleaning prior to Wed Jun 14 12:31:02 BST 2017, discarding tombstones prior to
Tue Jun 13 12:30:42 BST 2017)... (kafka.log.LogCleaner)
[2017-06-14 12:31:03,585] INFO Cleaner 0: Cleaning segment 0 in log
__consumer_offsets-12 (largest timestamp Wed Jun 14 12:30:42 BST 2017) into 0,
retaining deletes. (kafka.log.LogCleaner)
[2017-06-14 12:31:03,589] ERROR [kafka-log-cleaner-thread-0], Error due to
(kafka.log.LogCleaner)
org.apache.kafka.common.errors.CorruptRecordException: Record size is less than
the minimum record overhead (14)
{code}
edit: Its likely that i cannot have another look before monday. I hope someone
has a brigth idea. Deactivating the preallocation works fine as a workaround.
was (Author: crietz):
I still have no idea what exactly to do to trigger this behaviour. Maybe its
some strange behaviour of our producers. I have been able to find one instance
on our dev system with a low roll interval, a one broker cluster and some
restarts.
Here is the corrupt log file:
{code}
[user@host __consumer_offsets-12# ll -hs | head
total 2.7M
0 -rw-r--r-- 1 jboss jboss 0 Jun 14 13:31 00000000000000000000.index
4.0K -rw-r--r-- 1 jboss jboss 100M Jun 14 12:30 00000000000000000000.log
4.0K -rw-r--r-- 1 jboss jboss 12 Jun 14 13:31 00000000000000000000.timeindex
0 -rw-r--r-- 1 jboss jboss 0 Jun 14 13:31 00000000000000000001.index
4.0K -rw-r--r-- 1 jboss jboss 122 Jun 14 12:30 00000000000000000001.log
4.0K -rw-r--r-- 1 jboss jboss 12 Jun 14 13:31 00000000000000000001.timeindex
0 -rw-r--r-- 1 jboss jboss 0 Jun 14 13:31 00000000000000000002.index
4.0K -rw-r--r-- 1 jboss jboss 122 Jun 14 12:30 00000000000000000002.log
4.0K -rw-r--r-- 1 jboss jboss 12 Jun 14 13:31 00000000000000000002.timeindex
[user@host __consumer_offsets-10]# stat 00000000000000000000.log
File: `00000000000000000000.log'
Size: 104857600 Blocks: 8 IO Block: 4096 regular file
Device: fd00h/64768d Inode: 4718956 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 494/ jboss) Gid: ( 488/ jboss)
Access: 2017-06-14 12:31:03.585079439 +0100
Modify: 2017-06-14 12:30:42.000000000 +0100
Change: 2017-06-14 12:30:48.520886205 +0100
{code}
{code}
[user@host __consumer_offsets-10]# kafka-run-class.sh
kafka.tools.DumpLogSegments --files 00000000000000000000.log
Dumping 00000000000000000000.log
Starting offset: 0
offset: 0 position: 0 CreateTime: 1497439842157 isvalid: true payloadsize: 257
magic: 1 compresscodec: NONE crc: 1102674092 keysize: 26
Exception in thread "main"
org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller
than minimum record overhead (14).
{code}
hexdumping the file also shows that its still a sparse file with trailing zero
content
{code}
[user@host __consumer_offsets-10]# hexdump -C 00000000000000000000.log
00000000 00 00 00 00 00 00 00 00 00 00 01 31 41 b9 78 ac |...........1A.x.|
00000010 01 00 00 00 01 5c a6 5e af 6d 00 00 00 1a 00 02 |.....\.^.m......|
00000020 00 16 63 6f 6e 73 6f 6c 65 2d 63 6f 6e 73 75 6d |..console-consum|
00000030 65 72 2d 35 39 37 37 33 00 00 01 01 00 01 00 08 |er-59773........|
00000040 63 6f 6e 73 75 6d 65 72 00 00 00 01 00 05 72 61 |consumer......ra|
00000050 6e 67 65 00 2f 63 6f 6e 73 75 6d 65 72 2d 31 2d |nge./consumer-1-|
00000060 39 39 36 33 62 31 36 36 2d 39 32 39 39 2d 34 34 |9963b166-9299-44|
00000070 30 66 2d 38 64 39 37 2d 35 62 34 38 31 33 64 35 |0f-8d97-5b4813d5|
00000080 63 39 38 34 00 00 00 01 00 2f 63 6f 6e 73 75 6d |c984...../consum|
00000090 65 72 2d 31 2d 39 39 36 33 62 31 36 36 2d 39 32 |er-1-9963b166-92|
000000a0 39 39 2d 34 34 30 66 2d 38 64 39 37 2d 35 62 34 |99-440f-8d97-5b4|
000000b0 38 31 33 64 35 63 39 38 34 00 0a 63 6f 6e 73 75 |813d5c984..consu|
000000c0 6d 65 72 2d 31 00 0e 2f 31 30 2e 32 34 38 2e 34 |mer-1../10.248.4|
000000d0 33 2e 32 33 33 00 04 93 e0 00 00 27 10 00 00 00 |3.233......'....|
000000e0 28 00 00 00 00 00 01 00 1c 74 63 2e 69 6e 2e 64 |(........tc.in.d|
000000f0 65 76 2e 73 74 61 74 69 73 74 69 63 73 2e 67 72 |ev.statistics.gr|
00000100 61 70 68 69 74 00 00 00 00 00 00 00 30 00 00 00 |aphit.......0...|
00000110 00 00 01 00 1c 74 63 2e 69 6e 2e 64 65 76 2e 73 |.....tc.in.dev.s|
00000120 74 61 74 69 73 74 69 63 73 2e 67 72 61 70 68 69 |tatistics.graphi|
00000130 74 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 |t...............|
00000140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
06400000
{code}
We are running kafka_2.12-0.10.2.1 on jdk1.8.0_121 on Red Hat Enterprise Linux
Server release 6.9.
Relevant parameters:
{code}
log.segment.bytes=10485760
log.retention.check.interval.ms=1000
log.roll.ms=10000
log.preallocate=true
log.retention.bytes=10485760
{code}
And finally the log files. I removed all __consumer_offsets and restarted the
kafka broker due to full disk before.
{code}
[2017-06-14 12:28:46,198] INFO Started Kafka Graphite metrics reporter with
polling period 1 seconds (com.criteo.kafka.KafkaGraphiteMetricsReporter)
...
[2017-06-14 12:28:51,170] INFO [Group Metadata Manager on Broker 1]: Loading
offsets and group metadata from __consumer_offsets-12
(kafka.coordinator.GroupMetadataManager)
[2017-06-14 12:28:51,189] INFO [Group Metadata Manager on Broker 1]: Finished
loading offsets from __consumer_offsets-12 in 19 milliseconds.
(kafka.coordinator.GroupMetadataManager)
[2017-06-14 12:30:47,164] INFO Rolled new log segment for
'__consumer_offsets-12' in 1 ms. (kafka.log.Log)
[2017-06-14 12:30:48,431] INFO Cleaner 0: Beginning cleaning of log
__consumer_offsets-12. (kafka.log.LogCleaner)
[2017-06-14 12:30:48,432] INFO Cleaner 0: Building offset map for
__consumer_offsets-12... (kafka.log.LogCleaner)
[2017-06-14 12:30:48,470] INFO Cleaner 0: Building offset map for log
__consumer_offsets-12 for 1 segments in offset range [0, 1).
(kafka.log.LogCleaner)
[2017-06-14 12:30:48,499] INFO Cleaner 0: Offset map for log
__consumer_offsets-12 complete. (kafka.log.LogCleaner)
[2017-06-14 12:30:48,503] INFO Cleaner 0: Cleaning log __consumer_offsets-12
(cleaning prior to Wed Jun 14 12:30:42 BST 2017, discarding tombstones prior to
Thu Jan 01 01:00:00 GMT 1970)... (kafka.log.LogCleaner)
[2017-06-14 12:30:48,507] INFO Cleaner 0: Cleaning segment 0 in log
__consumer_offsets-12 (largest timestamp Wed Jun 14 12:30:42 BST 2017) into 0,
retaining deletes. (kafka.log.LogCleaner)
[2017-06-14 12:30:48,521] INFO Cleaner 0: Swapping in cleaned segment 0 for
segment(s) 0 in log __consumer_offsets-12. (kafka.log.LogCleaner)
Log cleaner thread 0 cleaned log __consumer_offsets-12 (dirty section =
[0, 0])
[2017-06-14 12:30:52,161] INFO Rolled new log segment for
'__consumer_offsets-12' in 1 ms. (kafka.log.Log)
{code}
Here the log file seems to have been rolled and not truncated. What tips over
the logcleaner a bit later
{code}
[2017-06-14 12:31:03,564] INFO Cleaner 0: Beginning cleaning of log
__consumer_offsets-12. (kafka.log.LogCleaner)
[2017-06-14 12:31:03,564] INFO Cleaner 0: Building offset map for
__consumer_offsets-12... (kafka.log.LogCleaner)
[2017-06-14 12:31:03,583] INFO Cleaner 0: Building offset map for log
__consumer_offsets-12 for 3 segments in offset range [1, 4).
(kafka.log.LogCleaner)
[2017-06-14 12:31:03,584] INFO Cleaner 0: Offset map for log
__consumer_offsets-12 complete. (kafka.log.LogCleaner)
[2017-06-14 12:31:03,584] INFO Cleaner 0: Cleaning log __consumer_offsets-12
(cleaning prior to Wed Jun 14 12:31:02 BST 2017, discarding tombstones prior to
Tue Jun 13 12:30:42 BST 2017)... (kafka.log.LogCleaner)
[2017-06-14 12:31:03,585] INFO Cleaner 0: Cleaning segment 0 in log
__consumer_offsets-12 (largest timestamp Wed Jun 14 12:30:42 BST 2017) into 0,
retaining deletes. (kafka.log.LogCleaner)
[2017-06-14 12:31:03,589] ERROR [kafka-log-cleaner-thread-0], Error due to
(kafka.log.LogCleaner)
org.apache.kafka.common.errors.CorruptRecordException: Record size is less than
the minimum record overhead (14)
{code}
> LogCleaner stopped due to
> org.apache.kafka.common.errors.CorruptRecordException
> -------------------------------------------------------------------------------
>
> Key: KAFKA-5431
> URL: https://issues.apache.org/jira/browse/KAFKA-5431
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 0.10.2.1
> Reporter: Carsten Rietz
> Labels: reliability
>
> Hey all,
> i have a strange problem with our uat cluster of 3 kafka brokers.
> the __consumer_offsets topic was replicated to two instances and our disks
> ran full due to a wrong configuration of the log cleaner. We fixed the
> configuration and updated from 0.10.1.1 to 0.10.2.1 .
> Today i increased the replication of the __consumer_offsets topic to 3 and
> triggered replication to the third cluster via kafka-reassign-partitions.sh.
> That went well but i get many errors like
> {code}
> [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for
> partition [__consumer_offsets,18] offset 0 error Record size is less than the
> minimum record overhead (14) (kafka.server.ReplicaFetcherThread)
> [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for
> partition [__consumer_offsets,24] offset 0 error Record size is less than the
> minimum record overhead (14) (kafka.server.ReplicaFetcherThread)
> {code}
> Which i think are due to the full disk event.
> The log cleaner threads died on these wrong messages:
> {code}
> [2017-06-12 09:59:50,722] ERROR [kafka-log-cleaner-thread-0], Error due to
> (kafka.log.LogCleaner)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is less
> than the minimum record overhead (14)
> [2017-06-12 09:59:50,722] INFO [kafka-log-cleaner-thread-0], Stopped
> (kafka.log.LogCleaner)
> {code}
> Looking at the file is see that some are truncated and some are jsut empty:
> $ ls -lsh 00000000000000594653.log
> 0 -rw-r--r-- 1 user user 100M Jun 12 11:00 00000000000000594653.log
> Sadly i do not have the logs any more from the disk full event itsself.
> I have three questions:
> * What is the best way to clean this up? Deleting the old log files and
> restarting the brokers?
> * Why did kafka not handle the disk full event well? Is this only affecting
> the cleanup or may we also loose data?
> * Is this maybe caused by the combination of upgrade and disk full?
> And last but not least: Keep up the good work. Kafka is really performing
> well while being easy to administer and has good documentation!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)