[
https://issues.apache.org/jira/browse/KAFKA-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047871#comment-16047871
]
Carsten Rietz commented on KAFKA-5431:
--------------------------------------
Thanks for the fast reponse. We did some more digging today and it seems
related to log.preallocate=true.
The log files which tip over the LogCleaner are not compacted even if they were
rolled. Here is a example with 1706631 being faulty.
{code}
[user@host ~]$ ls -lsh data/__consumer_offsets-26/*.log
328K -rw-r--r-- 1 jboss jboss 328K Jun 13 09:29 00000000000001701717.log
332K -rw-r--r-- 1 jboss jboss 330K Jun 13 09:29 00000000000001704168.log
32K -rw-r--r-- 1 jboss jboss 100M Jun 13 09:29 00000000000001706631.log
{code}
In the kafka log we see the normal
{code}
[2017-06-13 09:29:09,345] INFO Rolled new log segment for
'__consumer_offsets-26' in 1 ms. (kafka.log.Log)
{code}
As i understand the code this should not be possible :)
We worked around for now by setting log.preallocate=false, deleting all
__consumer_offsets on one broker and restarting it. Now eerything seems to run
stable.
I will try to find another occurrence on our test environment and check for
corrupted records.
> LogCleaner stopped due to
> org.apache.kafka.common.errors.CorruptRecordException
> -------------------------------------------------------------------------------
>
> Key: KAFKA-5431
> URL: https://issues.apache.org/jira/browse/KAFKA-5431
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 0.10.2.1
> Reporter: Carsten Rietz
> Labels: reliability
>
> Hey all,
> i have a strange problem with our uat cluster of 3 kafka brokers.
> the __consumer_offsets topic was replicated to two instances and our disks
> ran full due to a wrong configuration of the log cleaner. We fixed the
> configuration and updated from 0.10.1.1 to 0.10.2.1 .
> Today i increased the replication of the __consumer_offsets topic to 3 and
> triggered replication to the third cluster via kafka-reassign-partitions.sh.
> That went well but i get many errors like
> {code}
> [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for
> partition [__consumer_offsets,18] offset 0 error Record size is less than the
> minimum record overhead (14) (kafka.server.ReplicaFetcherThread)
> [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for
> partition [__consumer_offsets,24] offset 0 error Record size is less than the
> minimum record overhead (14) (kafka.server.ReplicaFetcherThread)
> {code}
> Which i think are due to the full disk event.
> The log cleaner threads died on these wrong messages:
> {code}
> [2017-06-12 09:59:50,722] ERROR [kafka-log-cleaner-thread-0], Error due to
> (kafka.log.LogCleaner)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is less
> than the minimum record overhead (14)
> [2017-06-12 09:59:50,722] INFO [kafka-log-cleaner-thread-0], Stopped
> (kafka.log.LogCleaner)
> {code}
> Looking at the file is see that some are truncated and some are jsut empty:
> $ ls -lsh 00000000000000594653.log
> 0 -rw-r--r-- 1 user user 100M Jun 12 11:00 00000000000000594653.log
> Sadly i do not have the logs any more from the disk full event itsself.
> I have three questions:
> * What is the best way to clean this up? Deleting the old log files and
> restarting the brokers?
> * Why did kafka not handle the disk full event well? Is this only affecting
> the cleanup or may we also loose data?
> * Is this maybe caused by the combination of upgrade and disk full?
> And last but not least: Keep up the good work. Kafka is really performing
> well while being easy to administer and has good documentation!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)