[ https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16427504#comment-16427504 ]
Ari Uka commented on KAFKA-6679: -------------------------------- Similar issue: https://issues.apache.org/jira/browse/KAFKA-3240 > Random corruption (CRC validation issues) > ------------------------------------------ > > Key: KAFKA-6679 > URL: https://issues.apache.org/jira/browse/KAFKA-6679 > Project: Kafka > Issue Type: Bug > Components: consumer, replication > Affects Versions: 0.10.2.0, 1.0.1 > Environment: FreeBSD 11.0-RELEASE-p8 > Reporter: Ari Uka > Priority: Major > > I'm running into a really strange issue on production. I have 3 brokers and > randomly consumers will start to fail with an error message saying the CRC > does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 > with the hope that upgrading would help fix the issue. > On the kafka side, I see errors related to this across all 3 brokers: > ``` > [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Error for partition topic-a-0 to broker > 1:org.apache.kafka.common.errors.CorruptRecordException: This message has > failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. > (kafka.server.ReplicaFetcherThread) > [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing > fetch operation on partition topic-b-0, offset 23848795 > (kafka.server.ReplicaManager) > org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller > than minimum record overhead (14). > [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing > fetch operation on partition topic-b-0, offset 23848795 > (kafka.server.ReplicaManager) > org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller > than minimum record overhead (14) > [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, > fetcherId=0] Error for partition topic-c-2 to broker > 2:org.apache.kafka.common.errors.CorruptRecordException: This message has > failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. > (kafka.server.ReplicaFetcherThread) > ``` > > To fix this, I have to use the kafka-consumer-groups.sh command line tool and > do a binary search until I can find a non corrupt message and push the > offsets forward. It's annoying because I can't actually push to a specific > date because kafka-consumer-groups.sh starts to emit the same error, > ErrInvalidMessage, CRC does not match. > The error popped up again the next day after fixing it tho, so I'm trying to > find the root cause. > I'm using the Go consumer [https://github.com/Shopify/sarama] and > [https://github.com/bsm/sarama-cluster]. > At first, I thought it could be the consumer libraries, but the error happens > with kafka-console-consumer.sh as well when a specific message is corrupted > in Kafka. I don't think it's possible for Kafka producers to actually push > corrupt messages to Kafka and then cause all consumers to break right? I > assume Kafka would reject corrupt messages, so I'm not sure what's going on > here. > Should I just re-create the cluster, I don't think it's hardware failure > across the 3 machines tho. -- This message was sent by Atlassian JIRA (v7.6.3#76005)