[jira] [Commented] (KAFKA-6679) Random corruption (CRC validation issues)

Ari Uka (JIRA) Thu, 05 Apr 2018 12:54:14 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16427504#comment-16427504
 ]


Ari Uka commented on KAFKA-6679:
--------------------------------

Similar issue: https://issues.apache.org/jira/browse/KAFKA-3240

> Random corruption (CRC validation issues) 
> ------------------------------------------
>
>                 Key: KAFKA-6679
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6679
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer, replication
>    Affects Versions: 0.10.2.0, 1.0.1
>         Environment: FreeBSD 11.0-RELEASE-p8
>            Reporter: Ari Uka
>            Priority: Major
>
> I'm running into a really strange issue on production. I have 3 brokers and 
> randomly consumers will start to fail with an error message saying the CRC 
> does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 
> with the hope that upgrading would help fix the issue.
> On the kafka side, I see errors related to this across all 3 brokers:
> ```
> [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Error for partition topic-a-0 to broker 
> 1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14).
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14)
> [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error for partition topic-c-2 to broker 
> 2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> ```
>  
> To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
> do a binary search until I can find a non corrupt message and push the 
> offsets forward. It's annoying because I can't actually push to a specific 
> date because kafka-consumer-groups.sh starts to emit the same error, 
> ErrInvalidMessage, CRC does not match.
> The error popped up again the next day after fixing it tho, so I'm trying to 
> find the root cause. 
> I'm using the Go consumer [https://github.com/Shopify/sarama] and 
> [https://github.com/bsm/sarama-cluster]. 
> At first, I thought it could be the consumer libraries, but the error happens 
> with kafka-console-consumer.sh as well when a specific message is corrupted 
> in Kafka. I don't think it's possible for Kafka producers to actually push 
> corrupt messages to Kafka and then cause all consumers to break right? I 
> assume Kafka would reject corrupt messages, so I'm not sure what's going on 
> here.
> Should I just re-create the cluster, I don't think it's hardware failure 
> across the 3 machines tho.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-6679) Random corruption (CRC validation issues)

Reply via email to