[jira] [Updated] (KAFKA-6679) Random corruption (CRC validation issues)
[ https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ari Uka updated KAFKA-6679: --- Description: I'm running into a really strange issue on production. I have 3 brokers and randomly consumers will start to fail with an error message saying the CRC does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with the hope that upgrading would help fix the issue. On the kafka side, I see errors related to this across all 3 brokers: ``` [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Error for partition topic-a-0 to broker 1:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14). [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14) [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, fetcherId=0] Error for partition topic-c-2 to broker 2:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) ``` To fix this, I have to use the kafka-consumer-groups.sh command line tool and do a binary search until I can find a non corrupt message and push the offsets forward. It's annoying because I can't actually push to a specific date because kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC does not match. The error popped up again the next day after fixing it tho, so I'm trying to find the root cause. I'm using the Go consumer [https://github.com/Shopify/sarama] and [https://github.com/bsm/sarama-cluster]. At first, I thought it could be the consumer libraries, but the error happens with kafka-console-consumer.sh as well when a specific message is corrupted in Kafka. I don't think it's possible for Kafka producers to actually push corrupt messages to Kafka and then cause all consumers to break right? I assume Kafka would reject corrupt messages, so I'm not sure what's going on here. Should I just re-create the cluster, I don't think it's hardware failure across the 3 machines tho. was: I'm running into a really strange issue on production. I have 3 brokers and randomly consumers will start to fail with an error message saying the CRC does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with the hope that upgrading would help fix the issue. On the kafka side, I see errors related to this across all 3 brokers: ``` [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Error for partition topic-a-0 to broker 1:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14). [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14) [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, fetcherId=0] Error for partition topic-c-2 to broker 2:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) ``` To fix this, I have to use the kafka-consumer-groups.sh command line tool and do a binary search until I can find a non corrupt message and push the offsets forward. It's annoying because I can't actually push to a specific date because kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC does not match. I'm using the Go consumer [https://github.com/Shopify/sarama] and [https://github.com/bsm/sarama-cluster]. At first, I thought it could be the consumer libraries, but the error happens with kafka-console-consumer.sh as well when a specific message is corrupted in Kafka. I don't think it's p
[jira] [Updated] (KAFKA-6679) Random corruption (CRC validation issues)
[ https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ari Uka updated KAFKA-6679: --- Description: I'm running into a really strange issue on production. I have 3 brokers and randomly consumers will start to fail with an error message saying the CRC does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with the hope that upgrading would help fix the issue. On the kafka side, I see errors related to this across all 3 brokers: ``` [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Error for partition topic-a-0 to broker 1:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14). [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14) [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, fetcherId=0] Error for partition topic-c-2 to broker 2:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) ``` To fix this, I have to use the kafka-consumer-groups.sh command line tool and do a binary search until I can find a non corrupt message and push the offsets forward. It's annoying because I can't actually push to a specific date because kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC does not match. I'm using the Go consumer [https://github.com/Shopify/sarama] and [https://github.com/bsm/sarama-cluster]. At first, I thought it could be the consumer libraries, but the error happens with kafka-console-consumer.sh as well when a specific message is corrupted in Kafka. I don't think it's possible for Kafka producers to actually push corrupt messages to Kafka and then cause all consumers to break right? I assume Kafka would reject corrupt messages, so I'm not sure what's going on here. was: I'm running into a really strange issue on production. I have 3 brokers and randomly consumers will start to fail with an error message saying the CRC does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with the hope that upgrading would help fix the issue. On the kafka side, I see errors related to this across all 3 brokers: ``` [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Error for partition topic-a-0 to broker 1:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14). [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14) [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, fetcherId=0] Error for partition topic-c-2 to broker 2:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) ``` To fix this, I have to use the kafka-consumer-groups.sh command line tool and do a binary search until I can find a non corrupt message and push the offsets forward. It's annoying because I can't actually push to a specific date because kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC does not match. I'm using the Go consumer [https://github.com/Shopify/sarama] and [https://github.com/bsm/sarama-cluster] > Random corruption (CRC validation issues) > -- > > Key: KAFKA-6679 > URL: https://issues.apache.org/jira/browse/KAFKA-6679 > Project: Kafka > Issue Type: Bug > Components: consumer, replication >Affects Versions: 0.10.2.0, 1.0.1 > Environment: FreeBSD 11.0-RELEASE-p8 >
[jira] [Updated] (KAFKA-6679) Random corruption (CRC validation issues)
[ https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ari Uka updated KAFKA-6679: --- Description: I'm running into a really strange issue on production. I have 3 brokers and randomly consumers will start to fail with an error message saying the CRC does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with the hope that upgrading would help fix the issue. On the kafka side, I see errors related to this across all 3 brokers: ``` [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Error for partition topic-a-0 to broker 1:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14). [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14) [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, fetcherId=0] Error for partition topic-c-2 to broker 2:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) ``` To fix this, I have to use the kafka-consumer-groups.sh command line tool and do a binary search until I can find a non corrupt message and push the offsets forward. It's annoying because I can't actually push to a specific date because kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC does not match. I'm using the Go consumer [https://github.com/Shopify/sarama] and [https://github.com/bsm/sarama-cluster] was: I'm running into a really strange issue on production. I have 3 brokers and randomly consumers will start to fail with an error message saying the CRC does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with the hope that upgrading would help fix the issue. On the kafka side, I see errors related to this across all 3 brokers: ``` [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Error for partition topic-a-0 to broker 1:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14). [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition telemetry-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14) [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, fetcherId=0] Error for partition topic-c-2 to broker 2:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) ``` To fix this, I have to use the kafka-consumer-groups.sh command line tool and do a binary search until I can find a non corrupt message and push the offsets forward. It's annoying because I can't actually push to a specific date because kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC does not match. I'm using the Go consumer [https://github.com/Shopify/sarama] and [https://github.com/bsm/sarama-cluster] > Random corruption (CRC validation issues) > -- > > Key: KAFKA-6679 > URL: https://issues.apache.org/jira/browse/KAFKA-6679 > Project: Kafka > Issue Type: Bug > Components: consumer, replication >Affects Versions: 0.10.2.0, 1.0.1 > Environment: FreeBSD 11.0-RELEASE-p8 >Reporter: Ari Uka >Priority: Major > > I'm running into a really strange issue on production. I have 3 brokers and > randomly consumers will start to fail with an error message saying the CRC > does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 > with the hope that upgrading would help fix the issue. > On the kafka side, I see errors
[jira] [Updated] (KAFKA-6679) Random corruption (CRC validation issues)
[ https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ari Uka updated KAFKA-6679: --- Description: I'm running into a really strange issue on production. I have 3 brokers and randomly consumers will start to fail with an error message saying the CRC does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with the hope that upgrading would help fix the issue. On the kafka side, I see errors related to this across all 3 brokers: ``` [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Error for partition topic-a-0 to broker 1:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14). [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition telemetry-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14) [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, fetcherId=0] Error for partition topic-c-2 to broker 2:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) ``` To fix this, I have to use the kafka-consumer-groups.sh command line tool and do a binary search until I can find a non corrupt message and push the offsets forward. It's annoying because I can't actually push to a specific date because kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC does not match. I'm using the Go consumer [https://github.com/Shopify/sarama] and [https://github.com/bsm/sarama-cluster] was: I'm running into a really strange issue on production. I have 3 brokers and randomly consumers will start to fail with an error message saying the CRC does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with the hope that upgrading would help fix the issue. On the kafka side, I see errors related to this across all 3 brokers: {noformat} [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Error for partition topic-a-0 to broker 1:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14). [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition telemetry-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14) [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, fetcherId=0] Error for partition topic-c-2 to broker 2:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) {noformat} To fix this, I have to use the kafka-consumer-groups.sh command line tool and do a binary search until I can find a non corrupt message and push the offsets forward. It's annoying because I can't actually push to a specific date because kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC does not match. After pushing the offsets forward again, the issue came up again a few days later. I'm unsure of what to do here, there doesn't appear to be a tool to go through the logs and scan for corruption and fix it, has anyone ever run into this before? I'm using the Go consumer [https://github.com/Shopify/sarama] and [https://github.com/bsm/sarama-cluster]. Is it even possible for Kafka producers to push messages to topics with corrupt messages. I thought perhaps the consumer logic was broken on my libraries, but the CRC issue also happens with the kafka-console-consumer,sh and other command line tools when it happens. > Random corruption (CRC validation issues) > -- > > Key: KAFKA-6679 > URL: https://issues.apache.org/jira/browse/KAFKA-6679 > Project: Kafka > Issue Ty
[jira] [Updated] (KAFKA-6679) Random corruption (CRC validation issues)
[ https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ari Uka updated KAFKA-6679: --- Description: I'm running into a really strange issue on production. I have 3 brokers and randomly consumers will start to fail with an error message saying the CRC does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with the hope that upgrading would help fix the issue. On the kafka side, I see errors related to this across all 3 brokers: ``` [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Error for partition topic-a-0 to broker 1:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14). [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition telemetry-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14) [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, fetcherId=0] Error for partition topic-c-2 to broker 2:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) ``` To fix this, I have to use the kafka-consumer-groups.sh command line tool and do a binary search until I can find a non corrupt message and push the offsets forward. It's annoying because I can't actually push to a specific date because kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC does not match. After pushing the offsets forward again, the issue came up again a few days later. I'm unsure of what to do here, there doesn't appear to be a tool to go through the logs and scan for corruption and fix it, has anyone ever run into this before? I'm using the Go consumer [https://github.com/Shopify/sarama] and [https://github.com/bsm/sarama-cluster]. Is it even possible for Kafka producers to push messages to topics with corrupt messages. I thought perhaps the consumer logic was broken on my libraries, but the CRC issue also happens with the kafka-console-consumer,sh and other command line tools when it happens. > Random corruption (CRC validation issues) > -- > > Key: KAFKA-6679 > URL: https://issues.apache.org/jira/browse/KAFKA-6679 > Project: Kafka > Issue Type: Bug > Components: consumer, replication >Affects Versions: 0.10.2.0, 1.0.1 > Environment: FreeBSD 11.0-RELEASE-p8 >Reporter: Ari Uka >Priority: Major > > I'm running into a really strange issue on production. I have 3 brokers and > randomly consumers will start to fail with an error message saying the CRC > does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 > with the hope that upgrading would help fix the issue. > On the kafka side, I see errors related to this across all 3 brokers: > ``` > [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Error for partition topic-a-0 to broker > 1:org.apache.kafka.common.errors.CorruptRecordException: This message has > failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. > (kafka.server.ReplicaFetcherThread) > [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing > fetch operation on partition topic-b-0, offset 23848795 > (kafka.server.ReplicaManager) > org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller > than minimum record overhead (14). > [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing > fetch operation on partition telemetry-b-0, offset 23848795 > (kafka.server.ReplicaManager) > org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller > than minimum record overhead (14) > [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, > fetcherId=0] Error for partition topic-c-2 to broker > 2:org.apache.kafka.common.errors.CorruptRecordException: This message has > failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. > (kafka.server.ReplicaFetcherThread) > ``` > > To fix this, I have to use the kafka-consumer-groups.sh command line tool and > do a binary search until I can find a non corrupt message and push the > offsets forward. It's annoying because I can't actually
[jira] [Updated] (KAFKA-6679) Random corruption (CRC validation issues)
[ https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ari Uka updated KAFKA-6679: --- Description: I'm running into a really strange issue on production. I have 3 brokers and randomly consumers will start to fail with an error message saying the CRC does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with the hope that upgrading would help fix the issue. On the kafka side, I see errors related to this across all 3 brokers: {noformat} [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Error for partition topic-a-0 to broker 1:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14). [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition telemetry-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14) [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, fetcherId=0] Error for partition topic-c-2 to broker 2:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) {noformat} To fix this, I have to use the kafka-consumer-groups.sh command line tool and do a binary search until I can find a non corrupt message and push the offsets forward. It's annoying because I can't actually push to a specific date because kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC does not match. After pushing the offsets forward again, the issue came up again a few days later. I'm unsure of what to do here, there doesn't appear to be a tool to go through the logs and scan for corruption and fix it, has anyone ever run into this before? I'm using the Go consumer [https://github.com/Shopify/sarama] and [https://github.com/bsm/sarama-cluster]. Is it even possible for Kafka producers to push messages to topics with corrupt messages. I thought perhaps the consumer logic was broken on my libraries, but the CRC issue also happens with the kafka-console-consumer,sh and other command line tools when it happens. was: I'm running into a really strange issue on production. I have 3 brokers and randomly consumers will start to fail with an error message saying the CRC does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with the hope that upgrading would help fix the issue. On the kafka side, I see errors related to this across all 3 brokers: ``` [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Error for partition topic-a-0 to broker 1:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition topic-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14). [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing fetch operation on partition telemetry-b-0, offset 23848795 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller than minimum record overhead (14) [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, fetcherId=0] Error for partition topic-c-2 to broker 2:org.apache.kafka.common.errors.CorruptRecordException: This message has failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. (kafka.server.ReplicaFetcherThread) ``` To fix this, I have to use the kafka-consumer-groups.sh command line tool and do a binary search until I can find a non corrupt message and push the offsets forward. It's annoying because I can't actually push to a specific date because kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC does not match. After pushing the offsets forward again, the issue came up again a few days later. I'm unsure of what to do here, there doesn't appear to be a tool to go through the logs and scan for corruption and fix it, has anyone ever run into this before? I'm using the Go consumer [https://github.com/Shopify/sarama] and [https://github.com/bsm/sarama-cluster]