[jira] [Commented] (KAFKA-6679) Random corruption (CRC validation issues)

2018-04-05 Thread Ari Uka (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427504#comment-16427504
 ] 

Ari Uka commented on KAFKA-6679:


Similar issue: https://issues.apache.org/jira/browse/KAFKA-3240

> Random corruption (CRC validation issues) 
> --
>
> Key: KAFKA-6679
> URL: https://issues.apache.org/jira/browse/KAFKA-6679
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, replication
>Affects Versions: 0.10.2.0, 1.0.1
> Environment: FreeBSD 11.0-RELEASE-p8
>Reporter: Ari Uka
>Priority: Major
>
> I'm running into a really strange issue on production. I have 3 brokers and 
> randomly consumers will start to fail with an error message saying the CRC 
> does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 
> with the hope that upgrading would help fix the issue.
> On the kafka side, I see errors related to this across all 3 brokers:
> ```
> [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Error for partition topic-a-0 to broker 
> 1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14).
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14)
> [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error for partition topic-c-2 to broker 
> 2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> ```
>  
> To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
> do a binary search until I can find a non corrupt message and push the 
> offsets forward. It's annoying because I can't actually push to a specific 
> date because kafka-consumer-groups.sh starts to emit the same error, 
> ErrInvalidMessage, CRC does not match.
> The error popped up again the next day after fixing it tho, so I'm trying to 
> find the root cause. 
> I'm using the Go consumer [https://github.com/Shopify/sarama] and 
> [https://github.com/bsm/sarama-cluster]. 
> At first, I thought it could be the consumer libraries, but the error happens 
> with kafka-console-consumer.sh as well when a specific message is corrupted 
> in Kafka. I don't think it's possible for Kafka producers to actually push 
> corrupt messages to Kafka and then cause all consumers to break right? I 
> assume Kafka would reject corrupt messages, so I'm not sure what's going on 
> here.
> Should I just re-create the cluster, I don't think it's hardware failure 
> across the 3 machines tho.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-3240) Replication issues

2018-04-05 Thread Ari Uka (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427437#comment-16427437
 ] 

Ari Uka commented on KAFKA-3240:


[~ser...@akhmatov.ru], or anyone else in this thread, were you running ZFS? And 
if you were, do you recall if you were using ZFS compression (we seem to be 
using LZ4.) on our ZVOl.

> Replication issues
> --
>
> Key: KAFKA-3240
> URL: https://issues.apache.org/jira/browse/KAFKA-3240
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1
> Environment: FreeBSD 10.2-RELEASE-p9
>Reporter: Jan Omar
>Priority: Major
>  Labels: reliability
>
> Hi,
> We are trying to replace our 3-broker cluster running on 0.6 with a new 
> cluster on 0.9.0.1 (but tried 0.8.2.2 and 0.9.0.0 as well).
> - 3 kafka nodes with one zookeeper instance on each machine
> - FreeBSD 10.2 p9
> - Nagle off (sysctl net.inet.tcp.delayed_ack=0)
> - all kafka machines write a ZFS ZIL to a dedicated SSD
> - 3 producers on 3 machines, writing to 1 topics, partitioning 3, replication 
> factor 3
> - acks all
> - 10 Gigabit Ethernet, all machines on one switch, ping 0.05 ms worst case.
> While using the ProducerPerformance or rdkafka_performance we are seeing very 
> strange Replication errors. Any hint on what's going on would be highly 
> appreciated. Any suggestion on how to debug this properly would help as well.
> This is what our broker config looks like:
> {code}
> broker.id=5
> auto.create.topics.enable=false
> delete.topic.enable=true
> listeners=PLAINTEXT://:9092
> port=9092
> host.name=kafka-five.acc
> advertised.host.name=10.5.3.18
> zookeeper.connect=zookeeper-four.acc:2181,zookeeper-five.acc:2181,zookeeper-six.acc:2181
> zookeeper.connection.timeout.ms=6000
> num.replica.fetchers=1
> replica.fetch.max.bytes=1
> replica.fetch.wait.max.ms=500
> replica.high.watermark.checkpoint.interval.ms=5000
> replica.socket.timeout.ms=30
> replica.socket.receive.buffer.bytes=65536
> replica.lag.time.max.ms=1000
> min.insync.replicas=2
> controller.socket.timeout.ms=3
> controller.message.queue.size=100
> log.dirs=/var/db/kafka
> num.partitions=8
> message.max.bytes=1
> auto.create.topics.enable=false
> log.index.interval.bytes=4096
> log.index.size.max.bytes=10485760
> log.retention.hours=168
> log.flush.interval.ms=1
> log.flush.interval.messages=2
> log.flush.scheduler.interval.ms=2000
> log.roll.hours=168
> log.retention.check.interval.ms=30
> log.segment.bytes=536870912
> zookeeper.connection.timeout.ms=100
> zookeeper.sync.time.ms=5000
> num.io.threads=8
> num.network.threads=4
> socket.request.max.bytes=104857600
> socket.receive.buffer.bytes=1048576
> socket.send.buffer.bytes=1048576
> queued.max.requests=10
> fetch.purgatory.purge.interval.requests=100
> producer.purgatory.purge.interval.requests=100
> replica.lag.max.messages=1000
> {code}
> These are the errors we're seeing:
> {code:borderStyle=solid}
> ERROR [Replica Manager on Broker 5]: Error processing fetch operation on 
> partition [test,0] offset 50727 (kafka.server.ReplicaManager)
> java.lang.IllegalStateException: Invalid message size: 0
>   at kafka.log.FileMessageSet.searchFor(FileMessageSet.scala:141)
>   at kafka.log.LogSegment.translateOffset(LogSegment.scala:105)
>   at kafka.log.LogSegment.read(LogSegment.scala:126)
>   at kafka.log.Log.read(Log.scala:506)
>   at 
> kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:536)
>   at 
> kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:507)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> kafka.server.ReplicaManager.readFromLocalLog(ReplicaManager.scala:507)
>   at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:462)
>   at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:431)
>   at kafka.server.KafkaApis.handle(KafkaApis.scala:69)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
>   at java.lang.Thread.run(Thread.java:745)0
> {code}
> and 
> {code}
> ERROR Found invalid messages during fetch for partition [test,0] offset 2732 
> error Message found with corrupt size (0) in shallow iterator 
> (kafka.server.ReplicaFetcherThread)
> {code}



--
This message was 

[jira] [Commented] (KAFKA-3240) Replication issues

2018-04-04 Thread Ari Uka (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425829#comment-16425829
 ] 

Ari Uka commented on KAFKA-3240:


[~ser...@akhmatov.ru] did you happen to use OpenJDK on your Linux distro, or 
did you go with Oracle?

 

 

> Replication issues
> --
>
> Key: KAFKA-3240
> URL: https://issues.apache.org/jira/browse/KAFKA-3240
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1
> Environment: FreeBSD 10.2-RELEASE-p9
>Reporter: Jan Omar
>Priority: Major
>  Labels: reliability
>
> Hi,
> We are trying to replace our 3-broker cluster running on 0.6 with a new 
> cluster on 0.9.0.1 (but tried 0.8.2.2 and 0.9.0.0 as well).
> - 3 kafka nodes with one zookeeper instance on each machine
> - FreeBSD 10.2 p9
> - Nagle off (sysctl net.inet.tcp.delayed_ack=0)
> - all kafka machines write a ZFS ZIL to a dedicated SSD
> - 3 producers on 3 machines, writing to 1 topics, partitioning 3, replication 
> factor 3
> - acks all
> - 10 Gigabit Ethernet, all machines on one switch, ping 0.05 ms worst case.
> While using the ProducerPerformance or rdkafka_performance we are seeing very 
> strange Replication errors. Any hint on what's going on would be highly 
> appreciated. Any suggestion on how to debug this properly would help as well.
> This is what our broker config looks like:
> {code}
> broker.id=5
> auto.create.topics.enable=false
> delete.topic.enable=true
> listeners=PLAINTEXT://:9092
> port=9092
> host.name=kafka-five.acc
> advertised.host.name=10.5.3.18
> zookeeper.connect=zookeeper-four.acc:2181,zookeeper-five.acc:2181,zookeeper-six.acc:2181
> zookeeper.connection.timeout.ms=6000
> num.replica.fetchers=1
> replica.fetch.max.bytes=1
> replica.fetch.wait.max.ms=500
> replica.high.watermark.checkpoint.interval.ms=5000
> replica.socket.timeout.ms=30
> replica.socket.receive.buffer.bytes=65536
> replica.lag.time.max.ms=1000
> min.insync.replicas=2
> controller.socket.timeout.ms=3
> controller.message.queue.size=100
> log.dirs=/var/db/kafka
> num.partitions=8
> message.max.bytes=1
> auto.create.topics.enable=false
> log.index.interval.bytes=4096
> log.index.size.max.bytes=10485760
> log.retention.hours=168
> log.flush.interval.ms=1
> log.flush.interval.messages=2
> log.flush.scheduler.interval.ms=2000
> log.roll.hours=168
> log.retention.check.interval.ms=30
> log.segment.bytes=536870912
> zookeeper.connection.timeout.ms=100
> zookeeper.sync.time.ms=5000
> num.io.threads=8
> num.network.threads=4
> socket.request.max.bytes=104857600
> socket.receive.buffer.bytes=1048576
> socket.send.buffer.bytes=1048576
> queued.max.requests=10
> fetch.purgatory.purge.interval.requests=100
> producer.purgatory.purge.interval.requests=100
> replica.lag.max.messages=1000
> {code}
> These are the errors we're seeing:
> {code:borderStyle=solid}
> ERROR [Replica Manager on Broker 5]: Error processing fetch operation on 
> partition [test,0] offset 50727 (kafka.server.ReplicaManager)
> java.lang.IllegalStateException: Invalid message size: 0
>   at kafka.log.FileMessageSet.searchFor(FileMessageSet.scala:141)
>   at kafka.log.LogSegment.translateOffset(LogSegment.scala:105)
>   at kafka.log.LogSegment.read(LogSegment.scala:126)
>   at kafka.log.Log.read(Log.scala:506)
>   at 
> kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:536)
>   at 
> kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:507)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> kafka.server.ReplicaManager.readFromLocalLog(ReplicaManager.scala:507)
>   at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:462)
>   at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:431)
>   at kafka.server.KafkaApis.handle(KafkaApis.scala:69)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
>   at java.lang.Thread.run(Thread.java:745)0
> {code}
> and 
> {code}
> ERROR Found invalid messages during fetch for partition [test,0] offset 2732 
> error Message found with corrupt size (0) in shallow iterator 
> (kafka.server.ReplicaFetcherThread)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-3240) Replication issues

2018-04-03 Thread Ari Uka (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424420#comment-16424420
 ] 

Ari Uka commented on KAFKA-3240:


I'm running into the same issue. 
https://issues.apache.org/jira/browse/KAFKA-6679

Restarting all machines seemed to fix the problem for us temporarily. It 
happened a week later and the issue came up again. 

We are running:
[https://github.com/Shopify/sarama] for Producer and Consumer.
FreeBSD 11.0-RELEASE-p8
Kafka 1.0.1 (we are going to attempt to upgrade to Kafka 1.1.0 and restart 
again.)
OpenJDK 1.8.0_121
ZFS 

This is also in Azure, so the instance is a VM. At this point, we're going to 
attempt to move to a Linux cluster instead of running a FreeBSD setup. We're 
going to introduce 3 new machines into the cluster that are running Linux and 
see if the problem appears on those instances.

May I ask what consumer/producer you guys are using, is anyone using the 
standard java consumer/producer?



 

 

> Replication issues
> --
>
> Key: KAFKA-3240
> URL: https://issues.apache.org/jira/browse/KAFKA-3240
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1
> Environment: FreeBSD 10.2-RELEASE-p9
>Reporter: Jan Omar
>Priority: Major
>  Labels: reliability
>
> Hi,
> We are trying to replace our 3-broker cluster running on 0.6 with a new 
> cluster on 0.9.0.1 (but tried 0.8.2.2 and 0.9.0.0 as well).
> - 3 kafka nodes with one zookeeper instance on each machine
> - FreeBSD 10.2 p9
> - Nagle off (sysctl net.inet.tcp.delayed_ack=0)
> - all kafka machines write a ZFS ZIL to a dedicated SSD
> - 3 producers on 3 machines, writing to 1 topics, partitioning 3, replication 
> factor 3
> - acks all
> - 10 Gigabit Ethernet, all machines on one switch, ping 0.05 ms worst case.
> While using the ProducerPerformance or rdkafka_performance we are seeing very 
> strange Replication errors. Any hint on what's going on would be highly 
> appreciated. Any suggestion on how to debug this properly would help as well.
> This is what our broker config looks like:
> {code}
> broker.id=5
> auto.create.topics.enable=false
> delete.topic.enable=true
> listeners=PLAINTEXT://:9092
> port=9092
> host.name=kafka-five.acc
> advertised.host.name=10.5.3.18
> zookeeper.connect=zookeeper-four.acc:2181,zookeeper-five.acc:2181,zookeeper-six.acc:2181
> zookeeper.connection.timeout.ms=6000
> num.replica.fetchers=1
> replica.fetch.max.bytes=1
> replica.fetch.wait.max.ms=500
> replica.high.watermark.checkpoint.interval.ms=5000
> replica.socket.timeout.ms=30
> replica.socket.receive.buffer.bytes=65536
> replica.lag.time.max.ms=1000
> min.insync.replicas=2
> controller.socket.timeout.ms=3
> controller.message.queue.size=100
> log.dirs=/var/db/kafka
> num.partitions=8
> message.max.bytes=1
> auto.create.topics.enable=false
> log.index.interval.bytes=4096
> log.index.size.max.bytes=10485760
> log.retention.hours=168
> log.flush.interval.ms=1
> log.flush.interval.messages=2
> log.flush.scheduler.interval.ms=2000
> log.roll.hours=168
> log.retention.check.interval.ms=30
> log.segment.bytes=536870912
> zookeeper.connection.timeout.ms=100
> zookeeper.sync.time.ms=5000
> num.io.threads=8
> num.network.threads=4
> socket.request.max.bytes=104857600
> socket.receive.buffer.bytes=1048576
> socket.send.buffer.bytes=1048576
> queued.max.requests=10
> fetch.purgatory.purge.interval.requests=100
> producer.purgatory.purge.interval.requests=100
> replica.lag.max.messages=1000
> {code}
> These are the errors we're seeing:
> {code:borderStyle=solid}
> ERROR [Replica Manager on Broker 5]: Error processing fetch operation on 
> partition [test,0] offset 50727 (kafka.server.ReplicaManager)
> java.lang.IllegalStateException: Invalid message size: 0
>   at kafka.log.FileMessageSet.searchFor(FileMessageSet.scala:141)
>   at kafka.log.LogSegment.translateOffset(LogSegment.scala:105)
>   at kafka.log.LogSegment.read(LogSegment.scala:126)
>   at kafka.log.Log.read(Log.scala:506)
>   at 
> kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:536)
>   at 
> kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:507)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> 

[jira] [Comment Edited] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-22 Thread Ari Uka (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410079#comment-16410079
 ] 

Ari Uka edited comment on KAFKA-6679 at 3/22/18 6:36 PM:
-

So I attempted to re-generate one of the topics we had and I've been able to 
actually find a corrupt message in the `.log` file. 

There exists 3 brokers: kafka-01, kafka-02, kafka-03 and the topic 
influxdb-telemetry which contains 161,760,969 (161.7M) messages. 

The topic description:
{noformat}
Topic:influxdb-telemetry PartitionCount:6 ReplicationFactor:3 Configs:
Topic: influxdb-telemetry Partition: 0 Leader: 2 Replicas: 2,1,3 Isr: 2
Topic: influxdb-telemetry Partition: 1 Leader: 3 Replicas: 3,2,1 Isr: 3
Topic: influxdb-telemetry Partition: 2 Leader: 1 Replicas: 1,3,2 Isr: 1
Topic: influxdb-telemetry Partition: 3 Leader: 2 Replicas: 2,3,1 Isr: 2
Topic: influxdb-telemetry Partition: 4 Leader: 3 Replicas: 3,1,2 Isr: 3
Topic: influxdb-telemetry Partition: 5 Leader: 1 Replicas: 1,2,3 Isr: 
1{noformat}
After inserting only 71,235 messages, influxdb-telemetry-0 becomes corrupt on 
kafka-02 and kafka-01 starts to complain:
{noformat}
[2018-03-22 17:00:00,690] ERROR [ReplicaManager broker=2] Error processing 
fetch operation on partition influxdb-telemetry-0, offset 71236 
(kafka.server.ReplicaManager) 
{noformat}
 

The last good written RecordSet looks like this:
{noformat}
baseOffset: 71233 lastOffset: 71235 baseSequence: -1 lastSequence: -1 
producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false 
position: 10285560 CreateTime: -1 isvalid: true size: 471 magic: 2 
compresscodec: NONE crc: 491186814{noformat}
The header of this RecordSet looks like so:

$ hd -s 10285560 -n 471 
/var/db/kafka/influxdb-telemetry-0/.log
{noformat}
009ce30d  00 00 00 00 00 01 16 27  00 00 01 4b 00 00 00 00  |...'...K|
009ce31d  02 ee 03 fd 04 00 00 00  00 00 01 ff ff ff ff ff  ||
009ce32d  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  ||
009ce33d  ff ff ff ff ff ff ff ff  ff 00 00 00 02 a8 02 00  ||
009ce34d  00 00 01 9a 02 00 8b 09  89 54 4f d9 dd 82 b2 14  |.TO.|
{noformat}
If we're looking at the log files on `kafka-01` (who is trying to replicate), 
it hasn't replicated past message 71235 and it's just sitting there complaining.

If we go on the leader of this partition `kafka-02` and dump the next message 
after the good message, it's indeed corrupt, this is what it looks like:

$ hd -s 10286031 -n 471 
/var/db/kafka/influxdb-telemetry-0/.log

 
{noformat}
009cf3cf  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
*
009cf59f
{noformat}
The asterisk ( * ) here means that every line is the same, so basically message 
71,236 contains 471 zeroes. I picked 471 arbitrarily seen I don't actually know 
how big the RecordSet is for the next Record.

If we try to dump `influxdb-telemetry-0` on the leader (kafka-02), it barfs and 
ends early because of this exception:

 
{noformat}
$ ./kafka-run-class.sh kafka.tools.DumpLogSegments --deep-iteration --files 
/var/db/kafka/influxdb-telemetry-0/.log | tail -n 5
Exception in thread "main" 
org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).
 
{noformat}
So it only tails these messages when in reality the log is 1GB and message 
71,236 is only about 10MB through the file.

 
{noformat}
offset: 71231 position: 10283251 CreateTime: -1 isvalid: true keysize: -1 
valuesize: 130 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 
sequence: -1 isTransactional: false headerKeys: []
offset: 71232 position: 10283251 CreateTime: -1 isvalid: true keysize: -1 
valuesize: 125 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 
sequence: -1 isTransactional: false headerKeys: []
offset: 71233 position: 10285560 CreateTime: -1 isvalid: true keysize: -1 
valuesize: 130 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 
sequence: -1 isTransactional: false headerKeys: []
offset: 71234 position: 10285560 CreateTime: -1 isvalid: true keysize: -1 
valuesize: 126 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 
sequence: -1 isTransactional: false headerKeys: []
offset: 71235 position: 10285560 CreateTime: -1 isvalid: true keysize: -1 
valuesize: 127 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 
sequence: -1 isTransactional: false headerKeys: []
{noformat}
What could cause kafka to dump 471 zeroes for a RecordSet?

 


was (Author: ari6123):
So I attempted to re-generate one of the topics we had and I've been able to 
actually find a corrupt message in the `.log` file. 

There exists 3 brokers: kafka-01, kafka-02, kafka-03 and the topic 
influxdb-telemetry which contains 161,760,969 (161.7M) messages. 

The topic description:


{noformat}

[jira] [Commented] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-22 Thread Ari Uka (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410079#comment-16410079
 ] 

Ari Uka commented on KAFKA-6679:


So I attempted to re-generate one of the topics we had and I've been able to 
actually find a corrupt message in the `.log` file. 

There exists 3 brokers: kafka-01, kafka-02, kafka-03 and the topic 
influxdb-telemetry which contains 161,760,969 (161.7M) messages. 

The topic description:


{noformat}
Topic:influxdb-telemetry PartitionCount:6 ReplicationFactor:3 Configs:
Topic: influxdb-telemetry Partition: 0 Leader: 2 Replicas: 2,1,3 Isr: 2
Topic: influxdb-telemetry Partition: 1 Leader: 3 Replicas: 3,2,1 Isr: 3
Topic: influxdb-telemetry Partition: 2 Leader: 1 Replicas: 1,3,2 Isr: 1
Topic: influxdb-telemetry Partition: 3 Leader: 2 Replicas: 2,3,1 Isr: 2
Topic: influxdb-telemetry Partition: 4 Leader: 3 Replicas: 3,1,2 Isr: 3
Topic: influxdb-telemetry Partition: 5 Leader: 1 Replicas: 1,2,3 Isr: 
1{noformat}

After inserting only 71,235 messages, influxdb-telemetry-0 becomes corrupt on 
kafka-02 and kafka-01 starts to complain:

 

 

 
{noformat}
[2018-03-22 17:00:00,690] ERROR [ReplicaManager broker=2] Error processing 
fetch operation on partition influxdb-telemetry-0, offset 71236 
(kafka.server.ReplicaManager) 
{noformat}
 

The last good written RecordSet looks like this:



 
{noformat}
baseOffset: 71233 lastOffset: 71235 baseSequence: -1 lastSequence: -1 
producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false 
position: 10285560 CreateTime: -1 isvalid: true size: 471 magic: 2 
compresscodec: NONE crc: 491186814{noformat}

The header of this RecordSet looks like so:

$ hd -s 10285560 -n 471 
/var/db/kafka/influxdb-telemetry-0/.log

 

 
{noformat}
009ce30d  00 00 00 00 00 01 16 27  00 00 01 4b 00 00 00 00  |...'...K|
009ce31d  02 ee 03 fd 04 00 00 00  00 00 01 ff ff ff ff ff  ||
009ce32d  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  ||
009ce33d  ff ff ff ff ff ff ff ff  ff 00 00 00 02 a8 02 00  ||
009ce34d  00 00 01 9a 02 00 8b 09  89 54 4f d9 dd 82 b2 14  |.TO.|
{noformat}

If we're looking at the log files on `kafka-01` (who is trying to replicate), 
it hasn't replicated past message 71235 and it's just sitting there complaining.

If we go on the leader of this partition and dump the next message after the 
good message, it's indeed corrupt, this is what it looks like:



$ hd -s 10286031 -n 471 
/var/db/kafka/influxdb-telemetry-0/.log

 
{noformat}
009cf3cf  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
*
009cf59f
{noformat}

The asterisk ( * ) here means that every line is the same, so basically message 
71236 contains 471 zeroes. I picked 471 arbitrarily seen I don't actually know 
how big the RecordSet is for the next Record.

If we try to dump `influxdb-telemetry-0` on the leader, it barfs and ends early 
because of this exception:



 
{noformat}
$ ./kafka-run-class.sh kafka.tools.DumpLogSegments --deep-iteration --files 
/var/db/kafka/influxdb-telemetry-0/.log | tail -n 5
Exception in thread "main" 
org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).
 
{noformat}

So it only tails these messages when in reality the log is 1GB and message 
71,236 is only about 10MB through the file.



 
{noformat}
offset: 71231 position: 10283251 CreateTime: -1 isvalid: true keysize: -1 
valuesize: 130 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 
sequence: -1 isTransactional: false headerKeys: []
offset: 71232 position: 10283251 CreateTime: -1 isvalid: true keysize: -1 
valuesize: 125 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 
sequence: -1 isTransactional: false headerKeys: []
offset: 71233 position: 10285560 CreateTime: -1 isvalid: true keysize: -1 
valuesize: 130 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 
sequence: -1 isTransactional: false headerKeys: []
offset: 71234 position: 10285560 CreateTime: -1 isvalid: true keysize: -1 
valuesize: 126 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 
sequence: -1 isTransactional: false headerKeys: []
offset: 71235 position: 10285560 CreateTime: -1 isvalid: true keysize: -1 
valuesize: 127 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 
sequence: -1 isTransactional: false headerKeys: []
{noformat}

What could cause kafka to dump 471 zeroes for a RecordSet?

 

> Random corruption (CRC validation issues) 
> --
>
> Key: KAFKA-6679
> URL: https://issues.apache.org/jira/browse/KAFKA-6679
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, replication
>Affects Versions: 0.10.2.0, 1.0.1
> Environment: FreeBSD 

[jira] [Commented] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-22 Thread Ari Uka (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409993#comment-16409993
 ] 

Ari Uka commented on KAFKA-6679:


One thing I'd like to also say is that we are using:

FreeBSD with ZFS, I'm unsure if that's relevant here. The cluster has been 
pretty healthy for about a year straight, so I'm unsure if that's an issue

I did find someone else complaining about ZFS + FreeBSD 
[https://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3cjira.12939477.1455623036000.60174.1455625758...@atlassian.jira%3E]

> Random corruption (CRC validation issues) 
> --
>
> Key: KAFKA-6679
> URL: https://issues.apache.org/jira/browse/KAFKA-6679
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, replication
>Affects Versions: 0.10.2.0, 1.0.1
> Environment: FreeBSD 11.0-RELEASE-p8
>Reporter: Ari Uka
>Priority: Major
>
> I'm running into a really strange issue on production. I have 3 brokers and 
> randomly consumers will start to fail with an error message saying the CRC 
> does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 
> with the hope that upgrading would help fix the issue.
> On the kafka side, I see errors related to this across all 3 brokers:
> ```
> [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Error for partition topic-a-0 to broker 
> 1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14).
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14)
> [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error for partition topic-c-2 to broker 
> 2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> ```
>  
> To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
> do a binary search until I can find a non corrupt message and push the 
> offsets forward. It's annoying because I can't actually push to a specific 
> date because kafka-consumer-groups.sh starts to emit the same error, 
> ErrInvalidMessage, CRC does not match.
> The error popped up again the next day after fixing it tho, so I'm trying to 
> find the root cause. 
> I'm using the Go consumer [https://github.com/Shopify/sarama] and 
> [https://github.com/bsm/sarama-cluster]. 
> At first, I thought it could be the consumer libraries, but the error happens 
> with kafka-console-consumer.sh as well when a specific message is corrupted 
> in Kafka. I don't think it's possible for Kafka producers to actually push 
> corrupt messages to Kafka and then cause all consumers to break right? I 
> assume Kafka would reject corrupt messages, so I'm not sure what's going on 
> here.
> Should I just re-create the cluster, I don't think it's hardware failure 
> across the 3 machines tho.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-19 Thread Ari Uka (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405539#comment-16405539
 ] 

Ari Uka commented on KAFKA-6679:


We have been using [https://github.com/Shopify/sarama], specifically the 
SyncProducer and the AsyncProducer.

Can the producer really corrupt the Kafka broker tho, does it just take 
whatever data and throw it in its log files?

> Random corruption (CRC validation issues) 
> --
>
> Key: KAFKA-6679
> URL: https://issues.apache.org/jira/browse/KAFKA-6679
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, replication
>Affects Versions: 0.10.2.0, 1.0.1
> Environment: FreeBSD 11.0-RELEASE-p8
>Reporter: Ari Uka
>Priority: Major
>
> I'm running into a really strange issue on production. I have 3 brokers and 
> randomly consumers will start to fail with an error message saying the CRC 
> does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 
> with the hope that upgrading would help fix the issue.
> On the kafka side, I see errors related to this across all 3 brokers:
> ```
> [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Error for partition topic-a-0 to broker 
> 1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14).
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14)
> [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error for partition topic-c-2 to broker 
> 2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> ```
>  
> To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
> do a binary search until I can find a non corrupt message and push the 
> offsets forward. It's annoying because I can't actually push to a specific 
> date because kafka-consumer-groups.sh starts to emit the same error, 
> ErrInvalidMessage, CRC does not match.
> The error popped up again the next day after fixing it tho, so I'm trying to 
> find the root cause. 
> I'm using the Go consumer [https://github.com/Shopify/sarama] and 
> [https://github.com/bsm/sarama-cluster]. 
> At first, I thought it could be the consumer libraries, but the error happens 
> with kafka-console-consumer.sh as well when a specific message is corrupted 
> in Kafka. I don't think it's possible for Kafka producers to actually push 
> corrupt messages to Kafka and then cause all consumers to break right? I 
> assume Kafka would reject corrupt messages, so I'm not sure what's going on 
> here.
> Should I just re-create the cluster, I don't think it's hardware failure 
> across the 3 machines tho.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-19 Thread Ari Uka (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405478#comment-16405478
 ] 

Ari Uka edited comment on KAFKA-6679 at 3/19/18 10:12 PM:
--

So the records seem to be v2, in this case, there were 5 records. This is what 
the header looked like:

`
 baseOffset: 17360912 lastOffset: 17360916 baseSequence: -1 lastSequence: -1 
producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 27 isTransactional: 
false position: 267886748 CreateTime: -1 isvalid: true size: 762 magic: 2 
compresscodec: NONE crc: 3599616919`

so I dumped this via `hd`, the hex dump of the header looks like this:

 `hd -s 267886748 -n 762 -C 16325357.log`

 
{noformat}
0ff7a09c  00 00 00 00 01 08 e8 10  00 00 02 ee 00 00 00 1b  ||
0ff7a0ac  02 d6 8d cb 97 00 00 00  00 00 04 ff ff ff ff ff  ||
0ff7a0bc  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  ||
0ff7a0cc  ff ff ff ff ff ff ff ff  ff 00 00 00 05 8e 02 00  ||
0ff7a0dc  00 00 01 80 02 00 7e 09  c0 eb 7f 91 17 f7 ad 14  |..~.|  
{noformat}
 

the header looks okay? the error message was this:

[2018-03-19 19:21:32,445] ERROR Found invalid messages during fetch for 
partition topic-a-1 offset 17360912 error Record size is less than the minimum 
record overhead (14) (kafka.server.ReplicaFetcherThread)

 


was (Author: ari6123):
So the records seem to be v2, in this case, there were 5 records. This is what 
the header looked like:

`
 baseOffset: 17360912 lastOffset: 17360916 baseSequence: -1 lastSequence: -1 
producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 27 isTransactional: 
false position: 267886748 CreateTime: -1 isvalid: true size: 762 magic: 2 
compresscodec: NONE crc: 3599616919`

so I dumped this via `hd`, the hex dump of the header looks like this:

 `hd -s 267886748 -n 762 -C 16325357.log`

 
{noformat}
0ff7a09c 00 00 00 00 01 08 e8 10 00 00 02 ee 00 00 00 1b ||
0ff7a09c 00 00 00 00 01 08 e8 10 00 00 02 ee 00 00 00 1b ||
0ff7a0ac 02 d6 8d cb 97 00 00 00 00 00 04 ff ff ff ff ff ||
0ff7a0ac 02 d6 8d cb 97 00 00 00 00 00 04 ff ff ff ff ff ||
0ff7a0bc ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ||
0ff7a0bc ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ||
0ff7a0cc ff ff ff ff ff ff ff ff ff 00 00 00 05 8e 02 00 ||
0ff7a0cc ff ff ff ff ff ff ff ff ff 00 00 00 05 8e 02 00 
||{noformat}
 

 

Is it normal for the CRC and magic portion to be duplicated like that? 

 

 

> Random corruption (CRC validation issues) 
> --
>
> Key: KAFKA-6679
> URL: https://issues.apache.org/jira/browse/KAFKA-6679
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, replication
>Affects Versions: 0.10.2.0, 1.0.1
> Environment: FreeBSD 11.0-RELEASE-p8
>Reporter: Ari Uka
>Priority: Major
>
> I'm running into a really strange issue on production. I have 3 brokers and 
> randomly consumers will start to fail with an error message saying the CRC 
> does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 
> with the hope that upgrading would help fix the issue.
> On the kafka side, I see errors related to this across all 3 brokers:
> ```
> [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Error for partition topic-a-0 to broker 
> 1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14).
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14)
> [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error for partition topic-c-2 to broker 
> 2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> ```
>  
> To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
> do a binary search until I can find a non corrupt message and push the 
> offsets forward. It's annoying because I can't actually push to a specific 
> 

[jira] [Comment Edited] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-19 Thread Ari Uka (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405478#comment-16405478
 ] 

Ari Uka edited comment on KAFKA-6679 at 3/19/18 9:27 PM:
-

So the records seem to be v2, in this case, there were 5 records. This is what 
the header looked like:

`
 baseOffset: 17360912 lastOffset: 17360916 baseSequence: -1 lastSequence: -1 
producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 27 isTransactional: 
false position: 267886748 CreateTime: -1 isvalid: true size: 762 magic: 2 
compresscodec: NONE crc: 3599616919`

so I dumped this via `hd`, the hex dump of the header looks like this:

 `hd -s 267886748 -n 762 -C 16325357.log`

 
{noformat}
0ff7a09c 00 00 00 00 01 08 e8 10 00 00 02 ee 00 00 00 1b ||
0ff7a09c 00 00 00 00 01 08 e8 10 00 00 02 ee 00 00 00 1b ||
0ff7a0ac 02 d6 8d cb 97 00 00 00 00 00 04 ff ff ff ff ff ||
0ff7a0ac 02 d6 8d cb 97 00 00 00 00 00 04 ff ff ff ff ff ||
0ff7a0bc ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ||
0ff7a0bc ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ||
0ff7a0cc ff ff ff ff ff ff ff ff ff 00 00 00 05 8e 02 00 ||
0ff7a0cc ff ff ff ff ff ff ff ff ff 00 00 00 05 8e 02 00 
||{noformat}
 

 

Is it normal for the CRC and magic portion to be duplicated like that? 

 

 


was (Author: ari6123):
So the records seem to be v2, in this case, there were 5 records. This is what 
the header looked like:

`
 baseOffset: 17360912 lastOffset: 17360916 baseSequence: -1 lastSequence: -1 
producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 27 isTransactional: 
false position: 267886748 CreateTime: -1 isvalid: true size: 762 magic: 2 
compresscodec: NONE crc: 3599616919`

so I dumped this via `hd`, the hex dump of the header looks like this:

 `hd -s 267886748 -n 762 -C 16325357.log`

 

{{0ff78cb2 00 00 00 00 01 08 e7 ef 00 00 00 c6 00 00 00 1b ||}}
{{ 0ff78cb2 00 00 00 00 01 08 e7 ef 00 00 00 c6 00 00 00 1b ||}}
{{ 0ff78cc2 02 6e c1 b7 8d 00 00 00 00 00 00 ff ff ff ff ff |.n..|}}
{{ 0ff78cc2 02 6e c1 b7 8d 00 00 00 00 00 00 ff ff ff ff ff |.n..|}}

Is it normal for the CRC and magic portion to be duplicated like that? 

 

 

> Random corruption (CRC validation issues) 
> --
>
> Key: KAFKA-6679
> URL: https://issues.apache.org/jira/browse/KAFKA-6679
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, replication
>Affects Versions: 0.10.2.0, 1.0.1
> Environment: FreeBSD 11.0-RELEASE-p8
>Reporter: Ari Uka
>Priority: Major
>
> I'm running into a really strange issue on production. I have 3 brokers and 
> randomly consumers will start to fail with an error message saying the CRC 
> does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 
> with the hope that upgrading would help fix the issue.
> On the kafka side, I see errors related to this across all 3 brokers:
> ```
> [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Error for partition topic-a-0 to broker 
> 1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14).
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14)
> [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error for partition topic-c-2 to broker 
> 2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> ```
>  
> To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
> do a binary search until I can find a non corrupt message and push the 
> offsets forward. It's annoying because I can't actually push to a specific 
> date because kafka-consumer-groups.sh starts to emit the same error, 
> ErrInvalidMessage, CRC does not match.
> The error popped up again the next day after fixing it tho, so I'm trying to 
> find the root cause. 
> I'm using the Go consumer [https://github.com/Shopify/sarama] and 
> 

[jira] [Comment Edited] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-19 Thread Ari Uka (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405478#comment-16405478
 ] 

Ari Uka edited comment on KAFKA-6679 at 3/19/18 9:26 PM:
-

So the records seem to be v2, in this case, there were 5 records. This is what 
the header looked like:

`
 baseOffset: 17360912 lastOffset: 17360916 baseSequence: -1 lastSequence: -1 
producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 27 isTransactional: 
false position: 267886748 CreateTime: -1 isvalid: true size: 762 magic: 2 
compresscodec: NONE crc: 3599616919`

so I dumped this via `hd`, the hex dump of the header looks like this:

 `hd -s 267886748 -n 762 -C 16325357.log`

 

{{0ff78cb2 00 00 00 00 01 08 e7 ef 00 00 00 c6 00 00 00 1b ||}}
{{ 0ff78cb2 00 00 00 00 01 08 e7 ef 00 00 00 c6 00 00 00 1b ||}}
{{ 0ff78cc2 02 6e c1 b7 8d 00 00 00 00 00 00 ff ff ff ff ff |.n..|}}
{{ 0ff78cc2 02 6e c1 b7 8d 00 00 00 00 00 00 ff ff ff ff ff |.n..|}}

Is it normal for the CRC and magic portion to be duplicated like that? 

 

 


was (Author: ari6123):
So the records seem to be v2, in this case, there were 5 records. This is what 
the header looked like:

`
 baseOffset: 17360912 lastOffset: 17360916 baseSequence: -1 lastSequence: -1 
producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 27 isTransactional: 
false position: 267886748 CreateTime: -1 isvalid: true size: 762 magic: 2 
compresscodec: NONE crc: 3599616919`

so I dumped this via `hd`, the hex dump of the header looks like this:

 `hd -s 267886748 -n 762 -C 16325357.log`

 

0ff78cb2 00 00 00 00 01 08 e7 ef 00 00 00 c6 00 00 00 1b ||
 0ff78cb2 00 00 00 00 01 08 e7 ef 00 00 00 c6 00 00 00 1b ||
 0ff78cc2 02 6e c1 b7 8d 00 00 00 00 00 00 ff ff ff ff ff |.n..|
 0ff78cc2 02 6e c1 b7 8d 00 00 00 00 00 00 ff ff ff ff ff |.n..|

Is it normal for the CRC and magic portion to be duplicated like that? 

 

 

> Random corruption (CRC validation issues) 
> --
>
> Key: KAFKA-6679
> URL: https://issues.apache.org/jira/browse/KAFKA-6679
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, replication
>Affects Versions: 0.10.2.0, 1.0.1
> Environment: FreeBSD 11.0-RELEASE-p8
>Reporter: Ari Uka
>Priority: Major
>
> I'm running into a really strange issue on production. I have 3 brokers and 
> randomly consumers will start to fail with an error message saying the CRC 
> does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 
> with the hope that upgrading would help fix the issue.
> On the kafka side, I see errors related to this across all 3 brokers:
> ```
> [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Error for partition topic-a-0 to broker 
> 1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14).
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14)
> [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error for partition topic-c-2 to broker 
> 2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> ```
>  
> To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
> do a binary search until I can find a non corrupt message and push the 
> offsets forward. It's annoying because I can't actually push to a specific 
> date because kafka-consumer-groups.sh starts to emit the same error, 
> ErrInvalidMessage, CRC does not match.
> The error popped up again the next day after fixing it tho, so I'm trying to 
> find the root cause. 
> I'm using the Go consumer [https://github.com/Shopify/sarama] and 
> [https://github.com/bsm/sarama-cluster]. 
> At first, I thought it could be the consumer libraries, but the error happens 
> with kafka-console-consumer.sh as well when a specific message is corrupted 
> in Kafka. I don't think it's possible for Kafka producers to actually push 
> corrupt messages to Kafka and then cause all 

[jira] [Comment Edited] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-19 Thread Ari Uka (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405478#comment-16405478
 ] 

Ari Uka edited comment on KAFKA-6679 at 3/19/18 9:25 PM:
-

So the records seem to be v2, in this case, there were 5 records. This is what 
the header looked like:

`
 baseOffset: 17360912 lastOffset: 17360916 baseSequence: -1 lastSequence: -1 
producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 27 isTransactional: 
false position: 267886748 CreateTime: -1 isvalid: true size: 762 magic: 2 
compresscodec: NONE crc: 3599616919`

so I dumped this via `hd`, the hex dump of the header looks like this:

 `hd -s 267886748 -n 762 -C 16325357.log`

 

0ff78cb2 00 00 00 00 01 08 e7 ef 00 00 00 c6 00 00 00 1b ||
 0ff78cb2 00 00 00 00 01 08 e7 ef 00 00 00 c6 00 00 00 1b ||
 0ff78cc2 02 6e c1 b7 8d 00 00 00 00 00 00 ff ff ff ff ff |.n..|
 0ff78cc2 02 6e c1 b7 8d 00 00 00 00 00 00 ff ff ff ff ff |.n..|

Is it normal for the CRC and magic portion to be duplicated like that? 

 

 


was (Author: ari6123):
So the records seem to be v2, in this case, there were 5 records. This is what 
the header looked like:


`
baseOffset: 17360912 lastOffset: 17360916 baseSequence: -1 lastSequence: -1 
producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 27 isTransactional: 
false position: 267886748 CreateTime: -1 isvalid: true size: 762 magic: 2 
compresscodec: NONE crc: 3599616919`

so I dumped this via `hd`, the hex dump of the header looks like this:

 

0ff78cb2 00 00 00 00 01 08 e7 ef 00 00 00 c6 00 00 00 1b ||
0ff78cb2 00 00 00 00 01 08 e7 ef 00 00 00 c6 00 00 00 1b ||
0ff78cc2 02 6e c1 b7 8d 00 00 00 00 00 00 ff ff ff ff ff |.n..|
0ff78cc2 02 6e c1 b7 8d 00 00 00 00 00 00 ff ff ff ff ff |.n..|



Is it normal for the CRC and magic portion to be duplicated like that? 

 

 

> Random corruption (CRC validation issues) 
> --
>
> Key: KAFKA-6679
> URL: https://issues.apache.org/jira/browse/KAFKA-6679
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, replication
>Affects Versions: 0.10.2.0, 1.0.1
> Environment: FreeBSD 11.0-RELEASE-p8
>Reporter: Ari Uka
>Priority: Major
>
> I'm running into a really strange issue on production. I have 3 brokers and 
> randomly consumers will start to fail with an error message saying the CRC 
> does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 
> with the hope that upgrading would help fix the issue.
> On the kafka side, I see errors related to this across all 3 brokers:
> ```
> [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Error for partition topic-a-0 to broker 
> 1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14).
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14)
> [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error for partition topic-c-2 to broker 
> 2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> ```
>  
> To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
> do a binary search until I can find a non corrupt message and push the 
> offsets forward. It's annoying because I can't actually push to a specific 
> date because kafka-consumer-groups.sh starts to emit the same error, 
> ErrInvalidMessage, CRC does not match.
> The error popped up again the next day after fixing it tho, so I'm trying to 
> find the root cause. 
> I'm using the Go consumer [https://github.com/Shopify/sarama] and 
> [https://github.com/bsm/sarama-cluster]. 
> At first, I thought it could be the consumer libraries, but the error happens 
> with kafka-console-consumer.sh as well when a specific message is corrupted 
> in Kafka. I don't think it's possible for Kafka producers to actually push 
> corrupt messages to Kafka and then cause all consumers to break right? I 
> assume Kafka would reject corrupt messages, 

[jira] [Commented] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-19 Thread Ari Uka (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405478#comment-16405478
 ] 

Ari Uka commented on KAFKA-6679:


So the records seem to be v2, in this case, there were 5 records. This is what 
the header looked like:


`
baseOffset: 17360912 lastOffset: 17360916 baseSequence: -1 lastSequence: -1 
producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 27 isTransactional: 
false position: 267886748 CreateTime: -1 isvalid: true size: 762 magic: 2 
compresscodec: NONE crc: 3599616919`

so I dumped this via `hd`, the hex dump of the header looks like this:

 

0ff78cb2 00 00 00 00 01 08 e7 ef 00 00 00 c6 00 00 00 1b ||
0ff78cb2 00 00 00 00 01 08 e7 ef 00 00 00 c6 00 00 00 1b ||
0ff78cc2 02 6e c1 b7 8d 00 00 00 00 00 00 ff ff ff ff ff |.n..|
0ff78cc2 02 6e c1 b7 8d 00 00 00 00 00 00 ff ff ff ff ff |.n..|



Is it normal for the CRC and magic portion to be duplicated like that? 

 

 

> Random corruption (CRC validation issues) 
> --
>
> Key: KAFKA-6679
> URL: https://issues.apache.org/jira/browse/KAFKA-6679
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, replication
>Affects Versions: 0.10.2.0, 1.0.1
> Environment: FreeBSD 11.0-RELEASE-p8
>Reporter: Ari Uka
>Priority: Major
>
> I'm running into a really strange issue on production. I have 3 brokers and 
> randomly consumers will start to fail with an error message saying the CRC 
> does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 
> with the hope that upgrading would help fix the issue.
> On the kafka side, I see errors related to this across all 3 brokers:
> ```
> [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Error for partition topic-a-0 to broker 
> 1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14).
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14)
> [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error for partition topic-c-2 to broker 
> 2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> ```
>  
> To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
> do a binary search until I can find a non corrupt message and push the 
> offsets forward. It's annoying because I can't actually push to a specific 
> date because kafka-consumer-groups.sh starts to emit the same error, 
> ErrInvalidMessage, CRC does not match.
> The error popped up again the next day after fixing it tho, so I'm trying to 
> find the root cause. 
> I'm using the Go consumer [https://github.com/Shopify/sarama] and 
> [https://github.com/bsm/sarama-cluster]. 
> At first, I thought it could be the consumer libraries, but the error happens 
> with kafka-console-consumer.sh as well when a specific message is corrupted 
> in Kafka. I don't think it's possible for Kafka producers to actually push 
> corrupt messages to Kafka and then cause all consumers to break right? I 
> assume Kafka would reject corrupt messages, so I'm not sure what's going on 
> here.
> Should I just re-create the cluster, I don't think it's hardware failure 
> across the 3 machines tho.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-19 Thread Ari Uka (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404999#comment-16404999
 ] 

Ari Uka commented on KAFKA-6679:


When I run `/usr/local/share/kafka_2.12-1.0.1/bin/kafka-run-class.sh 
kafka.tools.DumpLogSegments --files` on the leader of the partition, I get an 
exception and the dump seems to stop early. 

I wanted to dump some of the messages and check if they were corrupt, but the 
segments won't even dump properly.

Exception in thread "main" 
org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).

What is this from?

> Random corruption (CRC validation issues) 
> --
>
> Key: KAFKA-6679
> URL: https://issues.apache.org/jira/browse/KAFKA-6679
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, replication
>Affects Versions: 0.10.2.0, 1.0.1
> Environment: FreeBSD 11.0-RELEASE-p8
>Reporter: Ari Uka
>Priority: Major
>
> I'm running into a really strange issue on production. I have 3 brokers and 
> randomly consumers will start to fail with an error message saying the CRC 
> does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 
> with the hope that upgrading would help fix the issue.
> On the kafka side, I see errors related to this across all 3 brokers:
> ```
> [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Error for partition topic-a-0 to broker 
> 1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14).
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14)
> [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error for partition topic-c-2 to broker 
> 2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> ```
>  
> To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
> do a binary search until I can find a non corrupt message and push the 
> offsets forward. It's annoying because I can't actually push to a specific 
> date because kafka-consumer-groups.sh starts to emit the same error, 
> ErrInvalidMessage, CRC does not match.
> The error popped up again the next day after fixing it tho, so I'm trying to 
> find the root cause. 
> I'm using the Go consumer [https://github.com/Shopify/sarama] and 
> [https://github.com/bsm/sarama-cluster]. 
> At first, I thought it could be the consumer libraries, but the error happens 
> with kafka-console-consumer.sh as well when a specific message is corrupted 
> in Kafka. I don't think it's possible for Kafka producers to actually push 
> corrupt messages to Kafka and then cause all consumers to break right? I 
> assume Kafka would reject corrupt messages, so I'm not sure what's going on 
> here.
> Should I just re-create the cluster, I don't think it's hardware failure 
> across the 3 machines tho.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-18 Thread Ari Uka (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ari Uka updated KAFKA-6679:
---
Description: 
I'm running into a really strange issue on production. I have 3 brokers and 
randomly consumers will start to fail with an error message saying the CRC does 
not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with 
the hope that upgrading would help fix the issue.

On the kafka side, I see errors related to this across all 3 brokers:

```

[2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
fetcherId=0] Error for partition topic-a-0 to broker 
1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14)

[2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error for partition topic-c-2 to broker 
2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

```

 

To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
do a binary search until I can find a non corrupt message and push the offsets 
forward. It's annoying because I can't actually push to a specific date because 
kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC 
does not match.

The error popped up again the next day after fixing it tho, so I'm trying to 
find the root cause. 

I'm using the Go consumer [https://github.com/Shopify/sarama] and 
[https://github.com/bsm/sarama-cluster]. 

At first, I thought it could be the consumer libraries, but the error happens 
with kafka-console-consumer.sh as well when a specific message is corrupted in 
Kafka. I don't think it's possible for Kafka producers to actually push corrupt 
messages to Kafka and then cause all consumers to break right? I assume Kafka 
would reject corrupt messages, so I'm not sure what's going on here.

Should I just re-create the cluster, I don't think it's hardware failure across 
the 3 machines tho.

  was:
I'm running into a really strange issue on production. I have 3 brokers and 
randomly consumers will start to fail with an error message saying the CRC does 
not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with 
the hope that upgrading would help fix the issue.

On the kafka side, I see errors related to this across all 3 brokers:

```

[2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
fetcherId=0] Error for partition topic-a-0 to broker 
1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14)

[2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error for partition topic-c-2 to broker 
2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

```

 

To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
do a binary search until I can find a non corrupt message and push the offsets 
forward. It's annoying because I can't actually push to a specific date because 
kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC 
does not match.

I'm using the Go consumer [https://github.com/Shopify/sarama] and 
[https://github.com/bsm/sarama-cluster]. 

At first, I thought it could be the consumer libraries, but the error happens 
with kafka-console-consumer.sh as well when a specific message is corrupted in 
Kafka. I don't think it's 

[jira] [Updated] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-18 Thread Ari Uka (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ari Uka updated KAFKA-6679:
---
Description: 
I'm running into a really strange issue on production. I have 3 brokers and 
randomly consumers will start to fail with an error message saying the CRC does 
not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with 
the hope that upgrading would help fix the issue.

On the kafka side, I see errors related to this across all 3 brokers:

```

[2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
fetcherId=0] Error for partition topic-a-0 to broker 
1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14)

[2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error for partition topic-c-2 to broker 
2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

```

 

To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
do a binary search until I can find a non corrupt message and push the offsets 
forward. It's annoying because I can't actually push to a specific date because 
kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC 
does not match.

I'm using the Go consumer [https://github.com/Shopify/sarama] and 
[https://github.com/bsm/sarama-cluster]. 

At first, I thought it could be the consumer libraries, but the error happens 
with kafka-console-consumer.sh as well when a specific message is corrupted in 
Kafka. I don't think it's possible for Kafka producers to actually push corrupt 
messages to Kafka and then cause all consumers to break right? I assume Kafka 
would reject corrupt messages, so I'm not sure what's going on here.

  was:
I'm running into a really strange issue on production. I have 3 brokers and 
randomly consumers will start to fail with an error message saying the CRC does 
not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with 
the hope that upgrading would help fix the issue.

On the kafka side, I see errors related to this across all 3 brokers:

```

[2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
fetcherId=0] Error for partition topic-a-0 to broker 
1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14)

[2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error for partition topic-c-2 to broker 
2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

```

 

To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
do a binary search until I can find a non corrupt message and push the offsets 
forward. It's annoying because I can't actually push to a specific date because 
kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC 
does not match.

I'm using the Go consumer [https://github.com/Shopify/sarama] and 
[https://github.com/bsm/sarama-cluster]


> Random corruption (CRC validation issues) 
> --
>
> Key: KAFKA-6679
> URL: https://issues.apache.org/jira/browse/KAFKA-6679
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, replication
>Affects Versions: 0.10.2.0, 1.0.1
> Environment: FreeBSD 11.0-RELEASE-p8
>

[jira] [Updated] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-18 Thread Ari Uka (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ari Uka updated KAFKA-6679:
---
Description: 
I'm running into a really strange issue on production. I have 3 brokers and 
randomly consumers will start to fail with an error message saying the CRC does 
not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with 
the hope that upgrading would help fix the issue.

On the kafka side, I see errors related to this across all 3 brokers:

```

[2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
fetcherId=0] Error for partition topic-a-0 to broker 
1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14)

[2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error for partition topic-c-2 to broker 
2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

```

 

To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
do a binary search until I can find a non corrupt message and push the offsets 
forward. It's annoying because I can't actually push to a specific date because 
kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC 
does not match.

I'm using the Go consumer [https://github.com/Shopify/sarama] and 
[https://github.com/bsm/sarama-cluster]

  was:
I'm running into a really strange issue on production. I have 3 brokers and 
randomly consumers will start to fail with an error message saying the CRC does 
not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with 
the hope that upgrading would help fix the issue.

On the kafka side, I see errors related to this across all 3 brokers:

```

[2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
fetcherId=0] Error for partition topic-a-0 to broker 
1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition telemetry-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14)

[2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error for partition topic-c-2 to broker 
2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

```

 

To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
do a binary search until I can find a non corrupt message and push the offsets 
forward. It's annoying because I can't actually push to a specific date because 
kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC 
does not match.


I'm using the Go consumer [https://github.com/Shopify/sarama] and 
[https://github.com/bsm/sarama-cluster]


> Random corruption (CRC validation issues) 
> --
>
> Key: KAFKA-6679
> URL: https://issues.apache.org/jira/browse/KAFKA-6679
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, replication
>Affects Versions: 0.10.2.0, 1.0.1
> Environment: FreeBSD 11.0-RELEASE-p8
>Reporter: Ari Uka
>Priority: Major
>
> I'm running into a really strange issue on production. I have 3 brokers and 
> randomly consumers will start to fail with an error message saying the CRC 
> does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 
> with the hope that upgrading would help fix the issue.
> On the kafka side, I see errors 

[jira] [Updated] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-18 Thread Ari Uka (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ari Uka updated KAFKA-6679:
---
Description: 
I'm running into a really strange issue on production. I have 3 brokers and 
randomly consumers will start to fail with an error message saying the CRC does 
not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with 
the hope that upgrading would help fix the issue.

On the kafka side, I see errors related to this across all 3 brokers:

```

[2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
fetcherId=0] Error for partition topic-a-0 to broker 
1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition telemetry-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14)

[2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error for partition topic-c-2 to broker 
2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

```

 

To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
do a binary search until I can find a non corrupt message and push the offsets 
forward. It's annoying because I can't actually push to a specific date because 
kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC 
does not match.


I'm using the Go consumer [https://github.com/Shopify/sarama] and 
[https://github.com/bsm/sarama-cluster]

  was:
I'm running into a really strange issue on production. I have 3 brokers and 
randomly consumers will start to fail with an error message saying the CRC does 
not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with 
the hope that upgrading would help fix the issue.

On the kafka side, I see errors related to this across all 3 brokers:
{noformat}
[2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
fetcherId=0] Error for partition topic-a-0 to broker 
1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)
[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)
org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).
[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition telemetry-b-0, offset 23848795 
(kafka.server.ReplicaManager)
org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14)
[2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error for partition topic-c-2 to broker 
2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread) 
{noformat}
 

To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
do a binary search until I can find a non corrupt message and push the offsets 
forward. It's annoying because I can't actually push to a specific date because 
kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC 
does not match.

After pushing the offsets forward again, the issue came up again a few days 
later. I'm unsure of what to do here, there doesn't appear to be a tool to go 
through the logs and scan for corruption and fix it, has anyone ever run into 
this before?

I'm using the Go consumer [https://github.com/Shopify/sarama] and 
[https://github.com/bsm/sarama-cluster]. Is it even possible for Kafka 
producers to push messages to topics with corrupt messages. I thought perhaps 
the consumer logic was broken on my libraries, but the CRC issue also happens 
with the kafka-console-consumer,sh and other command line tools when it happens.


> Random corruption (CRC validation issues) 
> --
>
> Key: KAFKA-6679
> URL: https://issues.apache.org/jira/browse/KAFKA-6679
> Project: Kafka
>  Issue 

[jira] [Updated] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-18 Thread Ari Uka (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ari Uka updated KAFKA-6679:
---
Description: 
I'm running into a really strange issue on production. I have 3 brokers and 
randomly consumers will start to fail with an error message saying the CRC does 
not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with 
the hope that upgrading would help fix the issue.

On the kafka side, I see errors related to this across all 3 brokers:

```

[2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
fetcherId=0] Error for partition topic-a-0 to broker 
1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition telemetry-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14)

[2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error for partition topic-c-2 to broker 
2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

```

 

To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
do a binary search until I can find a non corrupt message and push the offsets 
forward. It's annoying because I can't actually push to a specific date because 
kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC 
does not match.

After pushing the offsets forward again, the issue came up again a few days 
later. I'm unsure of what to do here, there doesn't appear to be a tool to go 
through the logs and scan for corruption and fix it, has anyone ever run into 
this before?


I'm using the Go consumer [https://github.com/Shopify/sarama] and 
[https://github.com/bsm/sarama-cluster]. Is it even possible for Kafka 
producers to push messages to topics with corrupt messages. I thought perhaps 
the consumer logic was broken on my libraries, but the CRC issue also happens 
with the kafka-console-consumer,sh and other command line tools when it happens.

> Random corruption (CRC validation issues) 
> --
>
> Key: KAFKA-6679
> URL: https://issues.apache.org/jira/browse/KAFKA-6679
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, replication
>Affects Versions: 0.10.2.0, 1.0.1
> Environment: FreeBSD 11.0-RELEASE-p8
>Reporter: Ari Uka
>Priority: Major
>
> I'm running into a really strange issue on production. I have 3 brokers and 
> randomly consumers will start to fail with an error message saying the CRC 
> does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 
> with the hope that upgrading would help fix the issue.
> On the kafka side, I see errors related to this across all 3 brokers:
> ```
> [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Error for partition topic-a-0 to broker 
> 1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14).
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition telemetry-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14)
> [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error for partition topic-c-2 to broker 
> 2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> ```
>  
> To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
> do a binary search until I can find a non corrupt message and push the 
> offsets forward. It's annoying because I can't actually 

[jira] [Updated] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-18 Thread Ari Uka (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ari Uka updated KAFKA-6679:
---
Description: 
I'm running into a really strange issue on production. I have 3 brokers and 
randomly consumers will start to fail with an error message saying the CRC does 
not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with 
the hope that upgrading would help fix the issue.

On the kafka side, I see errors related to this across all 3 brokers:
{noformat}
[2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
fetcherId=0] Error for partition topic-a-0 to broker 
1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)
[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)
org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).
[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition telemetry-b-0, offset 23848795 
(kafka.server.ReplicaManager)
org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14)
[2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error for partition topic-c-2 to broker 
2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread) 
{noformat}
 

To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
do a binary search until I can find a non corrupt message and push the offsets 
forward. It's annoying because I can't actually push to a specific date because 
kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC 
does not match.

After pushing the offsets forward again, the issue came up again a few days 
later. I'm unsure of what to do here, there doesn't appear to be a tool to go 
through the logs and scan for corruption and fix it, has anyone ever run into 
this before?

I'm using the Go consumer [https://github.com/Shopify/sarama] and 
[https://github.com/bsm/sarama-cluster]. Is it even possible for Kafka 
producers to push messages to topics with corrupt messages. I thought perhaps 
the consumer logic was broken on my libraries, but the CRC issue also happens 
with the kafka-console-consumer,sh and other command line tools when it happens.

  was:
I'm running into a really strange issue on production. I have 3 brokers and 
randomly consumers will start to fail with an error message saying the CRC does 
not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 with 
the hope that upgrading would help fix the issue.

On the kafka side, I see errors related to this across all 3 brokers:

```

[2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
fetcherId=0] Error for partition topic-a-0 to broker 
1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition topic-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).

[2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
fetch operation on partition telemetry-b-0, offset 23848795 
(kafka.server.ReplicaManager)

org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14)

[2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error for partition topic-c-2 to broker 
2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
(kafka.server.ReplicaFetcherThread)

```

 

To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
do a binary search until I can find a non corrupt message and push the offsets 
forward. It's annoying because I can't actually push to a specific date because 
kafka-consumer-groups.sh starts to emit the same error, ErrInvalidMessage, CRC 
does not match.

After pushing the offsets forward again, the issue came up again a few days 
later. I'm unsure of what to do here, there doesn't appear to be a tool to go 
through the logs and scan for corruption and fix it, has anyone ever run into 
this before?


I'm using the Go consumer [https://github.com/Shopify/sarama] and 

[jira] [Created] (KAFKA-6679) Random corruption (CRC validation issues)

2018-03-18 Thread Ari Uka (JIRA)
Ari Uka created KAFKA-6679:
--

 Summary: Random corruption (CRC validation issues) 
 Key: KAFKA-6679
 URL: https://issues.apache.org/jira/browse/KAFKA-6679
 Project: Kafka
  Issue Type: Bug
  Components: consumer, replication
Affects Versions: 1.0.1, 0.10.2.0
 Environment: FreeBSD 11.0-RELEASE-p8
Reporter: Ari Uka






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)