[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.
[ https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734770#comment-15734770 ] Jan Omar commented on KAFKA-4477: - Nothing interesting there in our case. Looked at 30 minutes before and after the incident. Within that timeframe the worst numbers reported are (3 Node Cluster, the lowest reported value): - NetworkProcessorAvgIdlePercent 0.96 - RequestHandlerAvgIdlePercent 0.74 > Node reduces its ISR to itself, and doesn't recover. Other nodes do not take > leadership, cluster remains sick until node is restarted. > -- > > Key: KAFKA-4477 > URL: https://issues.apache.org/jira/browse/KAFKA-4477 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.1.0 > Environment: RHEL7 > java version "1.8.0_66" > Java(TM) SE Runtime Environment (build 1.8.0_66-b17) > Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) >Reporter: Michael Andre Pearce (IG) >Assignee: Apurva Mehta >Priority: Critical > Labels: reliability > Attachments: kafka.jstack > > > We have encountered a critical issue that has re-occured in different > physical environments. We haven't worked out what is going on. We do though > have a nasty work around to keep service alive. > We do have not had this issue on clusters still running 0.9.01. > We have noticed a node randomly shrinking for the partitions it owns the > ISR's down to itself, moments later we see other nodes having disconnects, > followed by finally app issues, where producing to these partitions is > blocked. > It seems only by restarting the kafka instance java process resolves the > issues. > We have had this occur multiple times and from all network and machine > monitoring the machine never left the network, or had any other glitches. > Below are seen logs from the issue. > Node 7: > [2016-12-01 07:01:28,112] INFO Partition > [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking > ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from > 1,2,7 to 7 (kafka.cluster.Partition) > All other nodes: > [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch > kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 > (kafka.server.ReplicaFetcherThread) > java.io.IOException: Connection to 7 was disconnected before the response was > read > All clients: > java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.NetworkException: The server disconnected > before a response was received. > After this occurs, we then suddenly see on the sick machine an increasing > amount of close_waits and file descriptors. > As a work around to keep service we are currently putting in an automated > process that tails and regex's for: and where new_partitions hit just itself > we restart the node. > "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for > partition \[.*\] from (?P.+) to (?P.+) > \(kafka.cluster.Partition\)" -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.
[ https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731461#comment-15731461 ] Jan Omar commented on KAFKA-4477: - We're seeing the same issue on FreeBSD 10.3. Also Kafka 0.10.1.0. Exact same Stacktraces as described by Michael. We've already seen this happening 2 or 3 times. Regards Jan > Node reduces its ISR to itself, and doesn't recover. Other nodes do not take > leadership, cluster remains sick until node is restarted. > -- > > Key: KAFKA-4477 > URL: https://issues.apache.org/jira/browse/KAFKA-4477 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.1.0 > Environment: RHEL7 > java version "1.8.0_66" > Java(TM) SE Runtime Environment (build 1.8.0_66-b17) > Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) >Reporter: Michael Andre Pearce (IG) >Priority: Critical > Labels: reliability > Attachments: kafka.jstack > > > We have encountered a critical issue that has re-occured in different > physical environments. We haven't worked out what is going on. We do though > have a nasty work around to keep service alive. > We do have not had this issue on clusters still running 0.9.01. > We have noticed a node randomly shrinking for the partitions it owns the > ISR's down to itself, moments later we see other nodes having disconnects, > followed by finally app issues, where producing to these partitions is > blocked. > It seems only by restarting the kafka instance java process resolves the > issues. > We have had this occur multiple times and from all network and machine > monitoring the machine never left the network, or had any other glitches. > Below are seen logs from the issue. > Node 7: > [2016-12-01 07:01:28,112] INFO Partition > [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking > ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from > 1,2,7 to 7 (kafka.cluster.Partition) > All other nodes: > [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch > kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 > (kafka.server.ReplicaFetcherThread) > java.io.IOException: Connection to 7 was disconnected before the response was > read > All clients: > java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.NetworkException: The server disconnected > before a response was received. > After this occurs, we then suddenly see on the sick machine an increasing > amount of close_waits and file descriptors. > As a work around to keep service we are currently putting in an automated > process that tails and regex's for: and where new_partitions hit just itself > we restart the node. > "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for > partition \[.*\] from (?P.+) to (?P.+) > \(kafka.cluster.Partition\)" -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KAFKA-3240) Replication issues
[ https://issues.apache.org/jira/browse/KAFKA-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Omar updated KAFKA-3240: Summary: Replication issues (was: Replication issues on FreeBSD) > Replication issues > -- > > Key: KAFKA-3240 > URL: https://issues.apache.org/jira/browse/KAFKA-3240 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.9.0.0, 0.8.2.2, 0.9.0.1 > Environment: FreeBSD 10.2-RELEASE-p9 >Reporter: Jan Omar > > Hi, > We are trying to replace our 3-broker cluster running on 0.6 with a new > cluster on 0.9.0.1 (but tried 0.8.2.2 and 0.9.0.0 as well). > - 3 kafka nodes with one zookeeper instance on each machine > - FreeBSD 10.2 p9 > - Nagle off (sysctl net.inet.tcp.delayed_ack=0) > - all kafka machines write a ZFS ZIL to a dedicated SSD > - 3 producers on 3 machines, writing to 1 topics, partitioning 3, replication > factor 3 > - acks all > - 10 Gigabit Ethernet, all machines on one switch, ping 0.05 ms worst case. > While using the ProducerPerformance or rdkafka_performance we are seeing very > strange Replication errors. Any hint on what's going on would be highly > appreciated. Any suggestion on how to debug this properly would help as well. > This is what our broker config looks like: > {code} > broker.id=5 > auto.create.topics.enable=false > delete.topic.enable=true > listeners=PLAINTEXT://:9092 > port=9092 > host.name=kafka-five.acc > advertised.host.name=10.5.3.18 > zookeeper.connect=zookeeper-four.acc:2181,zookeeper-five.acc:2181,zookeeper-six.acc:2181 > zookeeper.connection.timeout.ms=6000 > num.replica.fetchers=1 > replica.fetch.max.bytes=1 > replica.fetch.wait.max.ms=500 > replica.high.watermark.checkpoint.interval.ms=5000 > replica.socket.timeout.ms=30 > replica.socket.receive.buffer.bytes=65536 > replica.lag.time.max.ms=1000 > min.insync.replicas=2 > controller.socket.timeout.ms=3 > controller.message.queue.size=100 > log.dirs=/var/db/kafka > num.partitions=8 > message.max.bytes=1 > auto.create.topics.enable=false > log.index.interval.bytes=4096 > log.index.size.max.bytes=10485760 > log.retention.hours=168 > log.flush.interval.ms=1 > log.flush.interval.messages=2 > log.flush.scheduler.interval.ms=2000 > log.roll.hours=168 > log.retention.check.interval.ms=30 > log.segment.bytes=536870912 > zookeeper.connection.timeout.ms=100 > zookeeper.sync.time.ms=5000 > num.io.threads=8 > num.network.threads=4 > socket.request.max.bytes=104857600 > socket.receive.buffer.bytes=1048576 > socket.send.buffer.bytes=1048576 > queued.max.requests=10 > fetch.purgatory.purge.interval.requests=100 > producer.purgatory.purge.interval.requests=100 > replica.lag.max.messages=1000 > {code} > These are the errors we're seeing: > {code:borderStyle=solid} > ERROR [Replica Manager on Broker 5]: Error processing fetch operation on > partition [test,0] offset 50727 (kafka.server.ReplicaManager) > java.lang.IllegalStateException: Invalid message size: 0 > at kafka.log.FileMessageSet.searchFor(FileMessageSet.scala:141) > at kafka.log.LogSegment.translateOffset(LogSegment.scala:105) > at kafka.log.LogSegment.read(LogSegment.scala:126) > at kafka.log.Log.read(Log.scala:506) > at > kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:536) > at > kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:507) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > kafka.server.ReplicaManager.readFromLocalLog(ReplicaManager.scala:507) > at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:462) > at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:431) > at kafka.server.KafkaApis.handle(KafkaApis.scala:69) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60) > at java.lang.Thread.run(Thread.java:745)0 > {code} > and > {code} > ERROR Found invalid messages during fetch for partition [test,0] offset 2732 > error Message found with corrupt size (0) in shallow iterator > (kafka.server.ReplicaFetcherThread) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-3240) Replication issues on FreeBSD
[ https://issues.apache.org/jira/browse/KAFKA-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175423#comment-15175423 ] Jan Omar commented on KAFKA-3240: - To sum it up: The issue does NOT come up when we use 2 brokers with only 2 partitions for a given topic, without compression. Increasing the partition count (in our case 30) results in the reported issue. The same goes for enabling lz4 compression on the producer side. > Replication issues on FreeBSD > - > > Key: KAFKA-3240 > URL: https://issues.apache.org/jira/browse/KAFKA-3240 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.9.0.0, 0.8.2.2, 0.9.0.1 > Environment: FreeBSD 10.2-RELEASE-p9 >Reporter: Jan Omar > > Hi, > We are trying to replace our 3-broker cluster running on 0.6 with a new > cluster on 0.9.0.1 (but tried 0.8.2.2 and 0.9.0.0 as well). > - 3 kafka nodes with one zookeeper instance on each machine > - FreeBSD 10.2 p9 > - Nagle off (sysctl net.inet.tcp.delayed_ack=0) > - all kafka machines write a ZFS ZIL to a dedicated SSD > - 3 producers on 3 machines, writing to 1 topics, partitioning 3, replication > factor 3 > - acks all > - 10 Gigabit Ethernet, all machines on one switch, ping 0.05 ms worst case. > While using the ProducerPerformance or rdkafka_performance we are seeing very > strange Replication errors. Any hint on what's going on would be highly > appreciated. Any suggestion on how to debug this properly would help as well. > This is what our broker config looks like: > {code} > broker.id=5 > auto.create.topics.enable=false > delete.topic.enable=true > listeners=PLAINTEXT://:9092 > port=9092 > host.name=kafka-five.acc > advertised.host.name=10.5.3.18 > zookeeper.connect=zookeeper-four.acc:2181,zookeeper-five.acc:2181,zookeeper-six.acc:2181 > zookeeper.connection.timeout.ms=6000 > num.replica.fetchers=1 > replica.fetch.max.bytes=1 > replica.fetch.wait.max.ms=500 > replica.high.watermark.checkpoint.interval.ms=5000 > replica.socket.timeout.ms=30 > replica.socket.receive.buffer.bytes=65536 > replica.lag.time.max.ms=1000 > min.insync.replicas=2 > controller.socket.timeout.ms=3 > controller.message.queue.size=100 > log.dirs=/var/db/kafka > num.partitions=8 > message.max.bytes=1 > auto.create.topics.enable=false > log.index.interval.bytes=4096 > log.index.size.max.bytes=10485760 > log.retention.hours=168 > log.flush.interval.ms=1 > log.flush.interval.messages=2 > log.flush.scheduler.interval.ms=2000 > log.roll.hours=168 > log.retention.check.interval.ms=30 > log.segment.bytes=536870912 > zookeeper.connection.timeout.ms=100 > zookeeper.sync.time.ms=5000 > num.io.threads=8 > num.network.threads=4 > socket.request.max.bytes=104857600 > socket.receive.buffer.bytes=1048576 > socket.send.buffer.bytes=1048576 > queued.max.requests=10 > fetch.purgatory.purge.interval.requests=100 > producer.purgatory.purge.interval.requests=100 > replica.lag.max.messages=1000 > {code} > These are the errors we're seeing: > {code:borderStyle=solid} > ERROR [Replica Manager on Broker 5]: Error processing fetch operation on > partition [test,0] offset 50727 (kafka.server.ReplicaManager) > java.lang.IllegalStateException: Invalid message size: 0 > at kafka.log.FileMessageSet.searchFor(FileMessageSet.scala:141) > at kafka.log.LogSegment.translateOffset(LogSegment.scala:105) > at kafka.log.LogSegment.read(LogSegment.scala:126) > at kafka.log.Log.read(Log.scala:506) > at > kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:536) > at > kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:507) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > kafka.server.ReplicaManager.readFromLocalLog(ReplicaManager.scala:507) > at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:462) > at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:431) > at kafka.server.KafkaApis.handle(KafkaApis.scala:69) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60) > at java.lang.Thread.run(Thread.java:745)0 > {code} > and > {code} > ERROR Found invalid messages during fetch for partition [test,0] offset 2732 > error Message found with corrupt size (0) in shallow iterator >
[jira] [Commented] (KAFKA-3240) Replication issues on FreeBSD
[ https://issues.apache.org/jira/browse/KAFKA-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148529#comment-15148529 ] Jan Omar commented on KAFKA-3240: - Using UFS instead of ZFS works. But we'd really like to use ZFS instead. Any idea what might be causing this? The underlying filesystem shouldn't really matter I think. > Replication issues on FreeBSD > - > > Key: KAFKA-3240 > URL: https://issues.apache.org/jira/browse/KAFKA-3240 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.9.0.0, 0.8.2.2, 0.9.0.1 > Environment: FreeBSD 10.2-RELEASE-p9 >Reporter: Jan Omar > > Hi, > We are trying to replace our 3-broker cluster running on 0.6 with a new > cluster on 0.9.0.1 (but tried 0.8.2.2 and 0.9.0.0 as well). > - 3 kafka nodes with one zookeeper instance on each machine > - FreeBSD 10.2 p9 > - Nagle off (sysctl net.inet.tcp.delayed_ack=0) > - all kafka machines write a ZFS ZIL to a dedicated SSD > - 3 producers on 3 machines, writing to 1 topics, partitioning 3, replication > factor 3 > - acks all > - 10 Gigabit Ethernet, all machines on one switch, ping 0.05 ms worst case. > While using the ProducerPerformance or rdkafka_performance we are seeing very > strange Replication errors. Any hint on what's going on would be highly > appreciated. Any suggestion on how to debug this properly would help as well. > This is what our broker config looks like: > {code} > broker.id=5 > auto.create.topics.enable=false > delete.topic.enable=true > listeners=PLAINTEXT://:9092 > port=9092 > host.name=kafka-five.acc > advertised.host.name=10.5.3.18 > zookeeper.connect=zookeeper-four.acc:2181,zookeeper-five.acc:2181,zookeeper-six.acc:2181 > zookeeper.connection.timeout.ms=6000 > num.replica.fetchers=1 > replica.fetch.max.bytes=1 > replica.fetch.wait.max.ms=500 > replica.high.watermark.checkpoint.interval.ms=5000 > replica.socket.timeout.ms=30 > replica.socket.receive.buffer.bytes=65536 > replica.lag.time.max.ms=1000 > min.insync.replicas=2 > controller.socket.timeout.ms=3 > controller.message.queue.size=100 > log.dirs=/var/db/kafka > num.partitions=8 > message.max.bytes=1 > auto.create.topics.enable=false > log.index.interval.bytes=4096 > log.index.size.max.bytes=10485760 > log.retention.hours=168 > log.flush.interval.ms=1 > log.flush.interval.messages=2 > log.flush.scheduler.interval.ms=2000 > log.roll.hours=168 > log.retention.check.interval.ms=30 > log.segment.bytes=536870912 > zookeeper.connection.timeout.ms=100 > zookeeper.sync.time.ms=5000 > num.io.threads=8 > num.network.threads=4 > socket.request.max.bytes=104857600 > socket.receive.buffer.bytes=1048576 > socket.send.buffer.bytes=1048576 > queued.max.requests=10 > fetch.purgatory.purge.interval.requests=100 > producer.purgatory.purge.interval.requests=100 > replica.lag.max.messages=1000 > {code} > These are the errors we're seeing: > {code:borderStyle=solid} > ERROR [Replica Manager on Broker 5]: Error processing fetch operation on > partition [test,0] offset 50727 (kafka.server.ReplicaManager) > java.lang.IllegalStateException: Invalid message size: 0 > at kafka.log.FileMessageSet.searchFor(FileMessageSet.scala:141) > at kafka.log.LogSegment.translateOffset(LogSegment.scala:105) > at kafka.log.LogSegment.read(LogSegment.scala:126) > at kafka.log.Log.read(Log.scala:506) > at > kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:536) > at > kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:507) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > kafka.server.ReplicaManager.readFromLocalLog(ReplicaManager.scala:507) > at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:462) > at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:431) > at kafka.server.KafkaApis.handle(KafkaApis.scala:69) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60) > at java.lang.Thread.run(Thread.java:745)0 > {code} > and > {code} > ERROR Found invalid messages during fetch for partition [test,0] offset 2732 > error Message found with corrupt size (0) in shallow iterator > (kafka.server.ReplicaFetcherThread) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KAFKA-3240) Replication issues on FreeBSD
Jan Omar created KAFKA-3240: --- Summary: Replication issues on FreeBSD Key: KAFKA-3240 URL: https://issues.apache.org/jira/browse/KAFKA-3240 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.9.0.0, 0.8.2.2, 0.9.0.1 Environment: FreeBSD 10.2-RELEASE-p9 Reporter: Jan Omar Hi, We are trying to replace our 3-broker cluster running on 0.6 with a new cluster on 0.9.0.1 (but tried 0.8.2.2 and 0.9.0.0 as well). - 3 kafka nodes with one zookeeper instance on each machine - FreeBSD 10.2 p9 - Nagle off (sysctl net.inet.tcp.delayed_ack=0) - all kafka machines write a ZFS ZIL to a dedicated SSD - 3 producers on 3 machines, writing to 1 topics, partitioning 3, replication factor 3 - acks all - 10 Gigabit Ethernet, all machines on one switch, ping 0.05 ms worst case. While using the ProducerPerformance or rdkafka_performance we are seeing very strange Replication errors. Any hint on what's going on would be highly appreciated. Any suggestion on how to debug this properly would help as well. This is what our broker config looks like: {code} broker.id=5 auto.create.topics.enable=false delete.topic.enable=true listeners=PLAINTEXT://:9092 port=9092 host.name=kafka-five.acc advertised.host.name=10.5.3.18 zookeeper.connect=zookeeper-four.acc:2181,zookeeper-five.acc:2181,zookeeper-six.acc:2181 zookeeper.connection.timeout.ms=6000 num.replica.fetchers=1 replica.fetch.max.bytes=1 replica.fetch.wait.max.ms=500 replica.high.watermark.checkpoint.interval.ms=5000 replica.socket.timeout.ms=30 replica.socket.receive.buffer.bytes=65536 replica.lag.time.max.ms=1000 min.insync.replicas=2 controller.socket.timeout.ms=3 controller.message.queue.size=100 log.dirs=/var/db/kafka num.partitions=8 message.max.bytes=1 auto.create.topics.enable=false log.index.interval.bytes=4096 log.index.size.max.bytes=10485760 log.retention.hours=168 log.flush.interval.ms=1 log.flush.interval.messages=2 log.flush.scheduler.interval.ms=2000 log.roll.hours=168 log.retention.check.interval.ms=30 log.segment.bytes=536870912 zookeeper.connection.timeout.ms=100 zookeeper.sync.time.ms=5000 num.io.threads=8 num.network.threads=4 socket.request.max.bytes=104857600 socket.receive.buffer.bytes=1048576 socket.send.buffer.bytes=1048576 queued.max.requests=10 fetch.purgatory.purge.interval.requests=100 producer.purgatory.purge.interval.requests=100 replica.lag.max.messages=1000 {code} These are the errors we're seeing: {code:borderStyle=solid} ERROR [Replica Manager on Broker 5]: Error processing fetch operation on partition [test,0] offset 50727 (kafka.server.ReplicaManager) java.lang.IllegalStateException: Invalid message size: 0 at kafka.log.FileMessageSet.searchFor(FileMessageSet.scala:141) at kafka.log.LogSegment.translateOffset(LogSegment.scala:105) at kafka.log.LogSegment.read(LogSegment.scala:126) at kafka.log.Log.read(Log.scala:506) at kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:536) at kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:507) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at kafka.server.ReplicaManager.readFromLocalLog(ReplicaManager.scala:507) at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:462) at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:431) at kafka.server.KafkaApis.handle(KafkaApis.scala:69) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60) at java.lang.Thread.run(Thread.java:745)0 {code} and {code} ERROR Found invalid messages during fetch for partition [test,0] offset 2732 error Message found with corrupt size (0) in shallow iterator (kafka.server.ReplicaFetcherThread) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)