Turns out it was burrow sending an invalid request. After adding extra debugging information to kafka the source of the negative max bytes was apparent. So maybe just burrow not being tolerant of the brokers being upgraded behind its back. It's running fine again after a restart. It's an older version of burrow and I see a nice shiney big release is out, so we'll upgrade to that too.
Here's a dump of the request it sent after kafka parsed it: { replica_id=-1 max_wait_time=250 min_bytes=1 topics= [ { topic=__consumer_offsets partitions= [ {partition=15 fetch_offset=2512181 max_bytes=32768} {partition=40 fetch_offset=31502577 max_bytes=32768} {partition=25 fetch_offset=618438090 max_bytes=32768} {partition=45 fetch_offset=4234585 max_bytes=32768} {partition=35 fetch_offset=81494695 max_bytes=32768} {partition=20 fetch_offset=14651661 max_bytes=32768} {partition=10 fetch_offset=273654177 max_bytes=32768} {partition=0 fetch_offset=32357901 max_bytes=-2147483648} {partition=30 fetch_offset=155292581 max_bytes=32768} {partition=5 fetch_offset=140017282 max_bytes=32768} ] } ] } Anyway, easily fixed with restarts of Burrow, once it was clear which consumer was having the issue. On Fri, Dec 15, 2017 at 5:04 PM, Brett Rann <br...@zendesk.com> wrote: > Another interesting datapoint: > > Taking a deeper look at partition 21: > > brann@kafka1:/data/kafka/logs/__consumer_offsets-21$ ls -la > total 20176 > drwxr-xr-x 2 kafka kafka 4096 Dec 15 08:11 . > drwxr-xr-x 1605 kafka kafka 167936 Dec 15 08:31 .. > -rw-r--r-- 1 kafka kafka 0 Dec 15 08:03 00000000000000000000.index > -rw-r--r-- 1 kafka kafka 20924 Dec 15 04:08 00000000000000000000.log > -rw-r--r-- 1 kafka kafka 12 Dec 15 08:03 > 00000000000000000000.timeindex > -rw-r--r-- 1 kafka kafka 0 Dec 15 08:03 00000000000200248390.index > -rw-r--r-- 1 kafka kafka 3662 Dec 15 07:53 00000000000200248390.log > -rw-r--r-- 1 kafka kafka 10 Dec 15 04:08 > 00000000000200248390.snapshot > -rw-r--r-- 1 kafka kafka 12 Dec 15 08:03 > 00000000000200248390.timeindex > -rw-r--r-- 1 kafka kafka 10 Dec 15 04:36 > 00000000000200349263.snapshot > -rw-r--r-- 1 kafka kafka 10 Dec 15 06:01 > 00000000000200650903.snapshot > -rw-r--r-- 1 kafka kafka 10485760 Dec 15 08:31 00000000000201167309.index > -rw-r--r-- 1 kafka kafka 15874568 Dec 15 08:31 00000000000201167309.log > -rw-r--r-- 1 kafka kafka 10 Dec 15 07:53 > 00000000000201167309.snapshot > -rw-r--r-- 1 kafka kafka 10485756 Dec 15 08:31 > 00000000000201167309.timeindex > -rw-r--r-- 1 kafka kafka 164 Dec 15 08:11 leader-epoch-checkpoint > > *Why are there extra .snapsho files in there?* > > these two are before and after the offset error that is mentioned > > -rw-r--r-- 1 kafka kafka 10 Dec 15 04:36 > 00000000000200349263.snapshot > -rw-r--r-- 1 kafka kafka 10 Dec 15 06:01 > 00000000000200650903.snapshot > > > Taking a deeper look into the log file: > > Taking a deeper look into the log file... > > __consumer_offsets-21$ /data/kafka/kafka/bin/kafka-run-class.sh > kafka.tools.DumpLogSegments --deep-iteration --files > 00000000000200248390.log | less > > Dumping 00000000000200248390.log > Starting offset: 200248390 > offset: 201167253 position: 0 CreateTime: 1513324424113 isvalid: true > keysize: 75 valuesize: 28 magic: 2 compresscodec: NONE producerId: -1 > producerEpoch: -1 sequence: 1 isTransactional: false headerKeys: [] > offset: 201167258 position: 0 CreateTime: 1513324424113 isvalid: true > keysize: 75 valuesize: 28 magic: 2 compresscodec: NONE producerId: -1 > producerEpoch: -1 sequence: 6 isTransactional: false headerKeys: [] > ... > > Interestingly, for partition 21 the error is about offset 200349244. I'd > expect that to be in 00000000000200248390.log, but when dumping that log > file manually it starts with offset 201167266. > > > Still stumped on: > why is it erroring about 201167266 > why cant' I find 201167266 in the log file for that partition (I've > grepped the output of DumpLogSegments for all 3 log files. it's not there.) > Why is it trying 201167266? Is it from the snapshot files? Is there some > surgery we can do to make it stop. Safely? :) > > On Fri, Dec 15, 2017 at 4:33 PM, Brett Rann <br...@zendesk.com> wrote: > >> on `kafka_2.11-1.0.1-d04daf570` we are upgrading the log format from >> 0.9.0.1 to 0.11.0.1 and after the upgrade have set >> >> inter.broker.protocol.version=1.0 >> log.message.format.version=0.11.0.1 >> >> We have applied this upgrade to 5 clusters by upgrading broker 1, leaving >> it for a day, then coming back when happy to upgrade the remaining brokers. >> >> >> 4 of those upgrades went without issue. >> >> >> However in one, when we upgraded the remaining brokers, we now start >> seeing these errors on broker1: >> >> Error processing fetch operation on partition __consumer_offsets-21 offset >> 200349244 >> >> For 4 consumer offset partitions, all which happen to be led by 1. >> >> kafka-request-handler-3 72 ERROR kafka.server.ReplicaManager >> 2017-12-15T07:39:40.380+0000 [ReplicaManager broker=1] Error processing >> fetch operation on partition __consumer_offsets-21 offset 200349244 >> kafka-request-handler-3 72 ERROR kafka.server.ReplicaManager >> 2017-12-15T07:39:40.381+0000 [ReplicaManager broker=1] Error processing >> fetch operation on partition __consumer_offsets-11 offset 188709568 >> kafka-request-handler-3 72 ERROR kafka.server.ReplicaManager >> 2017-12-15T07:39:40.381+0000 [ReplicaManager broker=1] Error processing >> fetch operation on partition __consumer_offsets-1 offset 2045483676 >> kafka-request-handler-5 74 ERROR kafka.server.ReplicaManager >> 2017-12-15T07:39:41.672+0000 [ReplicaManager broker=1] Error processing >> fetch operation on partition __consumer_offsets-31 offset 235294887 >> >> Every second or so. >> >> If we stop that broker, those errors simply shift to the next leader for >> those 4 partitions. And moving the partitions to completely new brokers >> just moves the errors. >> >> We only see this on kafka1. not the other 9 brokers which had the log >> message fromat upgraded a day or two later. >> >> Any suggestion on how to proceed? I'm not even sure yet if this is >> isolated to the cluster, or if it's related to a consumer misbehaving. >> Since our multiple clusters /should/ have the same set of >> producers/consumers working on them, I'm doubtful that it's a misbehaving >> client. >> > > > -- Brett Rann Senior DevOps Engineer Zendesk International Ltd 395 Collins Street, Melbourne VIC 3000 Australia Mobile: +61 (0) 418 826 017