Turns out it was burrow sending an invalid request.

After adding extra debugging information to kafka the source of the
negative max bytes was apparent.  So maybe just burrow not being tolerant
of the brokers being upgraded behind its back. It's running fine again
after a restart.  It's an older version of burrow and I see a nice shiney
big release is out, so we'll upgrade to that too.

Here's a dump of the request it sent after kafka parsed it:

{
  replica_id=-1
  max_wait_time=250
  min_bytes=1
  topics=
  [
    {
      topic=__consumer_offsets
      partitions=
      [
        {partition=15 fetch_offset=2512181 max_bytes=32768}
        {partition=40 fetch_offset=31502577 max_bytes=32768}
        {partition=25 fetch_offset=618438090 max_bytes=32768}
        {partition=45 fetch_offset=4234585 max_bytes=32768}
        {partition=35 fetch_offset=81494695 max_bytes=32768}
        {partition=20 fetch_offset=14651661 max_bytes=32768}
        {partition=10 fetch_offset=273654177 max_bytes=32768}
        {partition=0 fetch_offset=32357901 max_bytes=-2147483648}
        {partition=30 fetch_offset=155292581 max_bytes=32768}
        {partition=5 fetch_offset=140017282 max_bytes=32768}
      ]
    }
  ]
}




Anyway, easily fixed with restarts of Burrow, once it was clear which
consumer was having the issue.



On Fri, Dec 15, 2017 at 5:04 PM, Brett Rann <br...@zendesk.com> wrote:

> Another interesting datapoint:
>
> Taking a deeper look at partition 21:
>
> brann@kafka1:/data/kafka/logs/__consumer_offsets-21$ ls -la
> total 20176
> drwxr-xr-x    2 kafka kafka     4096 Dec 15 08:11 .
> drwxr-xr-x 1605 kafka kafka   167936 Dec 15 08:31 ..
> -rw-r--r--    1 kafka kafka        0 Dec 15 08:03 00000000000000000000.index
> -rw-r--r--    1 kafka kafka    20924 Dec 15 04:08 00000000000000000000.log
> -rw-r--r--    1 kafka kafka       12 Dec 15 08:03 
> 00000000000000000000.timeindex
> -rw-r--r--    1 kafka kafka        0 Dec 15 08:03 00000000000200248390.index
> -rw-r--r--    1 kafka kafka     3662 Dec 15 07:53 00000000000200248390.log
> -rw-r--r--    1 kafka kafka       10 Dec 15 04:08 
> 00000000000200248390.snapshot
> -rw-r--r--    1 kafka kafka       12 Dec 15 08:03 
> 00000000000200248390.timeindex
> -rw-r--r--    1 kafka kafka       10 Dec 15 04:36 
> 00000000000200349263.snapshot
> -rw-r--r--    1 kafka kafka       10 Dec 15 06:01 
> 00000000000200650903.snapshot
> -rw-r--r--    1 kafka kafka 10485760 Dec 15 08:31 00000000000201167309.index
> -rw-r--r--    1 kafka kafka 15874568 Dec 15 08:31 00000000000201167309.log
> -rw-r--r--    1 kafka kafka       10 Dec 15 07:53 
> 00000000000201167309.snapshot
> -rw-r--r--    1 kafka kafka 10485756 Dec 15 08:31 
> 00000000000201167309.timeindex
> -rw-r--r--    1 kafka kafka      164 Dec 15 08:11 leader-epoch-checkpoint
>
> *Why are there extra .snapsho files in there?*
>
> these two are before and after the offset error that is mentioned
>
> -rw-r--r--    1 kafka kafka       10 Dec 15 04:36 
> 00000000000200349263.snapshot
> -rw-r--r--    1 kafka kafka       10 Dec 15 06:01 
> 00000000000200650903.snapshot
>
>
> Taking a deeper look into the log file:
>
> Taking a deeper look into the log file...
>
> __consumer_offsets-21$ /data/kafka/kafka/bin/kafka-run-class.sh 
> kafka.tools.DumpLogSegments   --deep-iteration --files 
> 00000000000200248390.log | less
>
> Dumping 00000000000200248390.log
> Starting offset: 200248390
> offset: 201167253 position: 0 CreateTime: 1513324424113 isvalid: true 
> keysize: 75 valuesize: 28 magic: 2 compresscodec: NONE producerId: -1 
> producerEpoch: -1 sequence: 1 isTransactional: false headerKeys: []
> offset: 201167258 position: 0 CreateTime: 1513324424113 isvalid: true 
> keysize: 75 valuesize: 28 magic: 2 compresscodec: NONE producerId: -1 
> producerEpoch: -1 sequence: 6 isTransactional: false headerKeys: []
> ...
>
> Interestingly, for partition 21 the error is about offset 200349244. I'd
> expect that to be in 00000000000200248390.log, but when dumping that log
> file manually it starts with offset 201167266.
>
>
> Still stumped on:
> why is it erroring about 201167266
> why cant' I find 201167266 in the log file for that partition (I've
> grepped the output of DumpLogSegments for all 3 log files. it's not there.)
> Why is it trying 201167266? Is it from the snapshot files? Is there some
> surgery we can do to make it stop. Safely? :)
>
> On Fri, Dec 15, 2017 at 4:33 PM, Brett Rann <br...@zendesk.com> wrote:
>
>> on `kafka_2.11-1.0.1-d04daf570` we are upgrading the log format from
>> 0.9.0.1 to 0.11.0.1 and after the upgrade have set
>>
>> inter.broker.protocol.version=1.0
>> log.message.format.version=0.11.0.1
>>
>> We have applied this upgrade to 5 clusters by upgrading broker 1, leaving
>> it for a day, then coming back when happy to upgrade the remaining brokers.
>>
>>
>> 4 of those upgrades went without issue.
>>
>>
>> However in one, when we upgraded the remaining brokers, we now start
>> seeing these errors on broker1:
>>
>> Error processing fetch operation on partition __consumer_offsets-21  offset 
>> 200349244
>>
>> For 4 consumer offset partitions, all which happen to be led by 1.
>>
>> kafka-request-handler-3 72 ERROR kafka.server.ReplicaManager 
>> 2017-12-15T07:39:40.380+0000 [ReplicaManager broker=1] Error processing 
>> fetch operation on partition __consumer_offsets-21  offset 200349244
>> kafka-request-handler-3 72 ERROR kafka.server.ReplicaManager 
>> 2017-12-15T07:39:40.381+0000 [ReplicaManager broker=1] Error processing 
>> fetch operation on partition __consumer_offsets-11  offset 188709568
>> kafka-request-handler-3 72 ERROR kafka.server.ReplicaManager 
>> 2017-12-15T07:39:40.381+0000 [ReplicaManager broker=1] Error processing 
>> fetch operation on partition __consumer_offsets-1  offset 2045483676
>> kafka-request-handler-5 74 ERROR kafka.server.ReplicaManager 
>> 2017-12-15T07:39:41.672+0000 [ReplicaManager broker=1] Error processing 
>> fetch operation on partition __consumer_offsets-31  offset 235294887
>>
>> Every second or so.
>>
>> If we stop that broker, those errors simply shift to the next leader for
>> those 4 partitions. And moving the partitions to completely new brokers
>> just moves the errors.
>>
>> We only see this on kafka1. not the other 9 brokers which had the log
>> message fromat upgraded a day or two later.
>>
>> Any suggestion on how to proceed? I'm not even sure yet if this is
>> isolated to the cluster, or if it's related to a consumer misbehaving.
>> Since our multiple clusters /should/ have the same set of
>> producers/consumers working on them, I'm doubtful that it's a misbehaving
>> client.
>>
>
>
>


-- 

Brett Rann

Senior DevOps Engineer


Zendesk International Ltd

395 Collins Street, Melbourne VIC 3000 Australia

Mobile: +61 (0) 418 826 017

Reply via email to