Hey all, Just wanted to confirm, this was totally our issue. Thank so much Todd and Matt, our cluster is much more stable now.
Apache Kafka folks: I know 0.8.3 is slated to come out soon, but this is a pretty serious bug. I would think it would merit a minor release just to get it out there, so that others don't run into this problem. 0.8.2.1 basically does not work at scale with snappy compression. I will add a comment to https://issues.apache.org/jira/browse/KAFKA-2189 noting this too. Thanks so much! -Andrew On Tue, Aug 11, 2015 at 3:43 PM, Matthew Bruce <mbr...@blackberry.com> wrote: > Hi Andrew, > > > > I work with Todd and did our 0.8.2.1 testing with him. I believe that the > Kafka 0.8.x brokers recompresses the messages once it receives them in, > order to assign the offsets to the messages (see the ‘Compression in Kafka’ > section of: > http://nehanarkhede.com/2013/03/28/compression-in-kafka-gzip-or-snappy/). > I expect that you will see an improvement with Snappy 1.1.1.7 (FWIW, our > load generator’s version of Snappy didn’t change between our 0.8.1.1 and > 0.8.2.1 testing, and we still saw the IO hit on the broker side, which > seems to confirm this). > > > > Thanks, > > Matt Bruce > > > > > > *From:* Andrew Otto [mailto:ao...@wikimedia.org] > *Sent:* Tuesday, August 11, 2015 3:15 PM > *To:* users@kafka.apache.org > *Cc:* Dan Andreescu <dandree...@wikimedia.org>; Joseph Allemandou < > jalleman...@wikimedia.org> > *Subject:* Re: 0.8.2.1 upgrade causes much more IO > > > > Hi Todd, > > > > We are using snappy! And we are using version 1.1.1.6 as of our upgrade > to 0.8.2.1 yesterday. However, as far as I can tell, that is only relevant > for Java producers, right? Our main producers use librdkafka (the Kafka C > lib) to produce, and in doing so use a built in C version of snappy[1]. > > > > Even so, your issue sounds very similar to mine, and I don’t have a full > understanding of how brokers deal with compression, so I have updated the > snappy java version to 1.1.1.7 on one of our brokers. We’ll have to wait a > while to see if the log sizes are actually smaller for data written to this > broker. > > > > Thanks! > > > > > > > > > > [1] https://github.com/edenhill/librdkafka/blob/0.8.5/src/snappy.c > > On Aug 11, 2015, at 12:58, Todd Snyder <tsny...@blackberry.com> wrote: > > > > Hi Andrew, > > > > Are you using Snappy Compression by chance? When we tested the 0.8.2.1 > upgrade initially we saw similar results and tracked it down to a problem > with Snappy version 1.1.1.6 ( > https://issues.apache.org/jira/browse/KAFKA-2189). We’re running with > Snappy 1.1.1.7 now and the performance is back to where it used to be. > > > > > > Sent from my BlackBerry 10 smartphone on the TELUS network. > > *From: *Andrew Otto > > *Sent: *Tuesday, August 11, 2015 12:26 PM > > *To: *users@kafka.apache.org > > *Reply To: *users@kafka.apache.org > > *Cc: *Dan Andreescu; Joseph Allemandou > > *Subject: *0.8.2.1 upgrade causes much more IO > > > > Hi all! > > > > Yesterday I did a production upgrade of our 4 broker Kafka cluster from > 0.8.1.1 to 0.8.2.1. > > > > When we did so, we were running our (varnishkafka) producers with > request.required.acks = -1. After switching to 0.8.2.1, producers saw > produce response RTTs of >60 seconds. I then switched to > request.required.acks = 1, and producers settled down. However, we then > started seeing flapping ISRs about every 10 minutes. We run Camus every 10 > minutes. If we disable Camus, then ISRs don’t flap. > > > > All of these issues seem to be a side affect of a larger problem. The > total amount of network and disk IO that Kafka brokers are doing after the > upgrade to 0.8.2.1 has tripled. We were previously seeing about 20 MB/s > incoming on broker interfaces, 0.8.2.1 knocks this up to around 60 MB/s. > Disk writes have tripled accordingly. Disk reads have also increased by a > huge amount, although I suspect this is a consequence of more data flying > around somehow dirtying the disk cache > > > > You can see these changes in this dashboard: > http://grafana.wikimedia.org/#/dashboard/db/kafka-0821-upgrade > > > > The upgrade started at around 2015-08-10 14:30, and was completed on all 4 > brokers within a couple of hours. > > > > Probably the most relevant is network rx_bytes on brokers. > > > > > > > > We looked at Kafka .log file sizes and noticed that file sizes are indeed > much larger than they were before this upgrade: > > > > # 0.8.1.1 > > 2015-08-10T04 38119109383 > > 2015-08-10T05 46172089174 > > 2015-08-10T06 46172182745 > > 2015-08-10T07 53151490032 > > 2015-08-10T08 53151892928 > > 2015-08-10T09 55836248198 > > 2015-08-10T10 57984054557 > > 2015-08-10T11 63353197416 > > 2015-08-10T12 68184938548 > > 2015-08-10T13 69259218741 > > 2015-08-10T14 79567698089 > > # Upgrade to 0.8.2.1 starts here > > 2015-08-10T15 133643184876 > > 2015-08-10T16 168515916825 > > 2015-08-10T17 181394338213 > > 2015-08-10T18 177097927553 > > 2015-08-10T19 183530782549 > > 2015-08-10T20 178706680082 > > 2015-08-10T21 178712665924 > > 2015-08-10T22 171741495606 > > 2015-08-10T23 169049665348 > > 2015-08-11T00 163682183241 > > 2015-08-11T01 165292426510 > > > > > > Aside from the request.required.acks change I mentioned above, we haven’t > made any config changes on brokers, producers, or consumers. Our > server.properties file is here: > https://gist.github.com/ottomata/cdd270102287661c176a > > > > Has anyone seen this before? What could be the cause of more data here? > Perhaps there is some compression config change that we missed that is > causing this data to be sent or saved uncompressed? (Sent uncompressed is > unlikely, as we would probably notice a larger network change on the > producers than we do. (Unless I’m looking at that wrong right now…:)) Is > there a quick way to tell if the data is compressed? > > > > > > Thanks! > > -Andrew Otto > > > > > > --------------------------------------------------------------------- > This transmission (including any attachments) may contain confidential > information, privileged material (including material protected by the > solicitor-client or other applicable privileges), or constitute non-public > information. Any use of this information by anyone other than the intended > recipient is prohibited. If you have received this transmission in error, > please immediately reply to the sender and delete this information from > your system. Use, dissemination, distribution, or reproduction of this > transmission by unintended recipients is not authorized and may be unlawful. > > > --------------------------------------------------------------------- > This transmission (including any attachments) may contain confidential > information, privileged material (including material protected by the > solicitor-client or other applicable privileges), or constitute non-public > information. Any use of this information by anyone other than the intended > recipient is prohibited. If you have received this transmission in error, > please immediately reply to the sender and delete this information from > your system. Use, dissemination, distribution, or reproduction of this > transmission by unintended recipients is not authorized and may be unlawful. >