I tested the new patch out and am seeing comparable CPU usage to the
previous patch. As far as I can see, heap usage is also comparable between
the two patches, though I will say that both look significantly better than
0.8.1.1 (~250MB vs. ~1GB).

I'll report back if any new issues come up as I start adding more
producer/consumer load.

On Sun, Feb 15, 2015 at 6:38 PM, Jun Rao <j...@confluent.io> wrote:

> Solon, Mathias,
>
> Thanks for testing this out. I just uploaded a slightly modified patch in
> https://issues.apache.org/jira/browse/KAFKA-1952. The new patch may not
> improve the latency and CPU usage further, but will potentially improve
> memory consumption. It would be great if you guys can test the new patch
> out.
>
> Thanks,
>
> Jun
>
> On Sat, Feb 14, 2015 at 9:08 AM, Mathias Söderberg <
> mathias.soederb...@gmail.com> wrote:
>
> > Jun,
> >
> > I updated our brokers earlier today with the mentioned patch. A week ago
> > our brokers used ~380% CPU (out of 400%) quite consistently, and now
> > they're varying between 250-325% (probably running a bit high right now
> as
> > we have some consumers catching up quite some lag), so there's definitely
> > an improvement. The producer latency is still a bit higher than with
> > 0.8.1.1, but I've been playing a bit with broker configuration as well as
> > producer configuration lately so that probably plays in a bit.
> >
> > I'll keep an eye on our metrics, and am going to mess around a bit with
> > configuration. Right now our traffic load is quite low, so it'd be
> > interesting to see how this works over the next few days. With that said,
> > we're at the same levels of CPU usage as with 0.8.1.1 (though with an
> > additional broker), so everything looks pretty great.
> >
> > We're using acks = "all" (-1) by the way.
> >
> > Best regards,
> > Mathias
> >
> > On Sat Feb 14 2015 at 4:40:31 AM Solon Gordon <so...@knewton.com> wrote:
> >
> > > Thanks for the fast response. I did a quick test and initial results
> look
> > > promising. When I swapped in the patched version, CPU usage dropped
> from
> > > ~150% to ~65%. Still a bit higher than what I see with 0.8.1.1 but much
> > > more reasonable.
> > >
> > > I'll do more testing on Monday but wanted to get you some quick
> feedback.
> > > Hopefully Mathias will have good results as well.
> > >
> > > On Fri, Feb 13, 2015 at 9:14 PM, Jun Rao <j...@confluent.io> wrote:
> > >
> > > > Mathias, Solon,
> > > >
> > > > We did identify a CPU issue and patched it in
> > > > https://issues.apache.org/jira/browse/KAFKA-1952. Could you apply
> the
> > > > patch
> > > > in the 0.8.2 branch and see if that addresses the issue?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Fri, Feb 13, 2015 at 3:26 PM, Jay Kreps <jay.kr...@gmail.com>
> > wrote:
> > > >
> > > > > We can reproduce this issue, have a theory as to the cause, and are
> > > > working
> > > > > on a fix. Here is the ticket to track it:
> > > > > https://issues.apache.org/jira/browse/KAFKA-1952
> > > > >
> > > > > I would recommend people hold off on 0.8.2 upgrades until we have a
> > > > handle
> > > > > on this.
> > > > >
> > > > > -Jay
> > > > >
> > > > > On Fri, Feb 13, 2015 at 1:47 PM, Solon Gordon <so...@knewton.com>
> > > wrote:
> > > > >
> > > > > > The partitions nearly all have replication factor 2 (a few stray
> > ones
> > > > > have
> > > > > > 1), and our producers use request.required.acks=-1. However, I
> > should
> > > > > note
> > > > > > there were hardly any messages being produced when I did the
> > upgrade
> > > > and
> > > > > > observed the high CPU load.
> > > > > >
> > > > > > I should have time to do some profiling on Monday and will get
> back
> > > to
> > > > > you
> > > > > > with the results.
> > > > > >
> > > > > > On Fri, Feb 13, 2015 at 1:00 PM, Jun Rao <j...@confluent.io>
> wrote:
> > > > > >
> > > > > > > Solon,
> > > > > > >
> > > > > > > What's the replication factor you used for those partitions?
> > What's
> > > > the
> > > > > > > producer ack that you used? Also, could you do a bit of
> profiling
> > > on
> > > > > the
> > > > > > > broker to see which methods used the most CPU?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon <
> so...@knewton.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > I saw a very similar jump in CPU usage when I tried upgrading
> > > from
> > > > > > > 0.8.1.1
> > > > > > > > to 0.8.2.0 today in a test environment. The Kafka cluster
> there
> > > is
> > > > > two
> > > > > > > > m1.larges handling 2,000 partitions across 32 topics. CPU
> usage
> > > > rose
> > > > > > from
> > > > > > > > 40% into the 150%–190% range, and load average from under 1
> to
> > > over
> > > > > 4.
> > > > > > > > Downgrading to 0.8.1.1 brought the CPU and load back to the
> > > > previous
> > > > > > > > values.
> > > > > > > >
> > > > > > > > If there's more info that would be helpful, please let me
> know.
> > > > > > > >
> > > > > > > > On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg <
> > > > > > > > mathias.soederb...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Jun,
> > > > > > > > >
> > > > > > > > > Pardon the radio silence. I booted up a new broker,
> created a
> > > > topic
> > > > > > > with
> > > > > > > > > three (3) partitions and replication factor one (1) and
> used
> > > the
> > > > > > > > > *kafka-producer-perf-test.sh
> > > > > > > > > *script to generate load (using messages of roughly the
> same
> > > size
> > > > > as
> > > > > > > > ours).
> > > > > > > > > There was a slight increase in CPU usage (~5-10%) on
> > > 0.8.2.0-rc2
> > > > > > > compared
> > > > > > > > > to 0.8.1.1, but that was about it.
> > > > > > > > >
> > > > > > > > > I upgraded our staging cluster to 0.8.2.0 earlier this week
> > or
> > > > so,
> > > > > > and
> > > > > > > > had
> > > > > > > > > to add an additional broker due to increased load after the
> > > > upgrade
> > > > > > > (note
> > > > > > > > > that the incoming load on the cluster has been pretty much
> > > > > > consistent).
> > > > > > > > > Since the upgrade we've been seeing an 2-3x increase in
> > latency
> > > > as
> > > > > > > well.
> > > > > > > > > I'm considering downgrading to 0.8.1.1 again to see if it
> > > > resolves
> > > > > > our
> > > > > > > > > issues.
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Mathias
> > > > > > > > >
> > > > > > > > > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <j...@confluent.io
> >
> > > > wrote:
> > > > > > > > >
> > > > > > > > > > Mathias,
> > > > > > > > > >
> > > > > > > > > > The new hprof doesn't reveal anything new to me. We did
> fix
> > > the
> > > > > > logic
> > > > > > > > in
> > > > > > > > > > using Purgatory in 0.8.2, which could potentially drive
> up
> > > the
> > > > > CPU
> > > > > > > > usage
> > > > > > > > > a
> > > > > > > > > > bit. To verify that, could you do your test on a single
> > > broker
> > > > > > (with
> > > > > > > > > > replication factor 1) btw 0.8.1 and 0.8.2 and see if
> there
> > is
> > > > any
> > > > > > > > > > significant difference in cpu usage?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Jun
> > > > > > > > > >
> > > > > > > > > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > > > > > > > > > mathias.soederb...@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > Jun,
> > > > > > > > > > >
> > > > > > > > > > > I re-ran the hprof test, for about 30 minutes again,
> for
> > > > > > > 0.8.2.0-rc2
> > > > > > > > > with
> > > > > > > > > > > the same version of snappy that 0.8.1.1 used. Attached
> > the
> > > > > logs.
> > > > > > > > > > > Unfortunately there wasn't any improvement as the node
> > > > running
> > > > > > > > > > 0.8.2.0-rc2
> > > > > > > > > > > still had a higher load and CPU usage.
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Mathias
> > > > > > > > > > >
> > > > > > > > > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> > > > > > > > > jai.forums2...@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > > > > > > > > > >> > Jaikiran,
> > > > > > > > > > >> >
> > > > > > > > > > >> > The fix you provided in probably unnecessary. The
> > > channel
> > > > > that
> > > > > > > we
> > > > > > > > > use
> > > > > > > > > > in
> > > > > > > > > > >> > SimpleConsumer (BlockingChannel) is configured to be
> > > > > blocking.
> > > > > > > So
> > > > > > > > > even
> > > > > > > > > > >> > though the read from the socket is in a loop, each
> > read
> > > > > blocks
> > > > > > > if
> > > > > > > > > > there
> > > > > > > > > > >> is
> > > > > > > > > > >> > no bytes received from the broker. So, that
> shouldn't
> > > > cause
> > > > > > > extra
> > > > > > > > > CPU
> > > > > > > > > > >> > consumption.
> > > > > > > > > > >> Hi Jun,
> > > > > > > > > > >>
> > > > > > > > > > >> Of course, you are right! I forgot that while reading
> > the
> > > > > thread
> > > > > > > > dump
> > > > > > > > > in
> > > > > > > > > > >> hprof output, one has to be aware that the thread
> state
> > > > isn't
> > > > > > > shown
> > > > > > > > > and
> > > > > > > > > > >> the thread need not necessarily be doing any CPU
> > activity.
> > > > > > > > > > >>
> > > > > > > > > > >> -Jaikiran
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> >
> > > > > > > > > > >> > Thanks,
> > > > > > > > > > >> >
> > > > > > > > > > >> > Jun
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg
> <
> > > > > > > > > > >> > mathias.soederb...@gmail.com> wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> >> Hi Neha,
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> I sent an e-mail earlier today, but noticed now
> that
> > it
> > > > > > didn't
> > > > > > > > > > >> actually go
> > > > > > > > > > >> >> through.
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> Anyhow, I've attached two files, one with output
> > from a
> > > > 10
> > > > > > > minute
> > > > > > > > > run
> > > > > > > > > > >> and
> > > > > > > > > > >> >> one with output from a 30 minute run. Realized that
> > > > maybe I
> > > > > > > > > should've
> > > > > > > > > > >> done
> > > > > > > > > > >> >> one or two runs with 0.8.1.1 as well, but
> > nevertheless.
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and
> > I'm
> > > > > seeing
> > > > > > > the
> > > > > > > > > > same
> > > > > > > > > > >> CPU
> > > > > > > > > > >> >> usage as with the beta version (basically pegging
> all
> > > > > cores).
> > > > > > > If
> > > > > > > > I
> > > > > > > > > > >> manage
> > > > > > > > > > >> >> to find the time I'll do another run with hprof on
> > the
> > > > rc2
> > > > > > > > version
> > > > > > > > > > >> later
> > > > > > > > > > >> >> today.
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> Best regards,
> > > > > > > > > > >> >> Mathias
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <
> > > > > > > > n...@confluent.io
> > > > > > > > > >
> > > > > > > > > > >> wrote:
> > > > > > > > > > >> >>
> > > > > > > > > > >> >>> The following should be sufficient
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>> java
> > > > > > > > > > >> >>>
> > > > -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > > > > > > > > > >> >>> y,thread=y,file=kafka.hprof
> > > > > > > > > > >> >>> <classname>
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>> You would need to start the Kafka server with the
> > > > settings
> > > > > > > above
> > > > > > > > > for
> > > > > > > > > > >> >>> sometime until you observe the problem.
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg
> <
> > > > > > > > > > >> >>> mathias.soederb...@gmail.com> wrote:
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>>> Hi Neha,
> > > > > > > > > > >> >>>>
> > > > > > > > > > >> >>>> Yeah sure. I'm not familiar with hprof, so any
> > > > particular
> > > > > > > > > options I
> > > > > > > > > > >> >>> should
> > > > > > > > > > >> >>>> include or just run with defaults?
> > > > > > > > > > >> >>>>
> > > > > > > > > > >> >>>> Best regards,
> > > > > > > > > > >> >>>> Mathias
> > > > > > > > > > >> >>>>
> > > > > > > > > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> > > > > > > > > n...@confluent.io>
> > > > > > > > > > >> >>> wrote:
> > > > > > > > > > >> >>>>> Thanks for reporting the issue. Would you mind
> > > running
> > > > > > hprof
> > > > > > > > and
> > > > > > > > > > >> >>> sending
> > > > > > > > > > >> >>>>> the output?
> > > > > > > > > > >> >>>>>
> > > > > > > > > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias
> Söderberg
> > <
> > > > > > > > > > >> >>>>> mathias.soederb...@gmail.com> wrote:
> > > > > > > > > > >> >>>>>
> > > > > > > > > > >> >>>>>> Good day,
> > > > > > > > > > >> >>>>>>
> > > > > > > > > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to
> > > > v0.8.2-beta
> > > > > > and
> > > > > > > > > > noticed
> > > > > > > > > > >> >>>> that
> > > > > > > > > > >> >>>>>> the CPU usage on the broker machines went up by
> > > > roughly
> > > > > > > 40%,
> > > > > > > > > from
> > > > > > > > > > >> >>> ~60%
> > > > > > > > > > >> >>>> to
> > > > > > > > > > >> >>>>>> ~100% and am wondering if anyone else has
> > > experienced
> > > > > > > > something
> > > > > > > > > > >> >>>> similar?
> > > > > > > > > > >> >>>>>> The load average also went up by 2x-3x.
> > > > > > > > > > >> >>>>>>
> > > > > > > > > > >> >>>>>> We're running on EC2 and the cluster currently
> > > > consists
> > > > > > of
> > > > > > > > four
> > > > > > > > > > >> >>>>> m1.xlarge,
> > > > > > > > > > >> >>>>>> with roughly 1100 topics / 4000 partitions.
> Using
> > > > Java
> > > > > 7
> > > > > > > > > > (1.7.0_65
> > > > > > > > > > >> >>> to
> > > > > > > > > > >> >>>> be
> > > > > > > > > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be
> > found
> > > > > over
> > > > > > > > here:
> > > > > > > > > > >> >>>>>>
> > > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > > > > > > > > > >> >>>>>>
> > > > > > > > > > >> >>>>>> I'm assuming that this is not expected
> behaviour
> > > for
> > > > > > > > > 0.8.2-beta?
> > > > > > > > > > >> >>>>>>
> > > > > > > > > > >> >>>>>> Best regards,
> > > > > > > > > > >> >>>>>> Mathias
> > > > > > > > > > >> >>>>>>
> > > > > > > > > > >> >>>>>
> > > > > > > > > > >> >>>>>
> > > > > > > > > > >> >>>>> --
> > > > > > > > > > >> >>>>> Thanks,
> > > > > > > > > > >> >>>>> Neha
> > > > > > > > > > >> >>>>>
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>> --
> > > > > > > > > > >> >>> Thanks,
> > > > > > > > > > >> >>> Neha
> > > > > > > > > > >> >>>
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to