Re: latency test

Helleren, Erik Wed, 09 Sep 2015 12:16:29 -0700

So, I did my own latency test on a cluster of 3 nodes, and there is a
significant difference around the 99%’ile and higher for partitions when
measuring the the ack time when configured for a single ack.  The graph
that I wish I could attach or post clearly shows that around 1/3 of the
partitions significantly diverge from the other two.  So, at least in my
case, one of my brokers is further than the others.
-Erik


On 9/4/15, 1:06 PM, "Yuheng Du" <yuheng.du.h...@gmail.com> wrote:

>No problem. Thanks for your advice. I think it would be fun to explore. I
>only know how to program in java though. Hope it will work.
>
>On Fri, Sep 4, 2015 at 2:03 PM, Helleren, Erik
><erik.helle...@cmegroup.com>
>wrote:
>
>> I thing the suggestion is to have partitions/brokers >=1, so 32 should
>>be
>> enough.
>>
>> As for latency tests, there isn’t a lot of code to do a latency test.
>>If
>> you just want to measure ack time its around 100 lines.  I will try to
>> push out some good latency testing code to github, but my company is
>> scared of open sourcing code… so it might be a while…
>> -Erik
>>
>>
>> On 9/4/15, 12:55 PM, "Yuheng Du" <yuheng.du.h...@gmail.com> wrote:
>>
>> >Thanks for your reply Erik. I am running some more tests according to
>>your
>> >suggestions now and I will share with my results here. Is it necessary
>>to
>> >use a fixed number of partitions (32 partitions maybe) for my test?
>> >
>> >I am testing 2, 4, 8, 16 and 32 brokers scenarios, all of them are
>>running
>> >on individual physical nodes. So I think using at least 32 partitions
>>will
>> >make more sense? I have seen latencies increase as the number of
>> >partitions
>> >goes up in my experiments.
>> >
>> >To get the latency of each event data recorded, are you suggesting
>>that I
>> >rewrite my own test program (in Java perhaps) or I can just modify the
>> >standard test program provided by kafka (
>> >https://gist.github.com/jkreps/c7ddb4041ef62a900e6c )? I guess I need
>>to
>> >rebuild the source if I modify the standard java test program
>> >ProducerPerformance provided in kafka, right? Now this standard program
>> >only has average latencies and percentile latencies but no per event
>> >latencies.
>> >
>> >Thanks.
>> >
>> >On Fri, Sep 4, 2015 at 1:42 PM, Helleren, Erik
>> ><erik.helle...@cmegroup.com>
>> >wrote:
>> >
>> >> That is an excellent question!  There are a bunch of ways to monitor
>> >> jitter and see when that is happening.  Here are a few:
>> >>
>> >> - You could slice the histogram every few seconds, save it out with a
>> >> timestamp, and then look at how they compare.  This would be mostly
>> >> manual, or you can graph line charts of the percentiles over time in
>> >>excel
>> >> where each percentile would be a series.  If you are using HDR
>> >>Histogram,
>> >> you should look at how to use the Recorder class to do this coupled
>> >>with a
>> >> ScheduledExecutorService.
>> >>
>> >> - You can just save the starting timestamp of the event and the
>>latency
>> >>of
>> >> each event.  If you put it into a CSV, you can just load it up into
>> >>excel
>> >> and graph as a XY chart.  That way you can see every point during the
>> >> running of your program and you can see trends.  You want to be
>>careful
>> >> about this one, especially of writing to a file in the callback that
>> >>kfaka
>> >> provides.
>> >>
>> >> Also, I have noticed that most of the very slow observations are at
>> >> startup.  But don’t trust me, trust the data and share your findings.
>> >> Also, having a 99.9 percentile provides a pretty good standard for
>> >>typical
>> >> poor case performance.  Average is borderline useless, 50%’ile is a
>> >>better
>> >> typical case because that’s the number that says “half of events
>>will be
>> >> this slow or faster”, or for values that are high like 99.9%’ile,
>>“0.1%
>> >>of
>> >> all events will be slower than this”.
>> >> -Erik
>> >>
>> >> On 9/4/15, 12:05 PM, "Yuheng Du" <yuheng.du.h...@gmail.com> wrote:
>> >>
>> >> >Thank you Erik! That's is helpful!
>> >> >
>> >> >But also I see jitters of the maximum latencies when running the
>> >> >experiment.
>> >> >
>> >> >The average end to acknowledgement latency from producer to broker
>>is
>> >> >around 5ms when using 92 producers and 4 brokers, and the 99.9
>> >>percentile
>> >> >latency is 58ms, but the maximum latency goes up to 1359 ms. How to
>> >>locate
>> >> >the source of this jitter?
>> >> >
>> >> >Thanks.
>> >> >
>> >> >On Fri, Sep 4, 2015 at 10:54 AM, Helleren, Erik
>> >> ><erik.helle...@cmegroup.com>
>> >> >wrote:
>> >> >
>> >> >> WellŠ not to be contrarian, but latency depends much more on the
>> >>latency
>> >> >> between the producer and the broker that is the leader for the
>> >>partition
>> >> >> you are publishing to.  At least when your brokers are not
>>saturated
>> >> >>with
>> >> >> messages, and acks are set to 1.  If acks are set to ALL, latency
>>on
>> >>an
>> >> >> non-saturated kafka cluster will be: Round Trip Latency from
>> >>producer to
>> >> >> leader for partition + Max( slowest Round Trip Latency to a
>>replicas
>> >>of
>> >> >> that partition).  If a cluster is saturated with messages, we
>>have to
>> >> >> assume that all partitions receive an equal distribution of
>>messages
>> >>to
>> >> >> avoid linear algebra and queueing theory models.  I don¹t like
>>linear
>> >> >> algebra :P
>> >> >>
>> >> >> Since you are probably putting all your latencies into a single
>> >> >>histogram
>> >> >> per producer, or worse, just an average, this pattern would have
>>been
>> >> >> obscured.  Obligatory lecture about measuring latency by Gil Tene
>> >> >> (https://www.youtube.com/watch?v=9MKY4KypBzg).  To verify this
>> >> >>hypothesis,
>> >> >> you should re-write the benchmark to plot the latencies for each
>> >>write
>> >> >>to
>> >> >> a partition for each producer into a histogram. (HRD histogram is
>> >>pretty
>> >> >> good for that).  This would give you producers*partitions
>>histograms,
>> >> >> which might be unwieldy for that many producers. But wait, there
>>is
>> >> >>hope!
>> >> >>
>> >> >> To verify that this hypothesis holds, you just have to see that
>>there
>> >> >>is a
>> >> >> significant difference between different partitions on a SINGLE
>> >> >>producing
>> >> >> client. So, pick one producing client at random and use the data
>>from
>> >> >> that. The easy way to do that is just plot all the partition
>>latency
>> >> >> histograms on top of each other in the same plot, that way you
>>have a
>> >> >> pretty plot to show people.  If you don¹t want to setup plotting,
>>you
>> >> >>can
>> >> >> just compare the medians (50¹th percentile) of the partitions¹
>> >> >>histograms.
>> >> >>  If there is a lot of variance, your latency anomaly is explained
>>by
>> >> >> brokers 4-7 being slower than nodes 0-3!  If there isn¹t a lot of
>> >> >>variance
>> >> >> at 50%, look at higher percentiles.  And if higher percentiles for
>> >>all
>> >> >>the
>> >> >> partitions look the same, this hypothesis is disproved.
>> >> >>
>> >> >> If you want to make a general statement about latency of writing
>>to
>> >> >>kafka,
>> >> >> you can merge all the histograms into a single histogram and plot
>> >>that.
>> >> >>
>> >> >> To Yuheng¹s credit, more brokers always results in more
>>throughput.
>> >>But
>> >> >> throughput and latency are two different creatures.  Its worth
>>noting
>> >> >>that
>> >> >> kafka is designed to be high throughput first and low latency
>>second.
>> >> >>And
>> >> >> it does a really good job at both.
>> >> >>
>> >> >> Disclaimer: I might not like linear algebra, but I do like
>> >>statistics.
>> >> >> Let me know if there are topics that need more explanation above
>>that
>> >> >> aren¹t covered by Gil¹s lecture.
>> >> >> -Erik
>> >> >>
>> >> >> On 9/4/15, 9:03 AM, "Yuheng Du" <yuheng.du.h...@gmail.com> wrote:
>> >> >>
>> >> >> >When I using 32 partitions, the 4 brokers latency becomes larger
>> >>than
>> >> >>the
>> >> >> >8
>> >> >> >brokers latency.
>> >> >> >
>> >> >> >So is it always true that using more brokers can give less
>>latency
>> >>when
>> >> >> >the
>> >> >> >number of partitions is at least the size of the brokers?
>> >> >> >
>> >> >> >Thanks.
>> >> >> >
>> >> >> >On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du
>> >><yuheng.du.h...@gmail.com>
>> >> >> >wrote:
>> >> >> >
>> >> >> >> I am running a producer latency test. When using 92 producers
>>in
>> >>92
>> >> >> >> physical node publishing to 4 brokers, the latency is slightly
>> >>lower
>> >> >> >>than
>> >> >> >> using 8 brokers, I am using 8 partitions for the topic.
>> >> >> >>
>> >> >> >> I have rerun the test and it gives me the same result, the 4
>> >>brokers
>> >> >> >> scenario still has lower latency than the 8 brokers scenarios.
>> >> >> >>
>> >> >> >> It is weird because I tested 1broker, 2 brokers, 4 brokers, 8
>> >> >>brokers,
>> >> >> >>16
>> >> >> >> brokers and 32 brokers. For the rest of the case the latency
>> >> >>decreases
>> >> >> >>as
>> >> >> >> the number of brokers increase.
>> >> >> >>
>> >> >> >> 4 brokers/8 brokers is the only pair that doesn't satisfy this
>> >>rule.
>> >> >> >>What
>> >> >> >> could be the cause?
>> >> >> >>
>> >> >> >> I am using a 200 bytes message, the test let each producer
>> >>publishes
>> >> >> >>500k
>> >> >> >> messages to a given topic. Every test run when I change the
>> >>number of
>> >> >> >> brokers, I use a new topic.
>> >> >> >>
>> >> >> >> Thanks for any advices.
>> >> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>>
>>

Re: latency test

Reply via email to