That is an excellent question! There are a bunch of ways to monitor jitter and see when that is happening. Here are a few:
- You could slice the histogram every few seconds, save it out with a timestamp, and then look at how they compare. This would be mostly manual, or you can graph line charts of the percentiles over time in excel where each percentile would be a series. If you are using HDR Histogram, you should look at how to use the Recorder class to do this coupled with a ScheduledExecutorService. - You can just save the starting timestamp of the event and the latency of each event. If you put it into a CSV, you can just load it up into excel and graph as a XY chart. That way you can see every point during the running of your program and you can see trends. You want to be careful about this one, especially of writing to a file in the callback that kfaka provides. Also, I have noticed that most of the very slow observations are at startup. But don’t trust me, trust the data and share your findings. Also, having a 99.9 percentile provides a pretty good standard for typical poor case performance. Average is borderline useless, 50%’ile is a better typical case because that’s the number that says “half of events will be this slow or faster”, or for values that are high like 99.9%’ile, “0.1% of all events will be slower than this”. -Erik On 9/4/15, 12:05 PM, "Yuheng Du" <yuheng.du.h...@gmail.com> wrote: >Thank you Erik! That's is helpful! > >But also I see jitters of the maximum latencies when running the >experiment. > >The average end to acknowledgement latency from producer to broker is >around 5ms when using 92 producers and 4 brokers, and the 99.9 percentile >latency is 58ms, but the maximum latency goes up to 1359 ms. How to locate >the source of this jitter? > >Thanks. > >On Fri, Sep 4, 2015 at 10:54 AM, Helleren, Erik ><erik.helle...@cmegroup.com> >wrote: > >> WellŠ not to be contrarian, but latency depends much more on the latency >> between the producer and the broker that is the leader for the partition >> you are publishing to. At least when your brokers are not saturated >>with >> messages, and acks are set to 1. If acks are set to ALL, latency on an >> non-saturated kafka cluster will be: Round Trip Latency from producer to >> leader for partition + Max( slowest Round Trip Latency to a replicas of >> that partition). If a cluster is saturated with messages, we have to >> assume that all partitions receive an equal distribution of messages to >> avoid linear algebra and queueing theory models. I don¹t like linear >> algebra :P >> >> Since you are probably putting all your latencies into a single >>histogram >> per producer, or worse, just an average, this pattern would have been >> obscured. Obligatory lecture about measuring latency by Gil Tene >> (https://www.youtube.com/watch?v=9MKY4KypBzg). To verify this >>hypothesis, >> you should re-write the benchmark to plot the latencies for each write >>to >> a partition for each producer into a histogram. (HRD histogram is pretty >> good for that). This would give you producers*partitions histograms, >> which might be unwieldy for that many producers. But wait, there is >>hope! >> >> To verify that this hypothesis holds, you just have to see that there >>is a >> significant difference between different partitions on a SINGLE >>producing >> client. So, pick one producing client at random and use the data from >> that. The easy way to do that is just plot all the partition latency >> histograms on top of each other in the same plot, that way you have a >> pretty plot to show people. If you don¹t want to setup plotting, you >>can >> just compare the medians (50¹th percentile) of the partitions¹ >>histograms. >> If there is a lot of variance, your latency anomaly is explained by >> brokers 4-7 being slower than nodes 0-3! If there isn¹t a lot of >>variance >> at 50%, look at higher percentiles. And if higher percentiles for all >>the >> partitions look the same, this hypothesis is disproved. >> >> If you want to make a general statement about latency of writing to >>kafka, >> you can merge all the histograms into a single histogram and plot that. >> >> To Yuheng¹s credit, more brokers always results in more throughput. But >> throughput and latency are two different creatures. Its worth noting >>that >> kafka is designed to be high throughput first and low latency second. >>And >> it does a really good job at both. >> >> Disclaimer: I might not like linear algebra, but I do like statistics. >> Let me know if there are topics that need more explanation above that >> aren¹t covered by Gil¹s lecture. >> -Erik >> >> On 9/4/15, 9:03 AM, "Yuheng Du" <yuheng.du.h...@gmail.com> wrote: >> >> >When I using 32 partitions, the 4 brokers latency becomes larger than >>the >> >8 >> >brokers latency. >> > >> >So is it always true that using more brokers can give less latency when >> >the >> >number of partitions is at least the size of the brokers? >> > >> >Thanks. >> > >> >On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du <yuheng.du.h...@gmail.com> >> >wrote: >> > >> >> I am running a producer latency test. When using 92 producers in 92 >> >> physical node publishing to 4 brokers, the latency is slightly lower >> >>than >> >> using 8 brokers, I am using 8 partitions for the topic. >> >> >> >> I have rerun the test and it gives me the same result, the 4 brokers >> >> scenario still has lower latency than the 8 brokers scenarios. >> >> >> >> It is weird because I tested 1broker, 2 brokers, 4 brokers, 8 >>brokers, >> >>16 >> >> brokers and 32 brokers. For the rest of the case the latency >>decreases >> >>as >> >> the number of brokers increase. >> >> >> >> 4 brokers/8 brokers is the only pair that doesn't satisfy this rule. >> >>What >> >> could be the cause? >> >> >> >> I am using a 200 bytes message, the test let each producer publishes >> >>500k >> >> messages to a given topic. Every test run when I change the number of >> >> brokers, I use a new topic. >> >> >> >> Thanks for any advices. >> >> >> >>