Hi, Thank you all for the feedback! After some debugging it appeared that it is a bug in wrk - most of the requests' latencies were 0 in the raw reports.
I've looked for a better maintained HTTP load testing tool and I liked https://github.com/tsenart/vegeta. it provides (correctly looking) statistics, can measure latencies while using constant rate, and last but not least can produce plot charts! I will update my article and let you know once I'm done! Regards, Martin On Fri, Jul 31, 2020 at 4:43 PM Pål Hermunn Johansen < herm...@varnish-software.com> wrote: > I am sorry for being so late to the game, but here it goes: > > ons. 29. jul. 2020 kl. 14:12 skrev Poul-Henning Kamp <p...@phk.freebsd.dk>: > > Your measurement says that there is 2/3 chance that the latency > > is between: > > > > 655.40µs - 798.70µs = -143.30µs > > > > and > > 655.40µs + 798.70µs = 1454.10µs > > No, it does not. There is no claim anywhere that the numbers are > following a normal distribution or an approximation of it. Of course, > the calculations you do demonstrate that the data is far from normally > distributed (as expected). > > > You cannot conclude _anything_ from those numbers. > > There are two numbers, the average and the standard deviation, and > they are calculated from the data, but the truth is hidden deeper in > the data. By looking at the particular numbers, I agree completely > that it is wrong to conclude that one is better than the other. I am > not saying that the statements in the article are false, just that you > do not have data to draw the conclusions. > > Furthermore I have to say that Geoff got things right (see below). As > a mathematician, I have to say that statistics is hard, and trusting > the output of wrk to draw conclusions is outright the wrong thing to > do. > > In this case we have a luxury which you typically do not have: Data is > essentially free. You can run many tests and you can run short or long > tests with different parameters. A 30 second test is simply not enough > for anything. > > As Geoff indicated, for each transaction you can extract many relevant > values from varnishlog, with the status, hit/miss, time to first byte > and time to last byte being the most obvious ones. They can be > extracted and saved to a csv file by using varnishncsa with a custom > format string, and you can use R (used it myself as a tool in my > previous job - not a fan) to do statistical analysis on the data. The > Student T suggestion from Geoff is a good idea, but just looking at > one set of numbers without considering other factors is mathematically > problematic. > > Anyway, some obvious questions then arise. For example: > - How do the numbers between wrk and varnishlog/varnishncsa compare? > Did wrk report a total number of transactions than varnish? If there > is a discrepancy, then the errors might be because of some resource > restraint (number of sockets or dropped syn packages?). > - How does the average and maximum compare between varnish and wrk? > - What is the CPU usage of the kernel, the benchmarking tool and the > varnish processes in the tests? > - What is the difference between the time to first byte and the time > to last byte in Varnish for different object sizes? > > When Varnish writes to a socket, it hands bytes over to the kernel, > and when the write call returns, we do not know how far the bytes have > come, and how long it will take before they get to the final > destination. The bytes may be in a kernel buffer, they might be on the > network card, and they might be already received at the client's > kernel, and they might have made it all into wrk (which may or may not > have timestamped the response). Typically, depending on many things, > Varnish will report faster times than what wrk, but since returning > from the write call means that the calling thread must be rescheduled, > it is even possible that wrk will see that some requests are faster > than what Varnish reports. Running wrk2 with different speeds in a > series of tests seems natural to me, so that you can observe when (and > how) the system starts running into bottlenecks. Note that the > bottleneck can just as well be in wrk2 itself or on the combined CPU > usage of kernel + Varnish + wrk2. > > To complicate things even further: On your ARM vs. x64 tests, my guess > is that both kernel parameters and parameters for the network are > different, and the distributions probably have good reason to choose > different values. It is very likely that these differences affect the > performance of the systems in many ways, and that different tests will > have different "optimal" tunings of kernel and network parameters. > > Sorry for rambling, but getting the statistics wrong is so easy. The > question is very interesting, but if you want to draw conclusions, you > should do the analysis, and (ideally) give access to the raw data in > case anyone wants to have a look. > > Best, > Pål > > fre. 31. jul. 2020 kl. 08:45 skrev Geoff Simmons <ge...@uplex.de>: > > > > On 7/28/20 13:52, Martin Grigorov wrote: > > > > > > I've just posted an article [1] about comparing the performance of > Varnish > > > Cache on two similar > > > machines - the main difference is the CPU architecture - x86_64 vs > aarch64. > > > It uses a specific use case - the backend service just returns a static > > > content. The idea is > > > to compare Varnish on the different architectures but also to compare > > > Varnish against the backend HTTP server. > > > What is interesting is that Varnish gives the same throughput as the > > > backend server on x86_64 but on aarch64 it is around 30% slower than > the > > > backend. > > > > Does your test have an account of whether there were any errors in > > backend fetches? Don't know if that explains anything, but with a > > connect timeout of 10s and first byte timeout of 5m, any error would > > have a considerable effect on the results of a 30 second test. > > > > The test tool output doesn't say anything I can see about error rates -- > > whether all responses had status 200, and if not, how many had which > > other status. Ideally it should be all 200, otherwise the results may > > not be valid. > > > > I agree with phk that a statistical analysis is needed for a robust > > statement about differences between the two platforms. For that, you'd > > need more than the summary stats shown in your blog post -- you need to > > collect all of the response times. What I usually do is query Varnish > > client request logs for Timestamp:Resp and save the number in the last > > column. > > > > t.test() in R runs Student's t-test (me R fanboi). > > > > >
_______________________________________________ varnish-dev mailing list varnish-dev@varnish-cache.org https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev