One other thing I can say that may be helpful is that when you do the "max time 
of N samples" you shift the probability distribution (not density) by an 
_exponent_ of N. See 
[https://en.wikipedia.org/wiki/Extreme_value_theory](https://en.wikipedia.org/wiki/Extreme_value_theory)
 for details. Basically, if the latency has distribution F(t), the worst of 5 
has distribution F(t)**5. The median is defined as dist=0.5. So, the median of 
the worst of 5 corresponds to 1-0.5**5 = 96.875th percentile time. Similarly, 
the median of the min time of 5 runs would be the 3.125th percentile of the 
base time distribution. So best/worst of 5 you expect (in a median sense) to 
sample the lower/upper tail of the base distribution. A lot of the time in 
benchmarking one takes the min of N to filter out all the noise from other 
things going on in the system. Unlike many things in statistics these results 
do not rely on anything about the shape of F(t)..It's just basic probability 
theory.

What this means in this benchmark example is that it is going to be very 
sensitive (by construction if not intent) to **a lot** of competing activity on 
the system - network interrupts, competing processes, OS kernel goings on, 
L2/L3 cache evictions, etc. So, it is "noisy by design". The bigger N the 
noisier the max will be. For example, if you have a Winchester hard drive on 
the system and you make N big enough you will sample some suspension of the 
process, CPU migration, total re-warm up of L2 cache, and however long the disk 
service takes. None of that really relates _directly_ to "memory manager 
performance". It's more measuring how loaded your system is.

Anyway, to me this explains the high run to run variability/difficulty of 
interpretation/difficulty to reproduce here. 97th percentile is pretty high and 
that's just the median. The tail of F(t)**5 is even worse than the median. And 
those tail-tail events are ever less relevant to comparing memory managers. To 
the extent they are of interest, because they are so very noisy, it's going to 
require a lot of sampling to assess them.

The truly worst-worst-worst cases where everything is cold cache are actually 
pretty hard to measure for a variety of reasons. A lot of the time "hot loop 
benchmarks" can be **_very_** misleading about worst case latency and even pros 
mess this up _all the time_. I could provide a couple of fun examples sometime.

Memory is mostly about DIMM/L3/L2/L1 latencies and bandwidths, though. You can 
measure those and reason from there more by looking at code and "calculating" 
from those measured constants than by timing it. Often cold cache DIMM hits are 
the most important thing if you aren't moving a lot of memory around. Then 
number of copies if you are moving. { This is why I like linear probing hash 
tables (with|without only local Robin Hood reorg which can help insert-heavy 
workloads). You just know it's basically just one 64-128 byte cache fill for 
really big tables when it's most painful, and you can't do better than 1. }

Reply via email to