One other thing I can say that may be helpful is that when you do the "max time of N samples" you shift the probability distribution (not density) by an _exponent_ of N. See [https://en.wikipedia.org/wiki/Extreme_value_theory](https://en.wikipedia.org/wiki/Extreme_value_theory) for details. Basically, if the latency has distribution F(t), the worst of 5 has distribution F(t)**5. The median is defined as dist=0.5. So, the median of the worst of 5 corresponds to 1-0.5**5 = 96.875th percentile time. Similarly, the median of the min time of 5 runs would be the 3.125th percentile of the base time distribution. So best/worst of 5 you expect (in a median sense) to sample the lower/upper tail of the base distribution. A lot of the time in benchmarking one takes the min of N to filter out all the noise from other things going on in the system. Unlike many things in statistics these results do not rely on anything about the shape of F(t)..It's just basic probability theory.
What this means in this benchmark example is that it is going to be very sensitive (by construction if not intent) to **a lot** of competing activity on the system - network interrupts, competing processes, OS kernel goings on, L2/L3 cache evictions, etc. So, it is "noisy by design". The bigger N the noisier the max will be. For example, if you have a Winchester hard drive on the system and you make N big enough you will sample some suspension of the process, CPU migration, total re-warm up of L2 cache, and however long the disk service takes. None of that really relates _directly_ to "memory manager performance". It's more measuring how loaded your system is. Anyway, to me this explains the high run to run variability/difficulty of interpretation/difficulty to reproduce here. 97th percentile is pretty high and that's just the median. The tail of F(t)**5 is even worse than the median. And those tail-tail events are ever less relevant to comparing memory managers. To the extent they are of interest, because they are so very noisy, it's going to require a lot of sampling to assess them. The truly worst-worst-worst cases where everything is cold cache are actually pretty hard to measure for a variety of reasons. A lot of the time "hot loop benchmarks" can be **_very_** misleading about worst case latency and even pros mess this up _all the time_. I could provide a couple of fun examples sometime. Memory is mostly about DIMM/L3/L2/L1 latencies and bandwidths, though. You can measure those and reason from there more by looking at code and "calculating" from those measured constants than by timing it. Often cold cache DIMM hits are the most important thing if you aren't moving a lot of memory around. Then number of copies if you are moving. { This is why I like linear probing hash tables (with|without only local Robin Hood reorg which can help insert-heavy workloads). You just know it's basically just one 64-128 byte cache fill for really big tables when it's most painful, and you can't do better than 1. }