Also, I just noticed that the `worst` in Araq's benchmark is the "max of
`msgCount` == 1 million trials", effectively. @miran did not mention if he was
careful to re-initialize `worst` to zero in his "slightly modified" benchmark
either. So, it might be more like 5 million. As should be clear from above, the
max of millions is just begging to capture system noise not the effect in
question, and I would not expect very consistent results over runs/systems/or
even over time on the same system.
Here is a perhaps interesting example of latency measurement. The basic task is
measuring memory latency. One might **_think_** just accessing a random element
of a large array that fits in available RAM is a way to do this, but one would
be **_wrong_** at least in a "hot loop" setting. Anyway, not that it's rocket
science, but the program also exhibits a way to measure some low quantile like
min of N (itself a statistical estimate and so in this case averaged/sdev'd):
import random, times, strutils, stats, cligen
proc prepRanElt(x: var seq[int], n: int) =
for i in 0 ..< n: x[i] = rand(9) #9 keeps sums short
#Optimized for speed not randomness to maximize delta
var r = initRand(123)
proc runRanElt(x: seq[int], nAcc: int): int =
let mask = x.len - 1 #XXX (1 shl floor(lg(x.len))))-1
for i in 1..nAcc: result += x[r.next.int and mask]
proc prepShuffle(x: var seq[int], n: int) =
for i in 0 ..< n: x[i] = i #Pop seq with identity
x.shuffle #..perm & then shuffle
proc runShuffle(x: seq[int], nAcc: int): int =
for i in 1..nAcc: result = x[result]
proc fmt(x=0.0, n=3): auto=formatFloat(x, ffDecimal, n)
proc time(prep, run: auto; n, nAcc, avgN, minN: int) =
var dtMins: RunningStat
var s = 0 #Block optimizing away all work by use
var x = newSeq[int](n)
x.prep n
for avgIt in 1..avgN:
var dtMin = float.high
for minIt in 1..minN:
let t0 = epochTime()
s += x.run(nAcc)
let dt = (epochTime() - t0) * 1e9 / nAcc.float
dtMin = min(dtMin, dt)
dtMins.push dtMin
echo "KiB: ",n shr 7," ns/Access: ", fmt(dtMins.mean),
" +- ",fmt(dtMins.standardDeviationS)," s:", s
type Algo = enum ranElt, shuff
proc lat*(kind=shuff, sizeKiB=1048576, nAcc=1_000_000,
avgN=4, minN=4, seed=0) =
## Time latency two ways. One does NOT measure it!
if seed > 0: r = initRand(seed)
else: randomize(); r = initRand(rand(100000))
let n = (sizeKiB shl 10) div int.sizeof
if kind == shuff:
time(prepShuffle, runShuffle, n, nAcc, avgN, minN)
else:
time(prepRanElt, runRanElt, n, nAcc, avgN, minN)
dispatch(lat, help={"kind": "shuff: chase ran perm\n" &
"ranElt: access ran elt",
"seed": "0=>random, else set" })
Run
I get output like this for (1 shl 20) KiB == 1 GiB working set:
shell$ memlat -kr; memlat -ks
KiB: 1048576 ns/Access: 8.197 +- 0.002 s:71968628
KiB: 1048576 ns/Access: 67.330 +- 0.078 s:1295362736
Run
I am pretty sure that what is going on is that the work to make the next
pseudorandom number and the subsequent cache line load is predictable. So,
speculative execution is letting the CPU "work ahead" of the memory system by
many loops, and its coordinated prefetching then masks the memory system
latency quite well, yielding a kind of "amortized/fake latency". Meanwhile the
CPU cannot predict the bouncing randomly around the array loads because the
next hop isn't known until the last hop is done. So, it has to actually wait on
the DIMMs and you get to see the real round-trip latency.
This may all seem "off track/topic", but many benchmarks have the pattern of
the former rather than the latter and then fail to represent real world
performance creating confusion. For example, hash table benchmarks often look
_just_ like the above random element case. One example of pros getting that
wrong, Table 1 in Kraska 2017
([https://arxiv.org/abs/1712.01208](https://arxiv.org/abs/1712.01208)) has
crazy good numbers like 31 nanoseconds for hash lookup latency. I don't think
anyone makes DIMMs with anything **_near_** that kind of latency and if those
were being used it would require special mention (not that the evaluation
sections of those authors ever stray too close to reproducibility or
analyzability...). Anyway, that 31 ns is almost surely amortized/fake latency.
I think even all the blogger follow-up to Kraska2017 failed to observe this
particular issue. Hardware manufacturers worsen this whole situation by lying
about/quoting amortized instead of true round trip latency (what "latency" has
meant my whole life anyway).
Another comment about that benchmark code is that even putting
`rdtsc`/epochTime/etc queries around every snippet does not really block all
the out-of-order/speculative execution stuff. You need to have some serializing
instruction like `cpuid` in there. That is more just a side comment. The main
problem is the max-of-many structure.
I would focus on measuring average case with some real/non-micro-benchmark
program that spends 90% of its time doing GC with the various competing GCs.
For worst case times, if just pondering what the CPU must do is not enough
(like in the linear probing hash example) then you ideally want something like
that "shuffle" approach above where the CPU _must_ finish the last worst/bad
case before moving on to the next worst/bad case. I don't know enough about the
impl to construct such a thing. Might be tricky or even require a special mode
(or even be impossible). (The "truly worst" case is probably swapping to a
Winchester disk for however many times possibly competing with who knows what
kind of non-isolated processes..but I'm sure that is not the metric of
interest. This parenthetical is just to exhibit that background activity
assumptions should probably be stated explicitly when discussing "worst cases".)