Re: Introducing --gc:arc

cblake Wed, 08 Jan 2020 06:45:25 -0800

Also, I just noticed that the `worst` in Araq's benchmark is the "max of 
`msgCount` == 1 million trials", effectively. @miran did not mention if he was 
careful to re-initialize `worst` to zero in his "slightly modified" benchmark 
either. So, it might be more like 5 million. As should be clear from above, the 
max of millions is just begging to capture system noise not the effect in 
question, and I would not expect very consistent results over runs/systems/or 
even over time on the same system.


Here is a perhaps interesting example of latency measurement. The basic task is 
measuring memory latency. One might **_think_** just accessing a random element 
of a large array that fits in available RAM is a way to do this, but one would 
be **_wrong_** at least in a "hot loop" setting. Anyway, not that it's rocket 
science, but the program also exhibits a way to measure some low quantile like 
min of N (itself a statistical estimate and so in this case averaged/sdev'd): 
    
    
    import random, times, strutils, stats, cligen
    
    proc prepRanElt(x: var seq[int], n: int) =
      for i in 0 ..< n: x[i] = rand(9) #9 keeps sums short
    
    #Optimized for speed not randomness to maximize delta
    var r = initRand(123)
    proc runRanElt(x: seq[int], nAcc: int): int =
      let mask = x.len - 1 #XXX (1 shl floor(lg(x.len))))-1
      for i in 1..nAcc: result += x[r.next.int and mask]
    
    proc prepShuffle(x: var seq[int], n: int) =
      for i in 0 ..< n: x[i] = i    #Pop seq with identity
      x.shuffle                     #..perm & then shuffle
    
    proc runShuffle(x: seq[int], nAcc: int): int =
      for i in 1..nAcc: result = x[result]
    
    proc fmt(x=0.0, n=3): auto=formatFloat(x, ffDecimal, n)
    
    proc time(prep, run: auto; n, nAcc, avgN, minN: int) =
      var dtMins: RunningStat
      var s = 0     #Block optimizing away all work by use
      var x = newSeq[int](n)
      x.prep n
      for avgIt in 1..avgN:
        var dtMin = float.high
        for minIt in 1..minN:
          let t0 = epochTime()
          s += x.run(nAcc)
          let dt = (epochTime() - t0) * 1e9 / nAcc.float
          dtMin = min(dtMin, dt)
        dtMins.push dtMin
      echo "KiB: ",n shr 7," ns/Access: ", fmt(dtMins.mean),
           " +- ",fmt(dtMins.standardDeviationS)," s:", s
    
    type Algo = enum ranElt, shuff
    
    proc lat*(kind=shuff, sizeKiB=1048576, nAcc=1_000_000,
              avgN=4, minN=4, seed=0) =
      ## Time latency two ways.  One does NOT measure it!
      if seed > 0: r = initRand(seed)
      else: randomize(); r = initRand(rand(100000))
      let n = (sizeKiB shl 10) div int.sizeof
      if kind == shuff:
        time(prepShuffle, runShuffle, n, nAcc, avgN, minN)
      else:
        time(prepRanElt, runRanElt, n, nAcc, avgN, minN)
    
    dispatch(lat, help={"kind": "shuff: chase ran perm\n" &
                                "ranElt: access ran elt",
                        "seed": "0=>random, else set" })
    
    
    Run

I get output like this for (1 shl 20) KiB == 1 GiB working set: 
    
    
    shell$ memlat -kr; memlat -ks
    KiB: 1048576 ns/Access: 8.197 +- 0.002 s:71968628
    KiB: 1048576 ns/Access: 67.330 +- 0.078 s:1295362736
    
    
    Run

I am pretty sure that what is going on is that the work to make the next 
pseudorandom number and the subsequent cache line load is predictable. So, 
speculative execution is letting the CPU "work ahead" of the memory system by 
many loops, and its coordinated prefetching then masks the memory system 
latency quite well, yielding a kind of "amortized/fake latency". Meanwhile the 
CPU cannot predict the bouncing randomly around the array loads because the 
next hop isn't known until the last hop is done. So, it has to actually wait on 
the DIMMs and you get to see the real round-trip latency.

This may all seem "off track/topic", but many benchmarks have the pattern of 
the former rather than the latter and then fail to represent real world 
performance creating confusion. For example, hash table benchmarks often look 
_just_ like the above random element case. One example of pros getting that 
wrong, Table 1 in Kraska 2017 
([https://arxiv.org/abs/1712.01208](https://arxiv.org/abs/1712.01208)) has 
crazy good numbers like 31 nanoseconds for hash lookup latency. I don't think 
anyone makes DIMMs with anything **_near_** that kind of latency and if those 
were being used it would require special mention (not that the evaluation 
sections of those authors ever stray too close to reproducibility or 
analyzability...). Anyway, that 31 ns is almost surely amortized/fake latency. 
I think even all the blogger follow-up to Kraska2017 failed to observe this 
particular issue. Hardware manufacturers worsen this whole situation by lying 
about/quoting amortized instead of true round trip latency (what "latency" has 
meant my whole life anyway).

Another comment about that benchmark code is that even putting 
`rdtsc`/epochTime/etc queries around every snippet does not really block all 
the out-of-order/speculative execution stuff. You need to have some serializing 
instruction like `cpuid` in there. That is more just a side comment. The main 
problem is the max-of-many structure.

I would focus on measuring average case with some real/non-micro-benchmark 
program that spends 90% of its time doing GC with the various competing GCs. 
For worst case times, if just pondering what the CPU must do is not enough 
(like in the linear probing hash example) then you ideally want something like 
that "shuffle" approach above where the CPU _must_ finish the last worst/bad 
case before moving on to the next worst/bad case. I don't know enough about the 
impl to construct such a thing. Might be tricky or even require a special mode 
(or even be impossible). (The "truly worst" case is probably swapping to a 
Winchester disk for however many times possibly competing with who knows what 
kind of non-isolated processes..but I'm sure that is not the metric of 
interest. This parenthetical is just to exhibit that background activity 
assumptions should probably be stated explicitly when discussing "worst cases".)

Re: Introducing --gc:arc

Reply via email to