Re: beat c in fefes "competition"

cblake Mon, 25 Mar 2019 06:16:17 -0700

Oh, and two more stats about my input data important to reason about my timings 
- 43 MB and 4.8e6 total words total (coincidentally close to 1e-3 times my 
1/4.8GHz clock cycle).


So, average string length around 43e6/4.8e6 =~ 9 bytes (also a web server log), 
and about 150e-3/4.8e6 =~ 31 nsec =~ 150 CPU clock cycles per word.

A little more fiddling (commenting out various sections) shows that the 150 
cycles per word breaks down to:
    
    
      22% parsing (no hash table or output, just the MemSlice stuff)
      55% hash table counting
      23% output (sorting + binary->ascii on the numbers).
    
    
    Run

So, 150 cycles per input word might sound like a lot, but I personally doubt it 
could be improved much. 43MB does not fit in my 8MB L3 cache, but 
2**ceil(lg(291e3*9B)) =~ 512kEntry*16 just barely does assuming about 16B per 
entry. Really, 41B per hash slot is more realistic (8B for the hash code, maybe 
16B more for MemSlice, 8B for the count and about 9B for the string data). The 
string data is indirect, and most lookups are failed lookups, so 32B is 
probably the appropriate number.

So, that 55% of 150 = 82 ms / 4.8e6 is =~ 17 ns ~ 80 cycles. The hash table is 
probably only about 50% L3 resident (16B/32B) - not bad, but not great. Quoted 
latency on Skylake L3 is supposedly about 40 cycles while my DIMMs are probably 
about 100 cycles (20ns). So, the (AFAIK unavoidable 4.8e6 lookup latencies) is 
some kind of average of 40 and 100 cycles. 80 cycles is not implausible. .5*100 
+ .5*40 = 70 cycles after all (and 100 may be..optimistic for my DIMMs). The 
table is only 56% full (291/512). So, probe sequences should be short and fit 
in one or two L3 cache lines, the 2nd of which is often preloaded by the 
hardware in a linear search. Of course, the table accesses are in dynamic 
competition with the streaming read from the parser because CPU designers sadly 
don't let user code control cache policies.

So, _maybe_ some hand-rolled assembly could shave off 20 cycles in the hash 
phase and maybe 15..20 cycles in each of the other two phases for maybe 60 
cycles better than the Nim's 150. 150/90 is another 1.6x of potential 
improvement. It's probably a ton of work to get that. Also, I haven't done it. 
So 60 cycles is just some nominal upper bound guess. 150 cycles might also be 
just about as good as can be done.

If the table cannot be kept 50% resident in L3, but more like 10% L3 resident 
then most of that "assumed 60 cycles" improvement would be lost just to the 
DIMM vs L3 latency. Meanwhile, just packing those hash entries into 16B from 
32B (using e.g. a 32-bit hash code and counter and inline string data) could 
potentially make almost as much a difference in pure Nim by keeping it 100% L3 
resident and maybe getting down to 40 cycles from 80. There is, of course, a 
hazard of focusing too much on the dimensions of this one input data set here 
which is unlikely to be very informative.

The "usual" informative way to benchmark hash tables is to do (at least) a 
"fits in L1" case to check code generation effects without a lot of memory 
system interference and then also a "doesn't fit in Lany" case to check 
latency/probe sequence effects. A more extreme variant of the latter would be a 
cold-cache spinning rust Winchester disk that doesn't fit in RAM, but such are 
usually painfully slow. Everything in between those two extremes gets a funky 
average of inter-cache-level behaviors that is harder to reason about. I 
figured I'd parse a web log like @enthus1ast. Generated synthetic data would be 
more reproducible, although also **not** exercise the hash function well. Sorry 
for the verbosity, but it's tricky to benchmark this stuff in a way that 
actually generalizes, and I've seen a lot of not so informative benchmarks get 
a lot of readership.

Re: beat c in fefes "competition"

Reply via email to