Re: beat c in fefes "competition"

2019-03-25 Thread Araq
Reusing the string used up for the word inside `for word in line.split():` is 
rather easy to do and can be a stdlib patch. It's not hard and Nim's strings 
are awesome because allocations are expensive and mutability helps for re-using 
memory aggressively.


Re: beat c in fefes "competition"

2019-03-25 Thread cblake
I wrote: "most lookups are failed lookups", but Oops! 291e3 unique/4.8e6 total 
= 6% ==> 94% of lookups are successful. So, the `memcmp` that happens only 
after hash codes match does happen almost all the time, and so my particular 
data set is more like 16B/(32B+9B) = 39% cached, not 50%. This doesn't really 
alter the rest of my analysis, though since 39% and 50% are pretty close.

A corollary of this is that (for that style of data set), Robin Hood hashing 
w/linear probing would not help. RH can help for failed lookups because probe 
sequences for missing items become about 50% as long (once the search hash code 
is > table hash code you know it's missing and where to put the new key). It 
costs a little work { O(1.15x) } to keep entries along probe sequences sorted 
by hash code. So, RH only pays off if you care about worst case times or have a 
workload dominated by failed lookups over long probe sequences. At 56% table 
load (291./512) the sequences aren't long unless the hash function is weak 
relative to the data (not the case for me unless all 4 functions were equally 
weak).

Anyway, RH would be about the only algorithmic speed-up I think could help in 
the poorly cached large table case, but only for very low hit rate data. It 
would not be crazy to add a `defined(useRobinHood)` in the Nim hash table impl 
(or even have a per-instance run-time flag) since whether RH helps or hurts is 
so dependent upon workload.


Re: beat c in fefes "competition"

2019-03-25 Thread cblake
Oh, and two more stats about my input data important to reason about my timings 
- 43 MB and 4.8e6 total words total (coincidentally close to 1e-3 times my 
1/4.8GHz clock cycle).

So, average string length around 43e6/4.8e6 =~ 9 bytes (also a web server log), 
and about 150e-3/4.8e6 =~ 31 nsec =~ 150 CPU clock cycles per word.

A little more fiddling (commenting out various sections) shows that the 150 
cycles per word breaks down to:


  22% parsing (no hash table or output, just the MemSlice stuff)
  55% hash table counting
  23% output (sorting + binary->ascii on the numbers).


Run

So, 150 cycles per input word might sound like a lot, but I personally doubt it 
could be improved much. 43MB does not fit in my 8MB L3 cache, but 
2**ceil(lg(291e3*9B)) =~ 512kEntry*16 just barely does assuming about 16B per 
entry. Really, 41B per hash slot is more realistic (8B for the hash code, maybe 
16B more for MemSlice, 8B for the count and about 9B for the string data). The 
string data is indirect, and most lookups are failed lookups, so 32B is 
probably the appropriate number.

So, that 55% of 150 = 82 ms / 4.8e6 is =~ 17 ns ~ 80 cycles. The hash table is 
probably only about 50% L3 resident (16B/32B) - not bad, but not great. Quoted 
latency on Skylake L3 is supposedly about 40 cycles while my DIMMs are probably 
about 100 cycles (20ns). So, the (AFAIK unavoidable 4.8e6 lookup latencies) is 
some kind of average of 40 and 100 cycles. 80 cycles is not implausible. .5*100 
+ .5*40 = 70 cycles after all (and 100 may be..optimistic for my DIMMs). The 
table is only 56% full (291/512). So, probe sequences should be short and fit 
in one or two L3 cache lines, the 2nd of which is often preloaded by the 
hardware in a linear search. Of course, the table accesses are in dynamic 
competition with the streaming read from the parser because CPU designers sadly 
don't let user code control cache policies.

So, _maybe_ some hand-rolled assembly could shave off 20 cycles in the hash 
phase and maybe 15..20 cycles in each of the other two phases for maybe 60 
cycles better than the Nim's 150. 150/90 is another 1.6x of potential 
improvement. It's probably a ton of work to get that. Also, I haven't done it. 
So 60 cycles is just some nominal upper bound guess. 150 cycles might also be 
just about as good as can be done.

If the table cannot be kept 50% resident in L3, but more like 10% L3 resident 
then most of that "assumed 60 cycles" improvement would be lost just to the 
DIMM vs L3 latency. Meanwhile, just packing those hash entries into 16B from 
32B (using e.g. a 32-bit hash code and counter and inline string data) could 
potentially make almost as much a difference in pure Nim by keeping it 100% L3 
resident and maybe getting down to 40 cycles from 80. There is, of course, a 
hazard of focusing too much on the dimensions of this one input data set here 
which is unlikely to be very informative.

The "usual" informative way to benchmark hash tables is to do (at least) a 
"fits in L1" case to check code generation effects without a lot of memory 
system interference and then also a "doesn't fit in Lany" case to check 
latency/probe sequence effects. A more extreme variant of the latter would be a 
cold-cache spinning rust Winchester disk that doesn't fit in RAM, but such are 
usually painfully slow. Everything in between those two extremes gets a funky 
average of inter-cache-level behaviors that is harder to reason about. I 
figured I'd parse a web log like @enthus1ast. Generated synthetic data would be 
more reproducible, although also **not** exercise the hash function well. Sorry 
for the verbosity, but it's tricky to benchmark this stuff in a way that 
actually generalizes, and I've seen a lot of not so informative benchmarks get 
a lot of readership.


Re: beat c in fefes "competition"

2019-03-25 Thread cblake
Just a quick follow-up, with gcc-8.3 on Linux x86_64 Skylake CPU, 
profile-guided optimization did get that 176ms time down to 150 ms (With 
-d:useNimHash it was 152 ms), a 1.17x boost to 1.43x faster than the C in 
@enthus1ast 's `wp.c`.

Of course, with 291,000 unique keys the average external chain length in the C 
is going to be 291./65.5 =~ 4.4 which is, um, not great. So, maybe this is only 
faster than a strawman C impl. YMMV. If you really want that particular C impl 
to be slow, just run it with 1e6 unique words. ;-) (But then you really better 
not use Nim's current `CountTable.sort`!)


Re: beat c in fefes "competition"

2019-03-25 Thread cblake
As @Stefan mentioned the string stuff can be slow. My `MemSlice` techniques 
might be more broadly interesting. The tokenizer below only works for "strict" 
one-character delimiting, not, e.g., repeated whitespace. Also, I agree 
completely with @arnetheduck that algorithm choices matter more than languages 
for Nim vs C comparisons.

That said, the below Nim is faster than the C version on my machine (215 ms for 
C, 176 ms for Nim). @enthus1ast can try that on his data and report back if he 
likes. There are several little `defined` switches to play with.

FWIW, the C version referenced is a rather lame one (fixed 64 KiEntry hash 
table with external chaining resolution). I didn't try 
profile-guided-optimization which could improve the Nim results some. Since the 
C version "cheats" a bit by pre-sizing its hash table, my version below takes 
any estimate on the number of unique words that will be found as input and 
pre-sizes the Nim table to fit that estimate.

Also, **beware** ` CountTable.sort()`. It uses Shell's sort, not a fast 
algorithm. They're always integers and typically small. A radix sort would be 
the best choice, but regular old merge sort from `algorithm` is a lot better. 
On my test input with 291,000 unique words, the version using `count.sort` took 
120 seconds instead of 176 milliseconds, almost 3 full orders of magnitude 
slower. Times seemed insensitive to hash functions (order 1% variations), 
though both the sorting and hash-sensitivity will be data set dependent, 
obviously. For very high entropy distributions you are going to want to avoid 
the current `CountTable.sort()`. Also, `CountTable` itself was a little slower 
than regular `Table` using the same sorting procedure. So, you might want to 
just avoid `CountTable` in general in peformance-sensitive cases.


import memfiles, hashes, tables, os, algorithm, strutils

proc hash*(ms: MemSlice): Hash {.inline.} =
  when defined(useNimHash):
result = hashData(ms.data, ms.size)
  elif defined(useCB2):
proc c_memhash(h0: int, a: pointer, size: int): int {. nodecl,
importc: "memhash_cb2", header: getHomeDir() & 
"s/cb/hash_cb2.c".}
result = c_memhash(0, ms.data, ms.size)
  elif defined(useCB4):
proc c_memhash(h0: int, a: pointer, size: int): int {. nodecl,
importc: "memhash_cb4", header: getHomeDir() & 
"s/cb/hash_cb4.c".}
result = c_memhash(0, ms.data, ms.size)
  else:
result = 0
for i in 0 ..< ms.size:
  result = (result + (result shl 5)) xor int(cast[cstring](ms.data)[i])

proc split*(ms: MemSlice, fields: var seq[MemSlice], nA: int,
sep=' ', size = -1): int =
  proc c_memchr(s: cstring, c: char, n: csize): cstring {.
   importc: "memchr", header: "".}
  proc `+!`(p: cstring, i: int): cstring {.inline.} = 
cast[cstring](cast[int](p) +% i)
  proc `-!`(p, q: cstring): int {.inline.} = cast[int](p) -% cast[int](q)
  var b   = cast[cstring](ms.data)
  var eob = cast[cstring](ms.data) +! (if size != -1: size else: ms.size)
  var e   = c_memchr(b, sep, eob -! b)
  while e != nil:
fields[result].data = cast[pointer](b)
fields[result].size = e -! b
result += 1
if result == nA - 2:  #max ix nA-1; Need one more slot for final field
  break
b = e +! 1
e = c_memchr(b, sep, eob -! b)
  fields[result].data = cast[pointer](b)
  fields[result].size = eob -! b
  result += 1

proc main(input: string, sizeGuess: int) =
  const mxCols = 1024 #file format-dependent max
  when defined(useCountTab):
var count = initCountTable[MemSlice](rightSize(sizeGuess))
  else:
var count = initTable[MemSlice, int](rightSize(sizeGuess))
  var fields = newSeq[MemSlice](mxCols)
  for line in memSlices(memfiles.open(input)):
fields.setLen(split(line, fields, mxCols, ' '))
for word in fields:
  when defined(useCountTab):
count.inc(word)
  else:
inc(count.mgetOrPut(word, 0))
  stderr.write count.len, " unique words\n"
  when defined(useCountTab) and defined(useCountSort):
count.sort()  #uses Shell's sort internally
for key, cnt in count:
  stdout.write cnt, " "
  discard stdout.writeBuffer(cast[cstring](key.data), key.size)
  stdout.write "\n"
  else:
var answer = newSeq[tuple[cnt: int, key: MemSlice]](count.len)
var i = 0
for key, cnt in count.pairs():
  answer[i].cnt = cnt
  answer[i].key = key
  inc(i)
proc myCmp(x, y: tuple[cnt: int, key: MemSlice]): int = - cmp(x.cnt, 
y.cnt)
answer.sort(myCmp)
for ans in answer:
  stdout.write ans.cnt, " "
  discard stdout.writeBuffer(cast[cstring](ans.key.data), ans.key.size)
  stdout.write "\n"

#Must call with 

Re: beat c in fefes "competition"

2019-03-25 Thread mratsim
Optimising strings is a pain :P. I don't have enough experience in that yet, 
but I plan to acquire it as I would need to do that for text analysis/natural 
language processing anyway.


Re: beat c in fefes "competition"

2019-03-24 Thread arnetheduck
Given that Nim compiles to C, then a C compiler compiles the resulting code, I 
always find these faster-than-C benchmark claims slightly amusing ;)

I'd venture ahead and say that it's mostly the skill of the developer and not 
so much the language that determines performance, by and large - that includes 
being able to find good idiomatic solutions to the problem in the language by 
using its native features appropriately. 


Re: beat c in fefes "competition"

2019-03-24 Thread Stefan_Salewski
It seems to be obvious that in your approach the problem is, that split 
iterator allocates a new string for each call. There may exist other, faster 
but uglier solutions. Basically reusing the same string. Was that available in 
parseutils? I can not remember, but I guess faster examples may be available 
somewhere. Rosetta, or that blog post about faster command line tools? Dom's 
book -- he talked about allocations, but his conclusion was not fully correct 
in the first draft. I wonder if iterators are possible which do not allocate 
the result? 


Re: beat c in fefes "competition"

2019-03-24 Thread miran
> So can one beat the c version?

Long time ago I concluded that when it comes to optimizations, I can optimize 
all day and night, but then at some point @mratsim will come and will improve 
it by 50+% :D

So if you want to beat C, basically wait for @mratsim to show up and do his 
magic :)


beat c in fefes "competition"

2019-03-24 Thread enthus1ast
Hi,

fefe (a german blogger, [http://blog.fefe.de](http://blog.fefe.de)/ ) has 
called for a little benchmark. The challenge is to count words in an webserver 
logfile (from stdin), then output them ordered like so:


150 foobaa
80  faa
12  www


Run

Blog entries:


  * [https://blog.fefe.de/?ts=a2689de5](https://blog.fefe.de/?ts=a2689de5)
  * [https://blog.fefe.de/?ts=b7c295e7](https://blog.fefe.de/?ts=b7c295e7)


Benchmarks:


  * 
[http://ptrace.fefe.de/wp/timings2019.txt](http://ptrace.fefe.de/wp/timings2019.txt)
  * 
[https://ptrace.fefe.de/wp/timings.txt](https://ptrace.fefe.de/wp/timings.txt)



Competing scripts:


  * [https://ptrace.fefe.de/wp](https://ptrace.fefe.de/wp)/



My idiomatic version: 


## compile with: nim c -d:release --opt:speed --passL:-s wp_idiomatic.nim
import tables, strutils

proc main() =
  var countTable = newCountTable[string]()
  var line: string = ""
  while true:
if not readLine(stdin, line): break
for word in line.split():
  countTable.inc(word)
  countTable.sort()
  for key, val in countTable.pairs():
echo val, " ", key
main()


Run

for my testfile i appended a few nginx logs together:


292M/tmp/bigger.log


Run

timings of the c version:


time cat /tmp/bigger.log | ./a.out
real0m2,373s
user0m2,276s
sys 0m0,276s


Run

timings of the nim idiomatic version:


time cat /tmp/bigger.log | ./wp_idiomatic
real0m2,869s
user0m2,766s
sys 0m0,280s



Run

which i think is already a good speed (for this naive approach)?! So can one 
beat the c version? :D