How do you extract a part from an initCountTable in Nim?

cblake Tue, 28 Jul 2020 13:30:52 -0700

Ok. I realize I also didn't tell @Serge who never replied about needing to do 
`wordFrequencies.sort`. So, as penance I did this series of **_six_** 
successively faster versions. First would be @gemath's, then the no-deps 
variant. Then using the heapqueue I mentioned. Then `Table[string,int]` instead 
of `CountTable[string]`. Then (for a big 2x time drop) using `cligen 
mfile/mslice`. Then pre-sizing with a good guess for average bytes per unique 
word. First will be the code and then the timings on an i7-6700k@4.7GHz with 
hopefully easy to reproduce input data. To keep things simple, I only include 
the `iterator` definition once.
    
    
    # whist 00
    import strutils, tables, zero_functional
    
    proc main() =
      var histo = initCountTable[string]()
      for line in lines "/dev/stdin":
        for word in line.split(" "):
          if word.len > 0:
            histo.inc(word)
      histo.sort
      histo.pairs() --> take(10).foreach(echo it[0] & " " & $it[1])
    main()
    
    #whist 0i; same as above but post build iterative not functional
      histo.sort   # maybe the 'real' answer to @Serge's Question
      var i = 10
      for word, cnt in histo:
        echo cnt, " ", word
        i.dec
        if i == 0: break
    
    # whist1; Now introduce `iterator topN`
    import strutils, tables, heapqueue
    
    iterator topN[T](h: CountTable[T]|Table[T, int], n=10):
        tuple[cnt: int; key: T] =
      var q = initHeapQueue[tuple[cnt: int; key: T]]()
      for key, cnt in h:
        if q.len < n:
          q.push((cnt, key))
        elif cnt > q[0].cnt:  # retain 1st seen on tied cnt
          discard q.replace((cnt, key))
      while q.len > 0:        # q now has top n entries
        yield q.pop           # Print in ascending order
    
    proc main() =
      var histo = initCountTable[string]()
      for line in lines "/dev/stdin":
        for word in line.split(" "):
          if word.len > 0: histo.inc(word)
      for tup in histo.topN: echo tup[0], " ", tup[1]
    
    #whist2: move from `CountTable[string]` to `Table[string,int]`
    proc main() =
      var histo = initTable[string, int]()
      for line in lines "/dev/stdin":
        for word in line.split(" "):
          if word.len > 0: histo.mgetOrPut(word, 0).inc
      for tup in histo.topN: echo tup[0], " ", tup[1]
    
    # whist3; move to memory mapped IO and from string to slices.
    import cligen/[mfile, mslice]
    proc main() =
      let mf = mopen("/dev/stdin")
      var histo = initTable[MSlice, int]()
      for line in mf.mSlices:
        for word in line.mSlices(' '):
          if word.len > 0: histo.mgetOrPut(word, 0).inc
      for tup in histo.topN: echo tup[0], " ", tup[1]
    
    # whist4: presize table; everything the same except this line
      var histo = initTable[MSlice, int](rightSize(mf.len div 45))
    
    
    Run


Then we run these 6 versions against the text files that are the first 50,000 
lines of alphabetically-recursively globbed source in the Nim git repository. 
There are Zsh-isms here with `setopt extendedglob numericglobsort` and I use a 
little `utime` program I have that times the execution of its argument like 
/usr/bin/time but to nanoseconds and a little program to emit 
Parzen-interpolated quantiles with 0 equivalent to the sample minimum. 
    
    
    nim-repo..908bbb250f$ cat **.nim|tail -n50000 > /tmp/j
    $ buncha PGO compiles with my nim-pgo script; panics:on -d:useMalloc 
--gc:arc, Linux
    $ for p in whist[0-4]*(x.); echo $p $((repeat 100 utime ./$p < j 
>/dev/null)|& quantiles 0 .05 .1)
    whist00 0.02391167 0.02400840 0.02403021
    whist0i 0.02335627 0.02340394 0.02345088
    whist1  0.02189838 0.02192395 0.02201866
    whist2  0.01853665 0.01885521 0.01904099
    whist3  0.00922274 0.00943208 0.00952293
    whist4  0.00760648 0.00772719 0.00777402
    
    
    Run

My `/tmp` is a /dev/shm FS. So, no device IO here. To maybe see ratios relative 
to the "first row" as we optimize, we can pipe to an `awk`: 
    
    
    $ awk '{print $1,$2/0.02391167,$3/0.02400840,$4/0.02403021}'
    whist00 1 1 1
    whist0i 0.976773 0.974823 0.975892
    whist1 0.915803 0.913178 0.916291
    whist2 0.775214 0.785359 0.792377
    whist3 0.3857 0.392866 0.39629
    whist4 0.318107 0.321854 0.32351
    
    
    Run

Not bad. We eventually do over 3x better than the initial version going from 
24ms to 7.6ms. As @gemath alludes to, the hit for zero_functional is only like 
2.5%. The topN iterator surprised me in how little it saved (at that slow 
stage) only being another 6%. Then regular `Table` is 1.2x faster than 
`CountTable`. Then `MSlice` and memory mapped IO is about 2.0x faster than 
using `string`. Then presizing gets you another 1.2x speed-up.

As a cross check on my input data, this is the output or just `whist4`: 
    
    
    1201 of
    1296 echo
    1317 """
    1389 let
    1934 doAssert
    2180 proc
    2543 #
    2574 var
    3441 ==
    11863 =
    
    
    Run

The outputs are the same across all programs but whist0* prints those 10 things 
in reverse order. Obviously easy and fast to use `proc algorithm.reverse` to 
flip those if desired.

How do you extract a part from an initCountTable in Nim?

Reply via email to