Re: Fastest way to count number of lines

cblake Fri, 20 Oct 2017 22:45:09 +0200

So, one other thing that is _probably_ obvious but bears mentioning just in 
case it didn't occur to @alfrednewman - the number of addressable/seekable 
bytes in a file is usually maintained by any modern filesystem on any modern OS 
as cheaply accessed metadata. So, if what you really need is not an exact count 
but a "reasonable bound" that could be violated once in a while then you may be 
able to _really_ accelerate your processing.


As one example text corpus that we all have access to, the current Nim git 
repo's `lib/nim/**.nim` files have an average length of 33 bytes and a standard 
deviation of about 22 bytes as assessed by this little program: 
    
    
    import os, memfiles
    when isMainModule:
      var sumSq = 0
      for slc in memSlices(inp):
        inc(counter)
        sumSq += slc.size * slc.size
      echo "num lines: ", counter
      let mean = float(inp.size) / float(counter)
      echo "mean length: ", mean
      let meanSq = float(sumSq) / float(counter)
      echo "var(length): ", meanSq - mean*mean
    

You could probably reasonably bound the average number of bytes per line as, 
say (average + 4*stdev) which in this case is about 33+22*4 =~ 121 bytes..maybe 
round up to 128. Then you could do something like: 
    
    
    import os
    var reasonableUpperBoundOnLineCount = int(float(getFileInfo(myPath).size) / 
float(128))
    

If you use that bound to allocate something then you are unlikely to over 
allocate memory by more than about 4X which isn't usually considered "that bad" 
in this kind of problem space. Depending on what you are doing you can tune 
that parameter and you might need to be prepared in your code to "spill" past a 
very, very rare 4 standard deviations tail event. This optimization will beat 
the pants off any even AVX512 deal that iterates over all the file bytes at 
least for this aspect of the calculation. It basically eliminates a whole pass 
over the input data in a case that is made common by construction.

Since you have seemed pretty dead set on an exact calculation in other posts, a 
small elaboration upon the "embedded assumptions" in this optimization may be 
warranted. All that is really relied upon is that some initial sample of files 
can predict the distribution of line lengths "well enough" to estimate some 
threshold (that "128" divisor") that has "tunably rare" spill overs where they 
are rare enough to not cause much slowdown in whatever ultimate calculation you 
are actually doing which you have not been clear about.

Another idea along these lines, if, say, the files are processed over and over 
again, is to avoid re-computing all those `memchr()` s by writing a little 
proc/program to maintain an off-to-the-side file of byte indexes to the 
beginnings of lines. The idea here would be that you have two sets of files, 
your actual input files and some paired file "foo.idx" with foo.idx containing 
just a bunch of binary ints in the native format of the CPU that are either 
byte offsets or line lengths effectively caching the answer of the `memchr`.

If you had such index files then when you want to know how many lines a file is 
you can `getFileInfo` on the ".idx" file and know immediately. You could be 
careful and check modification times on the .idx and the original data file and 
that sort of thing, too. Why, you could even add a "smarter" `memSlices` that 
checked for such a file and skipped almost all its work and an API call 
`numSlices` that skipped all the work if the ".idx" file is up-to-date 
according to time stamps.

Basically, except for actually "just computing file statistics", it seems 
highly unlikely to me that you should really be optimizing the heck out of 
newline counting in and of itself beyond what `memSlices/memchr()` already do.

Re: Fastest way to count number of lines

Reply via email to