Re: howto count lines - fast

H. S. Teoh via Digitalmars-d-learn Thu, 01 Jun 2017 16:12:49 -0700

On Wed, May 31, 2017 at 12:13:04PM -0700, H. S. Teoh via Digitalmars-d-learn 
wrote:
> On Tue, May 30, 2017 at 05:13:46PM -0700, Ali Çehreli via Digitalmars-d-learn 
> wrote:
> [...]
> > I could not make the D program come close to wc's performance when the
> > data was piped from stdin.
> [...]
> 
> Hmm. This is a particularly interesting case, because I adapted some
> of my algorithms to handle reading from stdin (i.e., std.mmfile is not
> an option), and I could not get it to reach wc's performance!  I even
> installed ldc just to see if that made a difference... it was somewhat
> faster than gdc, but still, the timings were about twice as slow as
> wc.
[...]


Here's a little update on my experiments w.r.t. reading from stdin
without being able to use std.mmfile: I found that I was able to achieve
decent performance by modifying the loop so that it loads data from the
array in size_t chunks rather than by individual bytes, and looping over
the bytes in the size_t to check for matches. Here's the code:
 
        size_t lineCount7(ref File input)
        {
            import std.algorithm.searching : count;

            ubyte[] buf;
            size_t c = 0;

            buf.length = 8192;

            foreach (d; input.byChunk(buf))
            {
                if (d.length == buf.length)
                {
                    auto ichunk = cast(ulong[]) d;
                    size_t subtotal = 0;

                    foreach (i; ichunk)
                    {
                        enum eol = cast(ulong) '\n';
                        if ((i & 0x00000000000000FF) ==  eol       ) subtotal++;
                        if ((i & 0x000000000000FF00) == (eol <<  8)) subtotal++;
                        if ((i & 0x0000000000FF0000) == (eol << 16)) subtotal++;
                        if ((i & 0x00000000FF000000) == (eol << 24)) subtotal++;
                        if ((i & 0x000000FF00000000) == (eol << 32)) subtotal++;
                        if ((i & 0x0000FF0000000000) == (eol << 40)) subtotal++;
                        if ((i & 0x00FF000000000000) == (eol << 48)) subtotal++;
                        if ((i & 0xFF00000000000000) == (eol << 56)) subtotal++;
                    }
                    c += subtotal;
                }
                else
                    c += d.count('\n');
            }
            return c;
        }

When the last chunk of the file (possibly the entire file, if it's < 8
KB) is incomplete, we revert back to the naïve loop-over-bytes search.

While this superficially may seem like unnecessary complication, it
actually makes a significant performance difference, because:

(1) Reading the array in size_t chunks means less roundtrips to RAM or
L1/L2/L3 caches.

(2) Since a size_t fits within a single CPU register, the inner loop can
be completely done inside the CPU without needing to even go to L1
cache, which, while it's pretty fast, is still a memory roundtrip. The
register file is the fastest memory of all, so we maximize this
advantage here.

(3) Since size_t has a fixed size, the loop can be completely unrolled
(ldc does this) and thus completely eliminate branch hazards from the
inner loop.

I originally tried to copy the glibc memchr implementation's xor trick
for checking whether a size_t word contains any matching bytes, but I
got mixed results, and in any case it loses out to my system's wc
implementation.  I suppose given enough effort I could track down what's
causing the D code to slow down, but then I realized something else
about wc: since it uses memchr to find EOL bytes, and now that I know
memchr's implementation in glibc, it means that a lot of overhead is
introduced when the data being scanned contains a lot of matches.

So I did a little test: I created two text files, both 200 million bytes
long, with all 200 million bytes newlines (i.e., 200,000,000 blank
lines), and one with 100-character lines (2,000,000 lines in total).
Then as an intermediate between these two extremes, I concatenated all
of the .d files in Phobos 10 times to make a file with 2.8 million lines
of varying lengths.  Then I tested both my system's wc and my linecount
written in D to see how they performed on these files (piped through
stdin, so no mmap-ing is involved).

Here are the results (note: these are raw measurements; I did not
account for system background noise):

            +------------------+-------------------+------------------+
            | 200M blank lines | 2M 100-byte lines | 10x Phobos code  |
+-----------+------------------+-------------------+------------------+
| wc -l     | real    0m0.475s | real    0m0.080s  | real    0m0.083s |
|           | user    0m0.417s | user    0m0.034s  | user    0m0.062s |
|           | sys     0m0.058s | sys     0m0.046s  | sys     0m0.020s |
+-----------+------------------+-------------------+------------------+
| linecount | real    0m0.181s | real    0m0.190s  | real    0m0.099s |
|           | user    0m0.138s | user    0m0.129s  | user    0m0.059s |
|           | sys     0m0.042s | sys     0m0.060s  | sys     0m0.040s |
+-----------+------------------+-------------------+------------------+

As expected, wc -l loses when dealing with blank lines (and, presumably,
short lines); the D version was able to beat it by more than a factor of
2.  On the file with 100-byte lines, though, the performance of wc
improved tremendously, because glibc's memchr is optimized for scanning
large amounts of data before finding a match, whereas the D version
performs more-or-less on par with the blank line case, but losing out to
wc by about a factor of 2.

The results of the 10x Phobos code runs are not directly comparable with
the first two test files, because the total file size is different.
Here, the D code still loses out slightly to wc, presumably because
memchr is ultimately still more efficient given the average line length
in Phobos code.

These results show that the performance of these algorithms depend on
the kind of data you feed them, and there's probably no "optimal"
line-counting algorithm unless you can predict the average line lengths
in advance.  In general, though, if you're dealing with text, I'd wager
that the average line length should be closer to the 100-byte lines end
of the extreme than the file filled with blank lines, so wc probably
wins on your typical text file.  I guess that means that using memchr or
its equivalent in D would be the best strategy to obtain results on par
with wc, as far as reading from stdin is concerned.

When you can use std.mmfile, though, the D version still beats wc by a
large margin. :-D


T

-- 
Help a man when he is in trouble and he will remember you when he is in trouble 
again.

Re: howto count lines - fast

Reply via email to