On Wed, May 31, 2017 at 12:13:04PM -0700, H. S. Teoh via Digitalmars-d-learn wrote: > On Tue, May 30, 2017 at 05:13:46PM -0700, Ali Çehreli via Digitalmars-d-learn > wrote: > [...] > > I could not make the D program come close to wc's performance when the > > data was piped from stdin. > [...] > > Hmm. This is a particularly interesting case, because I adapted some > of my algorithms to handle reading from stdin (i.e., std.mmfile is not > an option), and I could not get it to reach wc's performance! I even > installed ldc just to see if that made a difference... it was somewhat > faster than gdc, but still, the timings were about twice as slow as > wc. [...]
Here's a little update on my experiments w.r.t. reading from stdin without being able to use std.mmfile: I found that I was able to achieve decent performance by modifying the loop so that it loads data from the array in size_t chunks rather than by individual bytes, and looping over the bytes in the size_t to check for matches. Here's the code: size_t lineCount7(ref File input) { import std.algorithm.searching : count; ubyte[] buf; size_t c = 0; buf.length = 8192; foreach (d; input.byChunk(buf)) { if (d.length == buf.length) { auto ichunk = cast(ulong[]) d; size_t subtotal = 0; foreach (i; ichunk) { enum eol = cast(ulong) '\n'; if ((i & 0x00000000000000FF) == eol ) subtotal++; if ((i & 0x000000000000FF00) == (eol << 8)) subtotal++; if ((i & 0x0000000000FF0000) == (eol << 16)) subtotal++; if ((i & 0x00000000FF000000) == (eol << 24)) subtotal++; if ((i & 0x000000FF00000000) == (eol << 32)) subtotal++; if ((i & 0x0000FF0000000000) == (eol << 40)) subtotal++; if ((i & 0x00FF000000000000) == (eol << 48)) subtotal++; if ((i & 0xFF00000000000000) == (eol << 56)) subtotal++; } c += subtotal; } else c += d.count('\n'); } return c; } When the last chunk of the file (possibly the entire file, if it's < 8 KB) is incomplete, we revert back to the naïve loop-over-bytes search. While this superficially may seem like unnecessary complication, it actually makes a significant performance difference, because: (1) Reading the array in size_t chunks means less roundtrips to RAM or L1/L2/L3 caches. (2) Since a size_t fits within a single CPU register, the inner loop can be completely done inside the CPU without needing to even go to L1 cache, which, while it's pretty fast, is still a memory roundtrip. The register file is the fastest memory of all, so we maximize this advantage here. (3) Since size_t has a fixed size, the loop can be completely unrolled (ldc does this) and thus completely eliminate branch hazards from the inner loop. I originally tried to copy the glibc memchr implementation's xor trick for checking whether a size_t word contains any matching bytes, but I got mixed results, and in any case it loses out to my system's wc implementation. I suppose given enough effort I could track down what's causing the D code to slow down, but then I realized something else about wc: since it uses memchr to find EOL bytes, and now that I know memchr's implementation in glibc, it means that a lot of overhead is introduced when the data being scanned contains a lot of matches. So I did a little test: I created two text files, both 200 million bytes long, with all 200 million bytes newlines (i.e., 200,000,000 blank lines), and one with 100-character lines (2,000,000 lines in total). Then as an intermediate between these two extremes, I concatenated all of the .d files in Phobos 10 times to make a file with 2.8 million lines of varying lengths. Then I tested both my system's wc and my linecount written in D to see how they performed on these files (piped through stdin, so no mmap-ing is involved). Here are the results (note: these are raw measurements; I did not account for system background noise): +------------------+-------------------+------------------+ | 200M blank lines | 2M 100-byte lines | 10x Phobos code | +-----------+------------------+-------------------+------------------+ | wc -l | real 0m0.475s | real 0m0.080s | real 0m0.083s | | | user 0m0.417s | user 0m0.034s | user 0m0.062s | | | sys 0m0.058s | sys 0m0.046s | sys 0m0.020s | +-----------+------------------+-------------------+------------------+ | linecount | real 0m0.181s | real 0m0.190s | real 0m0.099s | | | user 0m0.138s | user 0m0.129s | user 0m0.059s | | | sys 0m0.042s | sys 0m0.060s | sys 0m0.040s | +-----------+------------------+-------------------+------------------+ As expected, wc -l loses when dealing with blank lines (and, presumably, short lines); the D version was able to beat it by more than a factor of 2. On the file with 100-byte lines, though, the performance of wc improved tremendously, because glibc's memchr is optimized for scanning large amounts of data before finding a match, whereas the D version performs more-or-less on par with the blank line case, but losing out to wc by about a factor of 2. The results of the 10x Phobos code runs are not directly comparable with the first two test files, because the total file size is different. Here, the D code still loses out slightly to wc, presumably because memchr is ultimately still more efficient given the average line length in Phobos code. These results show that the performance of these algorithms depend on the kind of data you feed them, and there's probably no "optimal" line-counting algorithm unless you can predict the average line lengths in advance. In general, though, if you're dealing with text, I'd wager that the average line length should be closer to the 100-byte lines end of the extreme than the file filled with blank lines, so wc probably wins on your typical text file. I guess that means that using memchr or its equivalent in D would be the best strategy to obtain results on par with wc, as far as reading from stdin is concerned. When you can use std.mmfile, though, the D version still beats wc by a large margin. :-D T -- Help a man when he is in trouble and he will remember you when he is in trouble again.