On Tue, May 30, 2017 at 08:02:38PM +0000, Nitram via Digitalmars-d-learn wrote: > After reading > https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i > was wondering how fast one can do a simple "wc -l" in D. > > So i made a couple short implementations and found myself confronted > with slow results compared to "/usr/bin/wc -l". > > How would a implementation look like in D, which is fast?
This little challenge piqued my interest. So I decided to take a shot at seeing if I could beat my system's /usr/bin/wc -l. First order of business: whenever it comes to performance, always choose the right compiler for the job. While dmd is the reference compiler and has the latest and greatest features, it doesn't do too well in the performance department. So the first thing I did was to use gdc instead. Specifically, gdc -O3, to get the most aggressive optimizations the gcc backend can give me. Second order of business: the naïve algorithm of reading 1 character at a time from the file is obviously suboptimal, so I didn't even consider it. Third order of business: I constructed a large text file containing 10634420 lines by concatenating 5 copies of a large CSV file I obtained online. The sample had to be large enough so that overhead like program startup times and background system noise don't bias the test results too much. With this setup, I measured the performance of my system's wc -l as the baseline for comparing the performance of the D solutions. Here's a typical output of my shell's `time wc -l` command: real 0m0.344s user 0m0.153s sys 0m0.191s Since we're talking about beating wc, and others have already posted the obvious solutions of loading into a buffer and scanning the buffer, the most obvious next step up is to use std.mmfile to memory-map the file so that we don't waste effort managing file buffers ourselves: let the OS do it for us. So here are the algorithms I tried: 1) lineCount1(): use std.mmfile to memory-map the file so that we can scan it as a single, contiguous array without any fussy buffer-management details. Result: pretty fast, but still loses to wc. Typical measurements: real 0m0.422s user 0m0.366s sys 0m0.054s 2) lineCount2(): just as a comparison, I did a read-into-buffer algorithm so that we have a reference as to how it performs. As expected, it's worse than the std.mmfile solution. Typical measurements: real 0m0.519s user 0m0.320s sys 0m0.198s 3) lineCount3(): In lineCount1(), I used std.algorithm.searching.count for counting newlines; so I thought, what if the range abstraction is introducing too much overhead? So I decided to write a simple foreach loop instead, in hope that the simpler code will allow the gcc optimizer do a better job. Sadly, the result is pretty much the same as lineCount1: real 0m0.421s user 0m0.379s sys 0m0.042s (The good news, though, is that this shows that using std.algorithm does not introduce excessive overhead compared to a hand-written loop. This proves that the high-level range abstraction does not necessarily equate to slower performance.) 4) lineCount4(): Having failed to beat wc thus far, it's time to bring out the big guns. Since we're essentially counting newlines in the input, who says we have to process the data sequentially from start to end? Counting from front to back or back to front gives the same answer, as does counting from middle to end, then front to middle. In particular, counting from middle to end *in parallel* with counting from front to middle, then summing the subtotals, gives the same answer. I.e., time to bust out std.parallelism. :-) So, in my 4th algorithm, I split the input into 16KB chunks, and then scan them for newlines in parallel, creating file_size/16384 subtotals. Then I sum the subtotals in the end. Here's the result, tested on my AMD Phenom 6-core CPU: real 0m0.242s user 0m1.151s sys 0m0.057s Woohoo!!! Finally, we beat the system's wc -l!! And by a pretty fair margin, too. Eat your heart out, wc!!! (The large user time is because we're using all 6 cores at once. But the actual elapsed time is shorter.) Here's the code for all of the above: ---------snip--------- import std.stdio; // real 0m0.422s // user 0m0.366s // sys 0m0.054s size_t lineCount1(string filename) { import std.algorithm.searching : count; import std.mmfile; auto f = new MmFile(filename); auto data = cast(ubyte[]) f[]; return data.count('\n'); } // real 0m0.519s // user 0m0.320s // sys 0m0.198s size_t lineCount2(string filename) { import std.algorithm.searching : count; auto f = File(filename); ubyte[] buf; size_t c = 0; buf.length = 32768; do { auto d = f.rawRead(buf); c += d.count('\n'); } while (!f.eof()); return c; } // real 0m0.421s // user 0m0.379s // sys 0m0.042s size_t lineCount3(string filename) { import std.mmfile; auto f = new MmFile(filename); auto data = cast(ubyte[]) f[]; size_t c; foreach (i; 0 .. data.length) { if (data[i] == '\n') c++; } return c; } // real 0m0.242s // user 0m1.151s // sys 0m0.057s size_t lineCount4(string filename) { import std.algorithm.comparison : min; import std.algorithm.iteration : sum; import std.mmfile; import std.parallelism; import std.range : chunks; auto f = new MmFile(filename); auto data = cast(ubyte[]) f[]; size_t[] subtotals; enum blockSize = 16384; if (data.length == 0) return 0; subtotals.length = 1 + (data.length - 1) / blockSize; foreach (j, ref subtotal; parallel(subtotals)) { size_t end = min(blockSize*(j+1), data.length); foreach (i; blockSize*j .. end) { if (data[i] == '\n') subtotal++; } } return subtotals.sum; } int main(string[] args) { if (args.length < 2) { stderr.writeln("Please specify target file."); return 1; } auto filename = args[1]; //auto count = lineCount1(filename); //auto count = lineCount2(filename); //auto count = lineCount3(filename); auto count = lineCount4(filename); writeln(count); return 0; } // vim:set sw=4 ts=4 et ai: ---------snip--------- T -- BREAKFAST.COM halted...Cereal Port Not Responding. -- YHL