On Wednesday, 13 May 2015 at 13:40:33 UTC, John Colvin wrote:
On Wednesday, 13 May 2015 at 11:33:55 UTC, John Colvin wrote:
On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote:
On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote:
On 13/05/2015 4:20 a.m., Gerald Jansen wrote:
At the risk of great embarassment ... here's my program:
http://dekoppel.eu/tmp/pedupg.d

Would it be possible to give us some example data?
I might give it a go to try rewriting it tomorrow.

http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb)

Contains two largish datasets in a directory structure expected by the program.

I only see 2 traits in that example, so it's hard for anyone to explore your scaling problem, seeing as there are a maximum of 2 tasks.

Either way, a few small changes were enough to cut the runtime by a factor of ~6 in the single-threaded case and improve the scaling a bit, although the printing to output files still looks like a bit of a bottleneck.


http://dpaste.dzfl.pl/80cd36fd6796

The key thing was reducing the number of allocations (more std.algorithm.splitter copying to static arrays, less std.array.split) and avoiding File.byLine. Other people in this thread have mentioned alternatives to it that may be faster/have lower memory usage, I just read the whole files in to memory and then lazily split them with std.algorithm.splitter. I ended up with some blank lines coming through, so i added if(line.empty) continue; in a few places, you might want to look more carefully at that, it could be my mistake.

The use of std.array.appender for `info` is just good practice, but it doesn't make much difference here.

Wow, I'm impressed with the effort you guys (John, Rikki, others) are making to teach me some efficiency tricks. I guess this is one of the strengths of D: its community. I'm studying your various contributions closely!

The empty line comes from the very last line on the files, which also end with a newline (as per "normal" practice?).

Reply via email to