On Wednesday, 13 May 2015 at 13:40:33 UTC, John Colvin wrote:
On Wednesday, 13 May 2015 at 11:33:55 UTC, John Colvin wrote:
On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote:
On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole
wrote:
On 13/05/2015 4:20 a.m., Gerald Jansen wrote:
At the risk of great embarassment ... here's my program:
http://dekoppel.eu/tmp/pedupg.d
Would it be possible to give us some example data?
I might give it a go to try rewriting it tomorrow.
http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb)
Contains two largish datasets in a directory structure
expected by the program.
I only see 2 traits in that example, so it's hard for anyone
to explore your scaling problem, seeing as there are a maximum
of 2 tasks.
Either way, a few small changes were enough to cut the runtime
by a factor of ~6 in the single-threaded case and improve the
scaling a bit, although the printing to output files still
looks like a bit of a bottleneck.
http://dpaste.dzfl.pl/80cd36fd6796
The key thing was reducing the number of allocations (more
std.algorithm.splitter copying to static arrays, less
std.array.split) and avoiding File.byLine. Other people in this
thread have mentioned alternatives to it that may be
faster/have lower memory usage, I just read the whole files in
to memory and then lazily split them with
std.algorithm.splitter. I ended up with some blank lines coming
through, so i added if(line.empty) continue; in a few places,
you might want to look more carefully at that, it could be my
mistake.
The use of std.array.appender for `info` is just good practice,
but it doesn't make much difference here.
Wow, I'm impressed with the effort you guys (John, Rikki, others)
are making to teach me some efficiency tricks. I guess this is
one of the strengths of D: its community. I'm studying your
various contributions closely!
The empty line comes from the very last line on the files, which
also end with a newline (as per "normal" practice?).