Re: [PATCH] sort: Add --threads option, which parallelizes internal sort.

Pádraig Brady Sun, 19 Apr 2009 16:59:52 -0700

Ralf Wildenhues wrote:
> 
> I've looked at this in a bit more detail; no big conclusion but maybe
> a few more hints that could help.
> 
> I am now pretty confident that your patch implements the threading
> correctly.  When inserting some
>   expensive_computation ();
> 
> in the sortlines function right after `swap' is computed, then all
> timings look good (in relation to each other).  For the expensive
> computation, I used a function in a separate translation unit doing some
> floating point, and compiled that without optimization.
> 
> 'valgrind --cachegrind' shows a very low cache miss rate of 0.2% max
> with both 1 and 2 threads, indicating there isn't anything funny going
> on here.
> 
> So I think the only remaining issues that could be relevant for the
> original timings are, IMHO:
> - false data sharing.  Don't see where that could happen,
> - memory bandwidth limitations of the specific system, or
> - suboptimal NUMA memory distribution,
> 
> That said, I did manage to get 'valgrind --tool=helgrind sort
> --threads=2' to show a potential error, with a large data set.  I'm
> quite sure it is a false positive, but I post the output below for
> reference.  The 'valgrind --tool=drd' tool could not reproduce this,
> though.


That's for looking at this Ralf!
Lots of valuable info there...

> Anyway.  I tried another experiment.  I replicated some large data set 8
> times.  I then timed a manual merge:
>   sort -m <( sort data ) <( sort data ) ... (8 times) >/dev/null
> 
> and compared that with:
> 
>   sort --threads=P 8-fold-data
> 
> manual merge: 16.75 s
> --threads=8:  44.11 s
> --threads=1:  81.32 s
> 
> This comparison isn't entirely fair, as the splicing was done as a
> precomputation.  However, the difference is so pronounced that even
> taking the splicing into account, the non-thread version would be
> quite a bit faster.  So to me, it would seem that some approach going
> in that direction could be beneficial.

Absolutely. That's one of the things Glen was going to work on this summer.
I.E. enhance split and perhaps have a helper script so as to achieve the
above more easily.
I would suggest that any data that can be split processed and recombined
doesn't need a threaded solution. Now a threaded solution while
being more complicated may involve less data copying and thus be enough
of a performance win to be worth it. Also it may be more transparent.
But the simple option should be explored first I think and tested with
various scenarios (which I mentioned in a previous mail).

It would be useful to know where you stored the 8 sets of split data.
Were they files on different spindles or cached in RAM, or on SSD etc.

> Other than that, have you thought of implementing something like
> described in <http://www.it.uom.gr/teaching/dbpp/text/node127.html>?

That link is timing out for me?

thanks again!
Pádraig.


_______________________________________________
Bug-coreutils mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: [PATCH] sort: Add --threads option, which parallelizes internal sort.

Reply via email to