Ralf Wildenhues wrote: > > I've looked at this in a bit more detail; no big conclusion but maybe > a few more hints that could help. > > I am now pretty confident that your patch implements the threading > correctly. When inserting some > expensive_computation (); > > in the sortlines function right after `swap' is computed, then all > timings look good (in relation to each other). For the expensive > computation, I used a function in a separate translation unit doing some > floating point, and compiled that without optimization. > > 'valgrind --cachegrind' shows a very low cache miss rate of 0.2% max > with both 1 and 2 threads, indicating there isn't anything funny going > on here. > > So I think the only remaining issues that could be relevant for the > original timings are, IMHO: > - false data sharing. Don't see where that could happen, > - memory bandwidth limitations of the specific system, or > - suboptimal NUMA memory distribution, > > That said, I did manage to get 'valgrind --tool=helgrind sort > --threads=2' to show a potential error, with a large data set. I'm > quite sure it is a false positive, but I post the output below for > reference. The 'valgrind --tool=drd' tool could not reproduce this, > though.
That's for looking at this Ralf! Lots of valuable info there... > Anyway. I tried another experiment. I replicated some large data set 8 > times. I then timed a manual merge: > sort -m <( sort data ) <( sort data ) ... (8 times) >/dev/null > > and compared that with: > > sort --threads=P 8-fold-data > > manual merge: 16.75 s > --threads=8: 44.11 s > --threads=1: 81.32 s > > This comparison isn't entirely fair, as the splicing was done as a > precomputation. However, the difference is so pronounced that even > taking the splicing into account, the non-thread version would be > quite a bit faster. So to me, it would seem that some approach going > in that direction could be beneficial. Absolutely. That's one of the things Glen was going to work on this summer. I.E. enhance split and perhaps have a helper script so as to achieve the above more easily. I would suggest that any data that can be split processed and recombined doesn't need a threaded solution. Now a threaded solution while being more complicated may involve less data copying and thus be enough of a performance win to be worth it. Also it may be more transparent. But the simple option should be explored first I think and tested with various scenarios (which I mentioned in a previous mail). It would be useful to know where you stored the 8 sets of split data. Were they files on different spindles or cached in RAM, or on SSD etc. > Other than that, have you thought of implementing something like > described in <http://www.it.uom.gr/teaching/dbpp/text/node127.html>? That link is timing out for me? thanks again! Pádraig. _______________________________________________ Bug-coreutils mailing list [email protected] http://lists.gnu.org/mailman/listinfo/bug-coreutils
