Hello Paul, Glen, Jim, [ I already wrote this in part to Glen off-list, sorry for the duplication ]
* Paul Eggert wrote on Fri, Apr 03, 2009 at 09:57:54PM CEST: > Of course we cannot reasonably expect this one performance improvement > to make 'sort' run 16x faster on a 16-CPU machine. No. Let's discuss the 1 vs. 2 threads issue for now: the part running concurrently takes O(N log N) work. The parts not running concurrently are: startup code, file reading, list splitting, list merging, and output. All of the latter are O(N) or less, with N being the input size. The concurrent parts do not require any communication/synchronization between the 2 threads. This means that a parallelization which doesn't hit the memory bus bandwidth limit in the concurrent stage, has to have (slowly) improving 1:2 thread behavior when N grows. But the experiment does not show that. This indicates a possible issue. To exclude bandwidth issues, one can try to run this code on a system which has as big as possible memory busses, and e.g., run the threads on separate processors (as opposed to only separate cores or HT threads). I'm not trying to veto this patch, it is certainly an improvement over not having it, and improving things in steps is good; but I'm sure it could be better even without parallelizing the merge part of the algorithm. I already noted off-list to Glen that you too can get an account on the GCC compile farm, which has some decent multi-way systems. But for the above issue a bigger system isn't necessary. Cheers, Ralf _______________________________________________ Bug-coreutils mailing list [email protected] http://lists.gnu.org/mailman/listinfo/bug-coreutils
