Hi coreutils hackers,
I see this topic has come up on the list before (searching for NMERGE in the archives), but in coreutils 5.3.0 sort.c, temporary files are still merged in a linear-search fashion. Admittedly I am new to this list and haven't read every note, but I didn't find a reason for this... If a patch like this has been rejected in the past, I am interested to know the reason. I have done quite a lot of testing with sort.c and large files, and the linear-search merge code can become a cpu bottleneck if NMERGE is turned up to avoid extra merge passes. However, for the case where there are a lot of duplicate keys or the file is already sorted, the next key is very often found in the same temp file, so a linear search is fastest, and implementing a binary search for every case would slow down sort on mostly-already-sorted input (it happens all the time). I've coded a patch to sort.c from coreutils 5.3.0 for your careful consideration (and my use ;) to perform a binary search in the merge pass, only if the key was not found in the first merge temp file. This removes the CPU bottleneck for large NMERGE values (even 1024 is OK!) and doesn't have much of a performance impact on NMERGE=16, which is the default... but even at 16, it's quite better at merging temp-files of random keys and not more than a percent worse on already-sorted input files, due to the extra check to see if this is the first file or not. The case where the input is already sorted, or there are lots of dupe keys, performs about the same because the binary search is not performed if the key is found in the first merge file. The upshot is you can now sort that 100GB file quite a lot faster with only one merge pass, if you recompile with large enough NMERGE. Rah! Attached is my patch, which I think you will find performs quite well on random keys and about the same on an already sorted files, and doesn't affect in-memory sorts at all. If this is of interest, FSF may have copyright to this patch in return for a much-coveted line in the THANKS file. Thanks very much, James Lemley ********************************************************************** The information contained in this communication is confidential, is intended only for the use of the recipient named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please re-send this communication to the sender and delete the original message or any copy of it from your computer system. Thank You.
sort-5.3.0_merge.patch
Description: Binary data
_______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils