Hi all,
I've been investigating sort functionality when sorting big files
(larger than RAM size). I'm looking at the system calls and the actual
read and write related instructions the application uses.
For sort to manage large files, my guess is, it implements some kind of
external merge sort algorithm. And indeed, I can see sort load chunks
from the big file, sorts them, and saves them back to disk when sorted.
But looking a bit further there are two things I do not understand:
1) There are plenty of reads to different buffers of different sizes
(49152, 36864, 28672 for example) - why? I mean, why it doesn't just
read the whole chunk once and then merge sort it?
2) After splitting and sorting each chunk, sort needs to merge them to
one sorted list. When I examine the instructions it seems like sort does
A LOT of accesses to those buffers in order to sort them, almost at the
same intensity of merging each chunk - why? doesn't it just need to look
at the beginning of each file, check where is the smallest value, put it
in an output buffer and then write it to the final output file?
I've been digging in the code, but unfortunately didn't find my answers
there yet.
I will really appreciate your help with helping me understand how sort
is really implemented.
Thanks!
Gil Shomron.