Re: Using mmap(2) in sort(1) instead of temp files

Mouse Thu, 04 Apr 2024 07:05:32 -0700

>> Given the issues about using mmap, can anybody suggest how I should
>> proceed with the implementation, or if I should at all?


> There are two potential ways where mmap(2) could help improve the speed
> of sort:

>  - If you know the input file name, use a read-only mmap() of that file
>    and avoid all buffering.  Downside: you can not store \0 at the
>    end of a line anymore and need to deal with char*/size_t pairs for
>    strings.

Actually, if you mmap it PROT_WRITE and MAP_PRIVATE, you could go right
ahead.  But that'll cost RAM or swap space when the COW fault happens.
It also works only when the input file fits into VM; to rephrase part
of what I wrote yesterday on tech-kern, sorting a file bigger than 4G
on a 32-bit port shouldn't break.

>  - You use "swap space" instead of a temporary file by doing

>       mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_ANNON, -1, 0);

(well, MAP_ANON).  Yes, but that has issues.  The size of an mmap()ped
area is fixed, set at map time, whereas file sizes grow dynamically.  I
suspect that trying to use mmap instead of temp files would amount to
implementing a rudimentary ramfs.

Furthermore, if the dataset fits in RAM, I'd say you shouldn't be using
the temporary-space paradigm at all; just slurp it in and sort it in
core.  And if it fits in VM but not RAM, given the way swap is tuned
for general-purpose use instead of the kind of access patterns sort
exhibits, I suspect temp files might end up being more performant.  And
if the dataset doesn't fit in VM, you'll need temp files regardless.

If this does go in, I really think it needs an option to suppress it.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                [email protected]
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Re: Using mmap(2) in sort(1) instead of temp files

Reply via email to