>> Given the issues about using mmap, can anybody suggest how I should >> proceed with the implementation, or if I should at all?
> There are two potential ways where mmap(2) could help improve the speed > of sort: > - If you know the input file name, use a read-only mmap() of that file > and avoid all buffering. Downside: you can not store \0 at the > end of a line anymore and need to deal with char*/size_t pairs for > strings. Actually, if you mmap it PROT_WRITE and MAP_PRIVATE, you could go right ahead. But that'll cost RAM or swap space when the COW fault happens. It also works only when the input file fits into VM; to rephrase part of what I wrote yesterday on tech-kern, sorting a file bigger than 4G on a 32-bit port shouldn't break. > - You use "swap space" instead of a temporary file by doing > mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_ANNON, -1, 0); (well, MAP_ANON). Yes, but that has issues. The size of an mmap()ped area is fixed, set at map time, whereas file sizes grow dynamically. I suspect that trying to use mmap instead of temp files would amount to implementing a rudimentary ramfs. Furthermore, if the dataset fits in RAM, I'd say you shouldn't be using the temporary-space paradigm at all; just slurp it in and sort it in core. And if it fits in VM but not RAM, given the way swap is tuned for general-purpose use instead of the kind of access patterns sort exhibits, I suspect temp files might end up being more performant. And if the dataset doesn't fit in VM, you'll need temp files regardless. If this does go in, I really think it needs an option to suppress it. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTML mo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B