Micah Cowan <[EMAIL PROTECTED]> writes: > Yes, but when mmap()ping with MEM_PRIVATE, once you actually start > _using_ the mapped space, is there much of a difference?
As long as you don't write to the mapped region, there should be no difference between shared and private mapped space -- that's what copy on write (explicitly documented for MAP_PRIVATE in both Linux and Solaris mmap man pages) is supposed to accomplish. I could have used MAP_SHARED, but at the time I believe there was still code that relied on being able to write to the buffer. That code was subsequently removed, but MAP_PRIVATE stayed because I saw no point in removing it. Given the semantics of copy on write, I figured there would be no difference between MAP_SHARED and unwritten-to MAP_PRIVATE. As for the memory footprint getting large, sure, Wget reads through it all, but that is no different from what, say, grep --mmap does. As long as we don't jump backwards in the file, the OS can swap out the unused parts. Another difference between mmap and malloc is that mmap'ed space can be reliably returned to the system. Using mmap pretty much guarantees that Wget's footprint won't increase to 1GB unless you're actually reading a 1GB file, and even then much less will be resident. > mmap() isn't failing; but wget's memory space gets huge through the > simple use of memchr() (on '<', for instance) on the mapped address > space. Wget's virtual memory footprint does get huge, but the resident memory needn't. memchr only accesses memory sequentially, so the above swap out scenario applies. More importantly, in this case the report documents "failing to allocate -2147483648 bytes", which is a malloc or realloc error, completely unrelated to mapped files. > Still, perhaps a better way to approach this would be to use some > sort of heuristic to determine whether the file looks to be > HTML. Doing this reliably without breaking real HTML files will be > something of a challenge, but perhaps requiring that we find > something that looks like a familiar HTML tag within the first 1k or > so would be appropriate. We can't expect well-formed HTML, of > course, so even requiring an <HTML> tag is not reasonable: but > finding any tag whatsoever would be something to start with. I agree in principle, but I'd still like to know exactly what went wrong in the reported case. I suspect it's not just a case of mmapping a huge file, but a case of misparsing it, for example by attempting to extract a "URL" hundreds of megabytes' long.