I've come across a situation whereby file transfers consistently fail from an httpd server. On the one hand it's a bit of an edge case, but on the other it definitely seems to be incorrect behaviour. I'm sure this must have been discussed before, but I couldn't find anything much on the list with an admittedly fairly brief search.
Essentially, npm/event/event.c waits in lingering close state for MAX_SECS_TO_LINGER ( which is defined as 30 ) before forcibly closing the connection - if there's still unacknowledged write data in the kernel socket at this point, the connection fails. Luckily, the conditions under which this can happen are fairly limited - essentially it amounts to the receiver not being able to accept data quickly enough. On Linux at least, the default write buffer for a socket seems to be 212992 bytes ( well, that's /proc/sys/net/core/wmem_(default|max), the actual value used will be less - the manpage suggesting half, though my experiments don't bear that out. ) For that to drain in 30 seconds, the transfer speed needs to be at least 57kbit/s. Whilst that's pretty slow, remember that there could be many simultaneous connections, so the size of the pipe that starts to cause issues could be considerably larger. Of course, the file(s) being transferred would also need to be big enough to fill that buffer - smaller files result in even lower transfer rates needed before the issue happens. A simple test case for all this is I set up a web server, client machine, and two routers in between to act as a WAN emulator. On each of the ( Linux ) routers I did: tc qdisc add dev eth2 root netem limit 100000 rate 1000kbit ( eth2 is obviously the "WAN" interface. ) Issuing 20 simultaneous "wget" commands from the client machine to fetch a 1M file with no retries resulted in 14 of them failing. It actually seems to struggle at 8 simultaneous connections and above - this is with a fairly default compilation of httpd from source. On Linux at least, you can see how much unsent data remains by querying the SIOCOUTQ ioctl, so the mitigation would be to check to see that ANY data was draining at all, and if so ( and there's some left ) extend the lingering close time and repeat. However, this wouldn't be a cross platform solution, but it would at least be the "correct" thing to do in terms of network function. Not sure if there's an equivalent on other systems. Adam