Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Mar 25, 2010, at 15:03 , Bardur Arantsson wrote: On 2010-02-24 20:50, Brandon S. Allbery KF8NH wrote: tcpdump 'host ps3 and tcp[tcpflags] 0x27 != 0' The only striking thing I can see about the dump is that there are 22 (conspicuously close to 16) sequences like: 19:45:30.135291 IP 192.168.0.115.64931 gwendolyn.9000: Flags [R], seq 2112225068, win 0, length 0 19:45:30.135295 IP 192.168.0.115.64931 gwendolyn.9000: Flags [R], seq 2112225068, win 0, length 0 19:45:30.135299 IP 192.168.0.115.64931 gwendolyn.9000: Flags [R], seq 2112225068, win 0, length 0 19:45:30.135302 IP 192.168.0.115.64931 gwendolyn.9000: Flags [R], seq 2112225068, win 0, length 0 The above is a single socket: the source and destination ports are the same for all 4 traces. More useful, from the dump, is: 19:44:41.774161 IP 192.168.0.115.65265 gwendolyn.9000: Flags [F.], seq 231, ack 1073301, win 41124, options [nop,nop,TS val 0 ecr 95041042], length 0 which is where the PS/3 sent a FIN telling gwendolyn to close the socket. It then follows that with a bunch of RST packets, the first of which is in sequence with the above FIN (suggesting the PS/3 responded to the continued attempt to send by dropping the socket on the floor instead of by resending the FIN) and the rest are this port is closed RSTs, presumably due to 22 attempts to continue sending data. This is somewhat poor on the part of the PS/3, but understandable given that it's essentially an embedded device. It would be interesting to see what the data around there was, but that's not easy to do without recording all of it. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon universityKF8NH PGP.sig Description: This is a digitally signed message part ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Feb 21, 2010, at 20:17 , Jeremy Shaw wrote: The PS3 does do something though. If we were doing a write *and* read select on the socket, the read select would wakeup. So, it is trying to notify us that something has happened, but we are not seeing it because we are only looking at the write select(). Earlier the OP claimed this would happen within a few minutes if he seeked in a movie. If it's that reproducible, it should be easy to capture a tcpdump and attach it to an email (or pastebin it), allowing us to determine what really happens. Also, Donn, you are incorrect about invalidating premises; we know the connection is going away, we can infer it's not going away normally, that's why there have been comments about it sending a FIN and dropping the connection entirely (bypassing the shutdown handshake), or sending an RST, etc. (I'd also be interested in finding out if OpenSolaris or FreeBSD has the same problem, but that may be too difficult to test easily. I still find it highly unlikely that loss of a connection only wakes the read end in general, and would absolutely not be surprised if this were some odd corner case in the Linux TCP stack. Sadly, I don't have a PS3 (yet, if ever) and I don't know of any streaming software for non-hacked Wiis.) -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon universityKF8NH PGP.sig Description: This is a digitally signed message part ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
Quoth Brandon S. Allbery KF8NH allb...@ece.cmu.edu, On Feb 21, 2010, at 20:17 , Jeremy Shaw wrote: The PS3 does do something though. If we were doing a write *and* read select on the socket, the read select would wakeup. So, it is trying to notify us that something has happened, but we are not seeing it because we are only looking at the write select(). Earlier the OP claimed this would happen within a few minutes if he seeked in a movie. If it's that reproducible, it should be easy to capture a tcpdump and attach it to an email (or pastebin it), allowing us to determine what really happens. Also, Donn, you are incorrect about invalidating premises; we know the connection is going away, we can infer it's not going away normally, that's why there have been comments about it sending a FIN and dropping the connection entirely (bypassing the shutdown handshake), or sending an RST, etc. That's what I'm saying - it clearly is not a full close, i.e., going away normally per protocol. With luck maybe the packets will show that something does happen at a wire protocol level, and there will be a way to recognize the event at the `user land' level and plug that into the event loop. My prediction is that on the contrary, the transition between functional and defunct will not be not announced in any way by the peer, but that's just guessing. It would be a lot less interesting. Donn ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Feb 23, 2010, at 23:47 , Donn Cave wrote: My prediction is that on the contrary, the transition between functional and defunct will not be not announced in any way by the peer, but that's just guessing. It would be a lot less interesting. But that's not the issue. The *kernel* is clearly detecting it; the problem is it's only being reported for the *read* end of the socket, whereas sendfile() (correctly) only cares about, and therefore only registers interest in, the *write* end. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon universityKF8NH PGP.sig Description: This is a digitally signed message part ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
Quoth Bardur Arantsson s...@scientician.net, Taru Karttunen wrote: Excerpts from Bardur Arantsson's message of Wed Feb 17 21:27:07 +0200 2010: For sendfile, a timeout of 1 second would probably be fine. The *ONLY* purpose of threadWaitWrite in the sendfile code is to avoid busy-waiting on EAGAIN from the native sendfile. Of course this will kill connections for all clients that may have a two second network hickup. I'm not talking about killing the connection. I'm talking about retrying sendfile() if threadWaitWrite has been waiting for more than 1 second. If the connection *has already been closed* (as detected by the OS), then sendfile() will fail with EBADF, and we're good. ... I don't see how that would lead to anything like what you describe. If I understand correctly, we're talking about what it means for the OS to detect a closed connection. The proposal I think was to change the socket options to add keepalive, and to set a short timeout. This will indeed allow the OS to discover connections that didn't properly close, but are effectively closed in the sense that they are no use any more - disconnected cable, or it sounds like the PS3 may routinely do this out of negligence. The problem is that this definition of `closed' is, precisely, `failed to respond within 2 seconds.' If there is no observable difference between a connection that has been abandoned by the PS3, and a connection that just suffered a momentary lapse, then there's no way to catch the former without making connections more fragile. Donn Cave d...@avvanta.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Feb 21, 2010, at 11:50 AM, Donn Cave wrote: The problem is that this definition of `closed' is, precisely, `failed to respond within 2 seconds.' If there is no observable difference between a connection that has been abandoned by the PS3, and a connection that just suffered a momentary lapse, then there's no way to catch the former without making connections more fragile. No. (i think) What happens is the PS3 has closed the connection, and if you attempt to send any more packets the PS3 will tell you it has closed the connection and the write() / sendfile() call will raise SIGPIPE. The problem is we never try to send those packets, because we are sitting at threadWaitWrite waiting to write -- and there is nothing that is going to happen that will cause that call to select () (by threadWaitWrite) to actually wakeup. I believe the proposal is to add a 2 second time out to the threadWaitWrite call. If it wakes up and can't write (because the remote side has lost connections, etc) then it will just go back to sleep. But if it wakes up, tries to write, and then gets sigPIPE, then it knows the connection is actually dead and will clean up after itself. The problem is that we have not successfully figure out what is causing this issue in the first place. I wrote a haskell server and a C client to try to emulate the situation which causes threadWaitWrite to never wake-up.. but I could not actually get that to happen. So for the PS3 client is the only thing that causes it. I think that applying a fix with out really understanding the problem is asking for trouble. Among other things, since the problem is with threadWaitWrite (not sendfile), then the same issue ought to exist when we are calling hPutStr, etc, since they ultimately call threadWaitWrite as well. If hPut never has this problem, then we should understand why and use the same solution for sendfile. If hPut does have this problem, then fixing just sendfile isn't much of a solution. So far there is: - no way for anyone besides Bardur to reproduce the problem - no sound explanation for why the PS3 client causes the error, but nothing else does - no proof that this error does or does not affect all the normal I/ O functions in Haskell (hPut, etc). - jeremy ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Sun, Feb 21, 2010 at 6:39 PM, Donn Cave d...@avvanta.com wrote: Quoth Jeremy Shaw jer...@n-heptane.com, ... What happens is the PS3 has closed the connection, and if you attempt to send any more packets the PS3 will tell you it has closed the connection and the write() / sendfile() call will raise SIGPIPE. ... So far there is: - no way for anyone besides Bardur to reproduce the problem - no sound explanation for why the PS3 client causes the error, but nothing else does I think in fact this invalidates your premise. If the PS3 really closed its connection in the standard fashion, then it would be trivial to reproduce this problem with any other peer. Evidently it doesn't, at least in this particular case, and that's why people are talking about TCP keep-alives, which address the defunct peer problem (within two hours, normally.) The PS3 does do something though. If we were doing a write *and* read select on the socket, the read select would wakeup. So, it is trying to notify us that something has happened, but we are not seeing it because we are only looking at the write select(). But I can not explain what the PS3 client is doing differently than the other clients such that it does not cause the threadWaitWrite to wakeup. Additionally, it is not clear that setting SO_KEEPALIVE will actually fix anything. The documentation that I have read indicates that that may only cause the read select() to wakeup not the write select(). Well, that is no good, because that is supposedly what is happening with the PS3 client already. Anyway, part of the annoyance here is that in this particular case we shouldn't need any timeouts to 'guess' that the client is 'probably dead'. The client seems to be telling us that it has disconnected, but we are not looking in the right place. And if we did try to write we would get a sigPIPE error. It is not the case the the client is unresponsive -- it is quite responsive. The problem is that we are not looking in the right place for that response. But, 'looking in the right place' is tricky. How do you tell hPut that it should wakeup from threadWaitWrite if the Handle happens to be backed by a socket, and threadWaitRead has data available? That does not even always indicate an error condition, it can be a perfectly valid situation. Well, before I think about that, I want to know what the PS3 client is doing differently such that it is the only client that seems to exhibit this behavior at the moment. If we do not understand the real difference between what the PS3 and the C client are doing, then I don't think we can expect to arrive at an appropriate fix. - jeremy ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
Excerpts from Bardur Arantsson's message of Wed Feb 17 21:27:07 +0200 2010: For sendfile, a timeout of 1 second would probably be fine. The *ONLY* purpose of threadWaitWrite in the sendfile code is to avoid busy-waiting on EAGAIN from the native sendfile. Of course this will kill connections for all clients that may have a two second network hickup. How so? As a user I expect sendfile to work and not semi-randomly block threads indefinitely. If you want sending something to terminate you will add a timeout to it. A nasty client may e.g. take one byte each minute and sending your file may take a few years. - Taru Karttunen ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
Excerpts from Bardur Arantsson's message of Tue Feb 16 23:48:14 +0200 2010: This cannot be fixed in the sendfile library, it is a feature of TCP that connections may linger for a long time unless explicit timeouts are used. The problem is that the sendfile library *doesn't* wake up when the connection is terminated (because of threadWaitWrite) -- it doesn't matter what the timeout is. Even server code without sendfile has the same issue since all writing to sockets ends up using threadWaitWrite. System.Timeout.timeout terminates a threadWaitWrite using asynchronous exceptions. If you want to detect dead sockets somewhat reliably without a timeout then there is SO_KEEPALIVE combined with polling SO_ERROR every few minutes. - Taru Karttunen ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Wed, Feb 17, 2010 at 2:36 AM, Taru Karttunen tar...@taruti.net wrote: Excerpts from Bardur Arantsson's message of Tue Feb 16 23:48:14 +0200 2010: This cannot be fixed in the sendfile library, it is a feature of TCP that connections may linger for a long time unless explicit timeouts are used. The problem is that the sendfile library *doesn't* wake up when the connection is terminated (because of threadWaitWrite) -- it doesn't matter what the timeout is. Even server code without sendfile has the same issue since all writing to sockets ends up using threadWaitWrite. Right, this is my concern -- I want to make sure that all of happstack is fixed, not just sendfile. System.Timeout.timeout terminates a threadWaitWrite using asynchronous exceptions. So for sendfile, instead of threadWaitWrite we could do: r - timeout (60 * 10^6) threadWaitWrite case r of Nothing - ... -- timed out (Just ()) - ... -- keep going It seems tricky to use timeout at a higher level in the code, because some requests may take a very long time to finish. For example, when serving a long video, or streaming music it could be hours or days before the IO request finishes. If you want to detect dead sockets somewhat reliably without a timeout then there is SO_KEEPALIVE combined with polling SO_ERROR every few minutes. This approach sounds promising because it seems like it could be incorporated into the guts of happstack-server. The timeout period could be a Config option with a reasonable default. I would be surprised if *any* happstack programs today are handling this correctly, so updating the core to do something reasonable would be a big improvement... And if someone has a special need where it is not ok, they can just change the config to use an infinite timeout... Does that sound like the right fix to you? (Obviously, if people are using sendfile with something other than happstack, it does not help them, but it sounds like trying to fix things in sendfile is misguided anyway.) - jeremy ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Wed, Feb 17, 2010 at 1:27 PM, Bardur Arantsson s...@scientician.netwrote: (Obviously, if people are using sendfile with something other than happstack, it does not help them, but it sounds like trying to fix things in sendfile is misguided anyway.) How so? As a user I expect sendfile to work and not semi-randomly block threads indefinitely. Because it only addresses *one* case when this type of blocking can happen. Shouldn't hPut and friends also block indefinitely since they also use threadWaitWrite? If so, what good is just fixing sendfile, when all other network I/O will still block indefinitely? If things are 'fixed' at a higher-level, by using SO_KEEPALIVE, then does sendfile really need a hack to deal with it? With your proposed fix, if the user unplugs the network cable, then won't you get an polling loop that never terminates? That doesn't sound any better than the current situation.. You said that you have not seen this issue when using the code that uses hPut, only the code that uses sendfile(). But my research indicates that we *should* see the error. So, I am not very comfortable fixing just sendfile and ignoring the fact that all network I/O might be borked.. I am also not 100% pleased by the SO_KEEPALIVE solution. There are really two errors which can occur: 1. the remote end drops the connection in such a manner that we immediately get notified of it by seeing that a read select() on the socket is successful but there are 0 bytes available to read. This happens because the remote end sent a notification to us that they have terminated the connection. 2. the remote end drops off the network (for example, the network cable is disconnected). In this case, we will not get any notification via read select(), because the remote server is not there to send the notification. The only solution is to eventually timeout. By using a timeout to handle #2, we implicitly handle #1, but in a very untimely manner. Ideally, we would like to handle both these cases separately. In case #1, we know immediately, that the connection is dead, and can therefore clean things up. With case #2, the remote client might actually come back online, (someone plugs the cable back in), and the transfer resumes. Perhaps in some applications we want infinite timeouts for case #2. That does not mean we do not want case #1 handled. However, I do not really see a good way of handle #1 right now that works for all network code, not just sendfile. The issue seems to be that select() was designed as a way to *avoid* using threads. There seems to be the assumption in the network code that you are going to do a select on the read and write aspects of the socket. When the select returns you will then look at what happened, and take the correct action. But, in Haskell, we are using multiple threads. So the code that is looking to read data and the code that is looking to write data don't really know about each other. So even if the read thread detects the closed socket, it has no idea that some other thread needs to be killed. so, what to do? Perhaps it is wrong to use a socket in more than one thread? Obviously, having multiple threads trying read the same socket, or write to the same socket would be a mess. So why do we expect it is ok to have one thread reading and a different thread writing? But, even if we do restrict ourselves to only accessing a socket from one thread at a time, we still have the issue that every place which uses threadWaitWrite needs to handle the disconnect case. We could, of course, write a wrapper function that does the check, and call that instead. But we still have not really solved the problem. The code in the I/O libraries that eventually implements hPut calls threadWaitWrite. But it has no idea that the file descriptor it is waiting on is a socket which has special requirements. That code is also used for writing to plain old files, etc, so it probably wouldn't make sense for it to behave that way by default.. - jeremy ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Wed, Feb 17, 2010 at 3:54 PM, Jeremy Shaw jer...@n-heptane.com wrote: On Wed, Feb 17, 2010 at 1:27 PM, Bardur Arantsson s...@scientician.netwrote: (Obviously, if people are using sendfile with something other than happstack, it does not help them, but it sounds like trying to fix things in sendfile is misguided anyway.) How so? As a user I expect sendfile to work and not semi-randomly block threads indefinitely. Because it only addresses *one* case when this type of blocking can happen. Shouldn't hPut and friends also block indefinitely since they also use threadWaitWrite? If so, what good is just fixing sendfile, when all other network I/O will still block indefinitely? If things are 'fixed' at a higher-level, by using SO_KEEPALIVE, then does sendfile really need a hack to deal with it? I think I understand the SO_KEEPALIVE + SO_ERROR solution, and that does not really fix things either. Setting SO_KEEPALIVE by itself does not cause the write select() to behave any differently. What it does do is cause the TCP stack to eventually send and empty packet to the remote host and hopefully get a response back. The response might be an error, or it might just be an ACK. But either way, I believe it is intended to cause the read select() to wakeup. But, in the case that started this discussion, we are already getting this information. So this won't help with that at all. The second part of the solution is to poll SO_ERROR to determine if something went wrong. This is an alternative to doing a read() on the socket and see if it returns 0 bytes. It is a nice alternative *because* it does not require a read(). However, it is still problematic. When you poll SO_ERROR, it will clear the error value, so there is a potential race condition if multiple threads are doing it. In happstack, we fork a new thread to handle each incoming connection. So at first it seems like we could just fork a second thread that polls the SO_ERROR option on the socket and kills the first thread if an error happens. Unfortunately, it is not that simple. The first thread might fork another thread that is actually doing the threadWaitWrite. Killing the parent thread will not kill that child thread. So, at present, I don't see a solution that is going to fix the problem in the rest of the IO code. There are multiple ways to hack only sendfile.. but that is only one place this error can happen. If this error truly never happens with hPut, then we should figure out why. If there is a solution that works for write() it should work for sendfile(), because the real issue is with the select() call anyway.. - jeremy ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Sun, Feb 14, 2010 at 2:04 PM, Bardur Arantsson s...@scientician.netwrote: I've tested this extensively during this weekend and not a single leaked FD so far. I think we can safely say that polling an FD for read readiness is sufficient to properly detect a disconnected client regardless of why/how the client disconnected. The only issue I can see with just dropping the above code directly into the sendfile library is that it may lead to busy-waiting on EAGAIN *if* the client is actually trying to send data to the server while it's receiving the file via sendfile(). If the client sends even a single byte and the server isn't reading it from the socket, then threadWaitRead will keep returning immediately since it's level-triggered rather than edge triggered. Yeah. That could be trouble. Not sure what the best solution for this would be, API-wise... Maybe actually have sendfile read the data and supply it to a user-defined function which could react to the data in some way? (Could supply two standard functions: disconnect immediately and accumulate all received data into a bytestring.) I think this goes beyond just a sendfile issue -- anyone trying to write non-blocking network code should run into this issue, right ? For now, maybe we should patch sendfile with what we have. But I think we really need to summarize our findings, see if we can generate a test case, and then see what Simon Marlow and company have to say... - jeremy ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Tue, Feb 16, 2010 at 12:37 PM, Jeremy Shaw jer...@n-heptane.com wrote: I think this goes beyond just a sendfile issue -- anyone trying to write non-blocking network code should run into this issue, right ? What's a fairly concise description of the issue at hand? I haven't been paying much attention to this thread, and the descriptions I have seen have been somewhat confused. One admittedly unhelpful observation is that when something goes wrong in this area, it's usually due to pilot error (either on the part of whoever wrote the Haskell library, or its user), and not so often caused by a bug in the underlying platform. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
Excerpts from Bardur Arantsson's message of Tue Feb 16 22:57:23 +0200 2010: As far as I can tell, all nonblocking networking code is vulnerable to this issue (unless it actually does use threadWaitRead, obviously :)). There are a few easy fixes: 1) socket timeouts with Network.Socket.setSocketOption 2) just make your server code have timeouts in Haskell This cannot be fixed in the sendfile library, it is a feature of TCP that connections may linger for a long time unless explicit timeouts are used. So just document it and in your code using sendfile wrap it in an application specific timeout. - Taru Karttunen ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Tue, Feb 16, 2010 at 3:48 PM, Bardur Arantsson s...@scientician.netwrote: The problem is that the sendfile library *doesn't* wake up when the connection is terminated (because of threadWaitWrite) -- it doesn't matter what the timeout is. Have we actually confirmed this? We know that with the default socket configuration things are good. But have we actually tested testing the timeout to something short and seeing what happens? It would be good to know for sure.. - jeremy ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Wed, Feb 10, 2010 at 1:15 PM, Bardur Arantsson s...@scientician.netwrote: I've also been contemplating some solutions, but I cannot see any solutions to this problem which could reasonably be implemented outside of GHC itself. GHC lacks a threadWaitError, so there's no way to detect the problem except by timeout or polling. Solutions involving timeouts and polling are bad in this case because they arbitrarily restrict the client connection rate. Cheers, I believe solutions involving polling and timeouts may be the *only* solution due to the way TCP works. There are two cases to consider here: 1. what happens when the remote client does a proper disconnect by sending a FIN packet, etc 2. what happens when the remote client just drops the connection Case #1 - Proper Disconnect I believe that in case we are ok. select() may not wakeup due to the socket being closed -- but something will eventually cause select() to wakeup, and then next time through the loop, the call to select will fail with EBADF. This will cause everyone to wakeup. We can test this case by writing a client that purposely (and correctly) terminations the connection while threadWaitWrite is blocking and see if that causes it to wakeup. To ensure that the IOManager is eventually waking up, the server can have an IO thread that just does, forever $ threadDelay (1*10^6) Look here for more details: http://darcs.haskell.org/packages/base/GHC/Conc.lhs Case #2 - Sudden Death In this case, there is no way to tell if the client is still there with out trying to send / recv data. A TCP connection is not a 'tangible' link. It is just an agreement to send packets to/from certain ports with certain sequence numbers. It's much closer to snail mail than a telephone call. If you set the keepalive socket option, then the TCP layer will automatically ping the connection to make sure it is still alive. However, I believe the default time between keepalive packets is 2 hours, and can only be changed on a system wide basis? http://www.unixguide.net/network/socketfaq/2.8.shtml The other option is to try to send some data. There are at least two cases that can happen here. 1. the network cable is unplugged -- this is not an 'error'. The write buffer will fill up and it will wait until it can send the data. If the write buffer is full, it will either block or return EAGAIN depending on the mode. Eventually, after 2 hours, it might give up. 2. the remote client has terminated the connection as far as it is concerned but not notified the server -- when you try to send data it will reject it, and send/write/sendfile/etc will raise sigPIPE. Looking at your debug output, we are seeing the sigPIPE / Broken Pipe error most of the time. But then there is the case where we get stuck on the threadWaitWrite. threadWaitWrite is ultimately implemented by passing the file descriptor to the list of write descriptors in a call to select(). It seems, however, that select() is not waking up just because calling write() on a file descriptor *would* cause sigPIPE. The easiest way to confirm this case is probably to write a small, pure C program and see what really happens. If this is the case, then it means the only way to tell if the client has abruptly dropped the connection is to actually try sending the data and see if the sending function calls sigPIPE. And that means doing some sort of polling/timeout? What do you think? I do not have a good explanation as to why the portable version does not fail. Except maybe it is just so slow that it does not ever fill up the buffer, and hence does not get stuck in threadWaitWrite? Any way, the fundamental question is: When your write buffer is full, and you call select() on that file descriptor, will select() return in the case where calling write() again would raise sigPIPE? - jeremy ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
Bardur Arantsson s...@scientician.net wrote: ... then do errno - getErrno if errno == eAGAIN then do threadDelay 100 sendfile out_fd in_fd poff bytes else throwErrno Network.Socket.SendFile.Linux else return (fromIntegral sbytes) That is, I removed the threadWaitWrite in favor of just adding a threadDelay 100 when eAGAIN is encountered. With this code, I cannot provoke the leak. Unfortunately this isn't really a solution -- the CPU is pegged at ~50% when I do this and it's not exactly elegant to have a hardcoded 100 ms delay in there. :) I don't think it matters wrt the desired final solution, but this is NOT a 100 ms delay. It is a 0.1 ms delay, which is less than a GHC time slice and as such is basically a tight loop. If you use a reasonable value for the delay you will probably see the CPU being almost completely idle. Thomas ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Feb 11, 2010, at 1:57 PM, Bardur Arantsson wrote: 2. the remote client has terminated the connection as far as it is concerned but not notified the server -- when you try to send data it will reject it, and send/write/sendfile/etc will raise sigPIPE. Looking at your debug output, we are seeing the sigPIPE / Broken Pipe error most of the time. But then there is the case where we get stuck on the threadWaitWrite. threadWaitWrite is ultimately implemented by passing the file descriptor to the list of write descriptors in a call to select(). It seems, however, that select() is not waking up just because calling write() on a file descriptor *would* cause sigPIPE. That's what I expect select() with an errfd FDSET would do. Nope. The expectfds are only trigger in esoteric conditions. For TCP sockets, I think it only occurs if there is out-of-band data available to be read via recv() with the MSG_OOB flag. http://uw714doc.sco.com/en/SDK_netapi/sockC.OoBdata.html The easiest way to confirm this case is probably to write a small, pure C program and see what really happens. If this is the case, then it means the only way to tell if the client has abruptly dropped the connection is to actually try sending the data and see if the sending function calls sigPIPE. And that means doing some sort of polling/timeout? Correct, but the trouble is deciding how often to poll and/or how long the timeout should be. I don't see any easy answer to that. That's why my suggested solution is to simply punt it to the OS (by using portable mode) and suck up the extra overhead of the portable solution. Hopefully the new GHC I/O manager will make it possible to have a proper solution. The whole point of the sendfile library is to use sendfile(), so not using sendfile() seems like the wrong solution. I am also not convinced that the new GHC I/O manager will do anything new to make it possible to have a proper solution. I believe we would be seeing the same error even in pure C, so we need to know the work around that works in pure C as well. I am not convinced we are punting to the OS by using portable mode either (more below). I do not have a good explanation as to why the portable version does not fail. Except maybe it is just so slow that it does not ever fill up the buffer, and hence does not get stuck in threadWaitWrite? The portable version doesn't call threadWaitWrite. It simply turns the Socket into a handle (which causes it to become blocking) and so the kernel is tasked with handling all the gritty details. The portable version does not directly call threadWaitWrite, but it still calls it. Data.ByteString.Char8.hPutStr calls Data.ByteString.hPut which calls Data.ByteString.hPutBuf which calls System.IO.hPutBuf which calls GHC.IO.Handle.Text.hPutBuf which calls GHC.IO.Handle.bufWrite.Text which calls GHC.IO.Device.write which calls GHC.IO.FD.fdWrite which calls GHC.IO.FD.writeRawBufferPtr which calls which is defined as: writeRawBufferPtr :: String - FD - Ptr Word8 - Int - CSize - IO CInt writeRawBufferPtr loc !fd buf off len | isNonBlocking fd = unsafe_write -- unsafe is ok, it can't block | otherwise = do r - unsafe_fdReady (fdFD fd) 1 0 0 if r /= 0 then write else do threadWaitWrite (fromIntegral (fdFD fd)); write where do_write call = fromIntegral `fmap` throwErrnoIfMinus1RetryMayBlock loc call (threadWaitWrite (fromIntegral (fdFD fd))) write = if threaded then safe_write else unsafe_write unsafe_write = do_write (c_write (fdFD fd) (buf `plusPtr` off) len) safe_write= do_write (c_safe_write (fdFD fd) (buf `plusPtr` off) len) According to the following test program, I expect that 'isNonBlocking fd' will be 'True'. So it seems like the portable solution should be vulnerable to the same condition. Perhaps the portable version is just so slow that the OS buffers never fill up so EAGAIN is never raised? --- {-# LANGUAGE RecordWildCards #-} module Main where import Control.Concurrent (forkIO) import Control.Monad (forever) import Network (PortID(PortNumber), Socket, listenOn) import Network.Socket (accept, socketToHandle) import System.IO import qualified GHC.IO.FD as FD import GHC.IO.Handle.Internals (withHandle, flushWriteBuffer) import GHC.IO.Handle.Types (Handle__(..), HandleType(..)) import qualified GHC.IO.FD as FD import System.Posix.Types (Fd(..)) import System.IO.Error import GHC.IO.Exception import Data.Typeable (cast) import GHC.IO.Handle.Internals (wantWritableHandle) main = listen (PortNumber (toEnum 2525)) $ \s - do h - socketToHandle s ReadWriteMode wantWritableHandle main h $ \h_ - showBlocking h_ showBlocking :: Handle__ - IO ()
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Feb 9, 2010, at 6:47 PM, Thomas Hartman wrote: Matt, have you seen this thread? Jeremy, are you saying this a bug in the sendfile library on hackage, or something underlying? I'm saying that the behavior of the sendfile library is buggy. But it could be due to something underlying.. Either threadWaitWrite is buggy and should be fixed. Or threadWaitWrite is doing the right thing, and sendfile needs to be modified some how to account for the behavior. But I don't know which is the case or how to implement a solution to either option. - jeremy ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Sun, Feb 7, 2010 at 9:22 AM, Bardur Arantsson s...@scientician.netwrote: True, it is perhaps technically not a bug, but it is certainly a misfeature since there is no easy way (at least AFAICT) to discover that something bad has happened for the file descriptor and act accordingly. AFAICT any solution would have to be based on a separate thread which either 1) checks the FD periodically somehow, or 2) simply lets the thread doing the threadWaitWrite time out after a set period of inactivity. Neither is very optimal. Either way, I'd certainly expect the sendfile library to work around this somehow such that this situation doesn't occur. I'm just having a hard time thinking up a good solution :). Well, it is certainly a bug in sendfile that needs to be fixed. I'm not sure how to fix it either. If we can simplify the test case, we can ask Simon Marlow.. - jeremy ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
Matt, have you seen this thread? Jeremy, are you saying this a bug in the sendfile library on hackage, or something underlying? thomas. 2010/2/9 Jeremy Shaw jer...@n-heptane.com: On Sun, Feb 7, 2010 at 9:22 AM, Bardur Arantsson s...@scientician.net wrote: True, it is perhaps technically not a bug, but it is certainly a misfeature since there is no easy way (at least AFAICT) to discover that something bad has happened for the file descriptor and act accordingly. AFAICT any solution would have to be based on a separate thread which either 1) checks the FD periodically somehow, or 2) simply lets the thread doing the threadWaitWrite time out after a set period of inactivity. Neither is very optimal. Either way, I'd certainly expect the sendfile library to work around this somehow such that this situation doesn't occur. I'm just having a hard time thinking up a good solution :). Well, it is certainly a bug in sendfile that needs to be fixed. I'm not sure how to fix it either. If we can simplify the test case, we can ask Simon Marlow.. - jeremy ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
It's not clear to me that this is actually a bug in threadWaitWrite. I believe that under Linux, select() does not wakeup just because the file descriptor was closed. (Under Windows, and possibly solaris/BSD/etc it does). So this behavior might be consistent with normal Linux behavior. However, it is clearly annoying that (a) the expected behavior is not documented (b) the behavior might be different under Linux than other OSes. In some sense it is correct -- if the file descriptor is closed, then we certainly can't write more to it -- so threadWaitWrite need not wake up.. But that leaves us with the issue of needing someway to be notified that the file descriptor was closed so that we can clean up after ourselves.. - jeremy On Sun, Feb 7, 2010 at 2:13 AM, Bardur Arantsson s...@scientician.netwrote: Bardur Arantsson wrote: Bardur Arantsson wrote: (sorry about replying-to-self) During yet another bout of debugging, I've added even more I am here instrumentation code to the SendFile code, and the culprit seems to be threadWaitWrite. As Jeremy Shaw pointed out off-list, the symptoms are also consistent with a thread that simply gets stuck in threadWaitWrite. I've tried a couple of different solutions to this based on starting a separate thread to enforce a timeout on threadWaitWrite (using throwTo). It seems to work to prevent the file descriptor leak, but causes GHC to segfault after a while. Probably some sort of other resource exhaustion since my code is just an evil hack: killer :: MVar () - ThreadId - IO () killer dontKill otherThread = do threadDelay 5000 x - tryTakeMVar dontKill case x of Just _ - putStrLn Killer thread expired Nothing - throwTo otherThread (Overflow) where the relevant bit of sendfile reads: mtid - myThreadId dontKill - newEmptyMVar forkIO $ killer dontKill mtid threadWaitWrite out_fd putMVar dontKill () So I'm basically creating a thread for every single threadWaitWrite operation (which is a lot in this case). Anyone got any ideas on a simpler way to handle this? Maybe I should just report a bug for threadWaitWrite? IMO threadWaitWrite really should throw some sort of IOException if the FD goes dead while it's waiting. I suppose an alternative way to try to work around this would be by forcing the output socket into blocking (as opposed to non-blocking) mode, but I can't figure out how to do this on GHC 6.10.x -- I only see setNonBlockingFD which doesn't take a parameter unlike its 6.12.x counterpart. Cheers, ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Sat, Feb 06, 2010 at 09:16:35AM +0100, Bardur Arantsson wrote: Brandon S. Allbery KF8NH wrote: On Feb 5, 2010, at 02:56 , Bardur Arantsson wrote: [--snip--] Broken pipe is normally handled as a signal, and is only mapped to an error if SIGPIPE is set to SIG_IGN. I can well imagine that the SIGPIPE signal handler isn't closing resources properly; a workaround would be to use the System.Posix.Signals API to ignore SIGPIPE, but I don't know if that would work as a general solution (it would depend on what other uses of pipes/sockets exist). It was a good idea, but it doesn't seem to help to add installHandler sigPIPE Ignore (Just fullSignalSet) to the main function. (Given the package name I assume System.Posix.Signals works similarly to regular old signals, i.e. globally per-process.) This is really starting to drive me round the bend... Have you seen GHC ticket #1619? http://hackage.haskell.org/trac/ghc/ticket/1619 One further thing I've noticed: When compiling on my 64-bit machine, ghc issues the following warnings: Linux.hsc:41: warning: format ‘%d’ expects type ‘int’, but argument 3 has type ‘long unsigned int’ Linux.hsc:45: warning: format ‘%d’ expects type ‘int’, but argument 3 has type ‘long unsigned int’ Linux.hsc:45: warning: format ‘%d’ expects type ‘int’, but argument 3 has type ‘long unsigned int’ Linux.hsc:45: warning: format ‘%d’ expects type ‘int’, but argument 3 has type ‘long unsigned int’ Those lines are: 39: -- max num of bytes in one send 40: maxBytes :: Int64 41: maxBytes = fromIntegral (maxBound :: (#type ssize_t)) and 44: foreign import ccall unsafe sendfile64 c_sendfile 45: :: Fd - Fd - Ptr (#type off_t) - (#type size_t) - IO (#type ssize_t) This looks like a typical 32/64-bit problem, but normally I would expect any real run-time problems caused by a problematic conversion in the FFI to crash the whole process. Maybe I'm wrong about this... To convert those '#' constants, hsc2hs preprocessor constructs a C file things like 'printf(%d, sizeof(ssize_t))' to use the system's C compiler and avoid having the encode the ABI of every platform (to be able to know the memory layout of the structures). So that message comes from that C file, not from your Haskell one. At runtime it really doesn't matter. -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
me too. 2010/2/5 MightyByte mightyb...@gmail.com: I've been seeing a steady stream of similar resource vanished messages for as long as I've been running my happstack app. This message I get is this: socket: 58: hClose: resource vanished (Broken pipe) I run my app from a shell script inside a while true loop, so it automatically gets restarted if it crashes. This incurs no more than a few seconds of down time. Since that is acceptable for my application, I've never put much effort into investigating the issue. But I don't think the resource vanished error results in program termination. When I have looked into it, I've had similar trouble reproducing it. Clients such as wget and firefox don't seem to cause the problem. If I remember correctly it only happens with IE. On Fri, Feb 5, 2010 at 2:56 AM, Bardur Arantsson s...@scientician.net wrote: Jeremy Shaw wrote: Actually, We should start by testing if native sendfile leaks file descriptors even when the whole file is sent. We have a test suite, but I am not sure if it tests for file handle leaking... I should have posted this earlier, but the exact message I'm seeing in the case where the Bad Client disconnects is this: hums: Network.Socket.SendFile.Linux: resource vanished (Broken pipe) Oddly, I haven't been able to reproduce this using a wget client with a ^C during transfer. When I disconnect wget with ^C or pkill wget or even pkill -9 wget, I get this message: hums: Network.Socket.SendFile.Linux: resource vanished (Connection reset by peer) (and no leak, as observed by lsof | grep hums). So there appears to be some vital difference between the handling of the two cases. Another observation which may be useful: Before the sendfile' API change (Handle - FilePath) in sendfile-0.6.x, my code used withFile to open the file and to ensure that it was closed. So it seems that withBinaryFile *should* also be fine. Unless the Broken pipe error somehow escapes the scope without causing a close. I don't have time to dig more right now, but I'll try to see if I can find out more later. Cheers, ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
I've been seeing a steady stream of similar resource vanished messages for as long as I've been running my happstack app. This message I get is this: socket: 58: hClose: resource vanished (Broken pipe) I run my app from a shell script inside a while true loop, so it automatically gets restarted if it crashes. This incurs no more than a few seconds of down time. Since that is acceptable for my application, I've never put much effort into investigating the issue. But I don't think the resource vanished error results in program termination. When I have looked into it, I've had similar trouble reproducing it. Clients such as wget and firefox don't seem to cause the problem. If I remember correctly it only happens with IE. On Fri, Feb 5, 2010 at 2:56 AM, Bardur Arantsson s...@scientician.net wrote: Jeremy Shaw wrote: Actually, We should start by testing if native sendfile leaks file descriptors even when the whole file is sent. We have a test suite, but I am not sure if it tests for file handle leaking... I should have posted this earlier, but the exact message I'm seeing in the case where the Bad Client disconnects is this: hums: Network.Socket.SendFile.Linux: resource vanished (Broken pipe) Oddly, I haven't been able to reproduce this using a wget client with a ^C during transfer. When I disconnect wget with ^C or pkill wget or even pkill -9 wget, I get this message: hums: Network.Socket.SendFile.Linux: resource vanished (Connection reset by peer) (and no leak, as observed by lsof | grep hums). So there appears to be some vital difference between the handling of the two cases. Another observation which may be useful: Before the sendfile' API change (Handle - FilePath) in sendfile-0.6.x, my code used withFile to open the file and to ensure that it was closed. So it seems that withBinaryFile *should* also be fine. Unless the Broken pipe error somehow escapes the scope without causing a close. I don't have time to dig more right now, but I'll try to see if I can find out more later. Cheers, ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?
On Feb 5, 2010, at 02:56 , Bardur Arantsson wrote: I should have posted this earlier, but the exact message I'm seeing in the case where the Bad Client disconnects is this: hums: Network.Socket.SendFile.Linux: resource vanished (Broken pipe) Oddly, I haven't been able to reproduce this using a wget client with a ^C during transfer. When I disconnect wget with ^C or pkill wget or even pkill -9 wget, I get this message: hums: Network.Socket.SendFile.Linux: resource vanished (Connection reset by peer) (and no leak, as observed by lsof | grep hums). Broken pipe is normally handled as a signal, and is only mapped to an error if SIGPIPE is set to SIG_IGN. I can well imagine that the SIGPIPE signal handler isn't closing resources properly; a workaround would be to use the System.Posix.Signals API to ignore SIGPIPE, but I don't know if that would work as a general solution (it would depend on what other uses of pipes/sockets exist). -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon universityKF8NH PGP.sig Description: This is a digitally signed message part ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe