subject:"\[Haskell\-cafe\] Re\: sendfile leaking descriptors on Linux\?"

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-03-25 Thread Bardur Arantsson


On 2010-02-24 20:50, Brandon S. Allbery KF8NH wrote:

tcpdump 'host ps3 and tcp[tcpflags]  0x27 != 0'


(Indulging in some serious thread necromancy here, but...)

Alright, I've _finally_ got round to doing a dump with leaking file 
descriptors (or threads as the case may be).


The bits of lsof output of the leaked file descriptors looks like this 
(sorry about the wrapping):


hums   2084 bardur   36u sock0,6   0t0 
45022400 can't identify protocol
hums   2084 bardur   37r  REG   0,23 733054976 
 267762 THE_MOVIE.avi


So pairs of FDs are being held up by sendfile when the PS3 disconnects. 
The number of such pairs in this test run was 16.


I've attached the gzipped output from tcpdump.

The only striking thing I can see about the dump is that there are 22 
(conspicuously close to 16) sequences like:


19:45:30.135291 IP 192.168.0.115.64931  gwendolyn.9000: Flags [R], seq 
2112225068, win 0, length 0
19:45:30.135295 IP 192.168.0.115.64931  gwendolyn.9000: Flags [R], seq 
2112225068, win 0, length 0
19:45:30.135299 IP 192.168.0.115.64931  gwendolyn.9000: Flags [R], seq 
2112225068, win 0, length 0
19:45:30.135302 IP 192.168.0.115.64931  gwendolyn.9000: Flags [R], seq 
2112225068, win 0, length 0


My tcpdump-fu is rather limited, so I'm not really sure what this 
actually means... any input much appreciated.


Cheers,


dump_with_leaking_fds.log.gz
Description: application/gzip
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-03-25 Thread Brandon S. Allbery KF8NH


On Mar 25, 2010, at 15:03 , Bardur Arantsson wrote:

On 2010-02-24 20:50, Brandon S. Allbery KF8NH wrote:

tcpdump 'host ps3 and tcp[tcpflags]  0x27 != 0'


The only striking thing I can see about the dump is that there are  
22 (conspicuously close to 16) sequences like:


19:45:30.135291 IP 192.168.0.115.64931  gwendolyn.9000: Flags [R],  
seq 2112225068, win 0, length 0
19:45:30.135295 IP 192.168.0.115.64931  gwendolyn.9000: Flags [R],  
seq 2112225068, win 0, length 0
19:45:30.135299 IP 192.168.0.115.64931  gwendolyn.9000: Flags [R],  
seq 2112225068, win 0, length 0
19:45:30.135302 IP 192.168.0.115.64931  gwendolyn.9000: Flags [R],  
seq 2112225068, win 0, length 0


The above is a single socket:  the source and destination ports are  
the same for all 4 traces.


More useful, from the dump, is:

19:44:41.774161 IP 192.168.0.115.65265  gwendolyn.9000: Flags [F.],  
seq 231, ack 1073301, win 41124, options [nop,nop,TS val 0 ecr  
95041042], length 0



which is where the PS/3 sent a FIN telling gwendolyn to close the  
socket.  It then follows that with a bunch of RST packets, the first  
of which is in sequence with the above FIN (suggesting the PS/3  
responded to the continued attempt to send by dropping the socket on  
the floor instead of by resending the FIN) and the rest are this port  
is closed RSTs, presumably due to 22 attempts to continue sending  
data.  This is somewhat poor on the part of the PS/3, but  
understandable given that it's essentially an embedded device.


It would be interesting to see what the data around there was, but  
that's not easy to do without recording all of it.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-24 Thread Bardur Arantsson


On 2010-02-24 05:10, Brandon S. Allbery KF8NH wrote:

On Feb 21, 2010, at 20:17 , Jeremy Shaw wrote:

The PS3 does do something though. If we were doing a write *and* read
select on the socket, the read select would wakeup. So, it is trying
to notify us that something has happened, but we are not seeing it
because we are only looking at the write select().


Earlier the OP claimed this would happen within a few minutes if he
seeked in a movie. If it's that reproducible, it should be easy to
capture a tcpdump and attach it to an email (or pastebin it), allowing
us to determine what really happens.


It's a huge amount of data since it's streaming ~900Kb/s (or 
thereabouts). I don't think it's really practical to look through all 
that to try to figure out exactly when the problem occurs.


Anyone know of any programs which can highlight 'anomalous' tcp traffic 
in tcpdumps?


Still, I'd be happy to try a capture and upload it somewhere if anyone 
cares too have a look at it. It'll have to wait for the weekend, though.


Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-23 Thread Brandon S. Allbery KF8NH


On Feb 21, 2010, at 20:17 , Jeremy Shaw wrote:
The PS3 does do something though. If we were doing a write *and*  
read select on the socket, the read select would wakeup. So, it is  
trying to notify us that something has happened, but we are not  
seeing it because we are only looking at the write select().


Earlier the OP claimed this would happen within a few minutes if he  
seeked in a movie.  If it's that reproducible, it should be easy to  
capture a tcpdump and attach it to an email (or pastebin it), allowing  
us to determine what really happens.


Also, Donn, you are incorrect about invalidating premises; we know the  
connection is going away, we can infer it's not going away normally,  
that's why there have been comments about it sending a FIN and  
dropping the connection entirely (bypassing the shutdown handshake),  
or sending an RST, etc.


(I'd also be interested in finding out if OpenSolaris or FreeBSD has  
the same problem, but that may be too difficult to test easily.  I  
still find it highly unlikely that loss of a connection only wakes the  
read end in general, and would absolutely not be surprised if this  
were some odd corner case in the Linux TCP stack.  Sadly, I don't have  
a PS3 (yet, if ever) and I don't know of any streaming software for  
non-hacked Wiis.)


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-23 Thread Donn Cave

Quoth Brandon S. Allbery KF8NH allb...@ece.cmu.edu,
 On Feb 21, 2010, at 20:17 , Jeremy Shaw wrote:
 The PS3 does do something though. If we were doing a write *and*  
 read select on the socket, the read select would wakeup. So, it is  
 trying to notify us that something has happened, but we are not  
 seeing it because we are only looking at the write select().

 Earlier the OP claimed this would happen within a few minutes if he  
 seeked in a movie.  If it's that reproducible, it should be easy to  
 capture a tcpdump and attach it to an email (or pastebin it), allowing  
 us to determine what really happens.

 Also, Donn, you are incorrect about invalidating premises; we know the  
 connection is going away, we can infer it's not going away normally,  
 that's why there have been comments about it sending a FIN and  
 dropping the connection entirely (bypassing the shutdown handshake),  
 or sending an RST, etc.

That's what I'm saying - it clearly is not a full close, i.e., going
away normally per protocol.

With luck maybe the packets will show that something does happen at
a wire protocol level, and there will be a way to recognize the event
at the `user land' level and plug that into the event loop.

My prediction is that on the contrary, the transition between functional
and defunct will not be not announced in any way by the peer, but that's
just guessing.  It would be a lot less interesting.

Donn

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-23 Thread Brandon S. Allbery KF8NH


On Feb 23, 2010, at 23:47 , Donn Cave wrote:
My prediction is that on the contrary, the transition between  
functional
and defunct will not be not announced in any way by the peer, but  
that's

just guessing.  It would be a lot less interesting.



But that's not the issue.  The *kernel* is clearly detecting it; the  
problem is it's only being reported for the *read* end of the socket,  
whereas sendfile() (correctly) only cares about, and therefore only  
registers interest in, the *write* end.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-21 Thread Bardur Arantsson


Jeremy Shaw wrote:

Hello,

I think to make progress on this bug we really need a failing test case that
other people can reproduce.

I have hacked up small server that should reproduce the error (using fdWrite
instead of sendfile). And a small C client which is intended to reproduce
the error -- but doesn't.

I have attached both.

The server tries to write a whole lot of 'a' characters to the client. The
client does not consume any of them. This causes the server to block on the
threadWaitWrite.

No matter how I kill the client, threadWaitWrite always wakes up.


Are you running the client and server on different physical machines? If 
so, have you tried simply yanking the connection?


Your client isn't dropping the connection hard -- if you kill the client 
(even with a -9) your OS cleans up any open sockets it has. On 
well-behaved OS'es that cleanup usually involves properly shutting down 
the connection somehow. Different OS'es have different ideas about what 
constitutes properly shutting down the connection -- some simply don't.


My hypothesis is that the PS3 doesn't properly shut down the connection, 
but simply sends a RST (or maybe a FIN) and drops any further packets. 
I'll do a Wireshark dump after posting this to see if I can see what 
it's doing at the TCP level -- I'm not optimistic about seeing the exact 
moment when the leak occurs, but maybe the general pattern can yield 
some useful ideas.


I have no idea how to test this without using an actual PS3.

 So, we

need to figure out exactly what the PS3 is doing differently that causes
threadWaitWrite to not wakeup..


Does it matter? I can reproduce this reliably within a few minutes of 
testing.


Note that this doesn't happen *every* time the PS3 disconnects and 
reconnects, it just happens some of the time. It's enough to eat up 
MAX_FDs file descriptors in a few hours of playing media normally. If I 
do a lot of seeking (forces a disconnect+reconnect) through the movie, 
at least one file descriptor usually leaks within a few minutes.



If we don't know why it is failing, then I
don't think we can properly fix it.


I'm more pragmatic: If, after applying a fix, I cannot reproduce this 
problem within a few hours (or so) or running my media server, I'd say 
it's fixed. As long as the modifications to the sendfile library don't 
change its behavior in other ways, I don't see the problem.


P.S. Does anyone else out there have a PS3 to test with?

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-21 Thread Bardur Arantsson


Taru Karttunen wrote:

Excerpts from Bardur Arantsson's message of Wed Feb 17 21:27:07 +0200 2010:
For sendfile, a timeout of 1 second would probably be fine. The *ONLY* 
purpose of threadWaitWrite in the sendfile code is to avoid busy-waiting 
on EAGAIN from the native sendfile.


Of course this will kill connections for all clients that may have a
two second network hickup.



I'm not talking about killing the connection. I'm talking about retrying 
sendfile() if threadWaitWrite has been waiting for more than 1 second.


If the connection *has already been closed* (as detected by the OS), 
then sendfile() will fail with EBADF, and we're good.


If the connection *hasn't been closed*, there are two cases:

  a) Sendfile can send more data, in which case it does and we go back 
to sleep on a threadWaitWrite.
  b) Sendfile cannot send more data... in which case the sendfile 
library gets an EAGAIN and goes back to sleep on a threadWaitWrite.


I don't see how that would lead to anything like what you describe.

How so? As a user I expect sendfile to work and not semi-randomly block 
threads indefinitely.


If you want sending something to terminate you will add a timeout to
it. A nasty client may e.g. take one byte each minute and sending your
file may take a few years.



This can always happen. The timeout here is at the application level 
(e.g. has this connection been open too long) and sendfile shouldn't 
concern itself with such things.


The point with my proposed fix is that sendfile will be reacting to the 
OS's detection of a dropped connection ASAP (plus 1 second) rather than 
just hanging.


Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-21 Thread Donn Cave

Quoth Bardur Arantsson s...@scientician.net,
 Taru Karttunen wrote:

 Excerpts from Bardur Arantsson's message of Wed Feb 17 21:27:07 +0200 2010:
 For sendfile, a timeout of 1 second would probably be fine. The *ONLY* 
 purpose of threadWaitWrite in the sendfile code is to avoid busy-waiting 
 on EAGAIN from the native sendfile.
 
 Of course this will kill connections for all clients that may have a
 two second network hickup.
 

 I'm not talking about killing the connection. I'm talking about retrying 
 sendfile() if threadWaitWrite has been waiting for more than 1 second.

 If the connection *has already been closed* (as detected by the OS), 
 then sendfile() will fail with EBADF, and we're good.
...
 I don't see how that would lead to anything like what you describe.

If I understand correctly, we're talking about what it means for the
OS to detect a closed connection.

The proposal I think was to change the socket options to add keepalive,
and to set a short timeout.  This will indeed allow the OS to discover
connections that didn't properly close, but are effectively closed in
the sense that they are no use any more - disconnected cable, or it
sounds like the PS3 may routinely do this out of negligence.

The problem is that this definition of `closed' is, precisely,
`failed to respond within 2 seconds.'  If there is no observable
difference between a connection that has been abandoned by the PS3,
and a connection that just suffered a momentary lapse, then there's
no way to catch the former without making connections more fragile.

Donn Cave
d...@avvanta.com

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-21 Thread Jeremy Shaw



On Feb 21, 2010, at 11:50 AM, Donn Cave wrote:


The problem is that this definition of `closed' is, precisely,
`failed to respond within 2 seconds.'  If there is no observable
difference between a connection that has been abandoned by the PS3,
and a connection that just suffered a momentary lapse, then there's
no way to catch the former without making connections more fragile.


No. (i think)

What happens is the PS3 has closed the connection, and if you attempt  
to send any more packets the PS3 will tell you it has closed the  
connection and the write() / sendfile() call will raise SIGPIPE.


The problem is we never try to send those packets, because we are  
sitting at threadWaitWrite waiting to write -- and there is nothing  
that is going to happen that will cause that call to select () (by  
threadWaitWrite) to actually wakeup.


I believe the proposal is to add a 2 second time out to the  
threadWaitWrite call. If it wakes up and can't write (because the  
remote side has lost connections, etc) then it will just go back to  
sleep. But if it wakes up, tries to write, and then gets sigPIPE, then  
it knows the connection is actually dead and will clean up after itself.


The problem is that we have not successfully figure out what is  
causing this issue in the first place.


I wrote a haskell server and a C client to try to emulate the  
situation which causes threadWaitWrite to never wake-up.. but I could  
not actually get that to happen. So for the PS3 client is the only  
thing that causes it.


I think that applying a fix with out really understanding the problem  
is asking for trouble.


Among other things, since the problem is with threadWaitWrite (not  
sendfile), then the same issue ought to exist when we are calling  
hPutStr, etc, since they ultimately call threadWaitWrite as well. If  
hPut never has this problem, then we should understand why and use the  
same solution for sendfile. If hPut does have this problem, then  
fixing just sendfile isn't much of a solution.


So far there is:

 - no way for anyone besides Bardur to reproduce the problem
 - no sound explanation for why the PS3 client causes the error, but  
nothing else does
 - no proof that this error does or does not affect all the normal I/ 
O functions in Haskell (hPut, etc).


- jeremy ___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-21 Thread Jeremy Shaw

On Sun, Feb 21, 2010 at 6:39 PM, Donn Cave d...@avvanta.com wrote:

 Quoth Jeremy Shaw jer...@n-heptane.com,
 ...
  What happens is the PS3 has closed the connection, and if you attempt
  to send any more packets the PS3 will tell you it has closed the
  connection and the write() / sendfile() call will raise SIGPIPE.
 ...
  So far there is:
 
- no way for anyone besides Bardur to reproduce the problem
- no sound explanation for why the PS3 client causes the error,
  but nothing else does

 I think in fact this invalidates your premise.  If the PS3 really
 closed its connection in the standard fashion, then it would be trivial
 to reproduce this problem with any other peer.  Evidently it doesn't,
 at least in this particular case, and that's why people are talking
 about TCP keep-alives, which address the defunct peer problem (within
 two hours, normally.)


The PS3 does do something though. If we were doing a write *and* read select
on the socket, the read select would wakeup. So, it is trying to notify us
that something has happened, but we are not seeing it because we are only
looking at the write select().

But I can not explain what the PS3 client is doing differently than the
other clients such that it does not cause the threadWaitWrite to wakeup.

Additionally, it is not clear that setting SO_KEEPALIVE will actually fix
anything. The documentation that I have read indicates that that may only
cause the read select() to wakeup not the write select(). Well, that is no
good, because that is supposedly what is happening with the PS3 client
already.

Anyway, part of the annoyance here is that in this particular case we
shouldn't need any timeouts to 'guess' that the client is 'probably dead'.
The client seems to be telling us that it has disconnected, but we are not
looking in the right place. And if we did try to write we would get a
sigPIPE error.

It is not the case the the client is unresponsive -- it is quite responsive.
The problem is that we are not looking in the right place for that response.

But, 'looking in the right place' is tricky. How do you tell hPut that it
should wakeup from threadWaitWrite if the Handle happens to be backed by a
socket, and threadWaitRead has data available? That does not even always
indicate an error condition, it can be a perfectly valid situation.

Well, before I think about that, I want to know what the PS3 client is doing
differently such that it is the only client that seems to exhibit this
behavior at the moment. If we do not understand the real difference between
what the PS3 and the C client are doing, then I don't think we can expect to
arrive at an appropriate fix.

- jeremy
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-18 Thread Taru Karttunen

Excerpts from Bardur Arantsson's message of Wed Feb 17 21:27:07 +0200 2010:
 For sendfile, a timeout of 1 second would probably be fine. The *ONLY* 
 purpose of threadWaitWrite in the sendfile code is to avoid busy-waiting 
 on EAGAIN from the native sendfile.

Of course this will kill connections for all clients that may have a
two second network hickup.

 How so? As a user I expect sendfile to work and not semi-randomly block 
 threads indefinitely.

If you want sending something to terminate you will add a timeout to
it. A nasty client may e.g. take one byte each minute and sending your
file may take a few years.

- Taru Karttunen
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-17 Thread Taru Karttunen

Excerpts from Bardur Arantsson's message of Tue Feb 16 23:48:14 +0200 2010:
  This cannot be fixed in the sendfile library, it is a 
  feature of TCP that connections may linger for a long
  time unless explicit timeouts are used.
 
 The problem is that the sendfile library *doesn't* wake
 up when the connection is terminated (because of threadWaitWrite)
 -- it doesn't matter what the timeout is.

Even server code without sendfile has the same issue since
all writing to sockets ends up using threadWaitWrite.

System.Timeout.timeout terminates a threadWaitWrite using
asynchronous exceptions.

If you want to detect dead sockets somewhat reliably 
without a timeout then there is SO_KEEPALIVE combined
with polling SO_ERROR every few minutes.

- Taru Karttunen
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-17 Thread Jeremy Shaw

On Wed, Feb 17, 2010 at 2:36 AM, Taru Karttunen tar...@taruti.net wrote:

 Excerpts from Bardur Arantsson's message of Tue Feb 16 23:48:14 +0200 2010:
   This cannot be fixed in the sendfile library, it is a
   feature of TCP that connections may linger for a long
   time unless explicit timeouts are used.
 
  The problem is that the sendfile library *doesn't* wake
  up when the connection is terminated (because of threadWaitWrite)
  -- it doesn't matter what the timeout is.

 Even server code without sendfile has the same issue since
 all writing to sockets ends up using threadWaitWrite.


Right, this is my concern -- I want to make sure that all of happstack is
fixed, not just sendfile.


 System.Timeout.timeout terminates a threadWaitWrite using
 asynchronous exceptions.


So for sendfile, instead of threadWaitWrite we could do:

 r - timeout (60 * 10^6) threadWaitWrite
 case r of
   Nothing - ... -- timed out
   (Just ()) - ... -- keep going

It seems tricky to use timeout at a higher level in the code, because some
requests may take a very long time to finish. For example, when serving a
long video, or streaming music it could be hours or days before the IO
request finishes.


If you want to detect dead sockets somewhat reliably
 without a timeout then there is SO_KEEPALIVE combined
 with polling SO_ERROR every few minutes.


 This approach sounds promising because it seems like it could be
incorporated into the guts of happstack-server. The timeout period could be
a Config option with a reasonable default. I would be surprised if *any*
happstack programs today are handling this correctly, so updating the core
to do something reasonable would be a big improvement... And if someone has
a special need where it is not ok, they can just change the config to use an
infinite timeout...

Does that sound like the right fix to you? (Obviously, if people are using
sendfile with something other than happstack, it does not help them, but it
sounds like trying to fix things in sendfile is misguided anyway.)

- jeremy
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-17 Thread Bardur Arantsson


Jeremy Shaw wrote:

On Wed, Feb 17, 2010 at 2:36 AM, Taru Karttunen tar...@taruti.net wrote:


So for sendfile, instead of threadWaitWrite we could do:

 r - timeout (60 * 10^6) threadWaitWrite
 case r of
   Nothing - ... -- timed out
   (Just ()) - ... -- keep going



For sendfile, a timeout of 1 second would probably be fine. The *ONLY* 
purpose of threadWaitWrite in the sendfile code is to avoid busy-waiting 
on EAGAIN from the native sendfile.


What would work is, instead of using threadWaitRead (as in the code you 
supplied) to simply have a 1 second timeout which causes the loop to 
call the native sendfile again. Native sendfile *will* fail with an 
error code if the socket has been disconnected.


With that in place dead threads waiting on threadWaitWrite will only 
linger at most 1 second before discovering the disconnect.


Not ideal, but a lot better than the current situation.


Does that sound like the right fix to you?


[--snip--]


(Obviously, if people are using sendfile with something other than happstack,
it does not help them, but it  sounds like trying to fix things in

 sendfile is misguided anyway.)




How so? As a user I expect sendfile to work and not semi-randomly block 
threads indefinitely.


Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-17 Thread Jeremy Shaw

On Wed, Feb 17, 2010 at 1:27 PM, Bardur Arantsson s...@scientician.netwrote:


  (Obviously, if people are using sendfile with something other than
 happstack,
 it does not help them, but it  sounds like trying to fix things in

  sendfile is misguided anyway.)



 How so? As a user I expect sendfile to work and not semi-randomly block
 threads indefinitely.


Because it only addresses *one* case when this type of blocking can happen.

Shouldn't hPut and friends also block indefinitely since they also use
threadWaitWrite? If so, what good is just fixing sendfile, when all other
network I/O will still block indefinitely?

If things are 'fixed' at a higher-level, by using SO_KEEPALIVE, then does
sendfile really need a hack to deal with it?

With your proposed fix, if the user unplugs the network cable, then won't
you get an polling loop that never terminates? That doesn't sound any better
than the current situation..

You said that you have not seen this issue when using the code that uses
hPut, only the code that uses sendfile(). But my research indicates that we
*should* see the error. So, I am not very comfortable fixing just sendfile
and ignoring the fact that all network I/O might be borked..

I am also not 100% pleased by the SO_KEEPALIVE solution. There are really
two errors which can occur:

  1. the remote end drops the connection in such a manner that we
immediately get notified of it by seeing that a read select() on the socket
is successful but there are 0 bytes available to read. This happens because
the remote end sent a notification to us that they have terminated the
connection.

  2. the remote end drops off the network (for example, the network cable is
disconnected). In this case, we will not get any notification via read
select(), because the remote server is not there to send the notification.
The only solution is to eventually timeout.

By using a timeout to handle #2, we implicitly handle #1, but in a very
untimely manner.

Ideally, we would like to handle both these cases separately. In case #1, we
know immediately, that the connection is dead, and can therefore clean
things up. With case #2, the remote client might actually come back online,
(someone plugs the cable back in), and the transfer resumes. Perhaps in some
applications we want infinite timeouts for case #2. That does not mean we do
not want case #1 handled.

However, I do not really see a good way of handle #1 right now that works
for all network code, not just sendfile.

The issue seems to be that select() was designed as a way to *avoid* using
threads. There seems to be the assumption in the network code that you are
going to do a select on the read and write aspects of the socket. When the
select returns you will then look at what happened, and take the correct
action.

But, in Haskell, we are using multiple threads. So the code that is looking
to read data and the code that is looking to write data don't really know
about each other. So even if the read thread detects the closed socket, it
has no idea that some other thread needs to be killed.

so, what to do? Perhaps it is wrong to use a socket in more than one thread?
Obviously, having multiple threads trying read the same socket, or write to
the same socket would be a mess. So why do we expect it is ok to have one
thread reading and a different thread writing? But, even if we do restrict
ourselves to only accessing a socket from one thread at a time, we still
have the issue that every place which uses threadWaitWrite needs to handle
the disconnect case. We could, of course, write a wrapper function that does
the check, and call that instead. But we still have not really solved the
problem. The code in the I/O libraries that eventually implements hPut calls
threadWaitWrite. But it has no idea that the file descriptor it is waiting
on is a socket which has special requirements. That code is also used for
writing to plain old files, etc, so it probably wouldn't make sense for it
to behave that way by default..

- jeremy
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-17 Thread Jeremy Shaw

On Wed, Feb 17, 2010 at 3:54 PM, Jeremy Shaw jer...@n-heptane.com wrote:

 On Wed, Feb 17, 2010 at 1:27 PM, Bardur Arantsson s...@scientician.netwrote:


  (Obviously, if people are using sendfile with something other than
 happstack,
 it does not help them, but it  sounds like trying to fix things in

  sendfile is misguided anyway.)



 How so? As a user I expect sendfile to work and not semi-randomly block
 threads indefinitely.


 Because it only addresses *one* case when this type of blocking can happen.

 Shouldn't hPut and friends also block indefinitely since they also use
 threadWaitWrite? If so, what good is just fixing sendfile, when all other
 network I/O will still block indefinitely?

 If things are 'fixed' at a higher-level, by using SO_KEEPALIVE, then does
 sendfile really need a hack to deal with it?


I think I understand the SO_KEEPALIVE + SO_ERROR solution, and that does not
really fix things either.

Setting SO_KEEPALIVE by itself does not cause the write select() to behave
any differently. What it does do is cause the TCP stack to eventually send
and empty packet to the remote host and hopefully get a response back. The
response might be an error, or it might just be an ACK. But either way, I
believe it is intended to cause the read select() to wakeup. But, in the
case that started this discussion, we are already getting this information.
So this won't help with that at all.

The second part of the solution is to poll SO_ERROR to determine if
something went wrong. This is an alternative to doing a read() on the socket
and see if it returns 0 bytes. It is a nice alternative *because* it does
not require a read(). However, it is still problematic. When you poll
SO_ERROR, it will clear the error value, so there is a potential race
condition if multiple threads are doing it.

In happstack, we fork a new thread to handle each incoming connection. So at
first it seems like we could just fork a second thread that polls the
SO_ERROR option on the socket and kills the first thread if an error
happens. Unfortunately, it is not that simple. The first thread might fork
another thread that is actually doing the threadWaitWrite. Killing the
parent thread will not kill that child thread.

So, at present, I don't see a solution that is going to fix the problem in
the rest of the IO code. There are multiple ways to hack only sendfile.. but
that is only one place this error can happen.

If this error truly never happens with hPut, then we should figure out why.
If there is a solution that works for write() it should work for sendfile(),
because the real issue is with the select() call anyway..

- jeremy
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-16 Thread Jeremy Shaw

On Sun, Feb 14, 2010 at 2:04 PM, Bardur Arantsson s...@scientician.netwrote:


 I've tested this extensively during this weekend and not a single leaked
 FD so far.

 I think we can safely say that polling an FD for read readiness is
 sufficient to properly detect a disconnected client regardless of why/how
 the client disconnected.

 The only issue I can see with just dropping the above code directly into
 the sendfile library is that it may lead to busy-waiting on EAGAIN *if* the
 client is actually trying to send data to the server while it's receiving
 the file via sendfile(). If the client sends even a single byte and the
 server isn't reading it from the socket, then threadWaitRead will keep
 returning immediately since it's level-triggered rather than edge triggered.


Yeah. That could be trouble.


 Not sure what the best solution for this would be, API-wise... Maybe
 actually have sendfile read the data and supply it to a user-defined
 function which could react to the data in some way? (Could supply two
 standard functions: disconnect immediately and accumulate all received
 data into a bytestring.)


I think this goes beyond just a sendfile issue -- anyone trying to write
non-blocking network code should run into this issue, right ? For now, maybe
we should patch sendfile with what we have. But I think we really need to
summarize our findings, see if we can generate a test case, and then see
what Simon Marlow and company have to say...

- jeremy
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-16 Thread Bryan O'Sullivan

On Tue, Feb 16, 2010 at 12:37 PM, Jeremy Shaw jer...@n-heptane.com wrote:


 I think this goes beyond just a sendfile issue -- anyone trying to write
 non-blocking network code should run into this issue, right ?


What's a fairly concise description of the issue at hand? I haven't been
paying much attention to this thread, and the descriptions I have seen have
been somewhat confused.

One admittedly unhelpful observation is that when something goes wrong in
this area, it's usually due to pilot error (either on the part of whoever
wrote the Haskell library, or its user), and not so often caused by a bug in
the underlying platform.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-16 Thread Bardur Arantsson


Jeremy Shaw wrote:

On Sun, Feb 14, 2010 at 2:04 PM, Bardur Arantsson s...@scientician.netwrote:



Not sure what the best solution for this would be, API-wise... Maybe
actually have sendfile read the data and supply it to a user-defined
function which could react to the data in some way? (Could supply two
standard functions: disconnect immediately and accumulate all received
data into a bytestring.)



I think this goes beyond just a sendfile issue -- anyone trying to write
non-blocking network code should run into this issue, right ? For now, maybe
we should patch sendfile with what we have. But I think we really need to
summarize our findings, see if we can generate a test case, and then see
what Simon Marlow and company have to say...


As far as I can tell, all nonblocking networking code is vulnerable to 
this issue (unless it actually does use threadWaitRead, obviously :)).


In particular, I would imagine most of the Haskell HTTP servers are 
vulnerable to this since they do use the same pattern of:


  1) read all the input from the client connection,
  2) send all the output to the client connection

where there is no reading from the socket in step 2.

I'm just not sure whether the GHC built-in I/O code *somehow*
avoids this problem. I think my tests indicate that it does, so it would 
seem that it's only when you go C that you need to worry.


Re: a test case, you'll probably need to run the test case code on a 
client whose OS allows (from userspace) the sudden dropping of 
connections without sending a proper connection shutdown sequence. I'm 
not sure that that OS would be.


Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-16 Thread Bardur Arantsson


Bardur Arantsson wrote:

Jeremy Shaw wrote:

[--snip--]
Re: a test case, you'll probably need to run the test case code on a 
client whose OS allows (from userspace) the sudden dropping of 
connections without sending a proper connection shutdown sequence. I'm 
not sure that that OS would be.


Actually, scratch that. Maybe it's just a question having a high enough 
connection rate to hit the case where threadWaitWrite hangs. Although 
I did try a few times using wget, I didn't really try hammering the 
server properly. It probably needs the right timing to trigger the 
problem (i.e. the disconnect needs to happen exactly when sendfile is 
done with its block and we're going around to threadWaitWrite again.)


I'll see if I get the time try a test client which can really hammer my 
server -- that ought to be able to trigger the problem. If that works, 
I'll try to produce a minimal server program which still exhibits the issue.


Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-16 Thread Taru Karttunen

Excerpts from Bardur Arantsson's message of Tue Feb 16 22:57:23 +0200 2010:
 As far as I can tell, all nonblocking networking code is vulnerable to 
 this issue (unless it actually does use threadWaitRead, obviously :)).

There are a few easy fixes:

1) socket timeouts with Network.Socket.setSocketOption
2) just make your server code have timeouts in Haskell

This cannot be fixed in the sendfile library, it is a 
feature of TCP that connections may linger for a long
time unless explicit timeouts are used.

So just document it and in your code using sendfile
wrap it in an application specific timeout.

- Taru Karttunen
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-16 Thread Bardur Arantsson


Taru Karttunen wrote:

Excerpts from Bardur Arantsson's message of Tue Feb 16 22:57:23 +0200 2010:
As far as I can tell, all nonblocking networking code is vulnerable to 
this issue (unless it actually does use threadWaitRead, obviously :)).


There are a few easy fixes:

1) socket timeouts with Network.Socket.setSocketOption


The whole point of this thread is that this isn't sufficent.


2) just make your server code have timeouts in Haskell

This cannot be fixed in the sendfile library, it is a 
feature of TCP that connections may linger for a long

time unless explicit timeouts are used.


The problem is that the sendfile library *doesn't* wake
up when the connection is terminated (because of threadWaitWrite)
-- it doesn't matter what the timeout is.

Client code of the sendfile library shouldn't have to try
to work around this -- it's absurd to expect it to.

Please read the entire thread.

Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-16 Thread Jeremy Shaw

On Tue, Feb 16, 2010 at 3:48 PM, Bardur Arantsson s...@scientician.netwrote:

 The problem is that the sendfile library *doesn't* wake
 up when the connection is terminated (because of threadWaitWrite)
 -- it doesn't matter what the timeout is.


Have we actually confirmed this? We know that with the default socket
configuration things are good. But have we actually tested testing the
timeout to something short and seeing what happens? It would be good to know
for sure..

- jeremy
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-14 Thread Bardur Arantsson


Jeremy Shaw wrote:


import Control.Concurrent
import Control.Concurrent.MVar
import System.Posix.Types

data RW = Read | Write

threadWaitReadWrite :: Fd - IO RW
threadWaitReadWrite fd =
  do m - newEmptyMVar
 rid - forkIO $ threadWaitRead fd   putMVar m Read
 wid - forkIO $ threadWaitWrite fd  putMVar m Write
 r - takeMVar m
 killThread rid
 killThread wid
 return r


[--snip--]

I've tested this extensively during this weekend and not a single 
leaked FD so far.


I think we can safely say that polling an FD for read readiness is 
sufficient to properly detect a disconnected client regardless of 
why/how the client disconnected.


The only issue I can see with just dropping the above code directly into 
the sendfile library is that it may lead to busy-waiting on EAGAIN *if* 
the client is actually trying to send data to the server while it's 
receiving the file via sendfile(). If the client sends even a single 
byte and the server isn't reading it from the socket, then 
threadWaitRead will keep returning immediately since it's 
level-triggered rather than edge triggered.


In the worst case this could be exploited by evil clients as a trivial 
way to DoS a server -- simply send data while the server is sending you 
a file. Bam, instant 100% CPU utilization on the server.


Not sure what the best solution for this would be, API-wise... Maybe 
actually have sendfile read the data and supply it to a user-defined 
function which could react to the data in some way? (Could supply two 
standard functions: disconnect immediately and accumulate all 
received data into a bytestring.)


Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-12 Thread Bardur Arantsson


Jeremy Shaw wrote:


import Control.Concurrent
import Control.Concurrent.MVar
import System.Posix.Types

data RW = Read | Write

threadWaitReadWrite :: Fd - IO RW
threadWaitReadWrite fd =
  do m - newEmptyMVar
 rid - forkIO $ threadWaitRead fd   putMVar m Read
 wid - forkIO $ threadWaitWrite fd  putMVar m Write
 r - takeMVar m
 killThread rid
 killThread wid
 return r



Initial testing seems promising. I haven't been able to provoke the 
leak during 15-20 minutes of testing.


I'll test more thoroughly during the weekend.

Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-11 Thread Jeremy Shaw

On Wed, Feb 10, 2010 at 1:15 PM, Bardur Arantsson s...@scientician.netwrote:

I've also been contemplating some solutions, but I cannot see any solutions
 to this problem which could reasonably be implemented outside of GHC itself.
 GHC lacks a threadWaitError, so there's no way to detect the problem
 except by timeout or polling. Solutions involving timeouts and polling are
 bad in this case because they arbitrarily restrict the client connection
 rate.

 Cheers,


I believe solutions involving polling and timeouts may be the *only*
solution due to the way TCP works. There are two cases to consider here:

 1. what happens when the remote client does a proper disconnect by sending
a FIN packet, etc
 2. what happens when the remote client just drops the connection

Case #1 - Proper Disconnect

I believe that in case we are ok. select() may not wakeup due to the socket
being closed -- but something will eventually cause select() to wakeup, and
then next time through the loop, the call to select will fail with EBADF.
This will cause everyone to wakeup. We can test this case by writing a
client that purposely (and correctly) terminations the connection while
threadWaitWrite is blocking and see if that causes it to wakeup. To ensure
that the IOManager is eventually waking up, the server can have an IO thread
that just does, forever $ threadDelay (1*10^6)

Look here for more details:
http://darcs.haskell.org/packages/base/GHC/Conc.lhs

Case #2 - Sudden Death

In this case, there is no way to tell if the client is still there with out
trying to send / recv data. A TCP connection is not a 'tangible' link. It is
just an agreement to send packets to/from certain ports with certain
sequence numbers. It's much closer to snail mail than a telephone call.

If you set the keepalive socket option, then the TCP layer will
automatically ping the connection to make sure it is still alive. However, I
believe the default time between keepalive packets is 2 hours, and can only
be changed on a system wide basis?

http://www.unixguide.net/network/socketfaq/2.8.shtml

The other option is to try to send some data. There are at least two cases
that can happen here.

 1. the network cable is unplugged -- this is not an 'error'. The write
buffer will fill up and it will wait until it can send the data. If the
write buffer is full, it will either block or return EAGAIN depending on the
mode. Eventually, after 2 hours, it might give up.

 2. the remote client has terminated the connection as far as it is
concerned but not notified the server -- when you try to send data it will
reject it, and send/write/sendfile/etc will raise sigPIPE.

Looking at your debug output, we are seeing the sigPIPE / Broken Pipe error
most of the time. But then there is the case where we get stuck on the
threadWaitWrite.

threadWaitWrite is ultimately implemented by passing the file descriptor to
the list of write descriptors in a call to select(). It seems, however, that
select() is not waking up just because calling write() on a file descriptor
*would* cause sigPIPE.

The easiest way to confirm this case is probably to write a small, pure C
program and see what really happens.

If this is the case, then it means the only way to tell if the client has
abruptly dropped the connection is to actually try sending the data and see
if the sending function calls sigPIPE. And that means doing some sort of
polling/timeout?

What do you think?

I do not have a good explanation as to why the portable version does not
fail. Except maybe it is just so slow that it does not ever fill up the
buffer, and hence does not get stuck in threadWaitWrite?

Any way, the fundamental question is:

 When your write buffer is full, and you call select() on that file
descriptor, will select() return in the case where calling write() again
would raise sigPIPE?

- jeremy
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-11 Thread Bardur Arantsson


Jeremy Shaw wrote:

On Wed, Feb 10, 2010 at 1:15 PM, Bardur Arantsson s...@scientician.netwrote:

I've also been contemplating some solutions, but I cannot see any solutions

to this problem which could reasonably be implemented outside of GHC itself.
GHC lacks a threadWaitError, so there's no way to detect the problem
except by timeout or polling. Solutions involving timeouts and polling are
bad in this case because they arbitrarily restrict the client connection
rate.

Cheers,



I believe solutions involving polling and timeouts may be the *only*
solution due to the way TCP works. There are two cases to consider here:



True, but my point was rather that a solution in the sendfile libary 
would incur an _extra_ timeout on top of the timeout which is handled by 
the OS. It's very hard to come up with a proper timeout here because 
apps will have different requirements depending on the expected 
connection rate, etc. This is what I see as unacceptable since it would 
have to be a completely arbitrary timeout -- there's no way for the 
application to specify a timeout to the sendfile library since the API 
doesn't permit it.


[--snip--]

Case #1 - Proper Disconnect

I believe that in case we are ok. select() may not wakeup due to the socket
being closed -- but something will eventually cause select() to wakeup, and
then next time through the loop, the call to select will fail with EBADF.
This will cause everyone to wakeup. We can test this case by writing a
client that purposely (and correctly) terminations the connection while
threadWaitWrite is blocking and see if that causes it to wakeup. To ensure
that the IOManager is eventually waking up, the server can have an IO thread
that just does, forever $ threadDelay (1*10^6)

Look here for more details:
http://darcs.haskell.org/packages/base/GHC/Conc.lhs



I don't have time to write a C test program right now. I'm actually not 
100% convinced that this case is *not* problematic, but my limited 
testing with well-behaved clients (wget, curl) hasn't turned up any 
problems so far.



Case #2 - Sudden Death

In this case, there is no way to tell if the client is still there with out
trying to send / recv data. A TCP connection is not a 'tangible' link. It is
just an agreement to send packets to/from certain ports with certain
sequence numbers. It's much closer to snail mail than a telephone call.

If you set the keepalive socket option, then the TCP layer will
automatically ping the connection to make sure it is still alive. However, I
believe the default time between keepalive packets is 2 hours, and can only
be changed on a system wide basis?

http://www.unixguide.net/network/socketfaq/2.8.shtml


There are some options you can set via setsockopt(), see man 7 tcp:

   tcp_keepalive_intvl(default: 75s)
   tcp_fin_timeout(default: 60s)

(The latter is the amount of time to wait for the final FIN before 
forcing a the socket to close.)


These can be set per-socket.



The other option is to try to send some data. There are at least two cases
that can happen here.


This is what I tried. The trouble here is that you have to force the 
thread doing threadWaitWrite to wake up periodically... and how do you 
decide how often? Too often and you're burning CPU doing nothing, too 
seldom and you're letting threads (and by implication 
used-but-really-disconnected-as-far-as-the-OS-is-concerned file 
descriptors) pile up. The overhead of mempcy (avoidance of which is 
sendfile's raison-d'être) is probably much less than the overhead of 
doing all this administration in userspace instead of just letting the 
kernel do its thing.


Even waking up very seldom (~1/s IIRC) incurred a lot of CPU overhead in 
my test case... but I suppose I could give it another try to see if I'd 
made some mistake in my code which caused it to use more CPU than necessary.




 1. the network cable is unplugged -- this is not an 'error'. The write
buffer will fill up and it will wait until it can send the data. If the
write buffer is full, it will either block or return EAGAIN depending on the
mode. Eventually, after 2 hours, it might give up.


I believe the socket is actually in non-blocking mode in my application. 
 I'm not putting it into non-blocking mode, so I'm guessing that the 
accept call is doing that -- or maybe it's just the default behavior 
of accept() on Linux. Converting a socket to a Handle (which is what the 
portable sendfile does) automatically puts it into blocking mode.


Actually, I think this whole issue could be avoided if the socket could 
just be forced into blocking mode. In that case, there would be no need 
to call threadWaitWrite: The native sendfile() call could never return 
EAGAIN (it would block instead), and so there'd be no need to call 
threadWaitWrite to avoid busy-waiting.



 2. the remote client has terminated the connection as far as it is
concerned but not notified the server -- when you try to send data it will
reject it, and

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-11 Thread Thomas DuBuisson

Bardur Arantsson s...@scientician.net wrote:
 ...
       then do errno - getErrno
               if errno == eAGAIN
                 then do
                    threadDelay 100
                    sendfile out_fd in_fd poff bytes
                 else throwErrno Network.Socket.SendFile.Linux
      else return (fromIntegral sbytes)

 That is, I removed the threadWaitWrite in favor of just adding a
 threadDelay 100 when eAGAIN is encountered.

 With this code, I cannot provoke the leak.

 Unfortunately this isn't really a solution -- the CPU is pegged at
 ~50% when I do this and it's not exactly elegant to have a hardcoded
 100 ms delay in there. :)

I don't think it matters wrt the desired final solution, but this is
NOT a 100 ms delay.  It is a 0.1 ms delay, which is less than a GHC
time slice and as such is basically a tight loop.  If you use a
reasonable value for the delay you will probably see the CPU being
almost completely idle.

Thomas
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-11 Thread Jeremy Shaw



On Feb 11, 2010, at 1:57 PM, Bardur Arantsson wrote:



2. the remote client has terminated the connection as far as it is
concerned but not notified the server -- when you try to send data  
it will

reject it, and send/write/sendfile/etc will raise sigPIPE.
Looking at your debug output, we are seeing the sigPIPE / Broken  
Pipe error
most of the time. But then there is the case where we get stuck on  
the

threadWaitWrite.
threadWaitWrite is ultimately implemented by passing the file  
descriptor to
the list of write descriptors in a call to select(). It seems,  
however, that
select() is not waking up just because calling write() on a file  
descriptor

*would* cause sigPIPE.


That's what I expect select() with an errfd FDSET would do.


Nope. The expectfds are only trigger in esoteric conditions. For TCP  
sockets, I think it only occurs if there is out-of-band data available  
to be read via recv() with the MSG_OOB flag.


http://uw714doc.sco.com/en/SDK_netapi/sockC.OoBdata.html

The easiest way to confirm this case is probably to write a small,  
pure C

program and see what really happens.
If this is the case, then it means the only way to tell if the  
client has
abruptly dropped the connection is to actually try sending the data  
and see
if the sending function calls sigPIPE. And that means doing some  
sort of

polling/timeout?


Correct, but the trouble is deciding how often to poll and/or how  
long the timeout should be.


I don't see any easy answer to that. That's why my suggested  
solution is to simply punt it to the OS (by using portable mode)  
and suck up the extra overhead of the portable solution. Hopefully  
the new GHC I/O manager will make it possible to have a proper  
solution.


The whole point of the sendfile library is to use sendfile(), so not  
using sendfile() seems like the wrong solution. I am also not  
convinced that the new GHC I/O manager will do anything new to make it  
possible to have a proper solution. I believe we would be seeing the  
same error even in pure C, so we need to know the work around that  
works in pure C as well. I am not convinced we are punting to the OS  
by using portable mode either (more below).


I do not have a good explanation as to why the portable version  
does not
fail. Except maybe it is just so slow that it does not ever fill up  
the

buffer, and hence does not get stuck in threadWaitWrite?


The portable version doesn't call threadWaitWrite. It simply turns  
the Socket into a handle (which causes it to become blocking)  and  
so the kernel is tasked with handling all the gritty details.


The portable version does not directly call threadWaitWrite, but it  
still calls it.


Data.ByteString.Char8.hPutStr calls
Data.ByteString.hPut which calls
Data.ByteString.hPutBuf which calls
System.IO.hPutBuf which calls
GHC.IO.Handle.Text.hPutBuf which calls
GHC.IO.Handle.bufWrite.Text which calls
GHC.IO.Device.write which calls
GHC.IO.FD.fdWrite which calls
GHC.IO.FD.writeRawBufferPtr which calls

which is defined as:

writeRawBufferPtr :: String - FD - Ptr Word8 - Int - CSize - IO  
CInt

writeRawBufferPtr loc !fd buf off len
  | isNonBlocking fd = unsafe_write -- unsafe is ok, it can't block
  | otherwise   = do r - unsafe_fdReady (fdFD fd) 1 0 0
 if r /= 0
then write
else do threadWaitWrite (fromIntegral (fdFD  
fd)); write

  where
do_write call = fromIntegral `fmap`
  throwErrnoIfMinus1RetryMayBlock loc call
(threadWaitWrite (fromIntegral (fdFD fd)))
write = if threaded then safe_write else unsafe_write
unsafe_write  = do_write (c_write (fdFD fd) (buf `plusPtr` off)  
len)
safe_write= do_write (c_safe_write (fdFD fd) (buf `plusPtr`  
off) len)


According to the following test program, I expect that 'isNonBlocking  
fd' will be 'True'. So it seems like the portable solution should be  
vulnerable to the same condition. Perhaps the portable version is just  
so slow that the OS buffers never fill up so EAGAIN is never raised?


---

{-# LANGUAGE RecordWildCards #-}
module Main where

import Control.Concurrent (forkIO)
import Control.Monad (forever)
import Network (PortID(PortNumber), Socket, listenOn)
import Network.Socket (accept, socketToHandle)
import System.IO
import qualified GHC.IO.FD as FD
import GHC.IO.Handle.Internals (withHandle, flushWriteBuffer)
import GHC.IO.Handle.Types (Handle__(..), HandleType(..))
import qualified GHC.IO.FD as FD
import System.Posix.Types (Fd(..))
import System.IO.Error
import GHC.IO.Exception
import Data.Typeable (cast)
import GHC.IO.Handle.Internals (wantWritableHandle)

main =
  listen (PortNumber (toEnum 2525)) $ \s -
 do h - socketToHandle s ReadWriteMode
wantWritableHandle main h $ \h_ - showBlocking h_


showBlocking :: Handle__ - IO ()

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-11 Thread Bardur Arantsson


Thomas DuBuisson wrote:

Bardur Arantsson s...@scientician.net wrote:

...
  then do errno - getErrno
  if errno == eAGAIN
then do
   threadDelay 100
   sendfile out_fd in_fd poff bytes
else throwErrno Network.Socket.SendFile.Linux
 else return (fromIntegral sbytes)

That is, I removed the threadWaitWrite in favor of just adding a
threadDelay 100 when eAGAIN is encountered.

With this code, I cannot provoke the leak.

Unfortunately this isn't really a solution -- the CPU is pegged at
~50% when I do this and it's not exactly elegant to have a hardcoded
100 ms delay in there. :)


I don't think it matters wrt the desired final solution, but this is
NOT a 100 ms delay.  It is a 0.1 ms delay, which is less than a GHC
time slice and as such is basically a tight loop.  If you use a
reasonable value for the delay you will probably see the CPU being
almost completely idle.



Excellent, thanks. I was probably too tired or annoyed when I wrote that 
code. I sorta-kinda-knew I must have been doing *something* wrong :).


I'll retry with a more reasonable delay tomorrow.

Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-11 Thread Bardur Arantsson


Jeremy Shaw wrote:


On Feb 11, 2010, at 1:57 PM, Bardur Arantsson wrote:



[--snip lots of technical info--]

Thanks for digging so much into this.

Just a couple of comments:



The whole point of the sendfile library is to use sendfile(), so not 
using sendfile() seems like the wrong solution.


Heh, well, presumably it could still use sendfile() only platforms where 
it can actually guarantee correctness :).




There is some evidence that when you are doing select() on a readfds, 
and the connection is closed, select() will indicate that the fds is 
ready to be read, but when you read it, you get 0-bytes. That indicates 
that a disconnect has happened. However, if you are only doing 
read()/recv(), I expect that only happens in the event of a proper 
disconnect, because if you are just listening for packets, there is no 
way to tell the difference between the sender just not saying anything, 
and the sender dying:


True, but the point here is that the OS has a built-in timeout mechanism 
(via keepalives) and *can* tell the program when that timeout has elapsed.


That's the timeout we're trying to get at instead of having to 
implement a new one.


Good point about the the readfds triggering when the client disconnects. 
I think that's what I've been seeing in all my other network-related 
code and I just misremembered the details. All my code is extremely 
likely to have been both reading and writing from (roughly) the same set 
of FDs at the same time.


If this method of detection is correct, then what we need is a 
threadWaitReadWrite, that will notify us if the socket can be read or 
written. The IO manager does not currently provide a function like 
that.. but we could fake it like this: (untested):


import Control.Concurrent
import Control.Concurrent.MVar
import System.Posix.Types

data RW = Read | Write

threadWaitReadWrite :: Fd - IO RW
threadWaitReadWrite fd =
  do m - newEmptyMVar
 rid - forkIO $ threadWaitRead fd   putMVar m Read
 wid - forkIO $ threadWaitWrite fd  putMVar m Write
 r - takeMVar m
 killThread rid
 killThread wid
 return r



I'll try to get the sendfile code to use this instead. AFAICT it 
shouldn't actually be necessary to peek on the read end of the socket 
to detect that something has gone wrong. We're guaranteed that 
sendfile() to a connection that's died (according to the OS, either due 
to proper disconnect or a timeout) will fail.


I might get a bit tricky to use this if the client is actually expecting 
to send proper data while the sendfile() is in progress -- if there's 
actual data to be read from the socket() then the naive replace 
threadWaitR by threadWaitRW will end up busy-waiting on EAGAIN since 
the socket() will be readable every time

threadWaitReadWrite gets called.

HOWEVER, that's not an issue in my particular scenario, so a simple 
relacement of threadWaitWrite by threadWaitReadWrite should do fine for 
testing purposes.


Of course, in the case where the client disconnects because someone 
turns off the power or pulls the ethernet cable, we have no way of 
knowing what is going on -- so there is still the possibility that dead 
connections will be left open for a long time.


True, but then it's (properly) left to the OS to decide and timeouts can 
be controlled via setsockopt -- as they should IMO.


I'll test tomorrow.

What I'll expect is that I'll still see a few dead threads lingering 
around for ~60 seconds (the OS-based timeout), but that I'll not see any
threads lingering indefinitely -- something which usually happens after 
a few hours of persistent use of my media server.


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-10 Thread Jeremy Shaw


On Feb 9, 2010, at 6:47 PM, Thomas Hartman wrote:


Matt, have you seen this thread?

Jeremy, are you saying this a bug in the sendfile library on hackage,
or something underlying?


I'm saying that the behavior of the sendfile library is buggy. But it  
could be due to something underlying..


Either threadWaitWrite is buggy and should be fixed. Or  
threadWaitWrite is doing the right thing, and sendfile needs to be  
modified some how to account for the behavior. But I don't know which  
is the case or how to implement a solution to either option.


- jeremy
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-10 Thread Bardur Arantsson


Jeremy Shaw wrote:

On Feb 9, 2010, at 6:47 PM, Thomas Hartman wrote:


Matt, have you seen this thread?

Jeremy, are you saying this a bug in the sendfile library on hackage,
or something underlying?


I'm saying that the behavior of the sendfile library is buggy. But it 
could be due to something underlying..


Either threadWaitWrite is buggy and should be fixed. Or threadWaitWrite 
is doing the right thing, and sendfile needs to be modified some how to 
account for the behavior. But I don't know which is the case or how to 
implement a solution to either option.


IMO, in the interests of correctness over speed, an interim release of 
sendfile which simply uses the portable code on Linux should be put 
out. The CPU overhead of the portable method doesn't matter that much 
for servers which aren't extremely busy.


I've also been contemplating some solutions, but I cannot see any 
solutions to this problem which could reasonably be implemented outside 
of GHC itself. GHC lacks a threadWaitError, so there's no way to 
detect the problem except by timeout or polling. Solutions involving 
timeouts and polling are bad in this case because they arbitrarily 
restrict the client connection rate.


Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-09 Thread Jeremy Shaw

On Sun, Feb 7, 2010 at 9:22 AM, Bardur Arantsson s...@scientician.netwrote:

True, it is perhaps technically not a bug, but it is certainly a misfeature
 since there is no easy way (at least AFAICT) to discover that something bad
 has happened for the file descriptor and act accordingly. AFAICT any
 solution would have to be based on a separate thread which either 1)
 checks the FD periodically somehow, or 2) simply lets the thread doing the
 threadWaitWrite time out after a set period of inactivity. Neither is very
 optimal.

 Either way, I'd certainly expect the sendfile library to work around this
 somehow such that this situation doesn't occur. I'm just having a hard time
 thinking up a good solution :).


Well, it is certainly a bug in sendfile that needs to be fixed. I'm not sure
how to fix it either. If we can simplify the test case, we can ask Simon
Marlow..

- jeremy
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-09 Thread Thomas Hartman

Matt, have you seen this thread?

Jeremy, are you saying this a bug in the sendfile library on hackage,
or something underlying?

thomas.

2010/2/9 Jeremy Shaw jer...@n-heptane.com:
 On Sun, Feb 7, 2010 at 9:22 AM, Bardur Arantsson s...@scientician.net
 wrote:

 True, it is perhaps technically not a bug, but it is certainly a
 misfeature since there is no easy way (at least AFAICT) to discover that
 something bad has happened for the file descriptor and act accordingly.
 AFAICT any solution would have to be based on a separate thread which either
 1) checks the FD periodically somehow, or 2) simply lets the thread doing
 the threadWaitWrite time out after a set period of inactivity. Neither is
 very optimal.

 Either way, I'd certainly expect the sendfile library to work around this
 somehow such that this situation doesn't occur. I'm just having a hard time
 thinking up a good solution :).

 Well, it is certainly a bug in sendfile that needs to be fixed. I'm not sure
 how to fix it either. If we can simplify the test case, we can ask Simon
 Marlow..
 - jeremy
 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-07 Thread Bardur Arantsson


Bardur Arantsson wrote:

Bardur Arantsson wrote:

(sorry about replying-to-self)

During yet another bout of debugging, I've added even more I am here 
instrumentation code to the SendFile code, and the culprit seems to be

  threadWaitWrite.



As Jeremy Shaw pointed out off-list, the symptoms are also consistent
with a thread that simply gets stuck in threadWaitWrite.

I've tried a couple of different solutions to this based on starting a
separate thread to enforce a timeout on threadWaitWrite (using throwTo).

It seems to work to prevent the file descriptor leak, but causes GHC
to segfault after a while. Probably some sort of other resource exhaustion
since my code is just an evil hack:

 killer :: MVar () - ThreadId - IO ()
 killer dontKill otherThread = do
threadDelay 5000
x - tryTakeMVar dontKill
case x of
   Just _ - putStrLn Killer thread expired
   Nothing - throwTo otherThread (Overflow)

where the relevant bit of sendfile reads:

mtid - myThreadId
dontKill - newEmptyMVar
forkIO $ killer dontKill mtid
threadWaitWrite out_fd
putMVar dontKill ()

So I'm basically creating a thread for every single threadWaitWrite operation
(which is a lot in this case).

Anyone got any ideas on a simpler way to handle this? Maybe I should just
report a bug for threadWaitWrite? IMO threadWaitWrite really should
throw some sort of IOException if the FD goes dead while it's waiting.

I suppose an alternative way to try to work around this would be by forcing the 
output
socket into blocking (as opposed to non-blocking) mode, but I can't figure out 
how to
do this on GHC 6.10.x -- I only see setNonBlockingFD which doesn't take a 
parameter
unlike its 6.12.x counterpart.

Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-07 Thread Jeremy Shaw

It's not clear to me that this is actually a bug in threadWaitWrite.

I believe that under Linux, select() does not wakeup just because the file
descriptor was closed. (Under Windows, and possibly solaris/BSD/etc it
does). So this behavior might be consistent with normal Linux behavior.
However, it is clearly annoying that (a) the expected behavior is not
documented (b) the behavior might be different under Linux than other OSes.

In some sense it is correct -- if the file descriptor is closed, then we
certainly can't write more to it -- so threadWaitWrite need not wake up..
But that leaves us with the issue of needing  someway to be notified that
the file descriptor was closed so that we can clean up after ourselves..

- jeremy

On Sun, Feb 7, 2010 at 2:13 AM, Bardur Arantsson s...@scientician.netwrote:

 Bardur Arantsson wrote:

 Bardur Arantsson wrote:

 (sorry about replying-to-self)

  During yet another bout of debugging, I've added even more I am here
 instrumentation code to the SendFile code, and the culprit seems to be

   threadWaitWrite.


 As Jeremy Shaw pointed out off-list, the symptoms are also consistent
 with a thread that simply gets stuck in threadWaitWrite.

 I've tried a couple of different solutions to this based on starting a
 separate thread to enforce a timeout on threadWaitWrite (using throwTo).

 It seems to work to prevent the file descriptor leak, but causes GHC
 to segfault after a while. Probably some sort of other resource exhaustion
 since my code is just an evil hack:

  killer :: MVar () - ThreadId - IO ()
  killer dontKill otherThread = do
 threadDelay 5000
 x - tryTakeMVar dontKill
 case x of
Just _ - putStrLn Killer thread expired
Nothing - throwTo otherThread (Overflow)

 where the relevant bit of sendfile reads:

 mtid - myThreadId
 dontKill - newEmptyMVar
 forkIO $ killer dontKill mtid
 threadWaitWrite out_fd
 putMVar dontKill ()

 So I'm basically creating a thread for every single threadWaitWrite
 operation
 (which is a lot in this case).

 Anyone got any ideas on a simpler way to handle this? Maybe I should just
 report a bug for threadWaitWrite? IMO threadWaitWrite really should
 throw some sort of IOException if the FD goes dead while it's waiting.

 I suppose an alternative way to try to work around this would be by forcing
 the output
 socket into blocking (as opposed to non-blocking) mode, but I can't figure
 out how to
 do this on GHC 6.10.x -- I only see setNonBlockingFD which doesn't take a
 parameter
 unlike its 6.12.x counterpart.


 Cheers,

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-07 Thread Bardur Arantsson


Jeremy Shaw wrote:

It's not clear to me that this is actually a bug in threadWaitWrite.

I believe that under Linux, select() does not wakeup just because the file
descriptor was closed.


select() has the option of specifying an exceptfds FD_SET where I'd 
certainly _expect_ select() to flag an FD if it's closed. Annoyingly, 
the man page is not very specific about what an exception is, so it's 
hard to be sure.



(Under Windows, and possibly solaris/BSD/etc it
does). So this behavior might be consistent with normal Linux behavior.
However, it is clearly annoying that (a) the expected behavior is not
documented (b) the behavior might be different under Linux than other OSes.

In some sense it is correct -- if the file descriptor is closed, then we
certainly can't write more to it -- so threadWaitWrite need not wake up..
But that leaves us with the issue of needing  someway to be notified that
the file descriptor was closed so that we can clean up after ourselves..



True, it is perhaps technically not a bug, but it is certainly a 
misfeature since there is no easy way (at least AFAICT) to discover that 
something bad has happened for the file descriptor and act accordingly. 
AFAICT any solution would have to be based on a separate thread which 
either 1) checks the FD periodically somehow, or 2) simply lets the 
thread doing the threadWaitWrite time out after a set period of 
inactivity. Neither is very optimal.


Either way, I'd certainly expect the sendfile library to work around 
this somehow such that this situation doesn't occur. I'm just having a 
hard time thinking up a good solution :).


Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-06 Thread Bardur Arantsson


Brandon S. Allbery KF8NH wrote:

On Feb 5, 2010, at 02:56 , Bardur Arantsson wrote:

[--snip--]


Broken pipe is normally handled as a signal, and is only mapped to an 
error if SIGPIPE is set to SIG_IGN.  I can well imagine that the SIGPIPE 
signal handler isn't closing resources properly; a workaround would be 
to use the System.Posix.Signals API to ignore SIGPIPE, but I don't know 
if that would work as a general solution (it would depend on what other 
uses of pipes/sockets exist).


It was a good idea, but it doesn't seem to help to add

installHandler sigPIPE Ignore (Just fullSignalSet)

to the main function. (Given the package name I assume 
System.Posix.Signals works similarly to regular old signals, i.e. 
globally per-process.)


This is really starting to drive me round the bend...

One further thing I've noticed: When compiling on my 64-bit machine,
ghc issues the following warnings:

Linux.hsc:41: warning: format ‘%d’ expects type ‘int’, but argument 3 
has type ‘long unsigned int’
Linux.hsc:45: warning: format ‘%d’ expects type ‘int’, but argument 3 
has type ‘long unsigned int’
Linux.hsc:45: warning: format ‘%d’ expects type ‘int’, but argument 3 
has type ‘long unsigned int’
Linux.hsc:45: warning: format ‘%d’ expects type ‘int’, but argument 3 
has type ‘long unsigned int’


Those lines are:

39: -- max num of bytes in one send
40: maxBytes :: Int64
41: maxBytes = fromIntegral (maxBound :: (#type ssize_t))

and

44: foreign import ccall unsafe sendfile64 c_sendfile
45:   :: Fd - Fd - Ptr (#type off_t) - (#type size_t) - IO (#type 
ssize_t)



This looks like a typical 32/64-bit problem, but normally I would expect 
any real run-time problems caused by a problematic conversion in the FFI 
to crash the whole process. Maybe I'm wrong about this...


Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-06 Thread Felipe Lessa

On Sat, Feb 06, 2010 at 09:16:35AM +0100, Bardur Arantsson wrote:
 Brandon S. Allbery KF8NH wrote:
 On Feb 5, 2010, at 02:56 , Bardur Arantsson wrote:
 [--snip--]
 
 Broken pipe is normally handled as a signal, and is only mapped
 to an error if SIGPIPE is set to SIG_IGN.  I can well imagine that
 the SIGPIPE signal handler isn't closing resources properly; a
 workaround would be to use the System.Posix.Signals API to ignore
 SIGPIPE, but I don't know if that would work as a general solution
 (it would depend on what other uses of pipes/sockets exist).

 It was a good idea, but it doesn't seem to help to add

   installHandler sigPIPE Ignore (Just fullSignalSet)

 to the main function. (Given the package name I assume
 System.Posix.Signals works similarly to regular old signals, i.e.
 globally per-process.)

 This is really starting to drive me round the bend...

Have you seen GHC ticket #1619?

http://hackage.haskell.org/trac/ghc/ticket/1619


 One further thing I've noticed: When compiling on my 64-bit machine,
 ghc issues the following warnings:

 Linux.hsc:41: warning: format ‘%d’ expects type ‘int’, but argument
 3 has type ‘long unsigned int’
 Linux.hsc:45: warning: format ‘%d’ expects type ‘int’, but argument
 3 has type ‘long unsigned int’
 Linux.hsc:45: warning: format ‘%d’ expects type ‘int’, but argument
 3 has type ‘long unsigned int’
 Linux.hsc:45: warning: format ‘%d’ expects type ‘int’, but argument
 3 has type ‘long unsigned int’

 Those lines are:

 39: -- max num of bytes in one send
 40: maxBytes :: Int64
 41: maxBytes = fromIntegral (maxBound :: (#type ssize_t))

 and

 44: foreign import ccall unsafe sendfile64 c_sendfile
 45:   :: Fd - Fd - Ptr (#type off_t) - (#type size_t) - IO
 (#type ssize_t)

 This looks like a typical 32/64-bit problem, but normally I would
 expect any real run-time problems caused by a problematic conversion
 in the FFI to crash the whole process. Maybe I'm wrong about this...

To convert those '#' constants, hsc2hs preprocessor constructs a
C file things like 'printf(%d, sizeof(ssize_t))' to use the
system's C compiler and avoid having the encode the ABI of every
platform (to be able to know the memory layout of the
structures).

So that message comes from that C file, not from your Haskell
one.  At runtime it really doesn't matter.

--
Felipe.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-06 Thread Bardur Arantsson


Felipe Lessa wrote:

On Sat, Feb 06, 2010 at 09:16:35AM +0100, Bardur Arantsson wrote:

Brandon S. Allbery KF8NH wrote:

On Feb 5, 2010, at 02:56 , Bardur Arantsson wrote:

[--snip--]

Broken pipe is normally handled as a signal, and is only mapped
to an error if SIGPIPE is set to SIG_IGN.  I can well imagine that
the SIGPIPE signal handler isn't closing resources properly; a
workaround would be to use the System.Posix.Signals API to ignore
SIGPIPE, but I don't know if that would work as a general solution
(it would depend on what other uses of pipes/sockets exist).

It was a good idea, but it doesn't seem to help to add

installHandler sigPIPE Ignore (Just fullSignalSet)

to the main function. (Given the package name I assume
System.Posix.Signals works similarly to regular old signals, i.e.
globally per-process.)

This is really starting to drive me round the bend...


Have you seen GHC ticket #1619?

http://hackage.haskell.org/trac/ghc/ticket/1619




I hadn't. I guess the conclusion is that SIG_PIPE is ignored by default anyway. 
So much
for that.

During yet another bout of debugging, I've added even more I am here 
instrumentation
code to the SendFile code, and the culprit seems to be threadWaitWrite. Here's 
the bit
of code I've modified:

 sendfile :: Fd - Fd - Ptr Int64 - Int64 - IO Int64
 sendfile out_fd in_fd poff bytes = do
 putStrLn PRE-threadWaitWrite
 threadWaitWrite out_fd
 putStrLn AFTER threadWaitWrite
 sbytes - c_sendfile out_fd in_fd poff (fromIntegral bytes)
 putStrLn $ AFTER c_sendfile; result was:  ++ (show sbytes)
 if sbytes = -1
   then do errno - getErrno
   if errno == eAGAIN
 then sendfile out_fd in_fd poff bytes
 else throwErrno Network.Socket.SendFile.Linux
   else return (fromIntegral sbytes)

This is the output when a file descriptor is lost:

---
AFTER sendfile: sbytes=27512
DIFFERENCE: 627264520
remaining=627264520, bytes=627264520
PRE-threadWaitWrite
Got request for CONTENT for objectId=1700,f2150400
Serving file 'X'...
Sending 625838080 bytes...
in_fd=13
---

So I have to conclude that threadWaitWrite is doing *something* which causes
the thread to die when the PS3 kills the connection.


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-06 Thread Bardur Arantsson


Bardur Arantsson wrote:

(sorry about replying-to-self)

During yet another bout of debugging, I've added even more I am here 
instrumentation code to the SendFile code, and the culprit seems to be

 threadWaitWrite.

I think I've pretty much confirmed this.

I've changed the code again. This time to:

 sendfile :: Fd - Fd - Ptr Int64 - Int64 - IO Int64
 sendfile out_fd in_fd poff bytes = do
 putStrLn PRE-threadWaitWrite
 -- threadWaitWrite out_fd
 -- putStrLn AFTER threadWaitWrite
 sbytes - c_sendfile out_fd in_fd poff (fromIntegral bytes)
 putStrLn $ AFTER c_sendfile; result was:  ++ (show sbytes)
 if sbytes = -1
   then do errno - getErrno
   if errno == eAGAIN
 then do
threadDelay 100
sendfile out_fd in_fd poff bytes
 else throwErrno Network.Socket.SendFile.Linux
  else return (fromIntegral sbytes)

That is, I removed the threadWaitWrite in favor of just adding a
threadDelay 100 when eAGAIN is encountered.

With this code, I cannot provoke the leak.

Unfortunately this isn't really a solution -- the CPU is pegged at
~50% when I do this and it's not exactly elegant to have a hardcoded
100 ms delay in there. :)

I'm hoping that someone who understands the internals of GHC can chime
in here with some kind of explanation as to if/why/how threadWaitWrite can
fail in this way.

Anyone?

Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-06 Thread Thomas Hartman

me too.

2010/2/5 MightyByte mightyb...@gmail.com:
 I've been seeing a steady stream of similar resource vanished messages
 for as long as I've been running my happstack app.  This message I get
 is this:

 socket: 58: hClose: resource vanished (Broken pipe)

 I run my app from a shell script inside a while true loop, so it
 automatically gets restarted if it crashes.  This incurs no more than
 a few seconds of down time.  Since that is acceptable for my
 application, I've never put much effort into investigating the issue.
 But I don't think the resource vanished error results in program
 termination.  When I have looked into it, I've had similar trouble
 reproducing it.  Clients such as wget and firefox don't seem to cause
 the problem.  If I remember correctly it only happens with IE.

 On Fri, Feb 5, 2010 at 2:56 AM, Bardur Arantsson s...@scientician.net wrote:
 Jeremy Shaw wrote:

 Actually,

 We should start by testing if native sendfile leaks file descriptors even
 when the whole file is sent. We have a test suite, but I am not sure if it
 tests for file handle leaking...


 I should have posted this earlier, but the exact message I'm seeing in the
 case where the Bad Client disconnects is this:

   hums: Network.Socket.SendFile.Linux: resource vanished (Broken pipe)

 Oddly, I haven't been able to reproduce this using a wget client with a ^C
 during transfer. When I disconnect wget with ^C or pkill wget or even
 pkill -9 wget, I get this message:

  hums: Network.Socket.SendFile.Linux: resource vanished (Connection reset by
 peer)

 (and no leak, as observed by lsof | grep hums).

 So there appears to be some vital difference between the handling of the two
 cases.

 Another observation which may be useful:

 Before the sendfile' API change (Handle - FilePath) in sendfile-0.6.x, my
 code used withFile to open the file and to ensure that it was closed. So
 it seems that withBinaryFile *should* also be fine. Unless the Broken pipe
 error somehow escapes the scope without causing a close.

 I don't have time to dig more right now, but I'll try to see if I can find
 out more later.

 Cheers,

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-05 Thread Bardur Arantsson


Jeremy Shaw wrote:

Actually,

We should start by testing if native sendfile leaks file descriptors even
when the whole file is sent. We have a test suite, but I am not sure if it
tests for file handle leaking...



I should have posted this earlier, but the exact message I'm seeing in 
the case where the Bad Client disconnects is this:


   hums: Network.Socket.SendFile.Linux: resource vanished (Broken pipe)

Oddly, I haven't been able to reproduce this using a wget client with a 
^C during transfer. When I disconnect wget with ^C or pkill wget or 
even pkill -9 wget, I get this message:


  hums: Network.Socket.SendFile.Linux: resource vanished (Connection 
reset by peer)


(and no leak, as observed by lsof | grep hums).

So there appears to be some vital difference between the handling of the 
two cases.


Another observation which may be useful:

Before the sendfile' API change (Handle - FilePath) in sendfile-0.6.x, 
my code used withFile to open the file and to ensure that it was 
closed. So it seems that withBinaryFile *should* also be fine. Unless 
the Broken pipe error somehow escapes the scope without causing a close.


I don't have time to dig more right now, but I'll try to see if I can 
find out more later.


Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-05 Thread MightyByte

I've been seeing a steady stream of similar resource vanished messages
for as long as I've been running my happstack app.  This message I get
is this:

socket: 58: hClose: resource vanished (Broken pipe)

I run my app from a shell script inside a while true loop, so it
automatically gets restarted if it crashes.  This incurs no more than
a few seconds of down time.  Since that is acceptable for my
application, I've never put much effort into investigating the issue.
But I don't think the resource vanished error results in program
termination.  When I have looked into it, I've had similar trouble
reproducing it.  Clients such as wget and firefox don't seem to cause
the problem.  If I remember correctly it only happens with IE.

On Fri, Feb 5, 2010 at 2:56 AM, Bardur Arantsson s...@scientician.net wrote:
 Jeremy Shaw wrote:

 Actually,

 We should start by testing if native sendfile leaks file descriptors even
 when the whole file is sent. We have a test suite, but I am not sure if it
 tests for file handle leaking...


 I should have posted this earlier, but the exact message I'm seeing in the
 case where the Bad Client disconnects is this:

   hums: Network.Socket.SendFile.Linux: resource vanished (Broken pipe)

 Oddly, I haven't been able to reproduce this using a wget client with a ^C
 during transfer. When I disconnect wget with ^C or pkill wget or even
 pkill -9 wget, I get this message:

  hums: Network.Socket.SendFile.Linux: resource vanished (Connection reset by
 peer)

 (and no leak, as observed by lsof | grep hums).

 So there appears to be some vital difference between the handling of the two
 cases.

 Another observation which may be useful:

 Before the sendfile' API change (Handle - FilePath) in sendfile-0.6.x, my
 code used withFile to open the file and to ensure that it was closed. So
 it seems that withBinaryFile *should* also be fine. Unless the Broken pipe
 error somehow escapes the scope without causing a close.

 I don't have time to dig more right now, but I'll try to see if I can find
 out more later.

 Cheers,

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-05 Thread Bardur Arantsson


Thomas Hartman wrote:

Do you have a test script to reproduce the behavior?



Unfortunately not, but the behavior *is* 100% reproducible with
my PS3 client. The production of a leaked FD appears to require a
particularly abrupt disconnect (see my other reply in this thread), so
you're probably safe in most cases.

Cheers,


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-05 Thread Bardur Arantsson

I desperation, I've tried to instrument a couple of the functions in 
SendFile:


 sendFile'' :: Socket - Handle - Integer - Integer - IO ()
 sendFile'' outs inp off count =
 do let out_fd = Fd (fdSocket outs)
in_fd - handleToFd inp
putStrLn (in_fd= ++ show in_fd)
finally (wrapSendFile' _sendFile out_fd in_fd off count)
(do
  putStrLn (SENDFILE DONE  ++ show in_fd)
)

 sendFile' :: Socket - FilePath - Integer - Integer - IO ()
 sendFile' outs infp offset count =
 bracket
(openBinaryFile infp ReadMode)
(\h - do
  putStrLn CLOSING FILE!
  hClose h
  putStrLn FILE CLOSED!)
(\inp - sendFile'' outs inp offset count)

(Yes, this made me feel dirty.)

Here's the resulting output from around when the file descriptor gets lost:

---
Serving file 'X'...
Sending 674465792 bytes... 

in_fd=7 

SENDFILE DONE 7 

CLOSING FILE! 

FILE CLOSED! 

hums: Network.Socket.SendFile.Linux: resource vanished (Broken pipe) 

Got request for CONTENT for objectId=1700,f2150400 


Serving file 'X'...
Sending 672892928 bytes... 

in_fd=7 

SENDFILE DONE 7 

CLOSING FILE! 

FILE CLOSED! 

hums: Network.Socket.SendFile.Linux: resource vanished (Broken pipe) 

Got request for CONTENT for objectId=1700,f2150400 


Serving file 'X'...
Sending 670140416 bytes... 

in_fd=7 



*- What happened here?

Got request for CONTENT for objectId=1700,f2150400 


Serving file 'X'...
Sending 667256832 bytes... 

in_fd=9 

SENDFILE DONE 9 

CLOSING FILE! 

FILE CLOSED! 

hums: Network.Socket.SendFile.Linux: resource vanished (Broken pipe) 

Got request for CONTENT for objectId=1700,f2150400 


Serving file 'X'...
Sending 665028608 bytes... 

in_fd=9 

SENDFILE DONE 9 

CLOSING FILE! 

FILE CLOSED! 

hums: Network.Socket.SendFile.Linux: resource vanished (Broken pipe) 

Got request for CONTENT for objectId=1700,f2150400 


Serving file 'X'...
---


Anyone got any clues as to what might cause the behavior show at the mark?

The only idea I have is that *something* in the SendFile library kills 
the thread completely (or somehow evades finally), but I have no idea 
what it might be.


Cheers,

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: sendfile leaking descriptors on Linux?

2010-02-05 Thread Brandon S. Allbery KF8NH


On Feb 5, 2010, at 02:56 , Bardur Arantsson wrote:
I should have posted this earlier, but the exact message I'm seeing  
in the case where the Bad Client disconnects is this:


  hums: Network.Socket.SendFile.Linux: resource vanished (Broken pipe)

Oddly, I haven't been able to reproduce this using a wget client  
with a ^C during transfer. When I disconnect wget with ^C or  
pkill wget or even pkill -9 wget, I get this message:


 hums: Network.Socket.SendFile.Linux: resource vanished (Connection  
reset by peer)


(and no leak, as observed by lsof | grep hums).



Broken pipe is normally handled as a signal, and is only mapped to  
an error if SIGPIPE is set to SIG_IGN.  I can well imagine that the  
SIGPIPE signal handler isn't closing resources properly; a workaround  
would be to use the System.Posix.Signals API to ignore SIGPIPE, but I  
don't know if that would work as a general solution (it would depend  
on what other uses of pipes/sockets exist).


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

49 matches

Mail list logo