Re: read(2) and ETIMEDOUT
Thu, Jun 07, 2001 at 20:18:46, gbarr (Graham Barr) wrote about "Re: read(2) and ETIMEDOUT": > > I'm quite sure ETIMEDOUT is a result of hitting the setsockopt > > SO_RCVTIMEO value when doing a read. > I had been thinking along those lines too. But immediately before calling > read, select said there was data to read, So it should not block, but > read what data is there and return. This is ideological error from you: select does _not_ say "there is data to read", it only says "read() will not block". EOF (when read() returns 0) and any situation where read() returns -1-and-errno also are such. But, this error does not (I think) influe to produce ETIMEDOUTs. > Also why does this happen only every few hours ? There is a lot of > data going through these connections maybe the timer for SO_RCVTIMEO > is not being reset. You should determine exact cases where ETIMEDOUT occurs. netch@iv:/usr/REL4/src/sys/netinet>fgrep ETIMEDOUT *.c tcp_input.c:tcp_drop(sototcpcb(so2), ETIMEDOUT); tcp_subr.c: if (errno == ETIMEDOUT && tp->t_softerror) tcp_timer.c:tp = tcp_drop(tp, ETIMEDOUT); tcp_timer.c: tp = tcp_drop(tp, ETIMEDOUT); tcp_timer.c:tp->t_softerror : ETIMEDOUT); Add debug printf()s with __LINE__, __FILE__, variables used by stack to make solution to drop connection. Collect statistics. /netch To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: read(2) and ETIMEDOUT
On Fri, Jun 08, 2001 at 09:39:15PM +0200, Bernd Walter wrote: > On Thu, Jun 07, 2001 at 03:20:58PM -0700, Matt Dillon wrote: > > > > : > > :On Thu, Jun 07, 2001 at 10:33:50AM -0700, Matt Dillon wrote: > > :> > > :> : > > :> :Thanks, I will try setting errno, but I don't think it is signals. > > :> :I have been running truss on the process. The relevant part is > > :> : > > :> :gettimeofday(0xbfbffa54,0x0) = 0 (0x0) > > :> :select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3) > > :> :read(0x16,0xa2da000,0x8000) ERR#60 'Operation timed out' > > :> : > > :> :In fact there are no signals in the whole truss output > > :> : > > :> :Graham. > > :> > > :> What type of descriptor is the read being performed on? A TCP > > :> connection or, say, a reading a file over NFS? > > : > > :It is a TCP/IP connection. > > : > > :Graham. > > > > You can get this if the TCP connection times out, either through a > > keepalive timeout or the protocol hits the maximum number of transmit > > retries. I'd have to delve into the cvs logs to see when it was added, > > but it seems reasonable. You should treat it simply as an EIO or > > something like that. > > Keepalives are a good point. > I know of OS/2 Systems that can't handle them and behave the way you > describe. > What system is on the other side? All the systems are exactly the same Graham. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: read(2) and ETIMEDOUT
On Thu, Jun 07, 2001 at 03:20:58PM -0700, Matt Dillon wrote: > > : > :On Thu, Jun 07, 2001 at 10:33:50AM -0700, Matt Dillon wrote: > :> > :> : > :> :Thanks, I will try setting errno, but I don't think it is signals. > :> :I have been running truss on the process. The relevant part is > :> : > :> :gettimeofday(0xbfbffa54,0x0) = 0 (0x0) > :> :select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3) > :> :read(0x16,0xa2da000,0x8000) ERR#60 'Operation timed out' > :> : > :> :In fact there are no signals in the whole truss output > :> : > :> :Graham. > :> > :> What type of descriptor is the read being performed on? A TCP > :> connection or, say, a reading a file over NFS? > : > :It is a TCP/IP connection. > : > :Graham. > > You can get this if the TCP connection times out, either through a > keepalive timeout or the protocol hits the maximum number of transmit > retries. I'd have to delve into the cvs logs to see when it was added, > but it seems reasonable. You should treat it simply as an EIO or > something like that. Keepalives are a good point. I know of OS/2 Systems that can't handle them and behave the way you describe. What system is on the other side? -- B.Walter COSMO-Project http://www.cosmo-project.de [EMAIL PROTECTED] Usergroup [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: read(2) and ETIMEDOUT
Ian Dowse wrote: > > In message <[EMAIL PROTECTED]>, Graham Barr writes: > > >Also why does this happen only every few hours ? There is a lot of > >data going through these connections maybe the timer for SO_RCVTIMEO > >is not being reset. > > > >But then we have another server, with a similar number of clients and > >data through put, but it does not suffer from this problem. > > I suspect that the server seeing this problem has a client that > occasionally disappears from the network, or for whatever reason > fails to respond to any packets for a long time (something like 5 > or 10 minutes). I've seen blocking TCP writes return ETIMEDOUT when > the network between the client and the server goes down. In the > non-blocking case I think the following can happen: I believe the proxy ARP normally sent on an interface coming up can have this effect in the case a client goes down, and someone else gets their DHCP lease. You don't often see this on FreeBSD clients after 4.1, since the gratuitous proxy ARP became broken around then (if you change your IP address, it won't send the ARP unless you down the interface first and bring it back up, and it caches bad clone routes, too, just to make your life miserable). Probably your lease expiration times are set too low. This is usually the case in networks where people have transient connections for things like mobile users, and have exhaused their IP address space, and are trying to conserve it by using much shorter leases. A good, real fix for this is to have incredibly long lease lifetimes (basically, the DHCP server hands out the lease, and if the computer comes back days later, it gets the same lease). For this to work, you are probably going to have to make the local DHCP server give out 10.x addresses, and then NAT the 10.x net for real Internet connectivity. Alternately, it could be something completely different. 8-). -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: read(2) and ETIMEDOUT
: :On Thu, Jun 07, 2001 at 10:33:50AM -0700, Matt Dillon wrote: :> :> : :> :Thanks, I will try setting errno, but I don't think it is signals. :> :I have been running truss on the process. The relevant part is :> : :> :gettimeofday(0xbfbffa54,0x0) = 0 (0x0) :> :select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3) :> :read(0x16,0xa2da000,0x8000) ERR#60 'Operation timed out' :> : :> :In fact there are no signals in the whole truss output :> : :> :Graham. :> :> What type of descriptor is the read being performed on? A TCP :> connection or, say, a reading a file over NFS? : :It is a TCP/IP connection. : :Graham. You can get this if the TCP connection times out, either through a keepalive timeout or the protocol hits the maximum number of transmit retries. I'd have to delve into the cvs logs to see when it was added, but it seems reasonable. You should treat it simply as an EIO or something like that. Generally speaking you should handle return codes from system calls by handling the codes you know about and simply assuming that anything else is fatal to the particular connection. if (systemcall(...) < 0) { switch(errno) { case EINTR: case EAGAIN: ... deal with non-blocking situations ... . . . default: ... assume everything else is a fatal error on the socket ... ... close the descriptor and cleanup its state ... } } This gives you the maximum portability between platforms and between releases. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: read(2) and ETIMEDOUT
While this does sound very plausable,... The server does not do any writes, data only travels from the clients to the server. The clients and the server are connected to the same switch. The other server which is similar is on the same network and is connected to by the same machines as clients, yet it does not see any problems. But thanks for the insight. I will place a sniffer of the port and see if there are excessive retransmits Graham. On Thu, Jun 07, 2001 at 09:16:19PM +0100, Ian Dowse wrote: > In message <[EMAIL PROTECTED]>, Graham Barr writes: > > >Also why does this happen only every few hours ? There is a lot of > >data going through these connections maybe the timer for SO_RCVTIMEO > >is not being reset. > > > >But then we have another server, with a similar number of clients and > >data through put, but it does not suffer from this problem. > > I suspect that the server seeing this problem has a client that > occasionally disappears from the network, or for whatever reason > fails to respond to any packets for a long time (something like 5 > or 10 minutes). I've seen blocking TCP writes return ETIMEDOUT when > the network between the client and the server goes down. In the > non-blocking case I think the following can happen: > > 1) Client is connected to server. > 2) Network goes down, or client is turned off > 3) Server performs non-blocking write() on socket > 4) Server uses poll/select/kevent waiting for data from socket > 5) The write operation times out because no acknowledgements > have been received. This occurs after TCP_MAXRXTSHIFT > retransmits, so->so_error is set to ETIMEDOUT and the > connection is shut down (I haven't read the code very > carefully, so the details could be wrong. > 6) select/poll/kevent notes the EOF condition, and says that > the descriptor is ready to read. > 7) read() returns the real error, which is ETIMEDOUT. > > I guess this should possibly be documented in read(2), but in > practice there are numerous network errors that can be returned > from read(). Normal practice in single-process servers is to > consider any unknown errors from read(),write() etc as only > fatal to that client rather than the whole server. > > Ian To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: read(2) and ETIMEDOUT
In message <[EMAIL PROTECTED]>, Graham Barr writes: >Also why does this happen only every few hours ? There is a lot of >data going through these connections maybe the timer for SO_RCVTIMEO >is not being reset. > >But then we have another server, with a similar number of clients and >data through put, but it does not suffer from this problem. I suspect that the server seeing this problem has a client that occasionally disappears from the network, or for whatever reason fails to respond to any packets for a long time (something like 5 or 10 minutes). I've seen blocking TCP writes return ETIMEDOUT when the network between the client and the server goes down. In the non-blocking case I think the following can happen: 1) Client is connected to server. 2) Network goes down, or client is turned off 3) Server performs non-blocking write() on socket 4) Server uses poll/select/kevent waiting for data from socket 5) The write operation times out because no acknowledgements have been received. This occurs after TCP_MAXRXTSHIFT retransmits, so->so_error is set to ETIMEDOUT and the connection is shut down (I haven't read the code very carefully, so the details could be wrong. 6) select/poll/kevent notes the EOF condition, and says that the descriptor is ready to read. 7) read() returns the real error, which is ETIMEDOUT. I guess this should possibly be documented in read(2), but in practice there are numerous network errors that can be returned from read(). Normal practice in single-process servers is to consider any unknown errors from read(),write() etc as only fatal to that client rather than the whole server. Ian To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: read(2) and ETIMEDOUT
On Thu, Jun 07, 2001 at 03:09:17PM -0400, Alfred Perlstein wrote: > * Graham Barr <[EMAIL PROTECTED]> [010607 12:17] wrote: > > Since people seem to be helping you in other ways, I'll just > answer this one: > > > So, here is my question. Does anyone know under what circumstance > > ETIMEDOUT may be returned from read(2) or is this a potential bug > > somewhere ? > > I'm quite sure ETIMEDOUT is a result of hitting the setsockopt > SO_RCVTIMEO value when doing a read. I had been thinking along those lines too. But immediately before calling read, select said there was data to read, So it should not block, but read what data is there and return. Also why does this happen only every few hours ? There is a lot of data going through these connections maybe the timer for SO_RCVTIMEO is not being reset. But then we have another server, with a similar number of clients and data through put, but it does not suffer from this problem. As you can probably tell, we have been tearing our hair out over this one. Graham. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: read(2) and ETIMEDOUT
* Graham Barr <[EMAIL PROTECTED]> [010607 12:17] wrote: Since people seem to be helping you in other ways, I'll just answer this one: > So, here is my question. Does anyone know under what circumstance > ETIMEDOUT may be returned from read(2) or is this a potential bug > somewhere ? I'm quite sure ETIMEDOUT is a result of hitting the setsockopt SO_RCVTIMEO value when doing a read. -- -Alfred Perlstein [[EMAIL PROTECTED]] Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: read(2) and ETIMEDOUT
I've seen this behavior in the past. My impression is that it is load related. If you do a grep on ETIMEDOUT in /usr/src/sys/netinet, you will see where the tcp stack may return this message. There may be some sysctl params relating to timers that you can muck with. Rick Graham Barr wrote: > On Thu, Jun 07, 2001 at 10:33:50AM -0700, Matt Dillon wrote: > > > > : > > :Thanks, I will try setting errno, but I don't think it is signals. > > :I have been running truss on the process. The relevant part is > > : > > :gettimeofday(0xbfbffa54,0x0) = 0 (0x0) > > :select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3) > > :read(0x16,0xa2da000,0x8000) ERR#60 'Operation timed out' > > : > > :In fact there are no signals in the whole truss output > > : > > :Graham. > > > > What type of descriptor is the read being performed on? A TCP > > connection or, say, a reading a file over NFS? > > It is a TCP/IP connection. > > Graham. > > To Unsubscribe: send mail to [EMAIL PROTECTED] > with "unsubscribe freebsd-hackers" in the body of the message To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: read(2) and ETIMEDOUT
On Thu, Jun 07, 2001 at 10:33:50AM -0700, Matt Dillon wrote: > > : > :Thanks, I will try setting errno, but I don't think it is signals. > :I have been running truss on the process. The relevant part is > : > :gettimeofday(0xbfbffa54,0x0) = 0 (0x0) > :select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3) > :read(0x16,0xa2da000,0x8000) ERR#60 'Operation timed out' > : > :In fact there are no signals in the whole truss output > : > :Graham. > > What type of descriptor is the read being performed on? A TCP > connection or, say, a reading a file over NFS? It is a TCP/IP connection. Graham. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: read(2) and ETIMEDOUT
: :Thanks, I will try setting errno, but I don't think it is signals. :I have been running truss on the process. The relevant part is : :gettimeofday(0xbfbffa54,0x0) = 0 (0x0) :select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3) :read(0x16,0xa2da000,0x8000) ERR#60 'Operation timed out' : :In fact there are no signals in the whole truss output : :Graham. What type of descriptor is the read being performed on? A TCP connection or, say, a reading a file over NFS? -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: read(2) and ETIMEDOUT
Thanks, I will try setting errno, but I don't think it is signals. I have been running truss on the process. The relevant part is gettimeofday(0xbfbffa54,0x0) = 0 (0x0) select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3) read(0x16,0xa2da000,0x8000) ERR#60 'Operation timed out' In fact there are no signals in the whole truss output Graham. On Thu, Jun 07, 2001 at 09:53:54AM -0700, Matt Dillon wrote: > :A while ago our systems were upgraded from 4.2 to 4.3-RC, and at > :this time we started seeing problems that I am having a difficult > :time tracking down. > : > :We have a server process which is connected to by many other > :machines, each of them streams data in via tcp/ip. These connections > :are pretty much permanent. > : > :All had been running fine for a long time before the upgrade, but > :now we have a problem with read(2) returning an error ETIMEDOUT, > :which causes our code to close the connection. > : > :The strange thing is that things are fine for a few hours, then > :all of a sudden we see each of the connections fail with this error. > :Then after the clients have reconnected, all is fine for a few > :hours and then they all do it again. > : > :The problem I am having in tracking this down is that man 2 read > :does not specify ETIMEDOUT as an error that can be returned from > :read(2) and man errno specifies that it would be returned from > :connect(2) or send(2) > : > :So, here is my question. Does anyone know under what circumstance > :ETIMEDOUT may be returned from read(2) or is this a potential bug > :somewhere ? > : > :Thanks, > :Graham. > > This seems very odd. I recommend setting errno to 0 prior to calling > read() to make sure that it is actually read() that is setting the > errno. You should also sift through your code and look closely at any > signal handlers you might have - system calls made from inside a signal > handler can rip errno right out from under the code the signal > interrupted. > > -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: read(2) and ETIMEDOUT
:A while ago our systems were upgraded from 4.2 to 4.3-RC, and at :this time we started seeing problems that I am having a difficult :time tracking down. : :We have a server process which is connected to by many other :machines, each of them streams data in via tcp/ip. These connections :are pretty much permanent. : :All had been running fine for a long time before the upgrade, but :now we have a problem with read(2) returning an error ETIMEDOUT, :which causes our code to close the connection. : :The strange thing is that things are fine for a few hours, then :all of a sudden we see each of the connections fail with this error. :Then after the clients have reconnected, all is fine for a few :hours and then they all do it again. : :The problem I am having in tracking this down is that man 2 read :does not specify ETIMEDOUT as an error that can be returned from :read(2) and man errno specifies that it would be returned from :connect(2) or send(2) : :So, here is my question. Does anyone know under what circumstance :ETIMEDOUT may be returned from read(2) or is this a potential bug :somewhere ? : :Thanks, :Graham. This seems very odd. I recommend setting errno to 0 prior to calling read() to make sure that it is actually read() that is setting the errno. You should also sift through your code and look closely at any signal handlers you might have - system calls made from inside a signal handler can rip errno right out from under the code the signal interrupted. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message