Re: read(2) and ETIMEDOUT

2001-06-08 Thread Graham Barr

On Fri, Jun 08, 2001 at 09:39:15PM +0200, Bernd Walter wrote:
 On Thu, Jun 07, 2001 at 03:20:58PM -0700, Matt Dillon wrote:
  
  :
  :On Thu, Jun 07, 2001 at 10:33:50AM -0700, Matt Dillon wrote:
  : 
  : :
  : :Thanks, I will try setting errno, but I don't think it is signals.
  : :I have been running truss on the process. The relevant part is
  : :
  : :gettimeofday(0xbfbffa54,0x0) = 0 (0x0)
  : :select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3)
  : :read(0x16,0xa2da000,0x8000)  ERR#60 'Operation timed out'
  : :
  : :In fact there are no signals in the whole truss output
  : :
  : :Graham.
  : 
  : What type of descriptor is the read being performed on?  A TCP
  : connection or, say, a reading a file over NFS?  
  :
  :It is a TCP/IP connection.
  :
  :Graham.
  
  You can get this if the TCP connection times out, either through a
  keepalive timeout or the protocol hits the maximum number of transmit
  retries.  I'd have to delve into the cvs logs to see when it was added,
  but it seems reasonable.  You should treat it simply as an EIO or
  something like that.
 
 Keepalives are a good point.
 I know of OS/2 Systems that can't handle them and behave the way you
 describe.
 What system is on the other side?

All the systems are exactly the same

Graham.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



read(2) and ETIMEDOUT

2001-06-07 Thread Graham Barr

A while ago our systems were upgraded from 4.2 to 4.3-RC, and at
this time we started seeing problems that I am having a difficult
time tracking down.

We have a server process which is connected to by many other
machines, each of them streams data in via tcp/ip. These connections
are pretty much permanent.

All had been running fine for a long time before the upgrade, but
now we have a problem with read(2) returning an error ETIMEDOUT,
which causes our code to close the connection.

The strange thing is that things are fine for a few hours, then
all of a sudden we see each of the connections fail with this error.
Then after the clients have reconnected, all is fine for a few
hours and then they all do it again.

The problem I am having in tracking this down is that man 2 read
does not specify ETIMEDOUT as an error that can be returned from
read(2) and man errno specifies that it would be returned from
connect(2) or send(2)

So, here is my question. Does anyone know under what circumstance
ETIMEDOUT may be returned from read(2) or is this a potential bug
somewhere ?


Thanks,
Graham.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-07 Thread Graham Barr

Thanks, I will try setting errno, but I don't think it is signals.
I have been running truss on the process. The relevant part is

gettimeofday(0xbfbffa54,0x0) = 0 (0x0)
select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3)
read(0x16,0xa2da000,0x8000)  ERR#60 'Operation timed out'

In fact there are no signals in the whole truss output

Graham.

On Thu, Jun 07, 2001 at 09:53:54AM -0700, Matt Dillon wrote:
 :A while ago our systems were upgraded from 4.2 to 4.3-RC, and at
 :this time we started seeing problems that I am having a difficult
 :time tracking down.
 :
 :We have a server process which is connected to by many other
 :machines, each of them streams data in via tcp/ip. These connections
 :are pretty much permanent.
 :
 :All had been running fine for a long time before the upgrade, but
 :now we have a problem with read(2) returning an error ETIMEDOUT,
 :which causes our code to close the connection.
 :
 :The strange thing is that things are fine for a few hours, then
 :all of a sudden we see each of the connections fail with this error.
 :Then after the clients have reconnected, all is fine for a few
 :hours and then they all do it again.
 :
 :The problem I am having in tracking this down is that man 2 read
 :does not specify ETIMEDOUT as an error that can be returned from
 :read(2) and man errno specifies that it would be returned from
 :connect(2) or send(2)
 :
 :So, here is my question. Does anyone know under what circumstance
 :ETIMEDOUT may be returned from read(2) or is this a potential bug
 :somewhere ?
 :
 :Thanks,
 :Graham.
 
 This seems very odd.  I recommend setting errno to 0 prior to calling
 read() to make sure that it is actually read() that is setting the
 errno.  You should also sift through your code and look closely at any
 signal handlers you might have - system calls made from inside a signal
 handler can rip errno right out from under the code the signal 
 interrupted.
 
   -Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-07 Thread Graham Barr

On Thu, Jun 07, 2001 at 10:33:50AM -0700, Matt Dillon wrote:
 
 :
 :Thanks, I will try setting errno, but I don't think it is signals.
 :I have been running truss on the process. The relevant part is
 :
 :gettimeofday(0xbfbffa54,0x0) = 0 (0x0)
 :select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3)
 :read(0x16,0xa2da000,0x8000)  ERR#60 'Operation timed out'
 :
 :In fact there are no signals in the whole truss output
 :
 :Graham.
 
 What type of descriptor is the read being performed on?  A TCP
 connection or, say, a reading a file over NFS?  

It is a TCP/IP connection.

Graham.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-07 Thread Graham Barr

On Thu, Jun 07, 2001 at 03:09:17PM -0400, Alfred Perlstein wrote:
 * Graham Barr [EMAIL PROTECTED] [010607 12:17] wrote:
 
 Since people seem to be helping you in other ways, I'll just
 answer this one:
 
  So, here is my question. Does anyone know under what circumstance
  ETIMEDOUT may be returned from read(2) or is this a potential bug
  somewhere ?
 
 I'm quite sure ETIMEDOUT is a result of hitting the setsockopt
 SO_RCVTIMEO value when doing a read.

I had been thinking along those lines too. But immediately before calling
read, select said there was data to read, So it should not block, but
read what data is there and return.

Also why does this happen only every few hours ? There is a lot of
data going through these connections maybe the timer for SO_RCVTIMEO
is not being reset.

But then we have another server, with a similar number of clients and
data through put, but it does not suffer from this problem.

As you can probably tell, we have been tearing our hair out over this one.

Graham.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-07 Thread Graham Barr

While this does sound very plausable,...

The server does not do any writes, data only travels from the clients
to the server.

The clients and the server are connected to the same switch.

The other server which is similar is on the same network and
is connected to by the same machines as clients, yet it
does not see any problems.

But thanks for the insight. I will place a sniffer of the port
and see if there are excessive retransmits

Graham.

On Thu, Jun 07, 2001 at 09:16:19PM +0100, Ian Dowse wrote:
 In message [EMAIL PROTECTED], Graham Barr writes:
 
 Also why does this happen only every few hours ? There is a lot of
 data going through these connections maybe the timer for SO_RCVTIMEO
 is not being reset.
 
 But then we have another server, with a similar number of clients and
 data through put, but it does not suffer from this problem.
 
 I suspect that the server seeing this problem has a client that
 occasionally disappears from the network, or for whatever reason
 fails to respond to any packets for a long time (something like 5
 or 10 minutes). I've seen blocking TCP writes return ETIMEDOUT when
 the network between the client and the server goes down. In the
 non-blocking case I think the following can happen:
 
   1) Client is connected to server.
   2) Network goes down, or client is turned off
   3) Server performs non-blocking write() on socket
   4) Server uses poll/select/kevent waiting for data from socket
   5) The write operation times out because no acknowledgements
  have been received. This occurs after TCP_MAXRXTSHIFT
  retransmits, so-so_error is set to ETIMEDOUT and the
  connection is shut down (I haven't read the code very
  carefully, so the details could be wrong.
   6) select/poll/kevent notes the EOF condition, and says that
  the descriptor is ready to read.
   7) read() returns the real error, which is ETIMEDOUT.
 
 I guess this should possibly be documented in read(2), but in
 practice there are numerous network errors that can be returned
 from read(). Normal practice in single-process servers is to
 consider any unknown errors from read(),write() etc as only
 fatal to that client rather than the whole server.
 
 Ian

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message