Re: read(2) and ETIMEDOUT

2001-06-10 Thread Valentin Nechayev

 Thu, Jun 07, 2001 at 20:18:46, gbarr (Graham Barr) wrote about Re: read(2) and 
ETIMEDOUT: 

  I'm quite sure ETIMEDOUT is a result of hitting the setsockopt
  SO_RCVTIMEO value when doing a read.
 I had been thinking along those lines too. But immediately before calling
 read, select said there was data to read, So it should not block, but
 read what data is there and return.

This is ideological error from you: select does _not_ say there is data
to read, it only says read() will not block. EOF (when read() returns 0)
and any situation where read() returns -1-and-errno also are such.

But, this error does not (I think) influe to produce ETIMEDOUTs.

 Also why does this happen only every few hours ? There is a lot of
 data going through these connections maybe the timer for SO_RCVTIMEO
 is not being reset.

You should determine exact cases where ETIMEDOUT occurs.

netch@iv:/usr/REL4/src/sys/netinetfgrep ETIMEDOUT *.c
tcp_input.c:tcp_drop(sototcpcb(so2), ETIMEDOUT);
tcp_subr.c: if (errno == ETIMEDOUT  tp-t_softerror)
tcp_timer.c:tp = tcp_drop(tp, ETIMEDOUT);
tcp_timer.c:  tp = tcp_drop(tp, ETIMEDOUT);
tcp_timer.c:tp-t_softerror : ETIMEDOUT);

Add debug printf()s with __LINE__, __FILE__, variables used by stack
to make solution to drop connection. Collect statistics.


/netch

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-08 Thread Terry Lambert

Ian Dowse wrote:
 
 In message [EMAIL PROTECTED], Graham Barr writes:
 
 Also why does this happen only every few hours ? There is a lot of
 data going through these connections maybe the timer for SO_RCVTIMEO
 is not being reset.
 
 But then we have another server, with a similar number of clients and
 data through put, but it does not suffer from this problem.
 
 I suspect that the server seeing this problem has a client that
 occasionally disappears from the network, or for whatever reason
 fails to respond to any packets for a long time (something like 5
 or 10 minutes). I've seen blocking TCP writes return ETIMEDOUT when
 the network between the client and the server goes down. In the
 non-blocking case I think the following can happen:

I believe the proxy ARP normally sent on an interface
coming up can have this effect in the case a client goes
down, and someone else gets their DHCP lease.

You don't often see this on FreeBSD clients after 4.1,
since the gratuitous proxy ARP became broken around then
(if you change your IP address, it won't send the ARP
unless you down the interface first and bring it back up,
and it caches bad clone routes, too, just to make your
life miserable).

Probably your lease expiration times are set too low.  This
is usually the case in networks where people have transient
connections for things like mobile users, and have exhaused
their IP address space, and are trying to conserve it by
using much shorter leases.

A good, real fix for this is to have incredibly long lease
lifetimes (basically, the DHCP server hands out the lease,
and if the computer comes back days later, it gets the same
lease).  For this to work, you are probably going to have
to make the local DHCP server give out 10.x addresses, and
then NAT the 10.x net for real Internet connectivity.

Alternately, it could be something completely different.  8-).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-08 Thread Bernd Walter

On Thu, Jun 07, 2001 at 03:20:58PM -0700, Matt Dillon wrote:
 
 :
 :On Thu, Jun 07, 2001 at 10:33:50AM -0700, Matt Dillon wrote:
 : 
 : :
 : :Thanks, I will try setting errno, but I don't think it is signals.
 : :I have been running truss on the process. The relevant part is
 : :
 : :gettimeofday(0xbfbffa54,0x0) = 0 (0x0)
 : :select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3)
 : :read(0x16,0xa2da000,0x8000)  ERR#60 'Operation timed out'
 : :
 : :In fact there are no signals in the whole truss output
 : :
 : :Graham.
 : 
 : What type of descriptor is the read being performed on?  A TCP
 : connection or, say, a reading a file over NFS?  
 :
 :It is a TCP/IP connection.
 :
 :Graham.
 
 You can get this if the TCP connection times out, either through a
 keepalive timeout or the protocol hits the maximum number of transmit
 retries.  I'd have to delve into the cvs logs to see when it was added,
 but it seems reasonable.  You should treat it simply as an EIO or
 something like that.

Keepalives are a good point.
I know of OS/2 Systems that can't handle them and behave the way you
describe.
What system is on the other side?

-- 
B.Walter  COSMO-Project http://www.cosmo-project.de
[EMAIL PROTECTED] Usergroup   [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-08 Thread Graham Barr

On Fri, Jun 08, 2001 at 09:39:15PM +0200, Bernd Walter wrote:
 On Thu, Jun 07, 2001 at 03:20:58PM -0700, Matt Dillon wrote:
  
  :
  :On Thu, Jun 07, 2001 at 10:33:50AM -0700, Matt Dillon wrote:
  : 
  : :
  : :Thanks, I will try setting errno, but I don't think it is signals.
  : :I have been running truss on the process. The relevant part is
  : :
  : :gettimeofday(0xbfbffa54,0x0) = 0 (0x0)
  : :select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3)
  : :read(0x16,0xa2da000,0x8000)  ERR#60 'Operation timed out'
  : :
  : :In fact there are no signals in the whole truss output
  : :
  : :Graham.
  : 
  : What type of descriptor is the read being performed on?  A TCP
  : connection or, say, a reading a file over NFS?  
  :
  :It is a TCP/IP connection.
  :
  :Graham.
  
  You can get this if the TCP connection times out, either through a
  keepalive timeout or the protocol hits the maximum number of transmit
  retries.  I'd have to delve into the cvs logs to see when it was added,
  but it seems reasonable.  You should treat it simply as an EIO or
  something like that.
 
 Keepalives are a good point.
 I know of OS/2 Systems that can't handle them and behave the way you
 describe.
 What system is on the other side?

All the systems are exactly the same

Graham.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



read(2) and ETIMEDOUT

2001-06-07 Thread Graham Barr

A while ago our systems were upgraded from 4.2 to 4.3-RC, and at
this time we started seeing problems that I am having a difficult
time tracking down.

We have a server process which is connected to by many other
machines, each of them streams data in via tcp/ip. These connections
are pretty much permanent.

All had been running fine for a long time before the upgrade, but
now we have a problem with read(2) returning an error ETIMEDOUT,
which causes our code to close the connection.

The strange thing is that things are fine for a few hours, then
all of a sudden we see each of the connections fail with this error.
Then after the clients have reconnected, all is fine for a few
hours and then they all do it again.

The problem I am having in tracking this down is that man 2 read
does not specify ETIMEDOUT as an error that can be returned from
read(2) and man errno specifies that it would be returned from
connect(2) or send(2)

So, here is my question. Does anyone know under what circumstance
ETIMEDOUT may be returned from read(2) or is this a potential bug
somewhere ?


Thanks,
Graham.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-07 Thread Matt Dillon

:A while ago our systems were upgraded from 4.2 to 4.3-RC, and at
:this time we started seeing problems that I am having a difficult
:time tracking down.
:
:We have a server process which is connected to by many other
:machines, each of them streams data in via tcp/ip. These connections
:are pretty much permanent.
:
:All had been running fine for a long time before the upgrade, but
:now we have a problem with read(2) returning an error ETIMEDOUT,
:which causes our code to close the connection.
:
:The strange thing is that things are fine for a few hours, then
:all of a sudden we see each of the connections fail with this error.
:Then after the clients have reconnected, all is fine for a few
:hours and then they all do it again.
:
:The problem I am having in tracking this down is that man 2 read
:does not specify ETIMEDOUT as an error that can be returned from
:read(2) and man errno specifies that it would be returned from
:connect(2) or send(2)
:
:So, here is my question. Does anyone know under what circumstance
:ETIMEDOUT may be returned from read(2) or is this a potential bug
:somewhere ?
:
:Thanks,
:Graham.

This seems very odd.  I recommend setting errno to 0 prior to calling
read() to make sure that it is actually read() that is setting the
errno.  You should also sift through your code and look closely at any
signal handlers you might have - system calls made from inside a signal
handler can rip errno right out from under the code the signal 
interrupted.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-07 Thread Graham Barr

Thanks, I will try setting errno, but I don't think it is signals.
I have been running truss on the process. The relevant part is

gettimeofday(0xbfbffa54,0x0) = 0 (0x0)
select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3)
read(0x16,0xa2da000,0x8000)  ERR#60 'Operation timed out'

In fact there are no signals in the whole truss output

Graham.

On Thu, Jun 07, 2001 at 09:53:54AM -0700, Matt Dillon wrote:
 :A while ago our systems were upgraded from 4.2 to 4.3-RC, and at
 :this time we started seeing problems that I am having a difficult
 :time tracking down.
 :
 :We have a server process which is connected to by many other
 :machines, each of them streams data in via tcp/ip. These connections
 :are pretty much permanent.
 :
 :All had been running fine for a long time before the upgrade, but
 :now we have a problem with read(2) returning an error ETIMEDOUT,
 :which causes our code to close the connection.
 :
 :The strange thing is that things are fine for a few hours, then
 :all of a sudden we see each of the connections fail with this error.
 :Then after the clients have reconnected, all is fine for a few
 :hours and then they all do it again.
 :
 :The problem I am having in tracking this down is that man 2 read
 :does not specify ETIMEDOUT as an error that can be returned from
 :read(2) and man errno specifies that it would be returned from
 :connect(2) or send(2)
 :
 :So, here is my question. Does anyone know under what circumstance
 :ETIMEDOUT may be returned from read(2) or is this a potential bug
 :somewhere ?
 :
 :Thanks,
 :Graham.
 
 This seems very odd.  I recommend setting errno to 0 prior to calling
 read() to make sure that it is actually read() that is setting the
 errno.  You should also sift through your code and look closely at any
 signal handlers you might have - system calls made from inside a signal
 handler can rip errno right out from under the code the signal 
 interrupted.
 
   -Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-07 Thread Matt Dillon


:
:Thanks, I will try setting errno, but I don't think it is signals.
:I have been running truss on the process. The relevant part is
:
:gettimeofday(0xbfbffa54,0x0) = 0 (0x0)
:select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3)
:read(0x16,0xa2da000,0x8000)  ERR#60 'Operation timed out'
:
:In fact there are no signals in the whole truss output
:
:Graham.

What type of descriptor is the read being performed on?  A TCP
connection or, say, a reading a file over NFS?  

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-07 Thread Graham Barr

On Thu, Jun 07, 2001 at 10:33:50AM -0700, Matt Dillon wrote:
 
 :
 :Thanks, I will try setting errno, but I don't think it is signals.
 :I have been running truss on the process. The relevant part is
 :
 :gettimeofday(0xbfbffa54,0x0) = 0 (0x0)
 :select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3)
 :read(0x16,0xa2da000,0x8000)  ERR#60 'Operation timed out'
 :
 :In fact there are no signals in the whole truss output
 :
 :Graham.
 
 What type of descriptor is the read being performed on?  A TCP
 connection or, say, a reading a file over NFS?  

It is a TCP/IP connection.

Graham.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-07 Thread rick norman


I've seen this behavior in the past.  My impression is that it is load related.
If you do a grep on ETIMEDOUT in /usr/src/sys/netinet, you will see where
the tcp stack may return this message.  There may be some sysctl params relating
to timers that you can muck with.

Rick

Graham Barr wrote:

 On Thu, Jun 07, 2001 at 10:33:50AM -0700, Matt Dillon wrote:
 
  :
  :Thanks, I will try setting errno, but I don't think it is signals.
  :I have been running truss on the process. The relevant part is
  :
  :gettimeofday(0xbfbffa54,0x0) = 0 (0x0)
  :select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3)
  :read(0x16,0xa2da000,0x8000)  ERR#60 'Operation timed out'
  :
  :In fact there are no signals in the whole truss output
  :
  :Graham.
 
  What type of descriptor is the read being performed on?  A TCP
  connection or, say, a reading a file over NFS?

 It is a TCP/IP connection.

 Graham.

 To Unsubscribe: send mail to [EMAIL PROTECTED]
 with unsubscribe freebsd-hackers in the body of the message


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-07 Thread Alfred Perlstein

* Graham Barr [EMAIL PROTECTED] [010607 12:17] wrote:

Since people seem to be helping you in other ways, I'll just
answer this one:

 So, here is my question. Does anyone know under what circumstance
 ETIMEDOUT may be returned from read(2) or is this a potential bug
 somewhere ?

I'm quite sure ETIMEDOUT is a result of hitting the setsockopt
SO_RCVTIMEO value when doing a read.

-- 
-Alfred Perlstein [[EMAIL PROTECTED]]
Instead of asking why a piece of software is using 1970s technology,
start asking why software is ignoring 30 years of accumulated wisdom.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-07 Thread Graham Barr

On Thu, Jun 07, 2001 at 03:09:17PM -0400, Alfred Perlstein wrote:
 * Graham Barr [EMAIL PROTECTED] [010607 12:17] wrote:
 
 Since people seem to be helping you in other ways, I'll just
 answer this one:
 
  So, here is my question. Does anyone know under what circumstance
  ETIMEDOUT may be returned from read(2) or is this a potential bug
  somewhere ?
 
 I'm quite sure ETIMEDOUT is a result of hitting the setsockopt
 SO_RCVTIMEO value when doing a read.

I had been thinking along those lines too. But immediately before calling
read, select said there was data to read, So it should not block, but
read what data is there and return.

Also why does this happen only every few hours ? There is a lot of
data going through these connections maybe the timer for SO_RCVTIMEO
is not being reset.

But then we have another server, with a similar number of clients and
data through put, but it does not suffer from this problem.

As you can probably tell, we have been tearing our hair out over this one.

Graham.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-07 Thread Ian Dowse

In message [EMAIL PROTECTED], Graham Barr writes:

Also why does this happen only every few hours ? There is a lot of
data going through these connections maybe the timer for SO_RCVTIMEO
is not being reset.

But then we have another server, with a similar number of clients and
data through put, but it does not suffer from this problem.

I suspect that the server seeing this problem has a client that
occasionally disappears from the network, or for whatever reason
fails to respond to any packets for a long time (something like 5
or 10 minutes). I've seen blocking TCP writes return ETIMEDOUT when
the network between the client and the server goes down. In the
non-blocking case I think the following can happen:

1) Client is connected to server.
2) Network goes down, or client is turned off
3) Server performs non-blocking write() on socket
4) Server uses poll/select/kevent waiting for data from socket
5) The write operation times out because no acknowledgements
   have been received. This occurs after TCP_MAXRXTSHIFT
   retransmits, so-so_error is set to ETIMEDOUT and the
   connection is shut down (I haven't read the code very
   carefully, so the details could be wrong.
6) select/poll/kevent notes the EOF condition, and says that
   the descriptor is ready to read.
7) read() returns the real error, which is ETIMEDOUT.

I guess this should possibly be documented in read(2), but in
practice there are numerous network errors that can be returned
from read(). Normal practice in single-process servers is to
consider any unknown errors from read(),write() etc as only
fatal to that client rather than the whole server.

Ian

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-07 Thread Graham Barr

While this does sound very plausable,...

The server does not do any writes, data only travels from the clients
to the server.

The clients and the server are connected to the same switch.

The other server which is similar is on the same network and
is connected to by the same machines as clients, yet it
does not see any problems.

But thanks for the insight. I will place a sniffer of the port
and see if there are excessive retransmits

Graham.

On Thu, Jun 07, 2001 at 09:16:19PM +0100, Ian Dowse wrote:
 In message [EMAIL PROTECTED], Graham Barr writes:
 
 Also why does this happen only every few hours ? There is a lot of
 data going through these connections maybe the timer for SO_RCVTIMEO
 is not being reset.
 
 But then we have another server, with a similar number of clients and
 data through put, but it does not suffer from this problem.
 
 I suspect that the server seeing this problem has a client that
 occasionally disappears from the network, or for whatever reason
 fails to respond to any packets for a long time (something like 5
 or 10 minutes). I've seen blocking TCP writes return ETIMEDOUT when
 the network between the client and the server goes down. In the
 non-blocking case I think the following can happen:
 
   1) Client is connected to server.
   2) Network goes down, or client is turned off
   3) Server performs non-blocking write() on socket
   4) Server uses poll/select/kevent waiting for data from socket
   5) The write operation times out because no acknowledgements
  have been received. This occurs after TCP_MAXRXTSHIFT
  retransmits, so-so_error is set to ETIMEDOUT and the
  connection is shut down (I haven't read the code very
  carefully, so the details could be wrong.
   6) select/poll/kevent notes the EOF condition, and says that
  the descriptor is ready to read.
   7) read() returns the real error, which is ETIMEDOUT.
 
 I guess this should possibly be documented in read(2), but in
 practice there are numerous network errors that can be returned
 from read(). Normal practice in single-process servers is to
 consider any unknown errors from read(),write() etc as only
 fatal to that client rather than the whole server.
 
 Ian

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: read(2) and ETIMEDOUT

2001-06-07 Thread Matt Dillon


:
:On Thu, Jun 07, 2001 at 10:33:50AM -0700, Matt Dillon wrote:
: 
: :
: :Thanks, I will try setting errno, but I don't think it is signals.
: :I have been running truss on the process. The relevant part is
: :
: :gettimeofday(0xbfbffa54,0x0) = 0 (0x0)
: :select(0x50,0x93f8c90,0x0,0x0,0xbfbffa74)= 3 (0x3)
: :read(0x16,0xa2da000,0x8000)  ERR#60 'Operation timed out'
: :
: :In fact there are no signals in the whole truss output
: :
: :Graham.
: 
: What type of descriptor is the read being performed on?  A TCP
: connection or, say, a reading a file over NFS?  
:
:It is a TCP/IP connection.
:
:Graham.

You can get this if the TCP connection times out, either through a
keepalive timeout or the protocol hits the maximum number of transmit
retries.  I'd have to delve into the cvs logs to see when it was added,
but it seems reasonable.  You should treat it simply as an EIO or
something like that.

Generally speaking you should handle return codes from system calls by
handling the codes you know about and simply assuming that anything else
is fatal to the particular connection.

if (systemcall(...)  0) {
switch(errno) {
case EINTR:
case EAGAIN:
... deal with non-blocking situations ...
.
.
.
default:
... assume everything else is a fatal error on the socket ...
... close the descriptor and cleanup its state ...
}
}

This gives you the maximum portability between platforms and between
releases.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message