Geez, cut and paste errors in the original post. Fixed here, I
sincerely hope.
On Jun 25, 2009, at 4:03 PM, Rush Manbert wrote:
IN order to test all of the Asio based code I have written, I have
modified StressTest in various ways. One is to allow an --endless
command line argument that causes the part of the test that runs the
clients to be looped forever. This lets me start the server, then
hammer on it overnight.
Once I had this capability, I discovered that if I ran with the
parameters --clients=20 --loop=10 I could cause the test to fail
within a short amount of time. The failure mode is that a client
task catches a Transport exception when it calls one of the
loopEchoXxx() methods. The exception says that ECONNRESET was seen
on a read.
I tested with various client and server configurations:
Client Server Notes Test Result
------ ------ ----------------- ------------
Mac Win XP No failures
WinXP Mac No failures
WinXp WinXP Loopback No failures
<snip>
Mac Mac Loopback Fails
Mac Mac Non-loopback Fails
i.e. only the Mac to Mac tets fail.
The code running on Windows XP is obviously Asio based, but these
test results occur regardless of whether the code running on the Mac
is Asio based (my new stuff) or the original code.
I did a lot of searching on the internet and one page (http://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable
) made me start wondering about the linger settings for the sockets.
The TSocket constructor used by clients initializes the linger
variables like this:
lingerOn_(1),
lingerVal_(0),
The server socket always has linger turned off.
I used tcpdump to capture the packets transmitted when a failure
occurs. For this test I used --clients=2 --loop=1. Here is the trace
for the last 2 clients. I have broken each line of the dump
immediately after the "Src>Dst:" portion, so the flags show up in
column 2 of the next line:
13:59:11.372267 IP localhost.53127 > localhost.xmltec-xmlmail:
S 2349780390:2349780390(0) win 65535 <mss 16344,nop,wscale
1,nop,nop,timestamp 824548119 0,sackOK,eol>
e...@a.@....@.............#..............4....?........
1%..........
13:59:11.372270 IP localhost.53128 > localhost.xmltec-xmlmail:
S 2191687293:2191687293(0) win 65535 <mss 16344,nop,wscale
1,nop,nop,timestamp 824548119 0,sackOK,eol>
e...@.'@....@.............#...~}.........4....?........
1%..........
13:59:11.372304 IP localhost.xmltec-xmlmail > localhost.53127:
S 2047182235:2047182235(0) ack 2349780391 win 65535 <mss
16344,nop,wscale 2,nop,nop,timestamp 824548119 824548119,sackOK,eol>
e...@.}@....@...........#...z............4....?........
1%..1%......
13:59:11.372336 IP localhost.xmltec-xmlmail > localhost.53128:
S 994342982:994342982(0) ack 2191687294 win 65535 <mss
16344,nop,wscale 2,nop,nop,timestamp 824548119 824548119,sackOK,eol>
e...@m.@....@...........#...;DxF..~~.....4....?........
1%..1%......
13:59:11.372345 IP localhost.53127 > localhost.xmltec-xmlmail:
. ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
e.....@.@.............#.....z......~.(.....
1%..1%..
13:59:11.372353 IP localhost.53128 > localhost.xmltec-xmlmail:
. ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
e.....@.@.............#...~~;DxG...~.(.....
1%..1%..
13:59:11.372372 IP localhost.xmltec-xmlmail > localhost.53127:
R 2047182236:2047182236(0) win 0
<<<<<<<<---------------------------------LOOK!!!!
E..(<-...@.@...........#...z.......P.......
13:59:11.372377 IP localhost.53127 > localhost.xmltec-xmlmail:
P 1:32(31) ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
E..S.)@....@.............#.....z......~.g.....
1%..1%..........echoI64....
...........
13:59:11.372385 IP localhost.53128 > localhost.xmltec-xmlmail:
P 1:32(31) ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
e.....@.@.............#...~~;DxG...~.G.....
1%..1%..........echoI64....
...........
13:59:11.373086 IP localhost.53128 > localhost.xmltec-xmlmail:
R 32:32(0) ack 1 win 40830
E..(i...@.@.............#...~.;DxGP..~....
The line marked with "LOOK!!!!" is where the failure occurs. The
client and server have made their socket connections, but the server
sends a RST packet before the client sends his first data packet.
I'm using the TSimpleServer, and it almost looks as if the server
tried to read before the client wrote any data, and when there was
nothing there the server immediately closed the socket. But that
would only explain this trace if the server had written data to the
socket that had not yet been read by the client and if the server
had linger on with a timeout of zero.
However, I did these experiments in a different order, and because
of that I tried modifying the TSocket constructor so that the linger
variables were initialized like this:
lingerOn_(0),
lingerVal_(0),
which turns linger off for the client socket. After I made this
change, the loopback test on the Mac just runs and runs and runs
with no errors. I also tried having linger on, but with a 5 second
timeout and that also runs without error.
I'm really at a loss to explain this. The tcpdump output makes it
pretty clear that the server immediately sent RST. But if I just
print the port_ value in TSocket::close() and run until a failure
occurs, there is no evidence that the server side closed its socket.
And if I turn off linger on the client side everything works like a
charm and I see no failures. How can the client side settings affect
the server side behavior?
I'm hoping that one of you POSIX sockets guys will have some idea of
what's happening here. It looks like we can sidestep the problem by
setting linger off, or setting it on with some nonzero time if we're
running a client on a Mac, but I would rather understand what's
going on and fix it if possible, rather than hack around it.
- Rush