IN order to test all of the Asio based code I have written, I have
modified StressTest in various ways. One is to allow an --endless
command line argument that causes the part of the test that runs the
clients to be looped forever. This lets me start the server, then
hammer on it overnight.
Once I had this capability, I discovered that if I ran with the
parameters --clients=20 --loop=10 I could cause the test to fail
within a short amount of time. The failure mode is that a client task
catches a Transport exception when it calls one of the loopEchoXxx()
methods. The exception says that ECONNRESET was seen on a read.
I tested with various client and server configurations:
Client Server Notes Test Result
------ ------ ----------------- ------------
Mac Win XP No failures
WinXP Mac No failures
WinXp WinXP Loopback No failures
Mac Mac Loopback No failures
Mac Mac Non-loopback No failures
The code running on Windows XP is obviously Asio based, but these test
results occur regardless of whether the code running on the Mac is
Asio based (my new stuff) or the original code.
I did a lot of searching on the internet and one page (http://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable
) made me start wondering about the linger settings for the sockets.
The TSocket constructor used by clients initializes the linger
variables like this:
lingerOn_(1),
lingerVal_(0),
The server socket always has linger turned off.
I used tcpdump to capture the packets transmitted when a failure
occurs. For this test I used --clients=2 --loop=1. Here is the trace
for the last 2 clients. I have broken each line of the dump
immediately after the "Src>Dst:" portion, so the flags show up in
column 2 of the next line:
13:59:11.372267 IP localhost.53127 > localhost.xmltec-xmlmail:
S 2349780390:2349780390(0) win 65535 <mss 16344,nop,wscale
1,nop,nop,timestamp 824548119 0,sackOK,eol>
e...@a.@....@.............#..............4....?........
1%..........
13:59:11.372270 IP localhost.53128 > localhost.xmltec-xmlmail:
S 2191687293:2191687293(0) win 65535 <mss 16344,nop,wscale
1,nop,nop,timestamp 824548119 0,sackOK,eol>
e...@.'@....@.............#...~}.........4....?........
1%..........
13:59:11.372304 IP localhost.xmltec-xmlmail > localhost.53127:
S 2047182235:2047182235(0) ack 2349780391 win 65535 <mss
16344,nop,wscale 2,nop,nop,timestamp 824548119 824548119,sackOK,eol>
e...@.}@....@...........#...z............4....?........
1%..1%......
13:59:11.372336 IP localhost.xmltec-xmlmail > localhost.53128:
S 994342982:994342982(0) ack 2191687294 win 65535 <mss
16344,nop,wscale 2,nop,nop,timestamp 824548119 824548119,sackOK,eol>
e...@m.@....@...........#...;DxF..~~.....4....?........
1%..1%......
13:59:11.372345 IP localhost.53127 > localhost.xmltec-xmlmail:
. ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
e.....@.@.............#.....z......~.(.....
1%..1%..
13:59:11.372353 IP localhost.53128 > localhost.xmltec-xmlmail:
. ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
e.....@.@.............#...~~;DxG...~.(.....
1%..1%..
13:59:11.372372 IP localhost.xmltec-xmlmail > localhost.53127:
R 2047182236:2047182236(0) win 0
<<<<<<<<---------------------------------LOOK!!!!
E..(<-...@.@...........#...z.......P.......
13:59:11.372377 IP localhost.53127 > localhost.xmltec-xmlmail:
P 1:32(31) ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
E..S.)@....@.............#.....z......~.g.....
1%..1%..........echoI64....
...........
13:59:11.372385 IP localhost.53128 > localhost.xmltec-xmlmail:
P 1:32(31) ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
e.....@.@.............#...~~;DxG...~.G.....
1%..1%..........echoI64....
...........
13:59:11.373086 IP localhost.53128 > localhost.xmltec-xmlmail:
R 32:32(0) ack 1 win 40830
E..(i...@.@.............#...~.;DxGP..~....
The line marked with "LOOK!!!!" is where the failure occurs. The
client and server have made their socket connections, but the server
sends a RST packet before the client sends his first data packet. I'm
using the TSimpleServer, and it almost looks as if the server tried to
read before the client wrote any data, and when there was nothing
there the server immediately closed the socket. But that would only
explain this trace if the server had written data to the socket that
had not yet been read by the client and if the server had linger on
with a timeout of zero.
However, I did these experiments in a different order, and because of
that I tried modifying the TSocket constructor so that the linger
variables were initialized like this:
lingerOn_(0),
lingerVal_(0),
which turns linger off for the client socket. After I made this
change, the loopback test on the Mac just runs and runs and runs with
no errors. I also tried having linger on, but with a 5 second timeout
and that also runs without error.
I'm really at a loss to explain this. The tcpdump output makes it
pretty clear that the server immediately sent RST. But if I just print
the port_ value in TSocket::close() and run until a failure occurs,
there is no evidence that the server side closed its socket. And if I
turn off linger on the client side everything works like a charm and I
see no failures. How can the client side settings affect the server
side behavior?
I'm hoping that one of you POSIX sockets guys will have some idea of
what's happening here. It looks like we can sidestep the problem by
setting linger off, or setting it on with some nonzero time if we're
running a client on a Mac, but I would rather understand what's going
on and fix it if possible, rather than hack around it.
- Rush