IN order to test all of the Asio based code I have written, I have modified StressTest in various ways. One is to allow an --endless command line argument that causes the part of the test that runs the clients to be looped forever. This lets me start the server, then hammer on it overnight.

Once I had this capability, I discovered that if I ran with the parameters --clients=20 --loop=10 I could cause the test to fail within a short amount of time. The failure mode is that a client task catches a Transport exception when it calls one of the loopEchoXxx() methods. The exception says that ECONNRESET was seen on a read.

I tested with various client and server configurations:
Client  Server  Notes             Test Result
------  ------  ----------------- ------------
Mac     Win XP                    No failures
WinXP   Mac                       No failures
WinXp   WinXP   Loopback          No failures
Mac     Mac     Loopback          No failures
Mac     Mac     Non-loopback      No failures

The code running on Windows XP is obviously Asio based, but these test results occur regardless of whether the code running on the Mac is Asio based (my new stuff) or the original code.

I did a lot of searching on the internet and one page (http://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable ) made me start wondering about the linger settings for the sockets. The TSocket constructor used by clients initializes the linger variables like this:
  lingerOn_(1),
  lingerVal_(0),

The server socket always has linger turned off.

I used tcpdump to capture the packets transmitted when a failure occurs. For this test I used --clients=2 --loop=1. Here is the trace for the last 2 clients. I have broken each line of the dump immediately after the "Src>Dst:" portion, so the flags show up in column 2 of the next line:
13:59:11.372267 IP localhost.53127 > localhost.xmltec-xmlmail:
S 2349780390:2349780390(0) win 65535 <mss 16344,nop,wscale 1,nop,nop,timestamp 824548119 0,sackOK,eol>
e...@a.@....@.............#..............4....?........
1%..........
13:59:11.372270 IP localhost.53128 > localhost.xmltec-xmlmail:
S 2191687293:2191687293(0) win 65535 <mss 16344,nop,wscale 1,nop,nop,timestamp 824548119 0,sackOK,eol>
e...@.'@....@.............#...~}.........4....?........
1%..........
13:59:11.372304 IP localhost.xmltec-xmlmail > localhost.53127:
S 2047182235:2047182235(0) ack 2349780391 win 65535 <mss 16344,nop,wscale 2,nop,nop,timestamp 824548119 824548119,sackOK,eol>
e...@.}@....@...........#...z............4....?........
1%..1%......
13:59:11.372336 IP localhost.xmltec-xmlmail > localhost.53128:
S 994342982:994342982(0) ack 2191687294 win 65535 <mss 16344,nop,wscale 2,nop,nop,timestamp 824548119 824548119,sackOK,eol>
e...@m.@....@...........#...;DxF..~~.....4....?........
1%..1%......
13:59:11.372345 IP localhost.53127 > localhost.xmltec-xmlmail:
 . ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
e.....@.@.............#.....z......~.(.....
1%..1%..
13:59:11.372353 IP localhost.53128 > localhost.xmltec-xmlmail:
 . ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
e.....@.@.............#...~~;DxG...~.(.....
1%..1%..
13:59:11.372372 IP localhost.xmltec-xmlmail > localhost.53127:
R 2047182236:2047182236(0) win 0 <<<<<<<<---------------------------------LOOK!!!!
E..(<-...@.@...........#...z.......P.......
13:59:11.372377 IP localhost.53127 > localhost.xmltec-xmlmail:
 P 1:32(31) ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
E..S.)@....@.............#.....z......~.g.....
1%..1%..........echoI64....
...........
13:59:11.372385 IP localhost.53128 > localhost.xmltec-xmlmail:
 P 1:32(31) ack 1 win 40830 <nop,nop,timestamp 824548119 824548119>
e.....@.@.............#...~~;DxG...~.G.....
1%..1%..........echoI64....
...........
13:59:11.373086 IP localhost.53128 > localhost.xmltec-xmlmail:
 R 32:32(0) ack 1 win 40830
E..(i...@.@.............#...~.;DxGP..~....


The line marked with "LOOK!!!!" is where the failure occurs. The client and server have made their socket connections, but the server sends a RST packet before the client sends his first data packet. I'm using the TSimpleServer, and it almost looks as if the server tried to read before the client wrote any data, and when there was nothing there the server immediately closed the socket. But that would only explain this trace if the server had written data to the socket that had not yet been read by the client and if the server had linger on with a timeout of zero.

However, I did these experiments in a different order, and because of that I tried modifying the TSocket constructor so that the linger variables were initialized like this:
  lingerOn_(0),
  lingerVal_(0),

which turns linger off for the client socket. After I made this change, the loopback test on the Mac just runs and runs and runs with no errors. I also tried having linger on, but with a 5 second timeout and that also runs without error.

I'm really at a loss to explain this. The tcpdump output makes it pretty clear that the server immediately sent RST. But if I just print the port_ value in TSocket::close() and run until a failure occurs, there is no evidence that the server side closed its socket. And if I turn off linger on the client side everything works like a charm and I see no failures. How can the client side settings affect the server side behavior?

I'm hoping that one of you POSIX sockets guys will have some idea of what's happening here. It looks like we can sidestep the problem by setting linger off, or setting it on with some nonzero time if we're running a client on a Mac, but I would rather understand what's going on and fix it if possible, rather than hack around it.

- Rush

Reply via email to