On Feb 14, 2011, at 10:37 AM, Robert Olson wrote:

> 
>> Hello Robert,
>> 
>> On 14/02/2011, at 16:36, Robert Olson wrote:
>> 
>>> The problem we're seeing is that for a particular test script on the
>>> client side, one of the exchanges is failing.
>>> 
>>> Looking at packet traces, I see the client sending a complete
>>> request to cherokee. Cherokee sends the request to the compute
>>> server, but it appears to be truncating the request one packet shy
>>> of finishing it. the compute server then reports a bad parse in
>>> response.
>>> 
>>> The problem initially showed up fairly reliably only when both front
>>> ends were running. If I killed wackamole on one of them (pushing
>>> both IPs over to a single server) the problem vanished.
>> 
>> Did it perform a clean 'three way' close sequence (FIN, FIN+ACK, ACK)?
>> The last package might be lost if a RST were sent while the connection
>> is being closed.
>> 
>> Cheers!
> 
> Here's the last bit of one of the failed exchanges. ml-mds is the frontend, 
> oak is the compute server. It sure looks like the frontend just decided to 
> close down the connection; its last packet appears to be the FIN. I 
> unfortunately don't appear to have saved any subsequent packets. I'll be 
> trying today to replicate the problem again and get strace output on the 
> frontends as well as detailed packet traces all around.
> 
> Thanks,

No joy yet on getting the problem repeated, but on looking at the traces for 
the successful runs, the initial FIN was sent by the compute server when it 
finished writing its output; the client (cherokee) only sent its FIN in 
response.  We've had some flaky behavior with the network switch that these 
systems so I'm not going to rule out hardware issues (though how a hardware 
failure would trigger an early FIN seems weird).

Aha. I think the key is here:

10:48:00.188691 IP oak.mcs.anl.gov.5104 > ml-mds.mcs.anl.gov.42694: . ack 15521 
win 288 <nop,nop,timestamp 1364352976 1359990193>
        0x0000:  4500 0034 68ae 4000 4006 c140 c005 c860  E..4h.@.@..@...`
        0x0010:  c005 c869 13f0 a6c6 7649 376b 6cd7 4707  ...i....vI7kl.G.
        0x0020:  8010 0120 789a 0000 0101 080a 5152 5fd0  ....x.......QR_.
        0x0030:  510f cdb1                                Q...
10:48:17.186797 IP ml-mds.mcs.anl.gov.42694 > oak.mcs.anl.gov.5104: F 
15521:15521(0) ack 1 win 46 <nop,nop,timestamp 1360007195 1364352976>
        0x0000:  4500 0034 c7e4 4000 4006 620a c005 c869  E..4..@[email protected]
        0x0010:  c005 c860 a6c6 13f0 6cd7 4707 7649 376b  ...`....l.G.vI7k
        0x0020:  8011 002e 3721 0000 0101 080a 5110 101b  ....7!......Q...
        0x0030:  5152 5fd0     

There's 17 seconds between those packets. The cherokee timeout had been set to 
the default at this point. I bet oak was waiting for its additional data, 
ml-mds wasn't sending it, and cherokee timed out and closed the connection. I 
think this is consistent with dropped packets.

--bob

_______________________________________________
Cherokee mailing list
[email protected]
http://lists.octality.com/listinfo/cherokee

Reply via email to