Hi Naveen,

On Mon, May 24, 2010 at 01:20:42PM -0400, Naveen Ayyagari wrote:
(...)
> When we track that request on the application server, it is received and 
> based on PHP timing of the process it completes in about  .02 seconds.  The 
> access log for the application server reports it returned a 200 response..
> 
> However haproxy timesout at 60 seconds(as it is supposed to), because it 
> never got the response.

Ah, that's a very interesting issue, because normally once the
response is emitted by the server, it's always received by the
client (here haproxy).

>  I don't think it is a problem with a specific machine, as I have seen this 
> behavior on each of the application servers, and is it not specific to that 
> particular path either.
> 
> It is unclear why the response does not make it back to haproxy and am 
> curious how one should go about debugging this issue. 

Among the possible causes I'm thinking about :

  - are you sure the server response is complete when emitted ? If
    the server would generate half of the headers, you may see the
    200 in your logs but still have a timeout on haproxy since it's
    not a complete response.

  - are you sure the response went out of the application server ?
    Please run a network capture on the server itself when this
    happens so that you can be sure that everything was sent. Use
    "tcpdump -s0 -w file.cap tcp port XXXX" for this.

  - are you sure you don't have MTU issues on your network ? It is
    possible that small packets pass but not large ones. This can
    happen for instance when there is an equipment between the
    client and the server which does not support jumbo frames (eg:
    a switch) while both are using jumbo frames. You could find
    that in a network capture on the haproxy host, because large
    packets will never reach it.

  - it is possible though very unlikely that either system sometimes
    computes a wrong TCP checksum on incoming or outgoing packets
    (probably the NICs if so) and that some packets are always
    considered bad for this reason. I've only met this case once,
    reason why I say it's very unlikely to be the cause. In order
    to be sure, you could disable hardware TCP checksumming on both
    sides (using ethtool -K under Linux).

Well, no more idea here, you'll have to get traces if nothing above
matches your issue !

Regards,
Willy


Reply via email to