Hi Willy,

Thanks for your support, makes me believe I would solve this riddle.
After updating to 1.4.10, sync-ing TC2 and LB1 times thru NTP, and using options tcp-smart-connect and tcp-smart-accept, I have seen significant improvments in server downtimes, retries and redispatches. But still I see lots of retries though there are only 1 redispatch at TC2.

Now in new stats page I noticed one thing which was not in 1.3.22 is
LastChk, but I wonder tc1 is showing L7OK/302 in 324ms _and tc2 is showing L7OK/302 in 104ms _ while currently haproxy is running on LB1 and there are
13 retries at TC2.

The only explanation I can see is a network connection issue. What you
describe looks like packet loss over the wire. It's possible that one
of your NICs is dying, or that the network cable or switch port is
defective.

You should try to perform a file transfer between the machine showing
issues and another one from the local network to verify this hypothesis.
If you can't achieve wire speed, it's possible you're having such a
problem. Then you should first move to another switch port (generally
easy), then swap the cable with another one (possibly swap the cables
between your two LBs if they're close) then try another port on the
machine.
We are on production, and servers are also in a data center, so wont be possible to swap cabels. To ascertain packet loss I carried out ping between LB1 and TC2 and TC1. LB1 and TC1 avg time was 0.101 ms and LB1 to TC2 avg time was 0.382 ms on 64 byte packet with 0% loss.
Another possible explanation which becomes quite rare nowadays would
be that you'd be using a forced 100Mbps full duplex port on your switch
with a gigabit port on your server, which would negociate half duplex.
You can check for that with "ethtool eth0" on your LBs and TCs.
I checked we are using vSwitch for external and internal server communication 1000mb full among v servers on a physical machine. v-server's are using 1000mb full v-adapters.

Following are the current stats:
TC1 Retr: 0, Redis: 0 Status OPEN 5h 52m UP, LastChk L70K/302 in 321 ms, Server Chk: 4, Dwn 1, Dwntime 4m 17s. TC2 Retr:1326 , Redis: 1 Status OPEN 4h 1m UP, LastChk L70K/302 in 87 ms, Server Chk: 90, Dwn 2, Dwntime 26s.
Backend 5d 6m UP

anyways what does LastChk signify?

Regards,
Amit
----- Original Message ----- From: "Willy Tarreau" <w...@1wt.eu>
To: "Amit Nigam" <amitni...@gobindas.in>
Cc: "Guillaume Bourque" <guillaume.bour...@gmail.com>; <haproxy@formilux.org>
Sent: Monday, December 27, 2010 11:25 AM
Subject: Re: node frequently goes down on another physical machine


Hi Amit,

On Fri, Dec 24, 2010 at 12:24:55PM +0530, Amit Nigam wrote:
(...)
I see nothing wrong in your configs which could justify your issues.

Now in new stats page I noticed one thing which was not in 1.3.22 is
LastChk, but I wonder tc1 is showing L7OK/302 in 324ms _and tc2 is showing L7OK/302 in 104ms _ while currently haproxy is running on LB1 and there are
13 retries at TC2.

The only explanation I can see is a network connection issue. What you
describe looks like packet loss over the wire. It's possible that one
of your NICs is dying, or that the network cable or switch port is
defective.

You should try to perform a file transfer between the machine showing
issues and another one from the local network to verify this hypothesis.
If you can't achieve wire speed, it's possible you're having such a
problem. Then you should first move to another switch port (generally
easy), then swap the cable with another one (possibly swap the cables
between your two LBs if they're close) then try another port on the
machine.

Another possible explanation which becomes quite rare nowadays would
be that you'd be using a forced 100Mbps full duplex port on your switch
with a gigabit port on your server, which would negociate half duplex.
You can check for that with "ethtool eth0" on your LBs and TCs.

Also can this issue be due to time differences between cluster nodes? as I
have seen there is a time difference of around 2 minutes between physical
machine 1 vms and physical machine 2 vms.

While it's a bad thing to have machines running at different times, I
don't see why it could cause any such issue.

Regards,
Willy




Reply via email to