Hi Amit, Try a "netstat -in" and see if you have any errors on your interfaces :)
might help to figure out if you have a duplex mismatch. cheers On Wed, Dec 29, 2010 at 8:04 AM, Amit Nigam <amitni...@gobindas.in> wrote: > Hi Willy, > > Thanks for your support, makes me believe I would solve this riddle. > After updating to 1.4.10, sync-ing TC2 and LB1 times thru NTP, and using > options tcp-smart-connect and tcp-smart-accept, I have seen significant > improvments in server downtimes, retries and redispatches. But still I see > lots of retries though there are only 1 redispatch at TC2. > >>> Now in new stats page I noticed one thing which was not in 1.3.22 is >>> LastChk, but I wonder tc1 is showing L7OK/302 in 324ms _and tc2 is >>> showing >>> L7OK/302 in 104ms _ while currently haproxy is running on LB1 and there >>> are >>> 13 retries at TC2. >> >> The only explanation I can see is a network connection issue. What you >> describe looks like packet loss over the wire. It's possible that one >> of your NICs is dying, or that the network cable or switch port is >> defective. >> >> You should try to perform a file transfer between the machine showing >> issues and another one from the local network to verify this hypothesis. >> If you can't achieve wire speed, it's possible you're having such a >> problem. Then you should first move to another switch port (generally >> easy), then swap the cable with another one (possibly swap the cables >> between your two LBs if they're close) then try another port on the >> machine. > > We are on production, and servers are also in a data center, so wont be > possible to swap cabels. > To ascertain packet loss I carried out ping between LB1 and TC2 and TC1. LB1 > and TC1 avg time was 0.101 ms and LB1 to TC2 avg time was 0.382 ms on 64 > byte packet with 0% loss. >> >> Another possible explanation which becomes quite rare nowadays would >> be that you'd be using a forced 100Mbps full duplex port on your switch >> with a gigabit port on your server, which would negociate half duplex. >> You can check for that with "ethtool eth0" on your LBs and TCs. > > I checked we are using vSwitch for external and internal server > communication 1000mb full among v servers on a physical machine. v-server's > are using 1000mb full v-adapters. > > Following are the current stats: > TC1 Retr: 0, Redis: 0 Status OPEN 5h 52m UP, LastChk L70K/302 in 321 ms, > Server Chk: 4, Dwn 1, Dwntime 4m 17s. > TC2 Retr:1326 , Redis: 1 Status OPEN 4h 1m UP, LastChk L70K/302 in 87 ms, > Server Chk: 90, Dwn 2, Dwntime 26s. > Backend 5d 6m UP > > anyways what does LastChk signify? > > Regards, > Amit > ----- Original Message ----- From: "Willy Tarreau" <w...@1wt.eu> > To: "Amit Nigam" <amitni...@gobindas.in> > Cc: "Guillaume Bourque" <guillaume.bour...@gmail.com>; > <haproxy@formilux.org> > Sent: Monday, December 27, 2010 11:25 AM > Subject: Re: node frequently goes down on another physical machine > > >> Hi Amit, >> >> On Fri, Dec 24, 2010 at 12:24:55PM +0530, Amit Nigam wrote: >> (...) >> I see nothing wrong in your configs which could justify your issues. >> >>> Now in new stats page I noticed one thing which was not in 1.3.22 is >>> LastChk, but I wonder tc1 is showing L7OK/302 in 324ms _and tc2 is >>> showing >>> L7OK/302 in 104ms _ while currently haproxy is running on LB1 and there >>> are >>> 13 retries at TC2. >> >> The only explanation I can see is a network connection issue. What you >> describe looks like packet loss over the wire. It's possible that one >> of your NICs is dying, or that the network cable or switch port is >> defective. >> >> You should try to perform a file transfer between the machine showing >> issues and another one from the local network to verify this hypothesis. >> If you can't achieve wire speed, it's possible you're having such a >> problem. Then you should first move to another switch port (generally >> easy), then swap the cable with another one (possibly swap the cables >> between your two LBs if they're close) then try another port on the >> machine. >> >> Another possible explanation which becomes quite rare nowadays would >> be that you'd be using a forced 100Mbps full duplex port on your switch >> with a gigabit port on your server, which would negociate half duplex. >> You can check for that with "ethtool eth0" on your LBs and TCs. >> >>> Also can this issue be due to time differences between cluster nodes? as >>> I >>> have seen there is a time difference of around 2 minutes between physical >>> machine 1 vms and physical machine 2 vms. >> >> While it's a bad thing to have machines running at different times, I >> don't see why it could cause any such issue. >> >> Regards, >> Willy >> > > > >