On Fri, Dec 18, 2009 at 05:00:38PM -0800, Joe Torsitano wrote: > Hi Willy, > > What's strange is traffic still appears normal, and is, for probably at > least 99% of the visitors. Logged traffic remains about normal (hundreds of > thousands of visitors a day). I just get a few e-mails asking why the site > has been down for days or when it will be back. But I cannot recreate the > problem. And I know there are probably people who just don't e-mail and, > unfortunately, don't come back.
yes, very possible unfortunately. > Here is the config file with the IP addresses changed, pretty much the > default that comes with it... A few questions that come to mind : - What version are you running by the way (haproxy -vv) ? Several cases of truncated responses were observed between 1.3.16 and 1.3.18, and sometimes a 502 response could be sent if the server closed too fast before 1.3.19. So please endure you're on 1.3.22. More info here about the bugs in your version : http://haproxy.1wt.eu/knownbugs-1.3.html - Have you tried to look for client errors in the logs ? - Have you tried to look in the logs if you could find some of the complainers' traces ? Most often, you can check for the same class-B or class-C addresses as the IP that posted the mail, and try to isolate the accesses by taking the access time into account. - are you sure that 2000 concurrent connections are enough ? You may check that in the logs too, as there is a field with connection counts. - I'm seeing there is no "option httpclose" below. Could you try to add it in the defaults section and see if it changes anything ? Before doing that, please check that you don't have iptables enabled on your haproxy machine. I'm also thinking about something else. You said that when you don't go through haproxy you don't get any complaint. Are your systems configured similarly ? I mean, the very low rate of problems could very well be caused by some TCP settings which are incompatible with a minority of users running behind a buggy router/firewall. In order to check this, you could run the following command on each server (including the one with haproxy) : $ sysctl -a | fgrep net.ipv4.tcp Please verify if tcp_ecn and tcp_window_scaling are at the same values. If not, start by setting tcp_ecn to 0 on the haproxy server. Then later you can try to similarly disable tcp_window_scaling, though this one is far less likely because it's enabled almost everywhere. Also check with "ip route" and "ip address" on all servers if you don't see a different MTU value on the default route. It's possible that a small part of your clients are still running misconfigured a PPPoE ADSL line and can't send/receive full packets. There are still some large sites who deal with that by setting their MTU to 1492 or even 1452 on the external interface. But this is less likely. Regards, Willy