On Fri, Dec 18, 2009 at 05:00:38PM -0800, Joe Torsitano wrote:
> Hi Willy,
> 
> What's strange is traffic still appears normal, and is, for probably at
> least 99% of the visitors.  Logged traffic remains about normal (hundreds of
> thousands of visitors a day).  I just get a few e-mails asking why the site
> has been down for days or when it will be back.  But I cannot recreate the
> problem.  And I know there are probably people who just don't e-mail and,
> unfortunately, don't come back.

yes, very possible unfortunately.

> Here is the config file with the IP addresses changed, pretty much the
> default that comes with it...

A few questions that come to mind :
- What version are you running by the way (haproxy -vv) ?
  Several cases of truncated responses were observed between
  1.3.16 and 1.3.18, and sometimes a 502 response could be
  sent if the server closed too fast before 1.3.19. So please
  endure you're on 1.3.22. More info here about the bugs in
  your version :

    http://haproxy.1wt.eu/knownbugs-1.3.html

- Have you tried to look for client errors in the logs ?

- Have you tried to look in the logs if you could find some of
  the complainers' traces ? Most often, you can check for the
  same class-B or class-C addresses as the IP that posted the
  mail, and try to isolate the accesses by taking the access
  time into account.

- are you sure that 2000 concurrent connections are enough ?
  You may check that in the logs too, as there is a field
  with connection counts.

- I'm seeing there is no "option httpclose" below. Could you
  try to add it in the defaults section and see if it changes
  anything ? Before doing that, please check that you don't
  have iptables enabled on your haproxy machine.

I'm also thinking about something else. You said that when
you don't go through haproxy you don't get any complaint.
Are your systems configured similarly ? I mean, the very
low rate of problems could very well be caused by some TCP
settings which are incompatible with a minority of users
running behind a buggy router/firewall.

In order to check this, you could run the following command
on each server (including the one with haproxy) :

    $ sysctl -a | fgrep net.ipv4.tcp

Please verify if tcp_ecn and tcp_window_scaling are at the
same values. If not, start by setting tcp_ecn to 0 on
the haproxy server. Then later you can try to similarly
disable tcp_window_scaling, though this one is far less
likely because it's enabled almost everywhere.

Also check with "ip route" and "ip address" on all servers
if you don't see a different MTU value on the default
route. It's possible that a small part of your clients
are still running misconfigured a PPPoE ADSL line and
can't send/receive full packets. There are still some
large sites who deal with that by setting their MTU to
1492 or even 1452 on the external interface. But this
is less likely.

Regards,
Willy


Reply via email to