Hi Marcus,

On Wed, Dec 30, 2009 at 02:04:26PM +0100, Marcus Herou wrote:
> Hi Willy, thanks for your answer it got filtered, that's why I missed it for
> two weeks.

No problem, it happens to me too from time to time.

> Let's start with describing the service.
> 
> We are hosting javascripts of the sizes up to 20K and serve flash and image
> banners as well which of course are larger. That is basically it.. Ad
> Serving.
> 
> On the LB's we have about 2MByte/s per LB  = 2x2MByte/s = 4MByte/s ~30MBit/s
> at peak, that is not the issue.

OK so you're running at approx 100 hits/s per LB on average.

> I've created a little script which parse the "active connections" from the
> HAProxy stat interface and plots it into Cacti, it peaks at 100 (2x100)
> connections per machine which is very little in your world I guess.

It's not "little", I'd even say it's in the average, as most sites are
running at very low rates.

> I've attached a plot of tcp-connections as well. Nothing fancy there either
> besides that the number of TIME_WAIT sockets are in the 10000 range (log
> scale)

10000 TIME_WAIT with a default setting of 60 seconds means 160 sessions per
second on average. That's still very reasonable and does not require
particular tuning.

> Here's the problem:
> 
> Everyother day I receive alarms from Pingdom that the service is not
> available and if I watch the syslog I get at about the same timings hints
> about possible SYN flood. At the same timings we receive emails from sites
> using us that our service is damn slow.
> 
> What I feel is that we get "hickups" on the LB's somehow and that requests
> get queued. If I count the number of rows in the access logs on the machines
> behind the LB it decreases at the same timings and with the same factor on
> each machine (perhaps 10-20%) leading me to think that the narrow point is
> not on the backend side.

this means to me that :
  1) your SYN backlog is too short. It defaults to 128 packets per socket on
     Linux (min of net.core.somaxconn and net.ipv4.tcp_max_syn_backlog). So
     you need to increase them (around 10000 for both always gives me good
     results.
  2) you may be experiencing SYN flood attacks from time to time.
  3) you have not enabled SYN cookies, which can protect against such issues
     especially during SYN attacks. You can enable them with 
net.ipv4.tcp_syn_cookies.

If you don't get any attack, #1 should be enough, but #3 is a good complement
that acts once #1 is not enough anymore.

It is also possible that you have not enabled enough connections in haproxy
and that the port is saturated for a long time. But this will still be triggered
by #1 above. This can be monitored on haproxy's stats page (limit and max on the
frontend's sessions).

> A little more about the backend servers:
> 
> We have an ad publishing system which pushes data to the web-servers
> enabling them to act almost 100% static, this have been the key thing which
> I tuned some years ago. Initially every request went to a DB but now just a
> simple Hashtable which is replicated from a "master".
> 
> The backend servers have very little to do and consumes very little
> resources:
> Example:
> top - 11:34:23 up 366 days,  1:15,  1 user,  load average: 0.37, 0.25, 0.23
> Tasks:  79 total,   1 running,  78 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.8%us,  0.5%sy,  0.0%ni, 94.0%id,  4.6%wa,  0.0%hi,  0.2%si,
> 0.0%st
> Mem:   4052904k total,  4008696k used,    44208k free,   292932k buffers
> Swap:  3903784k total,     9240k used,  3894544k free,  2145340k cached

You should be careful, this one has swapped at least once. It's very nasty
to use swap on any web server, as this considerably increases response times.

Regards,
Willy


Reply via email to