Hi Valentino,

On Thu, Feb 19, 2009 at 11:04:21AM -0800, Valentino Volonghi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi, I've been trying to use haproxy 1.3.15.7 in front of a couple of
> erlang mochiweb servers in EC2.
> 
> The server alone can deal with about 3000 req/sec and
> I can hit it directly with ab or siege or tsung and see a
> similar result.
> 
> I then tried using nginx in front of the system and it was
> about to reach about the same numbers although apparently
> it couldn't really improve performance as much as I expected
> and instead it increases latency quite a lot.
> 
> I then went on to try with haproxy but when I use ab to
> benchmark with 100k connection with 1000 concurrency
> after 30k requests I see haproxy jumping to 100% CPU usage.
> I tried looking into a strace of what's going on and there are
> many EADDRNOTAVAIL errors which I suppose means that
> ports are finished,

You're correct, this is typically what happens in such a case.
The increase of CPU usage is generally caused by two factors :
  - connection retries, which act as a traffic multiplier
  - immediate closure which causes the load injection tool
    to immediately send a new connection instead of having to
    wait a few ms for the application.

> even though I increased the available ports with sysctl.

could you try to reduce the number of ports to see if the problem
still shows up after 30k reqs or if it shows up before ? Use 5k
ports for instance. I'm asking this because I see that you have
25k for maxconn, and since those numbers are very close, we need
to find which one triggers the issue.

> haproxy configuration is the following:
> 
> global
>     maxconn 25000
>     user haproxy
>     group haproxy
> 
> defaults
>     log global
>     mode    http
>     option  dontlognull
>     option httpclose
>     option forceclose
>     option forwardfor
>     maxconn 25000
>     timeout connect      5000
>     timeout client       2000
>     timeout server       10000
>     timeout http-request 15000
>     balance roundrobin
> 
> listen adserver
>     bind :80
>     server ad1 127.0.0.1:8000 check inter 10000 fall 50 rise 1
> 
> stats enable
>     stats uri /lb?stats
>     stats realm Haproxy\ Stats
>     stats auth admin:pass
>     stats refresh 5s

Since you have stats enabled, could you check in the stats how
many sessions are still active on the frontend when the problem
happens ?

> Reading this list archives I think I have some of the symptoms  
> explained in
> these mails:
> 
> http://www.formilux.org/archives/haproxy/0901/1670.html

oops, I don't remember having read this mail, and the poor guy seems
not to have got any response!

> This is caused by connect() failing for EADDRNOTAVAIL and thus considers
> the server down.

I'm not that sure about this one. There's no mention of any CPU usage, and
traditionally this is the symptom of the ip_conntrack module being loaded
on the machine.

> http://www.formilux.org/archives/haproxy/0901/1735.html
> I think I'm seeing exactly the same issue here.

There could be a common cause, though in John's test, it was caused by a
huge client timeout, and maybe the client did not correctly respond to
late packets after the session was closed (typically if a firewall is
running on the client machines), which causes the sessions to last for
as long as the client timeout.

> A small strace excerpt:
> 
> socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 18
> fcntl64(18, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
> setsockopt(18, SOL_TCP, TCP_NODELAY, [1], 4) = 0
> connect(18, {sa_family=AF_INET, sin_port=htons(8000),  
> sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot  
> assign requested address)
> close(18)

This one is a nice capture of the problem. The fact that the FD is
very low (18) also makes me think that you will see very few connections
on haproxy on the stats page, meaning that some resource in the system
is exhausted (likely source ports).

> recv(357, 0x9c1acb8, 16384, MSG_NOSIGNAL) = -1 EAGAIN (Resource  
> temporarily unavailable)
> epoll_ctl(0, EPOLL_CTL_ADD, 357, {EPOLLIN, {u32=357, u64=357}}) = 0

This was is not related, it's normal behaviour : running in non-blocking
mode, you try to read, there's nothing (EAGAIN), so you register this FD
for polling.

> The last one mostly to show that I'm using epoll, in fact speculative  
> epoll, but even turning it off doesn't solve the issue.

Agreed it's speculative epoll which produces the two lines above. If you
just use epoll, you'll see epoll_ctl(EPOLL_CTL_ADD), then epoll_wait()
then read(). But this will not change anything to the connect() issue
the system is reporting.

> An interesting problem is that if I use mode tcp instead of mode http  
> this doesn't
> happen, but since it doesn't forward the client IP address (and I  
> can't patch
> an EC2 kernel) I can't do it.

this is because your load tester uses HTTP keep-alive by default, which
tcp mode does not affect. In http mode with "option httpclose" and
"option forceclose", keep-alive is disabled. If you comment out those
two lines, the problem will disappear but only the first request in
each session will have its x-forwarded-for header added.

> ulimit-n showed by haproxy is 50k sockets, well above maxconn and well  
> above
> the 30k wehere it breaks.

ulimit-n is slightly above 2*maxconn (you need two FDs per connection,
as well as a few for listeners and server health checks).

> sysctl.conf has the following settings:
> 
> # the following stops low-level messages on console
> kernel.printk = 4 4 1 7
> fs.inotify.max_user_watches = 524288
> # some spoof protection
> net.ipv4.conf.default.rp_filter=1
> net.ipv4.conf.all.rp_filter=1
> # General gigabit tuning:
> net.core.rmem_max = 33554432
> net.core.wmem_max = 33554432
> net.ipv4.tcp_rmem = 4096 16384 33554432
> net.ipv4.tcp_wmem = 4096 16384 33554432
> net.ipv4.tcp_mem = 786432 1048576 26777216

This one above is wrong (though it's not our problem here). It says
that you'll use a minimum of 3 GB, a default of 4 GB and a max of
100 GB for the TCP stack. The unit here is pages, not bytes.

> net.ipv4.tcp_max_tw_buckets = 360000
> net.core.netdev_max_backlog = 2500
> vm.min_free_kbytes = 65536
> vm.swappiness = 0
> net.ipv4.ip_local_port_range = 25000 65535

Could you check net.ipv4.tcp_tw_reuse, and set it to 1 if it's zero ?

> Everything runs on an ubuntu 8.04 with 2.6.21.7. Is there anything  
> that I get
> spectacularly wrong? Do you need more strace output?

more strace output will not show anything beyond your capture above.
Running a "netstat -an" (or the faster ss -an if you have it) would
give a lot of information about the state of those 30k sockets. And
please don't forget to check the stats page. You can even save the
page, as it's self contained in a single HTML file.

Regards,
Willy


Reply via email to