Hi Chris,

On Fri, Nov 22, 2013 at 06:40:50PM -0500, Chris Burroughs wrote:
> I am currently trying to migrate a somewhat over-complicated and 
> over-provisioned setup to something simpler and more efficient.  The 
> application servers lots of small HTTP requests (often 204 or 3xx), and 
> gets requests from all sorts of oddly behaving clients or 
> over-aggressive pre-connecting user agents. (Thanks for the help in the 
> client timeout thread!)
> 
> The servers are:
>  * AMD Opteron 4184, 6 cores, 2.8 Ghz
>  * 2x gigabit Intel 82576 with the igb driver
>  * centos 6 (2.6.32-358.23.2.el6.x86_64)
> 
> Running 1.4.24 with all traffic currently coming from the outside on 
> eth0 and going to the backend servers on eth1.
> 
> The setup that I thought would work well is:
>  * core 0: all network interrupts
>  * core 1: haproxy
>  * core 2-5: httpd for ssl termination

That's what I often do with pretty good results.

> After some adventures in interrupt tunning so that core 0 would not be 
> killed by > 100k/s we got this to satisfy roughly 20k req/s.  At this 
> point hatop would report the Queue was growing without bond and maxconn 
> would fill up.  The cause looks simple enough, core 1 was consistently 
> 0.00% idle at that load http://pastebin.com/kuT5uCtP.
> 
> My first question is if the roughly 1:3 usr/sys ratio is "normal" for a 
> well functioning haproxy serving tiny http objects?  See sample config 
> attached.

This is 25% user and 75% system. It's on the high side for the user, since
you generally get between 15 and 25% user for 75-85% system, but since you
have logs enabled, it's not really surprizing so yes it's in the norm. You
should be able to slightly improve this by using "http-server-close" instead
of "httpclose". It will actively close server-side connections and save a
few packets.

> While I don't have an empirical basis for comparison on this 
> older hardware, 20k req/s also "seemed" low.

I remember having benchmarked another recent opteron last year (with
many cores, a 7-something) and it performed very poorly, about 18k/s,
much worse than my 4-years old Phenom 9950. One of the reasons was
that it was difficult to share some L3 cache between certain cores.
I found little information on the 4184, except on Wikipedia (to be
taken with a grain of salt) :

   http://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29

Thus it probably suffers from the same design as the 7xxx, which is
that you need to identify the cores belonging to the same module, so
that they share the same L2 cache and that they are located in the
same part of the share L3 cache, otherwise the inter-core communications
happen via the outside.

These CPUs seem to be designed for VM hosting, or running highly
threaded Java apps which don't need much FPU. I'm not certain they
were optimized for network processing unfortunately, which is sad
considering that their older brothers were extremely fast at that.

> Because it's what we had used in the old setup I also tried setting 
> nbproc=6 with no IRQ pinning (tests showed IRQs on one core was 
> beneficial with no nbproc).  This was able to handle close to *twice* as 
> many req/s.

That certainly indicates *very* independant cores!

> This was a surprising result because almost every thread 
> mentioning nbproc suggests not using it and I expected at best marginal 
> gains.

Generally you're quickly stuck by the network stacks's ability to
run fast enough connect() calls, which require a bit of locking to
pick a source port, and which constitutes the highest cost in haproxy.
But as you can see, it depends a lot on the CPU architecture.

> Either way I'd prefer to avoid nbproc for stats and all of the 
> typical reason.

Sure.

> Finally assuming the single process performance can not be further 
> improved I was considering the following setup:
>  * core 0: eht0 interrupts
>  * core 1: haproxy bound to eth0
>  * core 2: eth1 interrupts
>  * core 3: haproxy bound to eth1
>  * core 4-5: ssl terminator

I definitely agree. I know at least one setup which runs fine this way.
It was a two-socket system, each with its own NIC and process. But here
you're in the same situation, consider that you have 3 independant CPUs
in the same box. The benefit of doing it this way is that you can still
parallelize network interrupts to multiple cores without having the
response traffic come to the wrong core (proxies are a hell to optimize
because of their two sides).

However you must absolutely figure what core shares L2 with what other
core. I suspect you'll have core 0 + core 3, core 1 + core 4, core 2 +
core 5. But that's only a guess.

> But I could not find too many examples of similar setups and was unsure 
> if it was a viable long term configuration.

Yes it is viable. The only limit right now is that you'll need to start
two processes. In the future, when listeners reliably support the
"bind-process" keyword, it will even be possible to centralize
everything and have a dedicated stats socket for each.

In the mean time I suggest that you have two processes with almost the
same config except interfaces. Note that haproxy supports binding to
interfaces.

Then there are two possibilities. Either you present two different
addresses to the world, or you present only one and your front router
or L3 switch dispatches the traffic to the two interfaces using ECMP.
This last solution tends to be preferred and more common these days.

Thus you have this :

    NIC1 (eth0) 192.168.0.1
    NIC2 (eth1) 192.168.0.2
    VIP on loopback: 192.168.0.3

L3 : route 192.168.0.3 nexthop 192.168.0.1 nexthop 192.168.0.2

In haproxy, you'll have this :

  frontend f
       bind :80 interface eth0
       use_backend b

  backend b
       source 192.168.0.1 interface eth0
       server s1 x.x.x.x:80 check
       server s2 x.x.x.x:80 check
       server s3 x.x.x.x:80 check

And you do the same in the other process with s/eth0/eth1.

This is handy because you can force to use process 1 or process 2 by
connecting explicitly to NIC1's or NIC2's address, which is convenient
for testing.

Otherwise, all your config below looks fine.

> global
>     log         127.0.0.1     local4 info
>     chroot      /var/lib/haproxy
>     pidfile     /var/run/haproxy.pid
>     daemon
>     stats socket /var/run/haproxy.socket mode 766
>     maxconn              65530
> 
> 
> defaults
>     mode                 http
>     log                  global
>     option               dontlog-normal
>     option               dontlognull # uncomment for details
>     option               httplog
>     option               httpclose
>     option               contstats
>     timeout client       7s
>     timeout server       4s
>     timeout connect      4s
>     timeout http-request 7s
>     maxconn              65530
> 
> listen stats
>     bind *:81
>     stats enable
>     stats uri /ha-stats
> 
> frontend foo_in
>     bind *:80
>     mode http
>     default_backend foo_nodes
>     log global
>     option forwardfor except 127.0.0.1
> 
> backend foo_nodes
>     balance roundrobin
>     option tcp-smart-connect
>     option httpchk HEAD /live-lb HTTP/1.0
>     server s0 xx.xx.xx.xx:8080 maxconn 4096 maxqueue 1024 check inter 1s fall 
> 5 rise 2
>     server s1 xx.xx.xx.xx:8080 maxconn 4096 maxqueue 1024 check inter 1s fall 
> 5 rise 2
>     server s2 xx.xx.xx.xx:8080 maxconn 4096 maxqueue 1024 check inter 1s fall 
> 5 rise 2
>     server s3 xx.xx.xx.xx:8080 maxconn 4096 maxqueue 1024 check inter 1s fall 
> 5 rise 2
>     server s4 xx.xx.xx.xx:8080 maxconn 4096 maxqueue 1024 check inter 1s fall 
> 5 rise 2
>     server s5 xx.xx.xx.xx:8080 maxconn 4096 maxqueue 1024 check inter 1s fall 
> 5 rise 2

>  haproxy -vv
> HA-Proxy version 1.4.24 2013/06/17
> Copyright 2000-2013 Willy Tarreau <w...@1wt.eu>
> 
> Build options :
>   TARGET  = linux2628
>   CPU     = generic
>   CC      = gcc
>   CFLAGS  = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing
>   OPTIONS = USE_LINUX_SPLICE=1 USE_LINUX_TPROXY=1 USE_STATIC_PCRE=1
> 
> Default settings :
>   maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200
> 
> Encrypted password support via crypt(3): yes
> 
> Available polling systems :
>      sepoll : pref=400,  test result OK
>       epoll : pref=300,  test result OK
>        poll : pref=200,  test result OK
>      select : pref=150,  test result OK
> Total: 4 (4 usable), will use sepoll.
> 

Regards,
Willy


Reply via email to