Hi Chris, On Fri, Nov 22, 2013 at 06:40:50PM -0500, Chris Burroughs wrote: > I am currently trying to migrate a somewhat over-complicated and > over-provisioned setup to something simpler and more efficient. The > application servers lots of small HTTP requests (often 204 or 3xx), and > gets requests from all sorts of oddly behaving clients or > over-aggressive pre-connecting user agents. (Thanks for the help in the > client timeout thread!) > > The servers are: > * AMD Opteron 4184, 6 cores, 2.8 Ghz > * 2x gigabit Intel 82576 with the igb driver > * centos 6 (2.6.32-358.23.2.el6.x86_64) > > Running 1.4.24 with all traffic currently coming from the outside on > eth0 and going to the backend servers on eth1. > > The setup that I thought would work well is: > * core 0: all network interrupts > * core 1: haproxy > * core 2-5: httpd for ssl termination
That's what I often do with pretty good results. > After some adventures in interrupt tunning so that core 0 would not be > killed by > 100k/s we got this to satisfy roughly 20k req/s. At this > point hatop would report the Queue was growing without bond and maxconn > would fill up. The cause looks simple enough, core 1 was consistently > 0.00% idle at that load http://pastebin.com/kuT5uCtP. > > My first question is if the roughly 1:3 usr/sys ratio is "normal" for a > well functioning haproxy serving tiny http objects? See sample config > attached. This is 25% user and 75% system. It's on the high side for the user, since you generally get between 15 and 25% user for 75-85% system, but since you have logs enabled, it's not really surprizing so yes it's in the norm. You should be able to slightly improve this by using "http-server-close" instead of "httpclose". It will actively close server-side connections and save a few packets. > While I don't have an empirical basis for comparison on this > older hardware, 20k req/s also "seemed" low. I remember having benchmarked another recent opteron last year (with many cores, a 7-something) and it performed very poorly, about 18k/s, much worse than my 4-years old Phenom 9950. One of the reasons was that it was difficult to share some L3 cache between certain cores. I found little information on the 4184, except on Wikipedia (to be taken with a grain of salt) : http://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29 Thus it probably suffers from the same design as the 7xxx, which is that you need to identify the cores belonging to the same module, so that they share the same L2 cache and that they are located in the same part of the share L3 cache, otherwise the inter-core communications happen via the outside. These CPUs seem to be designed for VM hosting, or running highly threaded Java apps which don't need much FPU. I'm not certain they were optimized for network processing unfortunately, which is sad considering that their older brothers were extremely fast at that. > Because it's what we had used in the old setup I also tried setting > nbproc=6 with no IRQ pinning (tests showed IRQs on one core was > beneficial with no nbproc). This was able to handle close to *twice* as > many req/s. That certainly indicates *very* independant cores! > This was a surprising result because almost every thread > mentioning nbproc suggests not using it and I expected at best marginal > gains. Generally you're quickly stuck by the network stacks's ability to run fast enough connect() calls, which require a bit of locking to pick a source port, and which constitutes the highest cost in haproxy. But as you can see, it depends a lot on the CPU architecture. > Either way I'd prefer to avoid nbproc for stats and all of the > typical reason. Sure. > Finally assuming the single process performance can not be further > improved I was considering the following setup: > * core 0: eht0 interrupts > * core 1: haproxy bound to eth0 > * core 2: eth1 interrupts > * core 3: haproxy bound to eth1 > * core 4-5: ssl terminator I definitely agree. I know at least one setup which runs fine this way. It was a two-socket system, each with its own NIC and process. But here you're in the same situation, consider that you have 3 independant CPUs in the same box. The benefit of doing it this way is that you can still parallelize network interrupts to multiple cores without having the response traffic come to the wrong core (proxies are a hell to optimize because of their two sides). However you must absolutely figure what core shares L2 with what other core. I suspect you'll have core 0 + core 3, core 1 + core 4, core 2 + core 5. But that's only a guess. > But I could not find too many examples of similar setups and was unsure > if it was a viable long term configuration. Yes it is viable. The only limit right now is that you'll need to start two processes. In the future, when listeners reliably support the "bind-process" keyword, it will even be possible to centralize everything and have a dedicated stats socket for each. In the mean time I suggest that you have two processes with almost the same config except interfaces. Note that haproxy supports binding to interfaces. Then there are two possibilities. Either you present two different addresses to the world, or you present only one and your front router or L3 switch dispatches the traffic to the two interfaces using ECMP. This last solution tends to be preferred and more common these days. Thus you have this : NIC1 (eth0) 192.168.0.1 NIC2 (eth1) 192.168.0.2 VIP on loopback: 192.168.0.3 L3 : route 192.168.0.3 nexthop 192.168.0.1 nexthop 192.168.0.2 In haproxy, you'll have this : frontend f bind :80 interface eth0 use_backend b backend b source 192.168.0.1 interface eth0 server s1 x.x.x.x:80 check server s2 x.x.x.x:80 check server s3 x.x.x.x:80 check And you do the same in the other process with s/eth0/eth1. This is handy because you can force to use process 1 or process 2 by connecting explicitly to NIC1's or NIC2's address, which is convenient for testing. Otherwise, all your config below looks fine. > global > log 127.0.0.1 local4 info > chroot /var/lib/haproxy > pidfile /var/run/haproxy.pid > daemon > stats socket /var/run/haproxy.socket mode 766 > maxconn 65530 > > > defaults > mode http > log global > option dontlog-normal > option dontlognull # uncomment for details > option httplog > option httpclose > option contstats > timeout client 7s > timeout server 4s > timeout connect 4s > timeout http-request 7s > maxconn 65530 > > listen stats > bind *:81 > stats enable > stats uri /ha-stats > > frontend foo_in > bind *:80 > mode http > default_backend foo_nodes > log global > option forwardfor except 127.0.0.1 > > backend foo_nodes > balance roundrobin > option tcp-smart-connect > option httpchk HEAD /live-lb HTTP/1.0 > server s0 xx.xx.xx.xx:8080 maxconn 4096 maxqueue 1024 check inter 1s fall > 5 rise 2 > server s1 xx.xx.xx.xx:8080 maxconn 4096 maxqueue 1024 check inter 1s fall > 5 rise 2 > server s2 xx.xx.xx.xx:8080 maxconn 4096 maxqueue 1024 check inter 1s fall > 5 rise 2 > server s3 xx.xx.xx.xx:8080 maxconn 4096 maxqueue 1024 check inter 1s fall > 5 rise 2 > server s4 xx.xx.xx.xx:8080 maxconn 4096 maxqueue 1024 check inter 1s fall > 5 rise 2 > server s5 xx.xx.xx.xx:8080 maxconn 4096 maxqueue 1024 check inter 1s fall > 5 rise 2 > haproxy -vv > HA-Proxy version 1.4.24 2013/06/17 > Copyright 2000-2013 Willy Tarreau <w...@1wt.eu> > > Build options : > TARGET = linux2628 > CPU = generic > CC = gcc > CFLAGS = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing > OPTIONS = USE_LINUX_SPLICE=1 USE_LINUX_TPROXY=1 USE_STATIC_PCRE=1 > > Default settings : > maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200 > > Encrypted password support via crypt(3): yes > > Available polling systems : > sepoll : pref=400, test result OK > epoll : pref=300, test result OK > poll : pref=200, test result OK > select : pref=150, test result OK > Total: 4 (4 usable), will use sepoll. > Regards, Willy