On 11/23/2013 04:13 AM, Willy Tarreau wrote:
This is 25% user and 75% system. It's on the high side for the user, since
you generally get between 15 and 25% user for 75-85% system, but since you
have logs enabled, it's not really surprizing so yes it's in the norm. You
should be able to slightly improve this by using "http-server-close" instead
of "httpclose". It will actively close server-side connections and save a
few packets.


My understanding was that HAProxy 1.4 does not formally support having persistent connections to backends while closing connections to clients. However, if the backend servers used keep alive and HAProxy did not force the connection close that this would likely work. I thought that continuing to shove bytes over a small number of existing TCP connection ought to be cheaper (in terms of packets, interprets, %sys, etc) than setting up and tearing down yet more sockets.


While I don't have an empirical basis for comparison on this
older hardware, 20k req/s also "seemed" low.

I remember having benchmarked another recent opteron last year (with
many cores, a 7-something) and it performed very poorly, about 18k/s,
much worse than my 4-years old Phenom 9950. One of the reasons was
that it was difficult to share some L3 cache between certain cores.
I found little information on the 4184, except on Wikipedia (to be
taken with a grain of salt) :

    http://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29

Thus it probably suffers from the same design as the 7xxx, which is
that you need to identify the cores belonging to the same module, so
that they share the same L2 cache and that they are located in the
same part of the share L3 cache, otherwise the inter-core communications
happen via the outside.


As far as I can tell from AMD docs and Vincent's handy /sys trick, each of the 6 cores has a fully independent L2 cache, and the chip has a single shared L3 cache.

I'm not sure I'm following the part about the "same part of the L3 cache". Are you saying that some cores are "closer" to each other on the L3 cache, like NUMA?

These CPUs seem to be designed for VM hosting, or running highly
threaded Java apps which don't need much FPU. I'm not certain they
were optimized for network processing unfortunately, which is sad
considering that their older brothers were extremely fast at that.


"Highly threaded Java apps" happens to be what most of our servers are used for and what we benchmarked for purchasing decisions.

Finally assuming the single process performance can not be further
improved I was considering the following setup:
  * core 0: eht0 interrupts
  * core 1: haproxy bound to eth0
  * core 2: eth1 interrupts
  * core 3: haproxy bound to eth1
  * core 4-5: ssl terminator

I definitely agree. I know at least one setup which runs fine this way.
It was a two-socket system, each with its own NIC and process. But here
you're in the same situation, consider that you have 3 independant CPUs
in the same box. The benefit of doing it this way is that you can still
parallelize network interrupts to multiple cores without having the
response traffic come to the wrong core (proxies are a hell to optimize
because of their two sides).


This setup (haproxy per NIC) was able to handle 50% more load than a single haproxy. So from about 20k req/s to 30k. This is very nice bump with with what would otherwise be mostly idle cpu cores. We found this to be very complex to setup at the IP layer though (which isn't haproxy's fault but in our particular circumstances might not be worth it).


But I could not find too many examples of similar setups and was unsure
if it was a viable long term configuration.

Yes it is viable. The only limit right now is that you'll need to start
two processes. In the future, when listeners reliably support the
"bind-process" keyword, it will even be possible to centralize
everything and have a dedicated stats socket for each.

In the mean time I suggest that you have two processes with almost the
same config except interfaces. Note that haproxy supports binding to
interfaces.

For reasons that could be completely incidental to our networking, I was unable to get "bind *:80 interface eth0" to consistently work and had to do "bind $IP:80 interface eth0". With the first one the instance bound to eth0 would answer requests that were coming on on eth1.


Otherwise, all your config below looks fine.

Thank you for looking. I and several of my colleagues have found this thread most helpful.



Reply via email to