Re: SSL, peered sticky tables + nbproc 1?
Hi Andy, On Tue, May 27, 2014 at 06:00:37PM +0100, Andrew Phillips wrote: Something I overlooked replying to on this thread; BTW, I remember you said that you fixed the busy loop by disabling the FD in the speculative event cache, but do you remember how you re-enable it ? Eg, if all other processes have accepted some connections, your first process will have to accept new connections again, so that means that its state depends on others'. We initially just returned from listener_accept(). This caused us to go into a busy spin as there were always pending speculative reads, so fd_nbspec was non zero in ev_epoll.c which triggered setting wait_time=0. Looking at the flow in listener_accept(), what we observed happening before was that without any of our patches, several processes would wake up on a new socket event. The fastest would win and accept() and the slower ones would hit the error check in listener.c at line 353. 353: if (unlikely(cfd == -1)) switch (errno) { case EAGAIN: case EINTR: case ECONNABORTED: fd_poll_recv(fd); return; /* nothing more to accept */ : In this case, chasing fd_poll_recv(fd) through the files indicated it cleared the speculative events off the queue, meaning fd_nbspec would not be set, and wait_time would not get set to 0. So we just added the same call to the shm patch refusal path. Which solved our problem. Not sure how that relates to your point about the processes state depending on others, which does not seem to be the case. Got it, thanks for the explanation! I thought you completely disabled events on this FD, which would be an issue right now. Here with only disabling speculative events, you only lose the readiness information. That works for level-triggered pollers, but will not work anymore with an event-triggered poller if/when we switch to EPOLL_ET. But at least I get the picture now. Thanks! Willy
Re: SSL, peered sticky tables + nbproc 1?
Willy, Thanks for the response. I wrote the reply as I read through, so it's interesting to see that we've pursued similar lines of thought about how to solve this problem. I think our workload is very different from 'normal'. We have several quite long lived connections, with a modest accept() rate of new connections. I've just checked the kernel code and indeed it's not a real round-robin, it's a hash on the 4-tuple (src/dst/spt/dpt), coupled with a pseudo-random mix. But that makes a lot of sense, since a round robin would have to perform a memory write to store the index. That said, when testing here, I get the same distribution on all servers +/- 0.2% or so. We'll test more with EL6 latest and SO_REUSEPORT - given the information above it's possible our test rig may not show the above algorithm to its best, and may not represent production load that well. Ideally though a least conn LB would be best. James has posted our test numbers - they're better and may be good enough for now. And there is always the alternative of maintaining the shm_balance patch internally. If you're concerned with long lived connections, then round robin is not the proper choice, you should use a leastconn algorithm instead, which will take care of disconnected clients. Yes, this is essentially the thinking behind the shm_balance patch. I definitely agree. I think we should propose some kernel-side improvements such as a way to distribute according to number of established connections per queue instead of hashing or round-robinning, but it seems there's no relation between the listen queues and the inherited sockets, so it looks hard to get that information even from the kernel. I'd be happy to help here where we can. Any patch that maintains state about the number of connections sent to each socket is likely to be hard to merge to kernel. The alternative is for haproxy to maintain the count by process of active sockets, and somehow poke that back into the kernel as a hint to send more to a particular socket. That also feels ugly however. It comes back to either haproxy making routing/load balancing decisions amongst its children or improving the kernel mechanism that is doing the same job. Haproxy has more information available, and a faster turn around on new load balancing strategies. So options are; 1) Come up with a better stateless LB algo for the kernel. 2) Maintain counts in kernel for a least connections algo. 3) Stay as is kernel wise, but have haproxy play a more active role in distributing connections. 4) Do nothing, as its good enough for most people. If there's a better way for us to track active connections per server that at least would help simplify the shm balance patch. That's quite common. That's the reason why we have the tune.maxaccept global tunable, in order to prevent some processes from grabbing all connections at once, but still that's not perfect because as you noticed, a same process can be notified several times in a row while another one was doing something else or was scheduled out, leaving some room for anything else. By the way, using cpu-map to bind processes to CPUs significantly helps, provided of course that no other system process is allowed to run on the same CPUs. Ok, we'll go back and check that in detail. CPU pinning and SMP IRQ affinity we do as a matter of course. Don't worry, I know that very well. Some customers are still running some 2.4 kernels that I built for them years ago for that reason, and our appliances still ship with 2.6.32 :-) So you don't have to justify that you cannot upgrade, I'm the first one to defend that position. Ok, that's reassuring. There are many projects out there that while wonderful, assume you have the latest version of fedora/ubuntu available. The actconn variable is shared in shared memory. We're using the single writer model, so each process is the only process writing to its slot in shared memory, OK found it, indexed based on relative_pid. I thought all processes shared the same global actconn which scared me! Yes, It wouldn't have worked very well either :-) via what should be an atomic write. Locking should not be required. If there is a concern about write tearing, we can change it to an explicit atomic cmpxchg or similar. No that's not needed. The only thing is that having them all in the same cache line means that cache lines are bouncing back and forth between CPUs, causing latencies to update the values, but that's all. Good point. We can avoid cache ping pong if we pad the structure appropriately. If this works as mentioned above, then please consider putting a note in the haproxy documents talking about this for EL customers and what the minimum acceptable kernel revs are to make this work properly. That's a good point. In fact we've been running with SO_REUSECONN for many years, I think my first
Re: SSL, peered sticky tables + nbproc 1?
Hi Andy, On Sun, May 18, 2014 at 03:16:34PM +0100, Andrew Phillips wrote: Willy, Thanks for the response. I wrote the reply as I read through, so it's interesting to see that we've pursued similar lines of thought about how to solve this problem. I think our workload is very different from 'normal'. We have several quite long lived connections, with a modest accept() rate of new connections. That's something that RDP providers see as well. I've also seen a Citrix farm in the past which had to face a difficult issue which is to support many long-lived connectinos with a very low average accept rate (eg: a few hundred connections per day) but with the goal of being able to accept 20 times more if people had to work from home due to problems going to their job (eg: transportation services on strike), and to accept all of them at 9am. There was some SSL in the mix to make things funnier. I've just checked the kernel code and indeed it's not a real round-robin, it's a hash on the 4-tuple (src/dst/spt/dpt), coupled with a pseudo-random mix. But that makes a lot of sense, since a round robin would have to perform a memory write to store the index. That said, when testing here, I get the same distribution on all servers +/- 0.2% or so. We'll test more with EL6 latest and SO_REUSEPORT - given the information above it's possible our test rig may not show the above algorithm to its best, and may not represent production load that well. It will really depend on the total amount of connections in fact. I would not be surprized if the load is highly uneven at all below 100 or so per process due to the hash. But maybe that could be enough already. Ideally though a least conn LB would be best. James has posted our test numbers - they're better and may be good enough for now. And there is always the alternative of maintaining the shm_balance patch internally. Sure! If your traffic is not too high, there's something simple you can do which can be *very* efficient. It's being used by at least one RDP provider, but I don't remember which one. The idea was the following : deciphering SSL costs much, especially the handshakes which you don't want to cause noticeable pauses to all users when they happen. So instead of randomly stacking the connections onto each others into a process pool, there was a front layer in pure TCP mode doing nothing but distributing connections in leastconn. The cost is very low in terms of CPU and even lower in terms of latency. And now with dev25, you have the abstract namespace sockets which are basically unix sockets with internal names. They're 2.5 times cheaper than TCP sockets. I'm really convinced you should give that a try. It would look like this : listen dispatcher bind :1234 process 1 balance leastconn server process2 abns@p2 send-proxy server process3 abns@p3 send-proxy server process4 abns@p4 send-proxy server process5 abns@p5 send-proxy listen worker bind abns@p2 process 2 accept-proxy bind abns@p3 process 3 accept-proxy bind abns@p4 process 4 accept-proxy bind abns@p5 process 5 accept-proxy ... In worker, simply add ssl ... to each line if you need to decipher SSL. You can (and should) even check that processes are still alive using a simple check on each line. (..) I definitely agree. I think we should propose some kernel-side improvements such as a way to distribute according to number of established connections per queue instead of hashing or round-robinning, but it seems there's no relation between the listen queues and the inherited sockets, so it looks hard to get that information even from the kernel. I'd be happy to help here where we can. Any patch that maintains state about the number of connections sent to each socket is likely to be hard to merge to kernel. Especially if it requires inflating a structure like struct sock. The alternative is for haproxy to maintain the count by process of active sockets, and somehow poke that back into the kernel as a hint to send more to a particular socket. I agree. That also feels ugly however. It depends. Said like this yes it feels ugly. However, if you reason with a budget and processes only accept their budget of incoming connections, then it's much different. And with a budget it's not that hard to implement. Basically you raise all budgets to 1 when they're all 0, you decrease a process's budget when it accepts a connection, you increase its budget when it closes a connection, and you subtract the value of the lowest budget when all of them have a budget greater than 1. It comes back to either haproxy making routing/load balancing decisions amongst its children or improving the kernel mechanism that is doing the same job. Haproxy has more information available, and a faster turn around on new load balancing strategies. Yes and load balancing is its job, though
Re: SSL, peered sticky tables + nbproc 1?
Hi James, On Tue, May 13, 2014 at 06:00:13PM +0100, James Hogarth wrote: Hi Willy, Please see the response from our Head of Systems below. Thank you. For ease of discussions, I'm copying him. Andy, please tell me if this is not appropriate. On a side note our initial investigations see better behaviour (ie one or two processes don't run away with it all) but the current EL6 kernel utilising the SO_REUSEPORT behaviour doesn't appear to do a perfect round robin of the processes and consequently can end up a bit unbalanced - I've just checked the kernel code and indeed it's not a real round-robin, it's a hash on the 4-tuple (src/dst/spt/dpt), coupled with a pseudo-random mix. But that makes a lot of sense, since a round robin would have to perform a memory write to store the index. That said, when testing here, I get the same distribution on all servers +/- 0.2% or so. and this is especially so for the longer lived connections depending on one client may disconnect. If you're concerned with long lived connections, then round robin is not the proper choice, you should use a leastconn algorithm instead, which will take care of disconnected clients. We're in the process of rebasing this code to dev25 and cleaning it up as per your suggestions. To give an idea of the difference in behaviour counting the connections per process very 5 seconds whilst ramping up the connections in the background: Haproxy HEAD on current el6: 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 4 1 2 0 0 2 2 5 2 3 0 0 2 2 6 3 4 2 0 2 2 7 3 4 4 2 3 3 7 4 5 5 2 3 6 8 4 6 6 2 3 8 9 4 7 6 2 3 10 9 6 7 6 3 3 12 9 7 9 7 3 Haproxy HEAD with new shm_balance patch on current el6: 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 2 1 2 2 2 2 2 2 2 2 3 3 3 2 3 3 3 4 3 4 4 3 4 3 5 4 4 4 5 4 4 5 5 5 5 5 5 6 6 6 6 6 6 5 6 7 6 6 7 7 6 7 7 8 7 7 7 7 7 But these are very small numbers. Are you really running with numbers *that* low in production or is it just because you wanted to make a test ? I was assuming that you were dealing with thousands or tens of thousands of connections per second, where the in-kernel distribution is really good. I can easily expect that it can be off by a few units in a test involving just a few tens of connections however. Responding to Andy below : We realise that this patch is rushed, and we appreciate the feedback. It also is true that we've been working off dev21 and have been working on it for a while. If SO_REUSEPORT works well, then it's a far neater solution that renders this patch unnecessary. I definitely agree. I think we should propose some kernel-side improvements such as a way to distribute according to number of established connections per queue instead of hashing or round-robinning, but it seems there's no relation between the listen queues and the inherited sockets, so it looks hard to get that information even from the kernel. There's some extra explanation here that may help answer some of your questions. I must confess I don't really understand well what behaviour this shm_balance mode is supposed to provide. Without this patch, on dev21, on the enterprise linux kernels we have tested and run in production, we see that the busiest haproxy process will run away and grab most new connections. In effect we get 160% capacity over a single process before we start seeing queueing latency. The load balancing is very unequal. That's quite common. That's the reason why we have the tune.maxaccept global tunable, in order to prevent some processes from grabbing all connections at once, but still that's not perfect because as you noticed, a same process can be notified several times in a row while another one was doing something else or was scheduled out, leaving some room for anything else. By the way, using cpu-map to bind processes to CPUs significantly helps, provided of course that no other system process is allowed to run on the same CPUs. We need more capacity than that. With this patch, we get uniform balancing across 7 processes giving us almost 700% usable over a single process. For sure, but as explained above, in my opinion, round robin is not the best choice for long lived connections (though it's better than nothing of course). The decision to upgrade to a more recent kernel (particularly if it's not an EL kernel) is a difficult one for shops running on enterprise versions of linux. Many places stick on a particular point revision for a while and only upgrade to newer point releases after requalification. Don't worry, I know that very well. Some customers are still running some 2.4 kernels that I built for them years ago for that reason, and our appliances
Re: SSL, peered sticky tables + nbproc 1?
Hi James, On Thu, May 08, 2014 at 08:58:59PM +0100, James Hogarth wrote: On 2 May 2014 20:10, Willy Tarreau w...@1wt.eu wrote: You're welcome. I really want to release 1.5-final ASAP, but at least with everything in place so that we can safely fix the minor remaining annoyances. So if we identify quickly that things are still done wrong and need to be addressed before the release (eg: because we'll be force to change the way some config settings are used), better do it ASAP. Otherwise if we're sure that a given config behaviour will not change, such fixes can happen in -stable because they won't affect users which do not rely on them. Alright in light of the above here's a RFC patch that's a little WIP still ... we've yet to write the documentation on the shm_balance mode but we are running this in a production environment. I must confess I don't really understand well what behaviour this shm_balance mode is supposed to provide. I'm seeing that the actconn variable seems to be shared between all processes and is incremented and decremented without any form of locking, so I'm a bit scared when you say that it's running in production ! More comments below. Our environment is dev21 at present but I just rebased it to the tarball snapshot of last night... It compiles against that but please not I've not yet tested it against that! To give you an idea on how to use it here's a sanitised snippet of config: global nbproc 4 daemon maxconn 4000 stats timeout 1d log 127.0.0.1 local2 pidfile /var/run/haproxy.pid stats socket /var/run/haproxy.1.sock level admin stats socket /var/run/haproxy.2.sock level admin stats socket /var/run/haproxy.3.sock level admin stats socket /var/run/haproxy.4.sock level admin stats bind-process all shm-balance my_shm_balancer listen web-stats-1 bind 0.0.0.0:81 bind-process 1 mode http log global maxconn 10 clitimeout 10s srvtimeout 10s contimeout 10s timeout queue 10s stats enable stats refresh 30s stats show-node stats show-legends stats auth admin:password stats uri /haproxy?stats listen web-stats-2 bind 0.0.0.0:82 bind-process 2 mode http log global maxconn 10 clitimeout 10s srvtimeout 10s contimeout 10s timeout queue 10s stats enable stats refresh 30s stats show-node stats show-legends stats auth admin:password stats uri /haproxy?stats listen web-stats-3 bind 0.0.0.0:83 bind-process 3 mode http log global maxconn 10 clitimeout 10s srvtimeout 10s contimeout 10s timeout queue 10s stats enable stats refresh 30s stats show-node stats show-legends stats auth admin:password stats uri /haproxy?stats listen web-stats-4 bind 0.0.0.0:84 bind-process 4 mode http log global maxconn 10 clitimeout 10s srvtimeout 10s contimeout 10s timeout queue 10s stats enable stats refresh 30s stats show-node stats show-legends stats auth admin:password stats uri /haproxy?stats listen frontendname bind 0.0.0.0:52000 server server 10.0.0.1:27000 id 1 check port 9501 option httpchk GET /status HTTP/1.0 mode tcp The haproxy-shm-client can be used to query the shm to see how things are loaded and weight/disable/enable threads from processing queries. But why not use the stats socket instead of using a second access path to check the status ? Now why did we do this? When we were testing multiple processes one thing we noted was that the most likely process to accept() was actually a bit unintuitive. Rather than being busy causing a 'natural' load balancing behaviour it worked out against this. Yes, that's the reason why the tune.maxaccept is divided by the number of active processes. With recent kernels (3.9+), the system will automatically round-robin between multiple socket queues bound to the same ip:port. However this requires multiple sockets. With the latest changes allowing the bind-process to go down to the listener (at last!), I realized that in addition to allowing it for the stats socket (primary goal), it provides an easy way to benefit from this kernel's round robin without having to create a bind/unbind sequence as I was planning it. If a thread was currently on the CPU it was reasonably likely that it would be the first to grab the connection due to the need for ones 'idle' to context switch onto the CPU. As a result it was primarily only one or two haproxy processes actually picking up the connections and it made for very asymmetrical balancing across processes. You see this pattern even more often when running local benchmarks. Just run two processes on a dual-core, dual-thread system, then have the load generator on the same system, and you'll see that the load generator disturbs one of the process more than the other one. The algorithm looks to see if it is in the least busy 'half bucket' and if so will
Re: SSL, peered sticky tables + nbproc 1?
hi, On Fri, May 02, 2014 at 11:11:39AM -0600, Jeff Zellner wrote: Well, I thought wrong -- I see that peered sticky tables absolutely don't work with multiple processes, and sticky rules give a warning. Would that be a feature on the roadmap? I can see that it's probably pretty non-trivial -- but would be super useful, at least for us. Yes that's clearly on the roadmap. In order of fixing/improvements, here's what I'd like to see : - peers work fine when only one process uses them - have the ability to run with explicit peers per process : if you just have to declare as many peers sections as processes, it's better than nothing. - have stick-table (and peers) work in multi-process mode with a shared memory system like we do with SSL contexts. Currently the issue is that all processes try to connect to the remote and present the same peer name, resulting in the previous connection to be dropped. And incoming connections will only feed one process and not the other ones. I'd like to be able to do at least #1 for the release, I do think it's doable, because I attempted it 18 months ago and ended up in a complex corner case of inter-proxy dependence calculation, to only realize that we didn't need to have haproxy automatically deduce everything, just let it do what the user wants, and document the limits. Regards, Willy
Re: SSL, peered sticky tables + nbproc 1?
It sounds like that Jeff ran out of CPU for SSL terminations and that could be addressed as described by Willy here https://www.mail-archive.com/haproxy@formilux.org/msg13104.html and allow him to stay with a single-process stick table for the actual load balancing. -Bryan On Fri, May 2, 2014 at 10:23 AM, Willy Tarreau w...@1wt.eu wrote: hi, On Fri, May 02, 2014 at 11:11:39AM -0600, Jeff Zellner wrote: Well, I thought wrong -- I see that peered sticky tables absolutely don't work with multiple processes, and sticky rules give a warning. Would that be a feature on the roadmap? I can see that it's probably pretty non-trivial -- but would be super useful, at least for us. Yes that's clearly on the roadmap. In order of fixing/improvements, here's what I'd like to see : - peers work fine when only one process uses them - have the ability to run with explicit peers per process : if you just have to declare as many peers sections as processes, it's better than nothing. - have stick-table (and peers) work in multi-process mode with a shared memory system like we do with SSL contexts. Currently the issue is that all processes try to connect to the remote and present the same peer name, resulting in the previous connection to be dropped. And incoming connections will only feed one process and not the other ones. I'd like to be able to do at least #1 for the release, I do think it's doable, because I attempted it 18 months ago and ended up in a complex corner case of inter-proxy dependence calculation, to only realize that we didn't need to have haproxy automatically deduce everything, just let it do what the user wants, and document the limits. Regards, Willy
Re: SSL, peered sticky tables + nbproc 1?
On Fri, May 02, 2014 at 10:59:00AM -0700, Bryan Talbot wrote: It sounds like that Jeff ran out of CPU for SSL terminations and that could be addressed as described by Willy here https://www.mail-archive.com/haproxy@formilux.org/msg13104.html and allow him to stay with a single-process stick table for the actual load balancing. Yes that's perfectly possible. And when we have proxy proto v2 with SSL info, it'll be even better :-) Willy
Re: SSL, peered sticky tables + nbproc 1?
On 2 May 2014 19:02, Willy Tarreau w...@1wt.eu wrote: On Fri, May 02, 2014 at 10:59:00AM -0700, Bryan Talbot wrote: It sounds like that Jeff ran out of CPU for SSL terminations and that could be addressed as described by Willy here https://www.mail-archive.com/haproxy@formilux.org/msg13104.html and allow him to stay with a single-process stick table for the actual load balancing. Yes that's perfectly possible. And when we have proxy proto v2 with SSL info, it'll be even better :-) Willy We've done quite a bit of work on this internally recently to provide SSL multiprocess with sane load balancing. There's a couple of small edge cases we've got left then we were intending to push it up for comments... I've literally just got home but I'll follow up in the office next week to see how close we are. James
Re: SSL, peered sticky tables + nbproc 1?
Great, we'd love to see that. And thanks for the other SSL performance trick. We might be able to make that and some SSL cache tuning work for us, as well. On Fri, May 2, 2014 at 12:23 PM, James Hogarth james.hoga...@gmail.com wrote: On 2 May 2014 19:02, Willy Tarreau w...@1wt.eu wrote: On Fri, May 02, 2014 at 10:59:00AM -0700, Bryan Talbot wrote: It sounds like that Jeff ran out of CPU for SSL terminations and that could be addressed as described by Willy here https://www.mail-archive.com/haproxy@formilux.org/msg13104.html and allow him to stay with a single-process stick table for the actual load balancing. Yes that's perfectly possible. And when we have proxy proto v2 with SSL info, it'll be even better :-) Willy We've done quite a bit of work on this internally recently to provide SSL multiprocess with sane load balancing. There's a couple of small edge cases we've got left then we were intending to push it up for comments... I've literally just got home but I'll follow up in the office next week to see how close we are. James
Re: SSL, peered sticky tables + nbproc 1?
Hi James, On Fri, May 02, 2014 at 07:23:21PM +0100, James Hogarth wrote: We've done quite a bit of work on this internally recently to provide SSL multiprocess with sane load balancing. There's a couple of small edge cases we've got left then we were intending to push it up for comments... I've literally just got home but I'll follow up in the office next week to see how close we are. You're welcome. I really want to release 1.5-final ASAP, but at least with everything in place so that we can safely fix the minor remaining annoyances. So if we identify quickly that things are still done wrong and need to be addressed before the release (eg: because we'll be force to change the way some config settings are used), better do it ASAP. Otherwise if we're sure that a given config behaviour will not change, such fixes can happen in -stable because they won't affect users which do not rely on them. Best regards, Willy