Hi James, On Tue, May 13, 2014 at 06:00:13PM +0100, James Hogarth wrote: > Hi Willy, > > Please see the response from our Head of Systems below.
Thank you. For ease of discussions, I'm copying him. Andy, please tell me if this is not appropriate. > On a side note our initial investigations see better behaviour > (ie one or two processes don't run away with it all) but the current EL6 > kernel > utilising the SO_REUSEPORT behaviour doesn't appear to do a perfect round > robin of the processes and consequently can end up a bit unbalanced - I've just checked the kernel code and indeed it's not a real round-robin, it's a hash on the 4-tuple (src/dst/spt/dpt), coupled with a pseudo-random mix. But that makes a lot of sense, since a round robin would have to perform a memory write to store the index. That said, when testing here, I get the same distribution on all servers +/- 0.2% or so. > and this is especially so for the longer lived connections depending on one > client > may disconnect. If you're concerned with long lived connections, then round robin is not the proper choice, you should use a leastconn algorithm instead, which will take care of disconnected clients. > We're in the process of rebasing this code to dev25 and > cleaning it up as per your suggestions. > > To give an idea of the difference in behaviour counting the connections per > process > very 5 seconds whilst ramping up the connections in the background: > > Haproxy HEAD on current el6: > 0 0 0 0 0 0 0 > 0 0 1 1 1 0 0 > 1 0 4 1 2 0 0 > 2 2 5 2 3 0 0 > 2 2 6 3 4 2 0 > 2 2 7 3 4 4 2 > 3 3 7 4 5 5 2 > 3 6 8 4 6 6 2 > 3 8 9 4 7 6 2 > 3 10 9 6 7 6 3 > 3 12 9 7 9 7 3 > > Haproxy HEAD with new shm_balance patch on current el6: > > 0 0 0 0 0 0 0 > 0 0 1 1 1 1 0 > 1 1 1 1 2 1 2 > 2 2 2 2 2 2 2 > 3 3 3 2 3 3 3 > 4 3 4 4 3 4 3 > 5 4 4 4 5 4 4 > 5 5 5 5 5 5 6 > 6 6 6 6 6 5 6 > 7 6 6 7 7 6 7 > 7 8 7 7 7 7 7 But these are very small numbers. Are you really running with numbers *that* low in production or is it just because you wanted to make a test ? I was assuming that you were dealing with thousands or tens of thousands of connections per second, where the in-kernel distribution is really good. I can easily expect that it can be off by a few units in a test involving just a few tens of connections however. Responding to Andy below : > We realise that this patch is rushed, and we appreciate the feedback. > It also is true that we've been working off dev21 and have been working on > it > for a while. If SO_REUSEPORT works well, then it's a far neater solution > that renders this patch unnecessary. I definitely agree. I think we should propose some kernel-side improvements such as a way to distribute according to number of established connections per queue instead of hashing or round-robinning, but it seems there's no relation between the listen queues and the inherited sockets, so it looks hard to get that information even from the kernel. > There's some extra explanation here that may help answer some of your > questions. > > >I must confess I don't really understand well what behaviour this > >shm_balance mode is supposed to provide. > > Without this patch, on dev21, on the enterprise linux kernels we have > tested > and run in production, we see that the busiest haproxy process will run > away and > grab most new connections. In effect we get 160% capacity over a single > process before we start seeing queueing latency. The load balancing is very > unequal. That's quite common. That's the reason why we have the tune.maxaccept global tunable, in order to prevent some processes from grabbing all connections at once, but still that's not perfect because as you noticed, a same process can be notified several times in a row while another one was doing something else or was scheduled out, leaving some room for anything else. By the way, using cpu-map to bind processes to CPUs significantly helps, provided of course that no other system process is allowed to run on the same CPUs. > We need more capacity than that. With this patch, we get uniform balancing > across 7 > processes giving us almost 700% usable over a single process. For sure, but as explained above, in my opinion, round robin is not the best choice for long lived connections (though it's better than nothing of course). > The decision to upgrade to a more recent kernel (particularly if it's not > an EL kernel) > is a difficult one for shops running on enterprise versions of linux. Many > places stick > on a particular point revision for a while and only upgrade to newer point > releases > after requalification. Don't worry, I know that very well. Some customers are still running some 2.4 kernels that I built for them years ago for that reason, and our appliances still ship with 2.6.32 :-) So you don't have to justify that you cannot upgrade, I'm the first one to defend that position. > For a mainline kernel, like 3.9 we would have to consider > that carefully. We're fairly committed to RPM based Enterprise linux > distributions with a long support lifetime. No problem with that. BTW, RHEL7 will ship with 3.10 if I understood correctly. > After your response we have been investigating whether SO_REUSEPORT is > supported under EL kernels. If it does work, (and it looks like this may > have been backported for 2.6.32-417) then we can upgrade both the > kernel to that EL kernel and haproxy to dev25 and our problem may be solved. Yes indeed I discovered that consecutive to your e-mail, that's quite a good news for many haproxy users! > If not, we'd either need to do something special for the loadbalancers > operating system wise to get a more recent kernel (i.e. run ubuntu LTS) > with the consequent impacts on puppet manifests and monitoring or run > unsupported kernels on EL. Neither is attractive. I'd strongly discourage you from switching to an unsupported kernel or changing your distro. You're already running with an haproxy version that is not shipped with your distro, so you should limit the moving parts and only tweak haproxy based on what your supported distro can do, not the other way around. > So, we'll test that on our workload and get back to you. James' test shows that with very few connections it's not a perfect distribution, but I wasn't expecting that you'd run with that few connections. Could you please clarify that point ? (...) > The actconn variable is shared in shared memory. We're using the single > writer model, so each process is the only process writing to its slot > in shared memory, OK found it, indexed based on relative_pid. I thought all processes shared the same global actconn which scared me! > via what should be an atomic write. Locking should not > be required. If there is a concern about write tearing, we can change it > to an explicit atomic cmpxchg or similar. No that's not needed. The only thing is that having them all in the same cache line means that cache lines are bouncing back and forth between CPUs, causing latencies to update the values, but that's all. (...) > The haproxy-shm-client > started as a quick and simple way to inspect the shared memory region to > verify the load balancing was evenly distributed. It then grew legs to > allow us to > enable/disable individual processes within an nbproc group. Hehe, the stats socket started this way as well :-) > If you'd prefer we did that through the stats socket, then we can > definitely do that. To be frank, haproxy is advanced C and we felt that > the less we touched the better, and the more likely it was that elements > of the patch would get merged. I understand your point. I'd say that for now I wouldn't care about the client but more about how the feature itself works in the code. (...) > > With recent kernels (3.9+), the system will > > automatically round-robin between multiple socket queues bound to the > > same ip:port. > > However this requires multiple sockets. With the latest changes allowing > > the bind-process to go down to the listener (at last!), I realized that in > > addition to allowing it for the stats socket (primary goal), it provides > an > > easy way to benefit from this kernel's round robin without having to > create > > a bind/unbind sequence as I was planning it. > > This is good to know. And makes our patch redundant. > > If this works as mentioned above, then please consider putting a note in > the haproxy > documents talking about this for EL customers and what the minimum > acceptable kernel > revs are to make this work properly. That's a good point. In fact we've been running with SO_REUSECONN for many years, I think my first patch probably dates 10 years now, but now that the feature is present in the kernel, we should update the doc. > (referring to our observed traffic pattern) > > >You see this pattern even more often when running local benchmarks. > >Just run two processes on a dual-core, dual-thread system, then have > > the load generator on the same system, and you'll see that the load > >generator disturbs one of the process more than the other one. > > Our load balancers are 32 core sandybridge with 4x10 gigabit cards > each. OK so I guess you see more than 1 connection per second :-) > Our load pattern is long lived TCP connections carrying data that > cannot be delayed by more than a millisecond. Without the patch, the > busiest process (we're running with nbproc=7 at present) will take the > lions share of new accepts(), until it starts to queue above our > acceptable minimum latency. > > By enforcing a more round robin distribution of the accepts() we're > distributing the load, and making use of all processes. As mentioned > above, we're now seeing the scalability we need. IMHO (again on an old > kernel, and for our workload) the nbproc setting is of limited use for > scalability of haproxy when you have the "free for all" method of load > balancing where each process competes to do the accept() and only the > winner triumphs. It tends to be quite limited. Most people dealing with large nbproc values use it for SSL or make intensive use of bind-process to spread the load depending on the frontends. With the recent per-bind process mapping, it will be much easier. > From observation, a variable number of other processes > will get woken up and will hit an error when they reach the accept() > call after the winner does. That's the principle. The kernel wakes up all waiters and one does accept() and others fail at it. When you're running at a high connection rate, it can be properly distributed if processes are overloaded, because the ones already working do not have time to run through accept(). But this doesn't fit in your use case where you need a very low latency. Heavily pre-forked products like Apache used a semaphore around the accept() call so that only one process could accept and the other ones slept. This slightly improves the scalability but not that much, because the contention then comes to the semaphore. A more scalable solution would be an array of semaphores. > > But then it will make things worse, because that means that your > > process that was woken up by poll() to accept a connection and which > > doesn't want it will suddenly enter a busy loop until another process > > accepts this connection. I find this counter-productive. > > In practice this happens already, without our patch. Not exactly, there's a subtle difference : the accept() is called and returns EAGAIN, which marks the fd as pollable again. If we ever want to switch to event-triggered epoll() instead of level-triggered, this will not work anymore because the woken up process would have pretended to consume the event while doing nothing instead. > The kernel seems to wake all or a subset of processes, which compete > for the accept(). There's a bunch of processes doing unnecessary work. I agree! > Our patch doesn't change that. But it does mean that the busiest > process will not attempt to accept(). One of the others which is woken > at the same time will call accept. > > We did observe a busy spin in an earlier version of the patch. However > by clearing the speculative read we seem to have fixed that. Yes that was needed. But in an event-triggered polling model, you will never be notified again about a new accept being available. > Looking at > the pattern of syscalls, it seems there's no busy spin (again that we > have observed). > > I agree that the kernel 3.9/SO_REUSEPORT option is a better way. > This was the best we could think of on a kernel that does not possess it. I really think you should give it a try with a load that matches yours, because it works really well in my tests. (...) > I'm happy to have a phone call or chat with you about some of the > data/thinking or alternative implementation ideas. As I mentioned, I'd > really like to avoid a long lived internal patch. I can understand, I don't like keeping long lived internal patches either. That said, there are a number of problems that make me feel uneasy with your patch, one of them is that the load-balancing is per process and not per listener. That means that someone using it with both SSL and clear traffic for example might very well end up with one process taking all the SSL traffic and other processes sharing the clear traffic. Also, I think that a leastconn distribution would make much more sense regarding this. I spent the day yesterday thinking about all this (hence my late reply). In fact, we've been thinking about other related improvements for future versions, such as delegating SSL to some processes, and distributing incoming connections via FD passing over unix sockets, etc... I realized that one of the problems you faced is knowing how many processes are still present. While the kernel has refcounts to many things, it's very difficult for the userland to get this piece of vital information. You solved it using timestamps, at first I was not very satisfied with the method, but I think it's one of the least ugly ones :-) The only reliable system-based methods I could elaborate to know immediately how many processes are present are the following : - using a semaphore: with a semop(+1, SEM_UNDO), each process announces itself. With semctl(GETVAL), you can read the number of subscribers. If a process dies, its value is decremented thanks to SEM_UNDO, so GETVAL will return the remaining numbre of processes. I thought about using this to get a map of present processes (one bit per relative process), but the semval is only a 16-bit short unsigned, so that limits to 16 processes, which could be low for something aiming at significantly improving scalability. - using socketpair(). The idea is the following : the parent process first creates as many socket pairs as it will fork children. Then each children inherit these sockets and close the output side of all those except the one attached to their ID. They poll for input on the common side of all others however. Thus, when a process dies, all other ones will be woken up with a read event on the socket associated to the defunct. I think that at some point we'll have to implement something like this if we want to be able to bounce between processes via the stats socket anyway. I thought about another point : depending on your load, it might make sense to stack two layers of load balancing. Some people already do that for SSL. The principle is that the first layer gets woken up in random order by the system, and distributes the load according to the configured algorithm to the second layer. The load at the second layer will be much smoother, and will respect the first layer's algorithm +/- an offset equivalent to the number of front nodes. In the past, I wanted to do this with fd-passing to work around the rough distribution of the kernel sockets. That would have made it possible to implement a simple round robin mechanism, but that does not allow us to implement a leastconn system since the accepting processes do not know the load of the second layer. Thus a shared memory is still needed to keep track of the load. Then, if we have an SHM and a front process to accept fds, we can imagine that any process can offer an FD it accepts to any other one based on the load indicated in the SHM. I'm just not sure of the benefits compared to doing nothing and letting the other one accept it. Hmmm yes in fact there is a small benefit which is that it supports leastconn even with the new SO_REUSEPORT of the recent kernels. Also another point of consideration is that the likeliness of going to threads instead of processes is growing with time. The reason is simple, with heavy latency users like SSL, we definitely want the ability to migrate this heavy traffic to dedicated proceses/threads. Unfortunately, doing so right now means that all the processing is done in a distinct process. In general it's not an issue and we can even chain this process to a central one if a more aggregate traffic is desired. But chaining processes also means that we don't have an easy access to thee SSL info from the central process. So all in all, it seems like at some point we'll have to support threads and implement a thread-aware scheduler (basically the same as the current one with a thread map for each task so that threads can only dequeue the tasks allowed to run on them). This will also solve the issue we're having with keeping stats across all processes, and will allow other unexpected snails like gzip to run in threads that do not affect the latency of the rest of the traffic. So as you can see, I'm having mixed opinions. I'm thinking that in its current form, your patch is too limited for general use, that a per-bind leastconn distribution would make a lot more sense, and that it would still have a limited life if we migrate to threads in 6 months or one year. Thus my first feeling is that we should try to do our best to see how your current workload could be supported without an extra patch (ie either by improving the config or patching haproxy a little bit in a way that is more acceptable for mainline). I'm really interested in your opinions on all of this. Please do not hesitate to share them on the list, there are a number of multi-process users here, all with different workloads (eg: RDP is sensible to connections count) and it's best if everyone can participate. Best regards, Willy