Hi Andy, On Sun, May 18, 2014 at 03:16:34PM +0100, Andrew Phillips wrote: > Willy, > > Thanks for the response. I wrote the reply as I read through, so it's > interesting to see that we've pursued similar lines of thought about how > to solve this problem. > > I think our workload is very different from 'normal'. We have several > quite long lived connections, with a modest accept() rate of new > connections.
That's something that RDP providers see as well. I've also seen a Citrix farm in the past which had to face a difficult issue which is to support many long-lived connectinos with a very low average accept rate (eg: a few hundred connections per day) but with the goal of being able to accept 20 times more if people had to work from home due to problems going to their job (eg: transportation services on strike), and to accept all of them at 9am. There was some SSL in the mix to make things funnier. > > I've just checked the kernel code and indeed it's not a real round-robin, > > it's a hash on the 4-tuple (src/dst/spt/dpt), coupled with a pseudo-random > > mix. But that makes a lot of sense, since a round robin would have to > > perform > > a memory write to store the index. That said, when testing here, I get the > > same distribution on all servers +/- 0.2% or so. > > > We'll test more with EL6 latest and SO_REUSEPORT - given the > information above it's possible our test rig may not show the above > algorithm to its best, and may not represent production load that well. It will really depend on the total amount of connections in fact. I would not be surprized if the load is highly uneven at all below 100 or so per process due to the hash. But maybe that could be enough already. > Ideally though a least conn LB would be best. James has posted our test > numbers - they're better and may be "good enough" for now. And there is > always the alternative of maintaining the shm_balance patch internally. Sure! If your traffic is not too high, there's something simple you can do which can be *very* efficient. It's being used by at least one RDP provider, but I don't remember which one. The idea was the following : deciphering SSL costs much, especially the handshakes which you don't want to cause noticeable pauses to all users when they happen. So instead of randomly stacking the connections onto each others into a process pool, there was a front layer in pure TCP mode doing nothing but distributing connections in leastconn. The cost is very low in terms of CPU and even lower in terms of latency. And now with dev25, you have the abstract namespace sockets which are basically unix sockets with internal names. They're 2.5 times cheaper than TCP sockets. I'm really convinced you should give that a try. It would look like this : listen dispatcher bind :1234 process 1 balance leastconn server process2 abns@p2 send-proxy server process3 abns@p3 send-proxy server process4 abns@p4 send-proxy server process5 abns@p5 send-proxy listen worker bind abns@p2 process 2 accept-proxy bind abns@p3 process 3 accept-proxy bind abns@p4 process 4 accept-proxy bind abns@p5 process 5 accept-proxy ... In "worker", simply add "ssl ..." to each line if you need to decipher SSL. You can (and should) even check that processes are still alive using a simple "check" on each line. (..) > > I definitely agree. I think we should propose some kernel-side improvements > > such as a way to distribute according to number of established connections > > per queue instead of hashing or round-robinning, but it seems there's no > > relation between the listen queues and the inherited sockets, so it looks > > hard to get that information even from the kernel. > > I'd be happy to help here where we can. Any patch that maintains state > about the number of connections sent to each socket is likely to be hard > to merge to kernel. Especially if it requires inflating a structure like struct sock. > The alternative is for haproxy to maintain the count by process of > active sockets, and somehow poke that back into the kernel as a hint to > send more to a particular socket. I agree. > That also feels ugly however. It depends. Said like this yes it feels ugly. However, if you reason with a budget and processes only accept their budget of incoming connections, then it's much different. And with a budget it's not that hard to implement. Basically you raise all budgets to 1 when they're all 0, you decrease a process's budget when it accepts a connection, you increase its budget when it closes a connection, and you subtract the value of the lowest budget when all of them have a budget greater than 1. > It comes > back to either haproxy making routing/load balancing decisions amongst > its children or improving the kernel mechanism that is doing the same > job. > Haproxy has more information available, and a faster turn around on new > load balancing strategies. Yes and load balancing is its job, though it's hard to get any global information while the kernel has everything. Ideally, the kernel would tell us the minimum information we need (eg: how many processes are still bound to the socket, maybe the total count of accepted incoming connections on that socket, etc). > So options are; > 1) Come up with a better stateless LB algo for the kernel. > 2) Maintain counts in kernel for a least connections algo. > 3) Stay as is kernel wise, but have haproxy play a more active > role in distributing connections. > 4) Do nothing, as its good enough for most people. > > If there's a better way for us to track active connections per server > that at least would help simplify the shm balance patch. Actually I'm wondering if we couldn't try to abuse epoll. Its API is a real mess in that processes interact badly between each other while most people would expect they would not. I think that by having all processes have access to each other process' epoll FD, we might be able to enable/disable polling from them. It's just a thought, nothing more. > > That's quite common. That's the reason why we have the tune.maxaccept global > > tunable, in order to prevent some processes from grabbing all connections at > > once, but still that's not perfect because as you noticed, a same process > > can > > be notified several times in a row while another one was doing something > > else > > or was scheduled out, leaving some room for anything else. By the way, using > > cpu-map to bind processes to CPUs significantly helps, provided of course > > that no other system process is allowed to run on the same CPUs. > > Ok, we'll go back and check that in detail. CPU pinning and SMP IRQ > affinity we do as a matter of course. OK. > > Don't worry, I know that very well. Some customers are still running some > > 2.4 > > kernels that I built for them years ago for that reason, and our appliances > > still ship with 2.6.32 :-) So you don't have to justify that you cannot > > upgrade, > > I'm the first one to defend that position. > > Ok, that's reassuring. There are many projects out there that while > wonderful, assume you have the latest version of fedora/ubuntu > available. Yes I know, and I'm really irritated by this generalized run-away that totally ignores the needs for stable processing in certain situations. > > > via what should be an atomic write. Locking should not > > > be required. If there is a concern about write tearing, we can change it > > > to an explicit atomic cmpxchg or similar. > > > > No that's not needed. The only thing is that having them all in the same > > cache line means that cache lines are bouncing back and forth between CPUs, > > causing latencies to update the values, but that's all. > > Good point. We can avoid cache ping pong if we pad the structure > appropriately. At least it should recover some of the performance that I expect to be lost in very high connection rate scenarios. > > > If this works as mentioned above, then please consider putting a note in > > > the haproxy > > > documents talking about this for EL customers and what the minimum > > > acceptable kernel > > > revs are to make this work properly. > > > > That's a good point. In fact we've been running with SO_REUSECONN for > > many years, I think my first patch probably dates 10 years now, but now > > that the feature is present in the kernel, we should update the doc. > > Please do. It makes the difference between something that is very > lopsided, and arguably of limited benefit (at least on our workloads) > and approximately even balancing. OK. I'm adding that in my long todo list! > > > From observation, a variable number of other processes > > > will get woken up and will hit an error when they reach the accept() > > > call after the winner does. > > > > That's the principle. The kernel wakes up all waiters and one does > > accept() and others fail at it. When you're running at a high > > connection rate, it can be properly distributed if processes are > > overloaded, because the ones already working do not have time to > > run through accept(). But this doesn't fit in your use case where > > you need a very low latency. > > > > Heavily pre-forked products like Apache used a semaphore around the accept() > > call so that only one process could accept and the other ones slept. This > > slightly improves the scalability but not that much, because the contention > > then comes to the semaphore. A more scalable solution would be an array of > > semaphores. > > I'm not a fan of this approach. I don't like the idea of locking - its > the enemy of latency and simplicity. totally agreed. > I'll have to check the apache code to see how they're doing that. >From what I remember, it was simple : lock() accept() unlock() So that was simple, the lock() ensured that only one process was woken up while waiting for the lock, and only this process could perform the accept(). But at high connection rates, that doesn't scale, because during all the time between taking the lock and releasing it, only one connection is accepted within a delay 2-3 times larger than accept() alone, so you can never saturate your CPU. However, connections are properly round-robinned since processes are generally dequeued in the arrival order of the lock. > But if all the processes are waking on an event, then testing a No, it was not walking on an event, it was a preforked model, so only one connection per process. That means that it was only a simple loop : while (1) { lock() accept() unlock() process() } > semaphore, I fail to see how much faster that > would be than the lock free approach we've taken with shm balance. You > only go to the kernel when you have no other choice. Hint: it was not fast at all, and always left 15-20% of idle CPU when the performance limit was reached. The context switch rate was horrible as well. > This is probably my ignorance showing. Hints appreciated. No no, don't feel that way. It's just that people always had to invent ways to work around their OS's design limitations. I hope that in 5 years people will have fun of your SHM patch because it will mean that we'll have much better support from the kernel! > > > > But then it will make things worse, because that means that your > > > > process that was woken up by poll() to accept a connection and which > > > > doesn't want it will suddenly enter a busy loop until another process > > > > accepts this connection. I find this counter-productive. > > > > > > In practice this happens already, without our patch. > > > > Not exactly, there's a subtle difference : the accept() is called and > > returns EAGAIN, which marks the fd as pollable again. If we ever want > > to switch to event-triggered epoll() instead of level-triggered, this > > will not work anymore because the woken up process would have pretended > > to consume the event while doing nothing instead. > > Ok, I can see that. In practice I think we're using epoll - don't know > without checking the code if the versions of haproxy we've been testing > with are level or ET. Haproxy currently doesn't enable ET, so you don't run into the problem. > We do see a number of processes woken - and each > socket does get accepted. BTW, I remember you said that you fixed the busy loop by disabling the FD in the speculative event cache, but do you remember how you re-enable it ? Eg, if all other processes have accepted some connections, your first process will have to accept new connections again, so that means that its state depends on others'. > That argues in favour of a separate accept() and SSL initialise thread > and a least connections/round robin/whatever hand off to a pool of > processes. That however isn't the architecture of haproxy and could be > quite disruptive to the design. Not completely disruptive, because we wanted to implement "remote sockets" (which means sockets from another process) and the architecture is supposed to be "easy" to migrate to this design as there's almost no reference to a local FD anywhere above the connections, or only for minor things such as disabling lingering before closing. > And would not work for those people who > have very many short lived processes where the accept thread would > become the bottle neck. Clearly, but with multiple accept threads it's not a problem anymore. That's among the things which make me favor threads more than processes for the future. > Or alternatively the first thread that wins does the accept, and then > does the decision as to where it should go. It would either lead back to > pushing it up stream into the kernel again, or their being an IPC > mechanism to hand over a socket to a more lightly loaded thread. Pushing > it up to kernel would be the route if you want to preserve the current > architecture. No, fd passing is much better here : one of many threads accepts an incoming connection, decides what worker should take it and passes it over a unix socket using SCM_RIGHTS. > > > We did observe a busy spin in an earlier version of the patch. However > > > by clearing the speculative read we seem to have fixed that. > > > > Yes that was needed. But in an event-triggered polling model, you will > > never be notified again about a new accept being available. > > Assuming that all the processes that get woken by that event don't > accept. I've not played with ET in enough detail to know if the kernel > does a wake one for each event, or just hands it over to a thundering > herd. No, in ET mode, it only notifies you when the status changes. So you get notified when there's data for you, and it's up to you to read these data or not and to keep the event in your own cache. But the kernel won't wake you up until you meet EAGAIN, which re-enables the event. The purpose of ET is to reduce the epoll_ctl(ADD)/epoll_ctl(DEL) dance which happens when transferring large amounts of data between two sides. But haproxy's speculative model allows it to generally not poll on the connection to servers because data come there very fast, so the kernel buffers are always filled, and a recv() almost always manages to fill an internal buffer and thus we don't need to enable then disable polling. I observed a very low count of epoll_ctl() calls on various workloads, leading me to think that the gain we'd have with ET remains low. > You may end up with a socket hanging around waiting for the > next connection before it gets serviced. Which is not good. Even worse as described above : you won't ever get notified anymore until you decided on your own to call accept() to get EAGAIN. > The fundamental problem here is where the load balancing of > connections across the pool should take place. Pushing it up into the > kernel is great, simple and works, but limits the strategies available > to round robin. At least currently. > Haproxy has the state to make the decisions for more complex strategies, > but no way of sharing that state with each process at present. That's it. Also, I'd be tempted to think that in general we should avoid complex strategies because they don't scale. I think that round robin, source hash and leastconn are more than enough at this level. > > > I agree that the kernel 3.9/SO_REUSEPORT option is a better way. > > > This was the best we could think of on a kernel that does not possess it. > > > > I really think you should give it a try with a load that matches yours, > > because it works really well in my tests. > > We'll find out and get back to you. OK. > > That said, there are a number of problems that make me feel uneasy with > > your patch, one of them is that the load-balancing is per process and not > > per listener. That means that someone using it with both SSL and clear > > traffic for example might very well end up with one process taking all > > the SSL traffic and other processes sharing the clear traffic. Also, I > > think that a leastconn distribution would make much more sense regarding > > this. > > Fair enough. We have a fairly simple model of a single listener per > process/service. We have a single service we want to scale. I can see > that some people may want to have a big pool of processes each of which > services several different services. How would per listener accept > leveling look? Would it suffice to level by listener by process? Yes, that's it. You already have an ID per "bind line". All sockets which belong to the same bind line have the same ID and are processed similarly (that's used for binding to port ranges for example). So probably that the idea would be to be able to use the listener ID as an index in the table. It could be acceptable to limit that to a reasonable number of IDs and refuse to load a config where a "bind" line has too high an ID and is marked as "shm". > shm_balance terms that would give us an array indexed by service and by > process. Indeed. > I get the feeling that this patch is likely to remain outside the code > base. That tends to be my feeling as well, at least until we try to move the current architecture to a threaded one. If for any reason we fail and we see that we're condamned to stay multi-process, the reasoning will be much different. > If you're more likely to accept it if we do this work, then we > could consider doing that. I'd suggest not to complexify it more because for now you're the only users, and you'll be the first victims of the extra complexity, while nobody will thank you for the small extra bonus. > > I spent the day yesterday thinking about all this (hence my late reply). > > In fact, we've been thinking about other related improvements for future > > versions, such as delegating SSL to some processes, and distributing > > incoming connections via FD passing over unix sockets, etc... > > A multithreaded model makes more sense. We're talking about the best > way to share state. Yes, that's why it's mentionned later below :-) > The options are; > > Processes/threads share the VM state (go multithreaded) In fact there are some benefits to using processes instead of threads, for example you can share only part of the VM state and not share the FDs at all. But it also comes with its pack of trouble. > Use shared memory > Use an external service - memcached. That's not usable here since it would require slow I/Os to know who must do the fast I/O. > Have an internal message bus that broadcasts updates to a local cache. Almost same here. > > I realized that one of the problems you faced is knowing how many processes > > are still present. While the kernel has refcounts to many things, it's very > > difficult for the userland to get this piece of vital information. You > > solved > > it using timestamps, at first I was not very satisfied with the method, but > > I > > think it's one of the least ugly ones :-) > > Thank you. High praise indeed :-). In retrospect I think that may be > over engineered. I don't think so, because the whole day it took me to think about it was about "but why did they do something that complex ?". And thinking about all of the points (shm, timestamps etc...) I arrived to the same principle due to the lack of kernel help on the subject. > All the processes are identical. If one dies chances > are they're all going to die. Yes, "chances". But not reality : when you restart, you could end up rejecting connections without that counter. Also, the more the processes, the higher the risk of getting one killed, and you don't want to get a weaker architecture. > But it seemed easy to add at the time. And > if the problem is down to customers sending you bad data, then it may > have value (although experience from running large internet sites tells > me that the customer is likely to just hit you again and serially knock > down each process int he pool). Yes, I can confirm, some people I know deployed haproxy to prevent a shared system from dying when one of their customers' services went mad and hammerred the service until all nodes died from the domino effect :-) > > The only reliable system-based methods I could elaborate to know immediately > > how many processes are present are the following : > > > > - using a semaphore: with a semop(+1, SEM_UNDO), each process announces > > itself. With semctl(GETVAL), you can read the number of subscribers. If > > a process dies, its value is decremented thanks to SEM_UNDO, so GETVAL > > will return the remaining numbre of processes. I thought about using > > this to get a map of present processes (one bit per relative process), > > but the semval is only a 16-bit short unsigned, so that limits to 16 > > processes, which could be low for something aiming at significantly > > improving scalability. > > > > - using socketpair(). The idea is the following : the parent process > > first creates as many socket pairs as it will fork children. Then > > each children inherit these sockets and close the output side of all > > those except the one attached to their ID. They poll for input on the > > common side of all others however. Thus, when a process dies, all > > other ones will be woken up with a read event on the socket associated > > to the defunct. I think that at some point we'll have to implement > > something like this if we want to be able to bounce between processes > > via the stats socket anyway. > > > Both of these seem high overhead and synchronous by involving in line > syscalls. The low latency side of me naturally shies away from this. I disagree. semctl(GETVAL) involves a syscall per accept(). But the socketpair does not. You create the sockets before the fork(), and you forget about it all the time. If a process dies, you're notified because epoll() tells you it happens. So it's a very efficient low-latency method with zero overhead instead. > They are both robust methods however. > > Along these lines is another route - if you go with an acceptor > thread model and make it the parent - is to set up an fd with signalfd > to listen for SIGCHLD and call one of the wait() family to get an idea > of why it died as well as which one. You don't even need signalfd, you can already register signals in haproxy. But the centralized model with one parent is not resilient. Everything goes out of control if the parent dies, I really don't like such centralized architectures. > However if you want to pass information between processes about > session state, that sort of heads in the direction of some sort of > generic IPC or memcached approach where there's a heartbeat message on > an internal memory or multicast bus. That generic mechanism could carry > session state messages for failover as well. It also has the advantage > of being asynchronous, if each process maintains a local cache that's > populated by pushes from the other processes. Yes, that's how the peers subsystem works for example. But here we want fast updates for the connection counts, so the shared memory is fine. In an ideal world, accept() would return an array showing the other process' counts as well, or at least the sum of all process' counts and number of processes still bound. > > I thought about another point : depending on your load, it might make sense > > to stack two layers of load balancing. Some people already do that for SSL. > > The principle is that the first layer gets woken up in random order by the > > system, and distributes the load according to the configured algorithm to > > the second layer. The load at the second layer will be much smoother, and > > will respect the first layer's algorithm +/- an offset equivalent to the > > number of front nodes. > > We did consider separating SSL termination from the loadbalancing/ > backend server handling, and having a multilayer architecture. You can already try it within haproxy as in the example at the top of this e-mail. > We will > probably look at specialist kit like the nitro ssl accelerator cards > when we need to do this. Right now, we thought that was overkill for our > scalability requirements. I'd be very interested in results using the cards. I've always been impressed by the numbers they indicate on asymmetric crypto, but I have no idea how close we can come to that and how deep the request queues need to be for this. > The thought also crossed our minds that actually DSR and SSL > termination on the fix servers would be the best way of load balancing > our workload. There are other considerations about our infrastructure > that prevent us using that at the moment. In my humble opinion, DSR architectures are becoming obsolete now. They were great in 2000 but services are rarely equivalent for all requests of a single connection, and DSR forces you to ignore L7 information. In your case it might not be an issue, but let's imagine you migrate to a SOAP architecture later, and you might want to selectively direct some traffic to different backends based on the class of service the customer subscribed to and/or what appears in the request. > Have you ever considered using an alternative software SSL engine to > openssl? Yes, Emeric has written an alternative implementation using CyaSSL, but that was long ago, when they didn't support TLSv1.2. But now their stack looks really complete, he wants to assign some time on this again. Their performance was great, about 2.5 times that of openssl. > It seems to top out much lower than the SSL engine used by ZXTM > for example. The problem of openssl is that everything is converted multiple times inside. Also another problem is that it uses write()/read(), and the API to use your own buffers + send()/recv() is even more obscure than the current one. That's a point where it might be possible that the example above actually improves performance by reducing the number of PUSH segments over the wire. > > Also another point of consideration is that the likeliness of going to > > threads instead of processes is growing with time. The reason is simple, > > with heavy latency users like SSL, we definitely want the ability to > > migrate this heavy traffic to dedicated proceses/threads. Unfortunately, > > Particularly where there may be SSL offload engines/coprocessors > involved. I don't know if support for those is on your roadmap. I have a simple patch to enable ssl engines. We never merged it because we never figured the long list of settings that we need to make configurable, so I preferred to wait for a real need to do it correctly instead. When you test your nitrox cards, I can send it to you. > > doing so right now means that all the processing is done in a distinct > > process. In general it's not an issue and we can even chain this process > > to a central one if a more aggregate traffic is desired. But chaining > > processes also means that we don't have an easy access to thee SSL info > > from the central process. So all in all, it seems like at some point we'll > > have to support threads and implement a thread-aware scheduler (basically > > the same as the current one with a thread map for each task so that threads > > can only dequeue the tasks allowed to run on them). This will also solve > > the issue we're having with keeping stats across all processes, and will > > allow other unexpected snails like gzip to run in threads that do not > > affect the latency of the rest of the traffic. > > There's a need to share state. Sharing a memory space would be one > way, although the design would need to change quite a bit to make > optimum use of a thread pool model. It would make the code harder, but > the benefits of sharing information would be there. I totally agree. There are a number of things we cannot optimize now that we could optimize this way (typically latencies caused by SSL handshakes). > An alternative would be to keep them as separate processes but use the > message bus idea for sharing state data/health information if a more > loosely coupled data sharing was needed. The fundamental question that > decides between the two models is the timeliness of the data. If each > thread has to know right now with no delay what the others are doing, > then multithreaded is the way to go. If we don't mind working with > slightly out of date data - for example connection counts - then other > multicast/broadcast alternatives could work. Connection counts are a bad example since you need an accurate one for leastconn :-) But I generally agree. Stats tend to be an example of this. > > So as you can see, I'm having mixed opinions. I'm thinking that in its > > current form, your patch is too limited for general use, that a per-bind > > leastconn distribution would make a lot more sense, and that it would > > still have a limited life if we migrate to threads in 6 months or one > > year. Thus my first feeling is that we should try to do our best to see > > how your current workload could be supported without an extra patch (ie > > either by improving the config or patching haproxy a little bit in a way > > that is more acceptable for mainline). > > Fair enough. We'll maintain the patch in parallel then for now. Where > we can simplify and merge we will. We'll see if reuseconn and RR is > "good enough". If you like the internal message bus idea we'll see if we > can help with that - or any other direction where you think we could add > value. With 1.5-final coming soon, I'd really like to attack the multi-thread model early because we don't want to try to do it too late when people start sending their contributions and everything conflicts. So I think we should avoid trying to add extra communication channels between the processes and add support for MT. Anyway that's an interesting and useful discussion Andy, thank you very much for your insights and for explaining the shortcomings you're facing, it definitely helps improving the long term design. Best regards, Willy