Willy,

 Thanks for the response. I wrote the reply as I read through, so it's
interesting to see that we've pursued similar lines of thought about how
to solve this problem. 

 I think our workload is very different from 'normal'. We have several 
quite long lived connections, with a modest accept() rate of new
connections. 

> I've just checked the kernel code and indeed it's not a real round-robin,
> it's a hash on the 4-tuple (src/dst/spt/dpt), coupled with a pseudo-random
> mix. But that makes a lot of sense, since a round robin would have to perform
> a memory write to store the index. That said, when testing here, I get the
> same distribution on all servers +/- 0.2% or so.
> 
 We'll test more with EL6 latest and SO_REUSEPORT - given the
information above it's possible our test rig may not show the above
algorithm to its best, and may not represent production load that well. 

 Ideally though a least conn LB would be best. James has posted our test
numbers - they're better and may be "good enough" for now. And there is
always the alternative of maintaining the shm_balance patch internally. 


> If you're concerned with long lived connections, then round robin is not the
> proper choice, you should use a leastconn algorithm instead, which will take
> care of disconnected clients.

  Yes, this is essentially the thinking behind the shm_balance patch. 
 
> I definitely agree. I think we should propose some kernel-side improvements
> such as a way to distribute according to number of established connections
> per queue instead of hashing or round-robinning, but it seems there's no
> relation between the listen queues and the inherited sockets, so it looks
> hard to get that information even from the kernel.

  I'd be happy to help here where we can. Any patch that maintains state
about the number of connections sent to each socket is likely to be hard
to merge to kernel. 

 The alternative is for haproxy to maintain the count by process of
active sockets, and somehow poke that back into the kernel as a hint to
send more to a particular socket. That also feels ugly however. It comes
back to either haproxy making routing/load balancing decisions amongst
its children or improving the kernel mechanism that is doing the same
job.
 Haproxy has more information available, and a faster turn around on new
load balancing strategies. 

  So options are;
   1) Come up with a better stateless LB algo for the kernel. 
   2) Maintain counts in kernel for a least connections algo. 
   3) Stay as is kernel wise, but have haproxy play a more active
      role in distributing connections. 
   4) Do nothing, as its good enough for most people. 

If there's a better way for us to track active connections per server
that at least would help simplify the shm balance patch. 

> That's quite common. That's the reason why we have the tune.maxaccept global
> tunable, in order to prevent some processes from grabbing all connections at
> once, but still that's not perfect because as you noticed, a same process can
> be notified several times in a row while another one was doing something else
> or was scheduled out, leaving some room for anything else. By the way, using
> cpu-map to bind processes to CPUs significantly helps, provided of course
> that no other system process is allowed to run on the same CPUs.

  Ok, we'll go back and check that in detail. CPU pinning and SMP IRQ
affinity we do as a matter of course. 

> Don't worry, I know that very well. Some customers are still running some 2.4
> kernels that I built for them years ago for that reason, and our appliances
> still ship with 2.6.32 :-) So you don't have to justify that you cannot 
> upgrade,
> I'm the first one to defend that position.

  Ok, that's reassuring. There are many projects out there that while
wonderful, assume you have the latest version of fedora/ubuntu
available.


> > The actconn variable is shared in shared memory. We're using the single
> > writer model, so each process is the only process writing to its slot
> > in shared memory,
> 
> OK found it, indexed based on relative_pid. I thought all processes shared
> the same global actconn which scared me!

  Yes, It wouldn't have worked very well either :-)

> > via what should be an atomic write. Locking should not
> > be required. If there is a concern about write tearing, we can change it
> > to an explicit atomic cmpxchg or similar.
> 
> No that's not needed. The only thing is that having them all in the same
> cache line means that cache lines are bouncing back and forth between CPUs,
> causing latencies to update the values, but that's all.

  Good point. We can avoid cache ping pong if we pad the structure
appropriately. 


> > If this works as mentioned above, then please consider putting a note in
> > the haproxy
> > documents talking about this for EL customers and what the minimum
> > acceptable kernel
> > revs are to make this work properly.
> 
> That's a good point. In fact we've been running with SO_REUSECONN for
> many years, I think my first patch probably dates 10 years now, but now
> that the feature is present in the kernel, we should update the doc.

 Please do. It makes the difference between something that is very
lopsided, and arguably of limited benefit (at least on our workloads)
and approximately even balancing. 


> > From observation, a variable number of other processes
> > will get woken up and will hit an error when they reach the accept()
> > call after the winner does.
> 
> That's the principle. The kernel wakes up all waiters and one does
> accept() and others fail at it. When you're running at a high
> connection rate, it can be properly distributed if processes are
> overloaded, because the ones already working do not have time to
> run through accept(). But this doesn't fit in your use case where
> you need a very low latency.
> 
> Heavily pre-forked products like Apache used a semaphore around the accept()
> call so that only one process could accept and the other ones slept. This
> slightly improves the scalability but not that much, because the contention
> then comes to the semaphore. A more scalable solution would be an array of
> semaphores.

 I'm not a fan of this approach. I don't like the idea of locking - its
the enemy of latency and simplicity. I'll have to check the apache code
to see how they're doing that. But if all the processes are waking on an
event, then testing a semaphore, I fail to see how much faster that
would be than the lock free approach we've taken with shm balance. You
only go to the kernel when you have no other choice. 

 This is probably my ignorance showing. Hints appreciated. 

> > > But then it will make things worse, because that means that your
> > > process that was woken up by poll() to accept a connection and which
> > > doesn't want it will suddenly enter a busy loop until another process
> > > accepts this connection. I find this counter-productive.
> > 
> >  In practice this happens already, without our patch.
> 
> Not exactly, there's a subtle difference : the accept() is called and
> returns EAGAIN, which marks the fd as pollable again. If we ever want
> to switch to event-triggered epoll() instead of level-triggered, this
> will not work anymore because the woken up process would have pretended
> to consume the event while doing nothing instead.

 Ok, I can see that. In practice I think we're using epoll - don't know
without checking the code if the versions of haproxy we've been testing
with are level or ET. We do see a number of processes woken - and each
socket does get accepted. 

 That argues in favour of a separate accept() and SSL initialise thread
and a least connections/round robin/whatever hand off to a pool of
processes. That however isn't the architecture of haproxy and could be
quite disruptive to the design. And would not work for those people who
have very many short lived processes where the accept thread would
become the bottle neck. 

  Or alternatively the first thread that wins does the accept, and then
does the decision as to where it should go. It would either lead back to
pushing it up stream into the kernel again, or their being an IPC
mechanism to hand over a socket to a more lightly loaded thread. Pushing
it up to kernel would be the route if you want to preserve the current
architecture. 

> >  The kernel seems to wake all or a subset of processes, which compete
> > for the accept(). There's a bunch of processes doing unnecessary work.
> 
> I agree!
> 
> >  Our patch doesn't change that. But it does mean that the busiest
> > process will not attempt to accept(). One of the others which is woken
> > at the same time will call accept.
> > 
> >  We did observe a busy spin in an earlier version of the patch. However
> > by clearing the speculative read we seem to have fixed that.
> 
> Yes that was needed. But in an event-triggered polling model, you will
> never be notified again about a new accept being available.

  Assuming that all the processes that get woken by that event don't
accept. I've not played with ET in enough detail to know if the kernel
does a wake one for each event, or just hands it over to a thundering
herd. You may end up with a socket hanging around waiting for the
next connection before it gets serviced. Which is not good. 

  The fundamental problem here is where the load balancing of
connections across the pool should take place. Pushing it up into the
kernel is great, simple and works, but limits the strategies available
to round robin. Haproxy has the state to make the decisions for more
complex strategies, but no way of sharing that state with each process
at present.

> > Looking at
> > the pattern of syscalls, it seems there's no busy spin (again that we
> > have observed).
> > 
> >  I agree that the kernel 3.9/SO_REUSEPORT option is a better way.
> > This was the best we could think of on a kernel that does not possess it.
> 
> I really think you should give it a try with a load that matches yours,
> because it works really well in my tests.

 We'll find out and get back to you. 

> That said, there are a number of problems that make me feel uneasy with
> your patch, one of them is that the load-balancing is per process and not
> per listener. That means that someone using it with both SSL and clear
> traffic for example might very well end up with one process taking all
> the SSL traffic and other processes sharing the clear traffic. Also, I
> think that a leastconn distribution would make much more sense regarding
> this.

 Fair enough. We have a fairly simple model of a single listener per
process/service. We have a single service we want to scale. I can see
that some people may want to have a big pool of processes each of which
services several different services. How would per listener accept
leveling look? Would it suffice to level by listener by process? In
shm_balance terms that would give us an array indexed by service and by
process. 

 I get the feeling that this patch is likely to remain outside the code
base. If you're more likely to accept it if we do this work, then we
could consider doing that. 

> 
> I spent the day yesterday thinking about all this (hence my late reply).
> In fact, we've been thinking about other related improvements for future
> versions, such as delegating SSL to some processes, and distributing
> incoming connections via FD passing over unix sockets, etc...

 A multithreaded model makes more sense. We're talking about the best
way to share state. The options are;

 Processes/threads share the VM state (go multithreaded)
 Use shared memory 
 Use an external service - memcached. 
 Have an internal message bus that broadcasts updates to a local cache. 

> 
> I realized that one of the problems you faced is knowing how many processes
> are still present. While the kernel has refcounts to many things, it's very
> difficult for the userland to get this piece of vital information. You solved
> it using timestamps, at first I was not very satisfied with the method, but I
> think it's one of the least ugly ones :-)

 Thank you. High praise indeed :-). In retrospect I think that may be
over engineered. All the processes are identical. If one dies chances
are they're all going to die. But it seemed easy to add at the time. And
if the problem is down to customers sending you bad data, then it may
have value (although experience from running large internet sites tells
me that the customer is likely to just hit you again and serially knock
down each process int he pool). 
 
> The only reliable system-based methods I could elaborate to know immediately
> how many processes are present are the following :
> 
>   - using a semaphore: with a semop(+1, SEM_UNDO), each process announces
>     itself. With semctl(GETVAL), you can read the number of subscribers. If
>     a process dies, its value is decremented thanks to SEM_UNDO, so GETVAL
>     will return the remaining numbre of processes. I thought about using
>     this to get a map of present processes (one bit per relative process),
>     but the semval is only a 16-bit short unsigned, so that limits to 16
>     processes, which could be low for something aiming at significantly
>     improving scalability.
> 
>   - using socketpair(). The idea is the following : the parent process
>     first creates as many socket pairs as it will fork children. Then
>     each children inherit these sockets and close the output side of all
>     those except the one attached to their ID. They poll for input on the
>     common side of all others however. Thus, when a process dies, all
>     other ones will be woken up with a read event on the socket associated
>     to the defunct. I think that at some point we'll have to implement
>     something like this if we want to be able to bounce between processes
>     via the stats socket anyway.
> 
   Both of these seem high overhead and synchronous by involving in line
syscalls. The low latency side of me naturally shies away from this.
They are both robust methods however.
 
   Along these lines is another route - if you go with an acceptor
thread model and make it the parent - is to set up an fd with signalfd
to listen for SIGCHLD and call one of the wait() family to get an idea
of why it died as well as which one. 

  However if you want to pass information between processes about
session state, that sort of heads in the direction of some sort of
generic IPC or  memcached approach where there's a heartbeat message on
an internal memory or multicast bus. That generic mechanism could carry
session state messages for failover as well. It also has the advantage
of being asynchronous, if each process maintains a local cache that's
populated by pushes from the other processes. 

> I thought about another point : depending on your load, it might make sense
> to stack two layers of load balancing. Some people already do that for SSL.
> The principle is that the first layer gets woken up in random order by the
> system, and distributes the load according to the configured algorithm to
> the second layer. The load at the second layer will be much smoother, and
> will respect the first layer's algorithm +/- an offset equivalent to the
> number of front nodes.

  We did consider separating SSL termination from the loadbalancing/
backend server handling, and having a multilayer architecture. We will
probably look at specialist kit like the nitro ssl accelerator cards
when we need to do this. Right now, we thought that was overkill for our
scalability requirements. 

  The thought also crossed our minds that actually DSR and SSL
termination on the fix servers would be the best way of load balancing
our workload. There are other considerations about our infrastructure
that prevent us using that at the moment. 

  Have you ever considered using an alternative software SSL engine to
openssl? It seems to top out much lower than the SSL engine used by ZXTM
for example. 

> In the past, I wanted to do this with fd-passing to work around the rough
> distribution of the kernel sockets. That would have made it possible to
> implement a simple round robin mechanism, but that does not allow us to
> implement a leastconn system since the accepting processes do not know the
> load of the second layer. Thus a shared memory is still needed to keep
> track of the load.
> 
> Then, if we have an SHM and a front process to accept fds, we can imagine
> that any process can offer an FD it accepts to any other one based on the
> load indicated in the SHM. I'm just not sure of the benefits compared to
> doing nothing and letting the other one accept it. Hmmm yes in fact there
> is a small benefit which is that it supports leastconn even with the new
> SO_REUSEPORT of the recent kernels.

 Then we're thinking along similar lines. 

> Also another point of consideration is that the likeliness of going to
> threads instead of processes is growing with time. The reason is simple,
> with heavy latency users like SSL, we definitely want the ability to
> migrate this heavy traffic to dedicated proceses/threads. Unfortunately,

  Particularly where there may be SSL offload engines/coprocessors
involved. I don't know if support for those is on your roadmap.

> doing so right now means that all the processing is done in a distinct
> process. In general it's not an issue and we can even chain this process
> to a central one if a more aggregate traffic is desired. But chaining
> processes also means that we don't have an easy access to thee SSL info
> from the central process. So all in all, it seems like at some point we'll
> have to support threads and implement a thread-aware scheduler (basically
> the same as the current one with a thread map for each task so that threads
> can only dequeue the tasks allowed to run on them). This will also solve
> the issue we're having with keeping stats across all processes, and will
> allow other unexpected snails like gzip to run in threads that do not
> affect the latency of the rest of the traffic.

  There's a need to share state. Sharing a memory space would be one
way, although the design would need to change quite a bit to make
optimum use of a thread pool model. It would make the code harder, but
the benefits of sharing information would be there. 

  An alternative would be to keep them as separate processes but use the
message bus idea for sharing state data/health information if a more
loosely coupled data sharing was needed. The fundamental question that
decides between the two models is the timeliness of the data. If each
thread has to know right now with no delay what the others are doing,
then multithreaded is the way to go. If we don't mind working with
slightly out of date data - for example connection counts - then other
multicast/broadcast alternatives could work. 

> 
> So as you can see, I'm having mixed opinions. I'm thinking that in its
> current form, your patch is too limited for general use, that a per-bind
> leastconn distribution would make a lot more sense, and that it would
> still have a limited life if we migrate to threads in 6 months or one
> year. Thus my first feeling is that we should try to do our best to see
> how your current workload could be supported without an extra patch (ie
> either by improving the config or patching haproxy a little bit in a way
> that is more acceptable for mainline).

  Fair enough. We'll maintain the patch in parallel then for now. Where
we can simplify and merge we will. We'll see if reuseconn and RR is
"good enough". If you like the internal message bus idea we'll see if we
can help with that - or any other direction where you think we could add
value. 

> I'm really interested in your opinions on all of this. Please do not
> hesitate to share them on the list, there are a number of multi-process
> users here, all with different workloads (eg: RDP is sensible to connections
> count) and it's best if everyone can participate.

  Happy to help and give back. 

 Andy

-- 
Andrew Phillips
Head of Systems

Direct: +44 (0)203 192 2509
Mobile: +44 (0)7595 242 900

LMAX, Yellow Building, 1A Nicholas Road,  London, W11 4AN



Reply via email to