Hi Andy,

On Sun, May 18, 2014 at 03:16:34PM +0100, Andrew Phillips wrote:
> Willy,
> 
>  Thanks for the response. I wrote the reply as I read through, so it's
> interesting to see that we've pursued similar lines of thought about how
> to solve this problem. 
> 
>  I think our workload is very different from 'normal'. We have several 
> quite long lived connections, with a modest accept() rate of new
> connections. 

That's something that RDP providers see as well. I've also seen a Citrix
farm in the past which had to face a difficult issue which is to support
many long-lived connectinos with a very low average accept rate (eg: a
few hundred connections per day) but with the goal of being able to accept
20 times more if people had to work from home due to problems going to
their job (eg: transportation services on strike), and to accept all of
them at 9am. There was some SSL in the mix to make things funnier.

> > I've just checked the kernel code and indeed it's not a real round-robin,
> > it's a hash on the 4-tuple (src/dst/spt/dpt), coupled with a pseudo-random
> > mix. But that makes a lot of sense, since a round robin would have to 
> > perform
> > a memory write to store the index. That said, when testing here, I get the
> > same distribution on all servers +/- 0.2% or so.
> > 
>  We'll test more with EL6 latest and SO_REUSEPORT - given the
> information above it's possible our test rig may not show the above
> algorithm to its best, and may not represent production load that well. 

It will really depend on the total amount of connections in fact. I would
not be surprized if the load is highly uneven at all below 100 or so per
process due to the hash. But maybe that could be enough already.

>  Ideally though a least conn LB would be best. James has posted our test
> numbers - they're better and may be "good enough" for now. And there is
> always the alternative of maintaining the shm_balance patch internally. 

Sure!

If your traffic is not too high, there's something simple you can do which
can be *very* efficient. It's being used by at least one RDP provider, but
I don't remember which one. The idea was the following : deciphering SSL
costs much, especially the handshakes which you don't want to cause noticeable
pauses to all users when they happen. So instead of randomly stacking the
connections onto each others into a process pool, there was a front layer
in pure TCP mode doing nothing but distributing connections in leastconn.
The cost is very low in terms of CPU and even lower in terms of latency.
And now with dev25, you have the abstract namespace sockets which are
basically unix sockets with internal names. They're 2.5 times cheaper
than TCP sockets. I'm really convinced you should give that a try. It
would look like this :

    listen dispatcher
       bind :1234 process 1
       balance leastconn
       server process2 abns@p2 send-proxy
       server process3 abns@p3 send-proxy
       server process4 abns@p4 send-proxy
       server process5 abns@p5 send-proxy

    listen worker
       bind abns@p2 process 2 accept-proxy
       bind abns@p3 process 3 accept-proxy
       bind abns@p4 process 4 accept-proxy
       bind abns@p5 process 5 accept-proxy
       ...

In "worker", simply add "ssl ..." to each line if you need to decipher
SSL. You can (and should) even check that processes are still alive
using a simple "check" on each line.

(..)
> > I definitely agree. I think we should propose some kernel-side improvements
> > such as a way to distribute according to number of established connections
> > per queue instead of hashing or round-robinning, but it seems there's no
> > relation between the listen queues and the inherited sockets, so it looks
> > hard to get that information even from the kernel.
> 
>   I'd be happy to help here where we can. Any patch that maintains state
> about the number of connections sent to each socket is likely to be hard
> to merge to kernel. 

Especially if it requires inflating a structure like struct sock.

>  The alternative is for haproxy to maintain the count by process of
> active sockets, and somehow poke that back into the kernel as a hint to
> send more to a particular socket.

I agree.

> That also feels ugly however.

It depends. Said like this yes it feels ugly. However, if you reason
with a budget and processes only accept their budget of incoming
connections, then it's much different. And with a budget it's not
that hard to implement. Basically you raise all budgets to 1 when
they're all 0, you decrease a process's budget when it accepts a
connection, you increase its budget when it closes a connection,
and you subtract the value of the lowest budget when all of them
have a budget greater than 1.

> It comes
> back to either haproxy making routing/load balancing decisions amongst
> its children or improving the kernel mechanism that is doing the same
> job.
>  Haproxy has more information available, and a faster turn around on new
> load balancing strategies. 

Yes and load balancing is its job, though it's hard to get any global
information while the kernel has everything. Ideally, the kernel would
tell us the minimum information we need (eg: how many processes are
still bound to the socket, maybe the total count of accepted incoming
connections on that socket, etc).

>   So options are;
>    1) Come up with a better stateless LB algo for the kernel. 
>    2) Maintain counts in kernel for a least connections algo. 
>    3) Stay as is kernel wise, but have haproxy play a more active
>       role in distributing connections. 
>    4) Do nothing, as its good enough for most people. 
> 
> If there's a better way for us to track active connections per server
> that at least would help simplify the shm balance patch. 

Actually I'm wondering if we couldn't try to abuse epoll. Its API is a real
mess in that processes interact badly between each other while most people
would expect they would not. I think that by having all processes have
access to each other process' epoll FD, we might be able to enable/disable
polling from them. It's just a thought, nothing more.

> > That's quite common. That's the reason why we have the tune.maxaccept global
> > tunable, in order to prevent some processes from grabbing all connections at
> > once, but still that's not perfect because as you noticed, a same process 
> > can
> > be notified several times in a row while another one was doing something 
> > else
> > or was scheduled out, leaving some room for anything else. By the way, using
> > cpu-map to bind processes to CPUs significantly helps, provided of course
> > that no other system process is allowed to run on the same CPUs.
> 
>   Ok, we'll go back and check that in detail. CPU pinning and SMP IRQ
> affinity we do as a matter of course. 

OK.

> > Don't worry, I know that very well. Some customers are still running some 
> > 2.4
> > kernels that I built for them years ago for that reason, and our appliances
> > still ship with 2.6.32 :-) So you don't have to justify that you cannot 
> > upgrade,
> > I'm the first one to defend that position.
> 
>   Ok, that's reassuring. There are many projects out there that while
> wonderful, assume you have the latest version of fedora/ubuntu
> available.

Yes I know, and I'm really irritated by this generalized run-away that
totally ignores the needs for stable processing in certain situations.

> > > via what should be an atomic write. Locking should not
> > > be required. If there is a concern about write tearing, we can change it
> > > to an explicit atomic cmpxchg or similar.
> > 
> > No that's not needed. The only thing is that having them all in the same
> > cache line means that cache lines are bouncing back and forth between CPUs,
> > causing latencies to update the values, but that's all.
> 
>   Good point. We can avoid cache ping pong if we pad the structure
> appropriately. 

At least it should recover some of the performance that I expect to be
lost in very high connection rate scenarios.

> > > If this works as mentioned above, then please consider putting a note in
> > > the haproxy
> > > documents talking about this for EL customers and what the minimum
> > > acceptable kernel
> > > revs are to make this work properly.
> > 
> > That's a good point. In fact we've been running with SO_REUSECONN for
> > many years, I think my first patch probably dates 10 years now, but now
> > that the feature is present in the kernel, we should update the doc.
> 
>  Please do. It makes the difference between something that is very
> lopsided, and arguably of limited benefit (at least on our workloads)
> and approximately even balancing. 

OK. I'm adding that in my long todo list!

> > > From observation, a variable number of other processes
> > > will get woken up and will hit an error when they reach the accept()
> > > call after the winner does.
> > 
> > That's the principle. The kernel wakes up all waiters and one does
> > accept() and others fail at it. When you're running at a high
> > connection rate, it can be properly distributed if processes are
> > overloaded, because the ones already working do not have time to
> > run through accept(). But this doesn't fit in your use case where
> > you need a very low latency.
> > 
> > Heavily pre-forked products like Apache used a semaphore around the accept()
> > call so that only one process could accept and the other ones slept. This
> > slightly improves the scalability but not that much, because the contention
> > then comes to the semaphore. A more scalable solution would be an array of
> > semaphores.
> 
>  I'm not a fan of this approach. I don't like the idea of locking - its
> the enemy of latency and simplicity.

totally agreed.

> I'll have to check the apache code to see how they're doing that.

>From what I remember, it was simple :

    lock()
    accept()
    unlock()

So that was simple, the lock() ensured that only one process was woken
up while waiting for the lock, and only this process could perform the
accept(). But at high connection rates, that doesn't scale, because
during all the time between taking the lock and releasing it, only one
connection is accepted within a delay 2-3 times larger than accept()
alone, so you can never saturate your CPU. However, connections are
properly round-robinned since processes are generally dequeued in the
arrival order of the lock.

> But if all the processes are waking on an event, then testing a

No, it was not walking on an event, it was a preforked model, so only
one connection per process. That means that it was only a simple loop :

  while (1) {
    lock()
    accept()
    unlock()
    process()
 }

> semaphore, I fail to see how much faster that
> would be than the lock free approach we've taken with shm balance. You
> only go to the kernel when you have no other choice. 

Hint: it was not fast at all, and always left 15-20% of idle CPU when
the performance limit was reached. The context switch rate was horrible
as well.

>  This is probably my ignorance showing. Hints appreciated. 

No no, don't feel that way. It's just that people always had to invent
ways to work around their OS's design limitations. I hope that in 5 years
people will have fun of your SHM patch because it will mean that we'll
have much better support from the kernel!

> > > > But then it will make things worse, because that means that your
> > > > process that was woken up by poll() to accept a connection and which
> > > > doesn't want it will suddenly enter a busy loop until another process
> > > > accepts this connection. I find this counter-productive.
> > > 
> > >  In practice this happens already, without our patch.
> > 
> > Not exactly, there's a subtle difference : the accept() is called and
> > returns EAGAIN, which marks the fd as pollable again. If we ever want
> > to switch to event-triggered epoll() instead of level-triggered, this
> > will not work anymore because the woken up process would have pretended
> > to consume the event while doing nothing instead.
> 
>  Ok, I can see that. In practice I think we're using epoll - don't know
> without checking the code if the versions of haproxy we've been testing
> with are level or ET.

Haproxy currently doesn't enable ET, so you don't run into the problem.

> We do see a number of processes woken - and each
> socket does get accepted. 

BTW, I remember you said that you fixed the busy loop by disabling the
FD in the speculative event cache, but do you remember how you re-enable
it ? Eg, if all other processes have accepted some connections, your
first process will have to accept new connections again, so that means
that its state depends on others'.

>  That argues in favour of a separate accept() and SSL initialise thread
> and a least connections/round robin/whatever hand off to a pool of
> processes. That however isn't the architecture of haproxy and could be
> quite disruptive to the design.

Not completely disruptive, because we wanted to implement "remote sockets"
(which means sockets from another process) and the architecture is supposed
to be "easy" to migrate to this design as there's almost no reference to a
local FD anywhere above the connections, or only for minor things such as
disabling lingering before closing.

> And would not work for those people who
> have very many short lived processes where the accept thread would
> become the bottle neck. 

Clearly, but with multiple accept threads it's not a problem anymore.
That's among the things which make me favor threads more than processes
for the future.

>   Or alternatively the first thread that wins does the accept, and then
> does the decision as to where it should go. It would either lead back to
> pushing it up stream into the kernel again, or their being an IPC
> mechanism to hand over a socket to a more lightly loaded thread. Pushing
> it up to kernel would be the route if you want to preserve the current
> architecture. 

No, fd passing is much better here : one of many threads accepts an incoming
connection, decides what worker should take it and passes it over a unix
socket using SCM_RIGHTS.

> > >  We did observe a busy spin in an earlier version of the patch. However
> > > by clearing the speculative read we seem to have fixed that.
> > 
> > Yes that was needed. But in an event-triggered polling model, you will
> > never be notified again about a new accept being available.
> 
>   Assuming that all the processes that get woken by that event don't
> accept. I've not played with ET in enough detail to know if the kernel
> does a wake one for each event, or just hands it over to a thundering
> herd.

No, in ET mode, it only notifies you when the status changes. So you get
notified when there's data for you, and it's up to you to read these data
or not and to keep the event in your own cache. But the kernel won't wake
you up until you meet EAGAIN, which re-enables the event. The purpose of
ET is to reduce the epoll_ctl(ADD)/epoll_ctl(DEL) dance which happens when
transferring large amounts of data between two sides. But haproxy's
speculative model allows it to generally not poll on the connection to
servers because data come there very fast, so the kernel buffers are
always filled, and a recv() almost always manages to fill an internal
buffer and thus we don't need to enable then disable polling. I observed
a very low count of epoll_ctl() calls on various workloads, leading me
to think that the gain we'd have with ET remains low.

> You may end up with a socket hanging around waiting for the
> next connection before it gets serviced. Which is not good. 

Even worse as described above : you won't ever get notified anymore
until you decided on your own to call accept() to get EAGAIN.

>   The fundamental problem here is where the load balancing of
> connections across the pool should take place. Pushing it up into the
> kernel is great, simple and works, but limits the strategies available
> to round robin.

At least currently.

> Haproxy has the state to make the decisions for more complex strategies,
> but no way of sharing that state with each process at present.

That's it. Also, I'd be tempted to think that in general we should avoid
complex strategies because they don't scale. I think that round robin,
source hash and leastconn are more than enough at this level.

> > >  I agree that the kernel 3.9/SO_REUSEPORT option is a better way.
> > > This was the best we could think of on a kernel that does not possess it.
> > 
> > I really think you should give it a try with a load that matches yours,
> > because it works really well in my tests.
> 
>  We'll find out and get back to you. 

OK.

> > That said, there are a number of problems that make me feel uneasy with
> > your patch, one of them is that the load-balancing is per process and not
> > per listener. That means that someone using it with both SSL and clear
> > traffic for example might very well end up with one process taking all
> > the SSL traffic and other processes sharing the clear traffic. Also, I
> > think that a leastconn distribution would make much more sense regarding
> > this.
> 
>  Fair enough. We have a fairly simple model of a single listener per
> process/service. We have a single service we want to scale. I can see
> that some people may want to have a big pool of processes each of which
> services several different services. How would per listener accept
> leveling look? Would it suffice to level by listener by process?

Yes, that's it. You already have an ID per "bind line". All sockets
which belong to the same bind line have the same ID and are processed
similarly (that's used for binding to port ranges for example). So
probably that the idea would be to be able to use the listener ID as
an index in the table. It could be acceptable to limit that to a
reasonable number of IDs and refuse to load a config where a "bind"
line has too high an ID and is marked as "shm".

> shm_balance terms that would give us an array indexed by service and by
> process. 

Indeed.

>  I get the feeling that this patch is likely to remain outside the code
> base.

That tends to be my feeling as well, at least until we try to move the
current architecture to a threaded one. If for any reason we fail and
we see that we're condamned to stay multi-process, the reasoning will
be much different.

> If you're more likely to accept it if we do this work, then we
> could consider doing that. 

I'd suggest not to complexify it more because for now you're the only
users, and you'll be the first victims of the extra complexity, while
nobody will thank you for the small extra bonus.

> > I spent the day yesterday thinking about all this (hence my late reply).
> > In fact, we've been thinking about other related improvements for future
> > versions, such as delegating SSL to some processes, and distributing
> > incoming connections via FD passing over unix sockets, etc...
> 
>  A multithreaded model makes more sense. We're talking about the best
> way to share state.

Yes, that's why it's mentionned later below :-)

> The options are;
> 
>  Processes/threads share the VM state (go multithreaded)

In fact there are some benefits to using processes instead of threads,
for example you can share only part of the VM state and not share the
FDs at all. But it also comes with its pack of trouble.

>  Use shared memory 
>  Use an external service - memcached. 

That's not usable here since it would require slow I/Os to know who must
do the fast I/O.

>  Have an internal message bus that broadcasts updates to a local cache. 

Almost same here.

> > I realized that one of the problems you faced is knowing how many processes
> > are still present. While the kernel has refcounts to many things, it's very
> > difficult for the userland to get this piece of vital information. You 
> > solved
> > it using timestamps, at first I was not very satisfied with the method, but 
> > I
> > think it's one of the least ugly ones :-)
> 
>  Thank you. High praise indeed :-). In retrospect I think that may be
> over engineered.

I don't think so, because the whole day it took me to think about it was
about "but why did they do something that complex ?". And thinking about
all of the points (shm, timestamps etc...) I arrived to the same principle
due to the lack of kernel help on the subject.

> All the processes are identical. If one dies chances
> are they're all going to die.

Yes, "chances". But not reality : when you restart, you could end up
rejecting connections without that counter. Also, the more the processes,
the higher the risk of getting one killed, and you don't want to get a
weaker architecture.

> But it seemed easy to add at the time. And
> if the problem is down to customers sending you bad data, then it may
> have value (although experience from running large internet sites tells
> me that the customer is likely to just hit you again and serially knock
> down each process int he pool). 

Yes, I can confirm, some people I know deployed haproxy to prevent a shared
system from dying when one of their customers' services went mad and hammerred
the service until all nodes died from the domino effect :-)
 
> > The only reliable system-based methods I could elaborate to know immediately
> > how many processes are present are the following :
> > 
> >   - using a semaphore: with a semop(+1, SEM_UNDO), each process announces
> >     itself. With semctl(GETVAL), you can read the number of subscribers. If
> >     a process dies, its value is decremented thanks to SEM_UNDO, so GETVAL
> >     will return the remaining numbre of processes. I thought about using
> >     this to get a map of present processes (one bit per relative process),
> >     but the semval is only a 16-bit short unsigned, so that limits to 16
> >     processes, which could be low for something aiming at significantly
> >     improving scalability.
> > 
> >   - using socketpair(). The idea is the following : the parent process
> >     first creates as many socket pairs as it will fork children. Then
> >     each children inherit these sockets and close the output side of all
> >     those except the one attached to their ID. They poll for input on the
> >     common side of all others however. Thus, when a process dies, all
> >     other ones will be woken up with a read event on the socket associated
> >     to the defunct. I think that at some point we'll have to implement
> >     something like this if we want to be able to bounce between processes
> >     via the stats socket anyway.
> > 
>    Both of these seem high overhead and synchronous by involving in line
> syscalls. The low latency side of me naturally shies away from this.

I disagree. semctl(GETVAL) involves a syscall per accept(). But the
socketpair does not. You create the sockets before the fork(), and you
forget about it all the time. If a process dies, you're notified because
epoll() tells you it happens. So it's a very efficient low-latency method
with zero overhead instead.

> They are both robust methods however.
> 
>    Along these lines is another route - if you go with an acceptor
> thread model and make it the parent - is to set up an fd with signalfd
> to listen for SIGCHLD and call one of the wait() family to get an idea
> of why it died as well as which one. 

You don't even need signalfd, you can already register signals in haproxy.
But the centralized model with one parent is not resilient. Everything goes
out of control if the parent dies, I really don't like such centralized
architectures.

>   However if you want to pass information between processes about
> session state, that sort of heads in the direction of some sort of
> generic IPC or  memcached approach where there's a heartbeat message on
> an internal memory or multicast bus. That generic mechanism could carry
> session state messages for failover as well. It also has the advantage
> of being asynchronous, if each process maintains a local cache that's
> populated by pushes from the other processes. 

Yes, that's how the peers subsystem works for example. But here we want
fast updates for the connection counts, so the shared memory is fine. In
an ideal world, accept() would return an array showing the other process'
counts as well, or at least the sum of all process' counts and number of
processes still bound.

> > I thought about another point : depending on your load, it might make sense
> > to stack two layers of load balancing. Some people already do that for SSL.
> > The principle is that the first layer gets woken up in random order by the
> > system, and distributes the load according to the configured algorithm to
> > the second layer. The load at the second layer will be much smoother, and
> > will respect the first layer's algorithm +/- an offset equivalent to the
> > number of front nodes.
> 
>   We did consider separating SSL termination from the loadbalancing/
> backend server handling, and having a multilayer architecture.

You can already try it within haproxy as in the example at the top of
this e-mail.

> We will
> probably look at specialist kit like the nitro ssl accelerator cards
> when we need to do this. Right now, we thought that was overkill for our
> scalability requirements. 

I'd be very interested in results using the cards. I've always been
impressed by the numbers they indicate on asymmetric crypto, but I
have no idea how close we can come to that and how deep the request
queues need to be for this.

>   The thought also crossed our minds that actually DSR and SSL
> termination on the fix servers would be the best way of load balancing
> our workload. There are other considerations about our infrastructure
> that prevent us using that at the moment. 

In my humble opinion, DSR architectures are becoming obsolete now. They
were great in 2000 but services are rarely equivalent for all requests
of a single connection, and DSR forces you to ignore L7 information.
In your case it might not be an issue, but let's imagine you migrate to
a SOAP architecture later, and you might want to selectively direct
some traffic to different backends based on the class of service the
customer subscribed to and/or what appears in the request.

>   Have you ever considered using an alternative software SSL engine to
> openssl?

Yes, Emeric has written an alternative implementation using CyaSSL, but
that was long ago, when they didn't support TLSv1.2. But now their stack
looks really complete, he wants to assign some time on this again. Their
performance was great, about 2.5 times that of openssl.

> It seems to top out much lower than the SSL engine used by ZXTM
> for example. 

The problem of openssl is that everything is converted multiple times
inside. Also another problem is that it uses write()/read(), and the
API to use your own buffers + send()/recv() is even more obscure than
the current one. That's a point where it might be possible that the
example above actually improves performance by reducing the number of
PUSH segments over the wire.

> > Also another point of consideration is that the likeliness of going to
> > threads instead of processes is growing with time. The reason is simple,
> > with heavy latency users like SSL, we definitely want the ability to
> > migrate this heavy traffic to dedicated proceses/threads. Unfortunately,
> 
>   Particularly where there may be SSL offload engines/coprocessors
> involved. I don't know if support for those is on your roadmap.

I have a simple patch to enable ssl engines. We never merged it because we
never figured the long list of settings that we need to make configurable,
so I preferred to wait for a real need to do it correctly instead. When you
test your nitrox cards, I can send it to you.

> > doing so right now means that all the processing is done in a distinct
> > process. In general it's not an issue and we can even chain this process
> > to a central one if a more aggregate traffic is desired. But chaining
> > processes also means that we don't have an easy access to thee SSL info
> > from the central process. So all in all, it seems like at some point we'll
> > have to support threads and implement a thread-aware scheduler (basically
> > the same as the current one with a thread map for each task so that threads
> > can only dequeue the tasks allowed to run on them). This will also solve
> > the issue we're having with keeping stats across all processes, and will
> > allow other unexpected snails like gzip to run in threads that do not
> > affect the latency of the rest of the traffic.
> 
>   There's a need to share state. Sharing a memory space would be one
> way, although the design would need to change quite a bit to make
> optimum use of a thread pool model. It would make the code harder, but
> the benefits of sharing information would be there. 

I totally agree. There are a number of things we cannot optimize now
that we could optimize this way (typically latencies caused by SSL
handshakes).

>   An alternative would be to keep them as separate processes but use the
> message bus idea for sharing state data/health information if a more
> loosely coupled data sharing was needed. The fundamental question that
> decides between the two models is the timeliness of the data. If each
> thread has to know right now with no delay what the others are doing,
> then multithreaded is the way to go. If we don't mind working with
> slightly out of date data - for example connection counts - then other
> multicast/broadcast alternatives could work. 

Connection counts are a bad example since you need an accurate one for
leastconn :-)  But I generally agree. Stats tend to be an example of
this.

> > So as you can see, I'm having mixed opinions. I'm thinking that in its
> > current form, your patch is too limited for general use, that a per-bind
> > leastconn distribution would make a lot more sense, and that it would
> > still have a limited life if we migrate to threads in 6 months or one
> > year. Thus my first feeling is that we should try to do our best to see
> > how your current workload could be supported without an extra patch (ie
> > either by improving the config or patching haproxy a little bit in a way
> > that is more acceptable for mainline).
> 
>   Fair enough. We'll maintain the patch in parallel then for now. Where
> we can simplify and merge we will. We'll see if reuseconn and RR is
> "good enough". If you like the internal message bus idea we'll see if we
> can help with that - or any other direction where you think we could add
> value. 

With 1.5-final coming soon, I'd really like to attack the multi-thread
model early because we don't want to try to do it too late when people
start sending their contributions and everything conflicts. So I think
we should avoid trying to add extra communication channels between the
processes and add support for MT.

Anyway that's an interesting and useful discussion Andy, thank you very
much for your insights and for explaining the shortcomings you're facing,
it definitely helps improving the long term design.

Best regards,
Willy


Reply via email to