Re: SSL, peered sticky tables + nbproc > 1?

Willy Tarreau Thu, 15 May 2014 02:23:27 -0700

Hi James,

On Tue, May 13, 2014 at 06:00:13PM +0100, James Hogarth wrote:
> Hi Willy,
> 
> Please see the response from our Head of Systems below.


Thank you. For ease of discussions, I'm copying him. Andy, please tell me
if this is not appropriate.

> On a side note our initial investigations see better behaviour
> (ie one or two processes don't run away with it all) but the current EL6
> kernel
> utilising the SO_REUSEPORT behaviour doesn't appear to do a perfect round
> robin of the processes and consequently can end up a bit unbalanced -

I've just checked the kernel code and indeed it's not a real round-robin,
it's a hash on the 4-tuple (src/dst/spt/dpt), coupled with a pseudo-random
mix. But that makes a lot of sense, since a round robin would have to perform
a memory write to store the index. That said, when testing here, I get the
same distribution on all servers +/- 0.2% or so.

> and this is especially so for the longer lived connections depending on one 
> client
> may disconnect.

If you're concerned with long lived connections, then round robin is not the
proper choice, you should use a leastconn algorithm instead, which will take
care of disconnected clients.

> We're in the process of rebasing this code to dev25 and
> cleaning it up as per your suggestions.
> 
> To give an idea of the difference in behaviour counting the connections per
> process
> very 5 seconds whilst ramping up the connections in the background:
> 
> Haproxy HEAD on current el6:
>  0   0   0   0   0   0   0
>  0   0   1   1   1   0   0
>  1   0   4   1   2   0   0
>  2   2   5   2   3   0   0
>  2   2   6   3   4   2   0
>  2   2   7   3   4   4   2
>  3   3   7   4   5   5   2
>  3   6   8   4   6   6   2
>  3   8   9   4   7   6   2
>  3   10   9   6   7   6   3
>  3   12   9   7   9   7   3
> 
> Haproxy HEAD with new shm_balance patch on current el6:
> 
>  0   0   0   0   0   0   0
>  0   0   1   1   1   1   0
>  1   1   1   1   2   1   2
>  2   2   2   2   2   2   2
>  3   3   3   2   3   3   3
>  4   3   4   4   3   4   3
>  5   4   4   4   5   4   4
>  5   5   5   5   5   5   6
>  6   6   6   6   6   5   6
>  7   6   6   7   7   6   7
>  7   8   7   7   7   7   7

But these are very small numbers. Are you really running with numbers *that*
low in production or is it just because you wanted to make a test ? I was
assuming that you were dealing with thousands or tens of thousands of
connections per second, where the in-kernel distribution is really good.
I can easily expect that it can be off by a few units in a test involving
just a few tens of connections however.

Responding to Andy below :

>  We realise that this patch is rushed, and we appreciate the feedback.
> It also is true that we've been working off dev21 and have been working on
> it
> for a while. If SO_REUSEPORT works well, then it's a far neater solution
> that renders this patch unnecessary.

I definitely agree. I think we should propose some kernel-side improvements
such as a way to distribute according to number of established connections
per queue instead of hashing or round-robinning, but it seems there's no
relation between the listen queues and the inherited sockets, so it looks
hard to get that information even from the kernel.

> There's some extra explanation here that may help answer some of your
> questions.
> 
> >I must confess I don't really understand well what behaviour this
> >shm_balance mode is supposed to provide.
> 
> Without this patch, on dev21, on the enterprise linux kernels we have
> tested
> and run in production, we see that the busiest haproxy process will run
> away and
> grab most new connections. In effect we get 160% capacity over a single
> process before we start seeing queueing latency. The load balancing is very
> unequal.

That's quite common. That's the reason why we have the tune.maxaccept global
tunable, in order to prevent some processes from grabbing all connections at
once, but still that's not perfect because as you noticed, a same process can
be notified several times in a row while another one was doing something else
or was scheduled out, leaving some room for anything else. By the way, using
cpu-map to bind processes to CPUs significantly helps, provided of course
that no other system process is allowed to run on the same CPUs.

> We need more capacity than that. With this patch, we get uniform balancing
> across 7
> processes giving us almost 700% usable over a single process.

For sure, but as explained above, in my opinion, round robin is not the
best choice for long lived connections (though it's better than nothing
of course).

> The decision to upgrade to a more recent kernel (particularly if it's not
> an EL kernel)
> is a difficult one for shops running on enterprise versions of linux. Many
> places stick
> on a particular point revision for a while and only upgrade to newer point
> releases
> after requalification.

Don't worry, I know that very well. Some customers are still running some 2.4
kernels that I built for them years ago for that reason, and our appliances
still ship with 2.6.32 :-) So you don't have to justify that you cannot upgrade,
I'm the first one to defend that position.

> For a mainline kernel, like 3.9 we would have to consider
> that carefully. We're fairly committed to RPM based Enterprise linux
> distributions with a long support lifetime.

No problem with that. BTW, RHEL7 will ship with 3.10 if I understood
correctly.

> After your response we have been investigating whether SO_REUSEPORT is
> supported under EL kernels. If it does work, (and it looks like this may
> have been backported for 2.6.32-417) then we can upgrade both the
> kernel to that EL kernel and haproxy to dev25 and our problem may be solved.

Yes indeed I discovered that consecutive to your e-mail, that's quite a
good news for many haproxy users!

> If not, we'd either need to do something special for the loadbalancers
> operating system wise to get a more recent kernel (i.e. run ubuntu LTS)
> with the consequent impacts on puppet manifests and monitoring or run
> unsupported kernels on EL. Neither is attractive.

I'd strongly discourage you from switching to an unsupported kernel or
changing your distro. You're already running with an haproxy version
that is not shipped with your distro, so you should limit the moving
parts and only tweak haproxy based on what your supported distro can
do, not the other way around.

> So, we'll test that on our workload and get back to you.

James' test shows that with very few connections it's not a perfect
distribution, but I wasn't expecting that you'd run with that few
connections. Could you please clarify that point ?

(...)
> The actconn variable is shared in shared memory. We're using the single
> writer model, so each process is the only process writing to its slot
> in shared memory,

OK found it, indexed based on relative_pid. I thought all processes shared
the same global actconn which scared me!

> via what should be an atomic write. Locking should not
> be required. If there is a concern about write tearing, we can change it
> to an explicit atomic cmpxchg or similar.

No that's not needed. The only thing is that having them all in the same
cache line means that cache lines are bouncing back and forth between CPUs,
causing latencies to update the values, but that's all.

(...)
> The haproxy-shm-client
> started as a quick and simple way to inspect the shared memory region to
> verify the load balancing was evenly distributed. It then grew legs to
> allow us to
> enable/disable individual processes within an nbproc group.

Hehe, the stats socket started this way as well :-)

>  If you'd prefer we did that through the stats socket, then we can
> definitely do that. To be frank, haproxy is advanced C and we felt that
> the less we touched the better, and the more likely it was that elements
> of the patch would get merged.

I understand your point. I'd say that for now I wouldn't care about the
client but more about how the feature itself works in the code.

(...)
> > With recent kernels (3.9+), the system will
> > automatically round-robin between multiple socket queues bound to the
> > same ip:port.
> > However this requires multiple sockets. With the latest changes allowing
> > the bind-process to go down to the listener (at last!), I realized that in
> > addition to allowing it for the stats socket (primary goal), it provides
> an
> > easy way to benefit from this kernel's round robin without having to
> create
> > a bind/unbind sequence as I was planning it.
> 
> This is good to know. And makes our patch redundant.
> 
> If this works as mentioned above, then please consider putting a note in
> the haproxy
> documents talking about this for EL customers and what the minimum
> acceptable kernel
> revs are to make this work properly.

That's a good point. In fact we've been running with SO_REUSECONN for
many years, I think my first patch probably dates 10 years now, but now
that the feature is present in the kernel, we should update the doc.

> (referring to our observed traffic pattern)
> 
> >You see this pattern even more often when running local benchmarks.
> >Just run two processes on a dual-core, dual-thread system, then have
> > the load generator on the same system, and you'll see that the load
> >generator disturbs one of the process more than the other one.
> 
>  Our load balancers are 32 core sandybridge with 4x10 gigabit cards
> each.

OK so I guess you see more than 1 connection per second :-)

>  Our load pattern is long lived TCP connections carrying data that
> cannot be delayed by more than a millisecond. Without the patch, the
> busiest process (we're running with nbproc=7 at present) will take the
> lions share of new accepts(), until it starts to queue above our
> acceptable minimum latency.
> 
>  By enforcing a more round robin distribution of the accepts() we're
> distributing the load, and making use of all processes. As mentioned
> above, we're now seeing the scalability we need. IMHO (again on an old
> kernel, and for our workload) the nbproc setting is of limited use for
> scalability of haproxy when you have the "free for all" method of load
> balancing where each process competes to do the accept() and only the
> winner triumphs.

It tends to be quite limited. Most people dealing with large nbproc
values use it for SSL or make intensive use of bind-process to spread
the load depending on the frontends. With the recent per-bind process
mapping, it will be much easier.

> From observation, a variable number of other processes
> will get woken up and will hit an error when they reach the accept()
> call after the winner does.

That's the principle. The kernel wakes up all waiters and one does
accept() and others fail at it. When you're running at a high
connection rate, it can be properly distributed if processes are
overloaded, because the ones already working do not have time to
run through accept(). But this doesn't fit in your use case where
you need a very low latency.

Heavily pre-forked products like Apache used a semaphore around the accept()
call so that only one process could accept and the other ones slept. This
slightly improves the scalability but not that much, because the contention
then comes to the semaphore. A more scalable solution would be an array of
semaphores.

> > But then it will make things worse, because that means that your
> > process that was woken up by poll() to accept a connection and which
> > doesn't want it will suddenly enter a busy loop until another process
> > accepts this connection. I find this counter-productive.
> 
>  In practice this happens already, without our patch.

Not exactly, there's a subtle difference : the accept() is called and
returns EAGAIN, which marks the fd as pollable again. If we ever want
to switch to event-triggered epoll() instead of level-triggered, this
will not work anymore because the woken up process would have pretended
to consume the event while doing nothing instead.

>  The kernel seems to wake all or a subset of processes, which compete
> for the accept(). There's a bunch of processes doing unnecessary work.

I agree!

>  Our patch doesn't change that. But it does mean that the busiest
> process will not attempt to accept(). One of the others which is woken
> at the same time will call accept.
> 
>  We did observe a busy spin in an earlier version of the patch. However
> by clearing the speculative read we seem to have fixed that.

Yes that was needed. But in an event-triggered polling model, you will
never be notified again about a new accept being available.

> Looking at
> the pattern of syscalls, it seems there's no busy spin (again that we
> have observed).
> 
>  I agree that the kernel 3.9/SO_REUSEPORT option is a better way.
> This was the best we could think of on a kernel that does not possess it.

I really think you should give it a try with a load that matches yours,
because it works really well in my tests.

(...)
>  I'm happy to have a phone call or chat with you about some of the
> data/thinking or alternative implementation ideas. As I mentioned, I'd
> really like to avoid a long lived internal patch.

I can understand, I don't like keeping long lived internal patches either.

That said, there are a number of problems that make me feel uneasy with
your patch, one of them is that the load-balancing is per process and not
per listener. That means that someone using it with both SSL and clear
traffic for example might very well end up with one process taking all
the SSL traffic and other processes sharing the clear traffic. Also, I
think that a leastconn distribution would make much more sense regarding
this.

I spent the day yesterday thinking about all this (hence my late reply).
In fact, we've been thinking about other related improvements for future
versions, such as delegating SSL to some processes, and distributing
incoming connections via FD passing over unix sockets, etc...

I realized that one of the problems you faced is knowing how many processes
are still present. While the kernel has refcounts to many things, it's very
difficult for the userland to get this piece of vital information. You solved
it using timestamps, at first I was not very satisfied with the method, but I
think it's one of the least ugly ones :-)

The only reliable system-based methods I could elaborate to know immediately
how many processes are present are the following :

  - using a semaphore: with a semop(+1, SEM_UNDO), each process announces
    itself. With semctl(GETVAL), you can read the number of subscribers. If
    a process dies, its value is decremented thanks to SEM_UNDO, so GETVAL
    will return the remaining numbre of processes. I thought about using
    this to get a map of present processes (one bit per relative process),
    but the semval is only a 16-bit short unsigned, so that limits to 16
    processes, which could be low for something aiming at significantly
    improving scalability.

  - using socketpair(). The idea is the following : the parent process
    first creates as many socket pairs as it will fork children. Then
    each children inherit these sockets and close the output side of all
    those except the one attached to their ID. They poll for input on the
    common side of all others however. Thus, when a process dies, all
    other ones will be woken up with a read event on the socket associated
    to the defunct. I think that at some point we'll have to implement
    something like this if we want to be able to bounce between processes
    via the stats socket anyway.

I thought about another point : depending on your load, it might make sense
to stack two layers of load balancing. Some people already do that for SSL.
The principle is that the first layer gets woken up in random order by the
system, and distributes the load according to the configured algorithm to
the second layer. The load at the second layer will be much smoother, and
will respect the first layer's algorithm +/- an offset equivalent to the
number of front nodes.

In the past, I wanted to do this with fd-passing to work around the rough
distribution of the kernel sockets. That would have made it possible to
implement a simple round robin mechanism, but that does not allow us to
implement a leastconn system since the accepting processes do not know the
load of the second layer. Thus a shared memory is still needed to keep
track of the load.

Then, if we have an SHM and a front process to accept fds, we can imagine
that any process can offer an FD it accepts to any other one based on the
load indicated in the SHM. I'm just not sure of the benefits compared to
doing nothing and letting the other one accept it. Hmmm yes in fact there
is a small benefit which is that it supports leastconn even with the new
SO_REUSEPORT of the recent kernels.

Also another point of consideration is that the likeliness of going to
threads instead of processes is growing with time. The reason is simple,
with heavy latency users like SSL, we definitely want the ability to
migrate this heavy traffic to dedicated proceses/threads. Unfortunately,
doing so right now means that all the processing is done in a distinct
process. In general it's not an issue and we can even chain this process
to a central one if a more aggregate traffic is desired. But chaining
processes also means that we don't have an easy access to thee SSL info
from the central process. So all in all, it seems like at some point we'll
have to support threads and implement a thread-aware scheduler (basically
the same as the current one with a thread map for each task so that threads
can only dequeue the tasks allowed to run on them). This will also solve
the issue we're having with keeping stats across all processes, and will
allow other unexpected snails like gzip to run in threads that do not
affect the latency of the rest of the traffic.

So as you can see, I'm having mixed opinions. I'm thinking that in its
current form, your patch is too limited for general use, that a per-bind
leastconn distribution would make a lot more sense, and that it would
still have a limited life if we migrate to threads in 6 months or one
year. Thus my first feeling is that we should try to do our best to see
how your current workload could be supported without an extra patch (ie
either by improving the config or patching haproxy a little bit in a way
that is more acceptable for mainline).

I'm really interested in your opinions on all of this. Please do not
hesitate to share them on the list, there are a number of multi-process
users here, all with different workloads (eg: RDP is sensible to connections
count) and it's best if everyone can participate.

Best regards,
Willy

Re: SSL, peered sticky tables + nbproc > 1?

Reply via email to