On Thu, Mar 24, 2016 at 2:00 PM, Willy Tarreau <w...@1wt.eu> wrote:
> The pattern is :
>
>   t0 : unprivileged processes 1 and 2 are listening to the same port
>        (sock1@pid1) (sock2@pid2)
>        <------ listening ------>
>
>   t1 : new processes are started to replace the old ones
>        (sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
>        <------ listening ------> <------ listening ------>
>
>   t2 : new processes signal the old ones they must stop
>        (sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4)
>        <------- draining ------> <------ listening ------>
>
>   t3 : pids 1 and 2 have finished, they go away
>                                  (sock3@pid3) (sock4@pid4)
>         <------ gone ----->      <------ listening ------>

To address the documentation issues, I'd like to reference the following:
- The filter.txt document in the kernel tree:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/filter.txt
- It uses (and extends) the BPF instruction set defined in the
original BSD BPF paper: http://www.tcpdump.org/papers/bpf-usenix93.pdf
- The kernel headers define all of the user-space structures used:
  * 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/filter.h
  * 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/bpf.h

I've been trying to come up with an example BPF program for use in the
example Willy gave earlier in this thread (using 4 points in time and
describing one process with two listening sockets replacing another
with two listening sockets).  Everything except the last step is
pretty straight forward using what is currently available in the
kernel.  I'm using random distribution for simplicity, but you could
easily do something smarter using more information about the specific
hardware:

t0: Evenly distrubute load to two SO_REUSEPORT sockets in a single process:
  ld rand
  mod #2
  ret a

t1: Fork a new process, create two new listening sockets in the same
group. Even after calling listen(), but before updating the BPF
program, only the first two sockets will see new connections.  The
program is trivially modified to use all 4.
  ld rand
  mod #4
  ret a

t2: Stop sending new connections to the first two sockets (draining)
  ld rand
  mod #2
  add #2
  ret a

t3: Close the first two sockets and only use the last two.  This is
the tricky step.  Before this point, the sockets are numbered 0
through 3 from the perspective of the BPF program (in the order
listen() was called).  As soon as socket 0 is closed, the last socket
in the list replaces it (what was 3 becomes 0).  When socket 1 is
closed, socket 2 moves into that position.  The assumptions about the
socket indexes in the BPF program need to change as the indexes change
as a result of closing them.

Even if you use an EBPF map as a level of indirection here, you still
have the issue that the socket indexes change as a result of some of
them leaving the group.  I'm not sure yet how to properly fix this,
but it will probably mean changing the way the socket indexing
works...  The current scheme is really an implementation detail
optimized for efficiency.  It may be worth modifying or creating a
mode which results in a stable mapping.  This will probably be
necessary for any scheme which expects sockets to regularly enter or
leave the group.

Reply via email to