Re: [PATCH for 5.9 v2 1/4] futex: introduce FUTEX_SWAP operation

2020-08-11 Thread Rik van Riel
On Tue, 2020-08-04 at 14:31 +0200, pet...@infradead.org wrote:
> On Mon, Aug 03, 2020 at 03:15:07PM -0700, Peter Oskolkov wrote:
> > A simplified/idealized use case: imagine a multi-user service
> > application
> > (e.g. a DBMS) that has to implement the following user CPU quota
> > policy:
> 
> So the last posting made hackernews; and there a bunch expressed far
> more interest in coroutines, which, if I'm not mistaken, can also be
> implemented using all this.
> 
> Would that not make for a far simpler and more convincing use-case?

Also just worker threads in general. Instead of waking up
an idle worker thread every time a new request comes in,
why not allow worker threads to indicate "I'm almost done",
and the main/receiving thread to enqueue the new request,
to be handled when the worker thread is done with the current
request?

> I really think we want to have block/resume detection sorted before
> this
> goes anywhere, I also strongly feel BPF should not be used for
> functional interfaces like that.
> 
> That is, I want to see a complete interface before I want to commit
> to
> an ABI that we're stuck with.

The work case above also needs to have that figured out,
for the (slow path, but still common) case of the worker
thread actually having gone to sleep while the work got
enqueued, and then needing to be woken up anyway.

Not sure that is a direction you are interested in, given
the description, but it could make coroutines and worker
threads more efficient :)

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH for 5.9 v2 1/4] futex: introduce FUTEX_SWAP operation

2020-08-04 Thread peterz
On Mon, Aug 03, 2020 at 03:15:07PM -0700, Peter Oskolkov wrote:
> A simplified/idealized use case: imagine a multi-user service application
> (e.g. a DBMS) that has to implement the following user CPU quota
> policy:

So the last posting made hackernews; and there a bunch expressed far
more interest in coroutines, which, if I'm not mistaken, can also be
implemented using all this.

Would that not make for a far simpler and more convincing use-case?

> - block detection: when a task blocks in the kernel (on a network
>   read, for example), the userspace scheduler is notified and
>   schedules (resumes or swaps into) a pending task in the newly available
>   CPU slot;
> - wake detection: when a task wakes from a previously blocking kernel
>   operation (e.g. can now process some data on a network socket), the
>   userspace scheduler is notified and can now schedule the task to
>   run on a CPU when a CPU is available and the task can use it according
>   to its scheduling policy.
> 
> (Technically, block/wake detection is still experimental and not
> used widely: as we control the userspace, we can actually determine
> blocking/waking syscalls without kernel support).
> 
> Internally we currently use kernel patches that are too "intrusive" to be
> included in a general-purpose Linux kernel, so we are exploring ways to
> upstream this functionality.
> 
> The easiest/least intrusive approach that we have come up with is this:
> 
> - block/resume map perfectly to futex wait/wake;
> - switch_to thus maps to FUTEX_SWAP;
> - block and wake detection can be done either through tracing
>   or by introducing new BPF attach points (when a task blocks or wakes,
>   a BPF program is triggered that then communicates with the userspace);
> - the BPF attach points are per task, and the task needs to "opt in"
>   (i.e. all other tasks suffer just an additional pointer comparison
>   on block/wake);
> - the BPF programs triggered on block/wake should be able to perform
>   futex ops (e.g. wake a designated userspace scheduling task) - this
>   probably indicates that tracing is not enough, and a new BPF prog type
>   is needed.

I really think we want to have block/resume detection sorted before this
goes anywhere, I also strongly feel BPF should not be used for
functional interfaces like that.

That is, I want to see a complete interface before I want to commit to
an ABI that we're stuck with.

I also want to see userspace that goes along with it; like with
sys_membarrier() / liburcu and sys_rseq() / librseq (which seems to be
heading for glibc).

Also, and this seems to be the crux of the whole endeavour, you want to
allow your 'fibers' to block. Which is what makes
{make,swap,get,set}context() unsuited for your needs and gives rise to
the whole block/resume issue above.

Also, I want words on the interaction between resume notification and
wake-up preemption. That is, how do you envision managing the
interaction between the two schedulers.


All in all, I don't think you're even close to having something
mergable.


[PATCH for 5.9 v2 1/4] futex: introduce FUTEX_SWAP operation

2020-08-03 Thread Peter Oskolkov
From: Peter Oskolkov 

As Paul Turner presented at LPC in 2003, Google has developed
an M:N userspace threading subsystem backed by Google-private
SwitchTo Linux Kernel API. This subsystem provides latency-sensitive
services at Google withfine-grained user-space control/scheduling
over what is running when, and this subsystem is used widely
internally (called schedulers or fibers).

A simplified/idealized use case: imagine a multi-user service application
(e.g. a DBMS) that has to implement the following user CPU quota
policy:
- each user (these are DBMS users, not Linux users) can purchase
  certain amounts of expensive, but guaranteed, low-latency
  CPU quota (as a % of total CPUs available to the service), and a certain
  amount of cheap high-latency CPU quota;
- quotas are enforced per second;
- each user RPC/request to the service can specify whether this is
  a latency-critical request that should use the user's low-latency quota,
  and be immediately rejected if the quota for this second is exhausted;
- requests can be high-latency-tolerant: should only use the high-latency
  quota;
- a request can also be latency-tolerant: it should use the low-latency
  quota if available, or the high-latency quota if the low-latency quota
  is exhausted;
- all "sold" (= allocated) low-latency quotas can go up to, but not exceed,
  70% of all available CPUs (i.e. no over-subscription);
- high-latency quotas are oversubscribed;
- user isolation: misbehaving users should not affect the serving
  latency of users with available low-latency quotas;
- soft deadlines/timeouts: each request/RPC can specify that it must
  be served within a certain deadline (let's say the minimum deadline
  is 10 milliseconds) or dropped if the deadline is exceeded;
- each user request can potentially spawn several processing threads/tasks,
  and do child RPCs to remote services; these threads/tasks should
  also be covered by this quota/policy;
- user requests should be served somewhat in-order: requests that use
  the same quota tiers that arrive earlier should be granted CPU before
  requests that arrive later ("arrival order scheduling").

There are many services at Google that implement a variant of the scheduling
policy outlined above. In reality there are more priorities/quota tiers,
there is service-internal maintenance work that can be either high
or low priority, sometimes FIFO/LIFO/round robin scheduling is used in
addition to arrival order scheduling, etc. (for example, LIFO scheduling
is better at cache-locality in certain scenarios). User isolation within
a process, as well as latency improvements are the main benefits (on top
of the actual ability to implement complex quota/scheduling policies).

What is important is that these scheduling policies are driven by
userspace schedulers built on top of these basic kernel primitives:
- block: block the current thread/task (with a timeout);
- resume: resume some previously blocked task (should commutate
  with block, i.e. racing block/resume pairs should behave
  exactly as if wake arrived after block);
- switch_to: block the current thread, resume some previously
  blocked task (behaves exactly as wake(remote), block(current), but
  optimized to do a fast context switch on the fast path);
- block detection: when a task blocks in the kernel (on a network
  read, for example), the userspace scheduler is notified and
  schedules (resumes or swaps into) a pending task in the newly available
  CPU slot;
- wake detection: when a task wakes from a previously blocking kernel
  operation (e.g. can now process some data on a network socket), the
  userspace scheduler is notified and can now schedule the task to
  run on a CPU when a CPU is available and the task can use it according
  to its scheduling policy.

(Technically, block/wake detection is still experimental and not
used widely: as we control the userspace, we can actually determine
blocking/waking syscalls without kernel support).

Internally we currently use kernel patches that are too "intrusive" to be
included in a general-purpose Linux kernel, so we are exploring ways to
upstream this functionality.

The easiest/least intrusive approach that we have come up with is this:

- block/resume map perfectly to futex wait/wake;
- switch_to thus maps to FUTEX_SWAP;
- block and wake detection can be done either through tracing
  or by introducing new BPF attach points (when a task blocks or wakes,
  a BPF program is triggered that then communicates with the userspace);
- the BPF attach points are per task, and the task needs to "opt in"
  (i.e. all other tasks suffer just an additional pointer comparison
  on block/wake);
- the BPF programs triggered on block/wake should be able to perform
  futex ops (e.g. wake a designated userspace scheduling task) - this
  probably indicates that tracing is not enough, and a new BPF prog type
  is needed.

In addition to the above, another common use case for FUTEX_SWAP is
message passing a-la RPC