On Tue, Jul 30, 2019 at 06:06:02PM -0400, Gabriel Krisman Bertazi wrote:
> This is a new futex operation, called FUTEX_WAIT_MULTIPLE, which allows
> a thread to wait on several futexes at the same time, and be awoken by
> any of them.  In a sense, it implements one of the features that was
> supported by pooling on the old FUTEX_FD interface.
> 
> My use case for this operation lies in Wine, where we want to implement
> a similar interface available in Windows, used mainly for event
> handling.  The wine folks have an implementation that uses eventfd, but
> it suffers from FD exhaustion (I was told they have application that go
> to the order of multi-milion FDs), and higher CPU utilization.

So is multi-million the range we expect for @count ?

If so, we're having problems, see below.

> In time, we are also proposing modifications to glibc and libpthread to
> make this feature available for Linux native multithreaded applications
> using libpthread, which can benefit from the behavior of waiting on any
> of a group of futexes.
> 
> In particular, using futexes in our Wine use case reduced the CPU
> utilization by 4% for the game Beat Saber and by 1.5% for the game
> Shadow of Tomb Raider, both running over Proton (a wine based solution
> for Windows emulation), when compared to the eventfd interface. This
> implementation also doesn't rely of file descriptors, so it doesn't risk
> overflowing the resource.
> 
> Technically, the existing FUTEX_WAIT implementation can be easily
> reworked by using do_futex_wait_multiple with a count of one, and I
> have a patch showing how it works.  I'm not proposing it, since
> futex is such a tricky code, that I'd be more confortable to have
> FUTEX_WAIT_MULTIPLE running upstream for a couple development cycles,
> before considering modifying FUTEX_WAIT.
> 
> From an implementation perspective, the futex list is passed as an array
> of (pointer,value,bitset) to the kernel, which will enqueue all of them
> and sleep if none was already triggered. It returns a hint of which
> futex caused the wake up event to userspace, but the hint doesn't
> guarantee that is the only futex triggered.  Before calling the syscall
> again, userspace should traverse the list, trying to re-acquire any of
> the other futexes, to prevent an immediate -EWOULDBLOCK return code from
> the kernel.

> Signed-off-by: Zebediah Figura <z.figur...@gmail.com>
> Signed-off-by: Steven Noonan <ste...@valvesoftware.com>
> Signed-off-by: Pierre-Loup A. Griffais <pgriff...@valvesoftware.com>
> Signed-off-by: Gabriel Krisman Bertazi <kris...@collabora.com>

That is not a valid SoB chain.

> ---
>  include/uapi/linux/futex.h |   7 ++
>  kernel/futex.c             | 161 ++++++++++++++++++++++++++++++++++++-
>  2 files changed, 164 insertions(+), 4 deletions(-)
> 
> diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
> index a89eb0accd5e..2401c4cf5095 100644
> --- a/include/uapi/linux/futex.h
> +++ b/include/uapi/linux/futex.h

> @@ -150,4 +151,10 @@ struct robust_list_head {
>    (((op & 0xf) << 28) | ((cmp & 0xf) << 24)          \
>     | ((oparg & 0xfff) << 12) | (cmparg & 0xfff))
>  
> +struct futex_wait_block {
> +     __u32 __user *uaddr;
> +     __u32 val;
> +     __u32 bitset;
> +};

That is not compat invariant and I see a distinct lack of compat code in
this patch.

> diff --git a/kernel/futex.c b/kernel/futex.c
> index 91f3db335c57..2623e8f152cd 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c

no function comment in sight

> +static int do_futex_wait_multiple(struct futex_wait_block *wb,
> +                               u32 count, unsigned int flags,
> +                               ktime_t *abs_time)
> +{
> +

(spurious empty line)

> +     struct hrtimer_sleeper timeout, *to;
> +     struct futex_hash_bucket *hb;
> +     struct futex_q *qs = NULL;
> +     int ret;
> +     int i;
> +
> +     qs = kcalloc(count, sizeof(struct futex_q), GFP_KERNEL);
> +     if (!qs)
> +             return -ENOMEM;

This will not work for @count ~ 1e6, or rather, MAX_ORDER is 11, so we
can, at most, allocate 4096 << 11 bytes, and since sizeof(futex_q) ==
112, that gives: ~75k objects.

Also; this is the only actual limit placed on @count.

Jann, Al, this also allows a single task to increment i_count or
mm_count by ~75k, which might be really awesome for refcount smashing
attacks.

> +
> +     to = futex_setup_timer(abs_time, &timeout, flags,
> +                            current->timer_slack_ns);
> + retry:

(wrongly indented label)

> +     for (i = 0; i < count; i++) {
> +             qs[i].key = FUTEX_KEY_INIT;
> +             qs[i].bitset = wb[i].bitset;
> +
> +             ret = get_futex_key(wb[i].uaddr, flags & FLAGS_SHARED,
> +                                 &qs[i].key, FUTEX_READ);
> +             if (unlikely(ret != 0)) {
> +                     for (--i; i >= 0; i--)
> +                             put_futex_key(&qs[i].key);
> +                     goto out;
> +             }
> +     }
> +
> +     set_current_state(TASK_INTERRUPTIBLE);
> +
> +     for (i = 0; i < count; i++) {
> +             ret = __futex_wait_setup(wb[i].uaddr, wb[i].val,
> +                                      flags, &qs[i], &hb);
> +             if (ret) {
> +                     /* Drop the failed key directly.  keys 0..(i-1)
> +                      * will be put by unqueue_me.
> +                      */

(broken comment style)

> +                     put_futex_key(&qs[i].key);
> +
> +                     /* Undo the partial work we did. */
> +                     for (--i; i >= 0; i--)
> +                             unqueue_me(&qs[i]);
> +
> +                     __set_current_state(TASK_RUNNING);
> +                     if (ret > 0)
> +                             goto retry;
> +                     goto out;
> +             }
> +
> +             /* We can't hold to the bucket lock when dealing with
> +              * the next futex. Queue ourselves now so we can unlock
> +              * it before moving on.
> +              */

(broken comment style)

> +             queue_me(&qs[i], hb);
> +     }
> +
> +     if (to)
> +             hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
> +
> +     /* There is no easy to way to check if we are wake already on
> +      * multiple futexes without waking through each one of them.  So
> +      * just sleep and let the scheduler handle it.
> +      */

(broken comment style)

> +     if (!to || to->task)
> +             freezable_schedule();
> +
> +     __set_current_state(TASK_RUNNING);
> +
> +     ret = -ETIMEDOUT;
> +     /* If we were woken (and unqueued), we succeeded. */
> +     for (i = 0; i < count; i++)
> +             if (!unqueue_me(&qs[i]))
> +                     ret = i;

(missing {})

> +
> +     /* Succeed wakeup */
> +     if (ret >= 0)
> +             goto out;
> +
> +     /* Woken by triggered timeout */
> +     if (to && !to->task)
> +             goto out;
> +
> +     /*
> +      * We expect signal_pending(current), but we might be the
> +      * victim of a spurious wakeup as well.
> +      */

(curiously correct comment style -- which makes the patch
self-inconsistent)

> +     if (!signal_pending(current))
> +             goto retry;

I think that if you invest in a few helper functions; the above can be
reduced and written more like a normal wait loop.

> +
> +     ret = -ERESTARTSYS;
> +     if (!abs_time)
> +             goto out;
> +
> +     ret = -ERESTART_RESTARTBLOCK;
> + out:

(wrong label indent)

> +     if (to) {
> +             hrtimer_cancel(&to->timer);
> +             destroy_hrtimer_on_stack(&to->timer);
> +     }
> +
> +     kfree(qs);
> +     return ret;
> +}
> +

distinct lack of function comments

> +static int futex_wait_multiple(u32 __user *uaddr, unsigned int flags,
> +                            u32 count, ktime_t *abs_time)
> +{
> +     struct futex_wait_block *wb;
> +     struct restart_block *restart;
> +     int ret;
> +
> +     if (!count)
> +             return -EINVAL;
> +
> +     wb = kcalloc(count, sizeof(struct futex_wait_block), GFP_KERNEL);
> +     if (!wb)
> +             return -ENOMEM;
> +
> +     if (copy_from_user(wb, uaddr,
> +                        count * sizeof(struct futex_wait_block))) {
> +             ret = -EFAULT;
> +             goto out;
> +     }

I'm thinking we can do away with this giant copy and do it one at a time
from the other function, just extend the storage allocated there to
store whatever values are still required later.

Do we want to impose alignment constraints on uaddr?

> +     ret = do_futex_wait_multiple(wb, count, flags, abs_time);
> +
> +     if (ret == -ERESTART_RESTARTBLOCK) {
> +             restart = &current->restart_block;
> +             restart->fn = futex_wait_restart;
> +             restart->futex.uaddr = uaddr;
> +             restart->futex.val = count;
> +             restart->futex.time = *abs_time;
> +             restart->futex.flags = (flags | FLAGS_HAS_TIMEOUT |
> +                                     FLAGS_WAKE_MULTIPLE);
> +     }
> +
> +out:

(inconsistent correctly indented label)

> +     kfree(wb);
> +     return ret;
> +}

Reply via email to