----- On Jul 28, 2017, at 4:55 AM, Peter Zijlstra [email protected] wrote:

> On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote:
>> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
>> +            struct mm_struct *oldmm)
> 
> That is a bit of a mouth-full...
> 
>> +{
>> +    if (!IS_ENABLED(CONFIG_MEMBARRIER))
>> +            return;
>> +    /*
>> +     * __schedule()
>> +     *   finish_task_switch()
>> +     *    if (mm)
>> +     *      mmdrop(mm)
>> +     *        atomic_dec_and_test()
>         *
>> +     * takes care of issuing a memory barrier when oldmm is
>> +     * non-NULL. We also don't need the barrier when switching to a
>> +     * kernel thread, nor when we switch between threads belonging
>> +     * to the same process.
>> +     */
>> +    if (likely(oldmm || !mm || mm == oldmm))
>> +            return;
>> +    /*
>> +     * When switching between processes, membarrier expedited
>> +     * private requires a memory barrier after we set the current
>> +     * task.
>> +     */
>> +    smp_mb();
>> +}
> 
> And because of what it complements, I would have expected the callsite:
> 
>> @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>>  
>>      mm = next->mm;
>>      oldmm = prev->active_mm;
>> +    membarrier_expedited_mb_after_set_current(mm, oldmm);
>>      /*
>>       * For paravirt, this is coupled with an exit in switch_to to
>>       * combine the page table reload and the switch backend into
> 
> to be in finish_task_switch(), something like:
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e9785f7aed75..33f34a201255 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct
> *prev)
>       finish_arch_post_lock_switch();
> 
>       fire_sched_in_preempt_notifiers(current);
> +
> +     /*
> +      * For CONFIG_MEMBARRIER we need a full memory barrier after the
> +      * rq->curr assignment. Not all architectures have one in either
> +      * switch_to() or switch_mm() so we use (and complement) the one
> +      * implied by mmdrop()'s atomic_dec_and_test().
> +      */
>       if (mm)
>               mmdrop(mm);
> +     else if (IS_ENABLED(CONFIG_MEMBARRIER))
> +             smp_mb();
> +
>       if (unlikely(prev_state == TASK_DEAD)) {
>               if (prev->sched_class->task_dead)
>                       prev->sched_class->task_dead(prev);
> 
> 
> I realize this is sub-optimal if we're switching to a kernel thread, so
> it might want some work, then again, a whole bunch of architectures
> don't in fact need this extra barrier at all.

As discussed on IRC, I plan to go instead for:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01e3b881ab3a..dd677fb2ee92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2636,6 +2636,11 @@ static struct rq *finish_task_switch(struct task_struct *
prev)
        vtime_task_switch(prev);
        perf_event_task_sched_in(prev, current);
        finish_lock_switch(rq, prev);
+       /*
+        * The membarrier system call requires a full memory barrier
+        * after storing to rq->curr, before going back to user-space.
+        */
+       smp_mb__after_unlock_lock();
        finish_arch_post_lock_switch();
 
        fire_sched_in_preempt_notifiers(current);

Which is free on most architectures, except those defining
CONFIG_ARCH_WEAK_RELEASE_ACQUIRE. CCing PPC maintainers.

> 
>> +static void membarrier_private_expedited(void)
>> +{
>> +    int cpu, this_cpu;
>> +    bool fallback = false;
>> +    cpumask_var_t tmpmask;
>> +
>> +    if (num_online_cpus() == 1)
>> +            return;
>> +
>> +    /*
>> +     * Matches memory barriers around rq->curr modification in
>> +     * scheduler.
>> +     */
>> +    smp_mb();       /* system call entry is not a mb. */
>> +
> 
> Weren't you going to put in a comment on that GFP_NOWAIT thing?

I only added it to the uapi header. Adding this to the implementation
too:

+       /*
+        * Expedited membarrier commands guarantee that they won't
+        * block, hence the GFP_NOWAIT allocation flag and fallback
+        * implementation.
+        */



> 
>> +    if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> 
> You really want: zalloc_cpumask_var().

ok

> 
>> +            /* Fallback for OOM. */
>> +            fallback = true;
>> +    }
>> +
>> +    /*
>> +     * Skipping the current CPU is OK even through we can be
>> +     * migrated at any point. The current CPU, at the point where we
>> +     * read raw_smp_processor_id(), is ensured to be in program
>> +     * order with respect to the caller thread. Therefore, we can
>> +     * skip this CPU from the iteration.
>> +     */
>> +    this_cpu = raw_smp_processor_id();
> 
> So if instead you do the below, that is still true, but you have the
> opportunity to skip moar CPUs, then again, if you migrate the wrong way
> you'll end up not skipping yourself.. a well.

Chances are better to skip more CPUs in face of migration if we do it
in the loop as you suggest. Will do.

> 
>> +    cpus_read_lock();
>> +    for_each_online_cpu(cpu) {
>> +            struct task_struct *p;
>> +
>               if (cpu == raw_smp_processor_id())
>                       continue;
> 
>> +            rcu_read_lock();
>> +            p = task_rcu_dereference(&cpu_rq(cpu)->curr);
>> +            if (p && p->mm == current->mm) {
>> +                    if (!fallback)
>> +                            __cpumask_set_cpu(cpu, tmpmask);
>> +                    else
>> +                            smp_call_function_single(cpu, ipi_mb, NULL, 1);
>> +            }
>> +            rcu_read_unlock();
>> +    }
>> +    cpus_read_unlock();
> 
> This ^, wants to go after that v
> 
>> +    if (!fallback) {
>> +            smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
>> +            free_cpumask_var(tmpmask);
>> +    }
> 
> Because otherwise the bits in your tmpmask might no longer match the
> online state.

Good point, thanks!

Mathieu

> 
>> +
>> +    /*
>> +     * Memory barrier on the caller thread _after_ we finished
>> +     * waiting for the last IPI. Matches memory barriers around
>> +     * rq->curr modification in scheduler.
>> +     */
>> +    smp_mb();       /* exit from system call is not a mb */
> > +}

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Reply via email to