5] sys_membarrier: Add expedited option

Peter Zijlstra Thu, 27 Jul 2017 01:30:56 -0700

On Wed, Jul 26, 2017 at 08:41:10AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 26, 2017 at 09:41:28AM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 25, 2017 at 04:59:36PM -0700, Paul E. McKenney wrote:


> > Sure, but SCHED_OTHER auto throttles in that if there's anything else to
> > run, you get to wait. So you can't generate an IPI storm with it. Also,
> > again, we can be limited to a subset of CPUs.
> 
> OK, what is its auto-throttle policy?  One round of IPIs per jiffy or
> some such?

No. Its called wakeup latency :-) Your SCHED_OTHER task will not get to
insta-run all the time. If there are other tasks already running, we'll
not IPI unless it should preempt.

If its idle, nobody cares..

> Does this auto-throttling also apply if the user is running a CPU-bound
> SCHED_BATCH or SCHED_IDLE task on each CPU, and periodically waking up
> one of a large group of SCHED_OTHER tasks, where the SCHED_OTHER tasks
> immediately sleep upon being awakened?

SCHED_BATCH is even more likely to suffer wakeup latency since it will
never preempt anything.

> OK, and the rq->curr assignment is in common code, correct?  Does this
> allow the IPI-only-requesting-process approach to live entirely within
> common code?

That is the idea.

> The 2010 email thread ended up with sys_membarrier() acquiring the
> runqueue lock for each CPU,

Yes, that's something I'm not happy with. Machine wide banging of that
lock will be a performance no-no.

> because doing otherwise meant adding code to the scheduler fastpath.

And that's obviously another thing I'm not happy with either.

> Don't we still need to do this?
> 
> https://marc.info/?l=linux-kernel&m=126341138408407&w=2
> https://marc.info/?l=linux-kernel&m=126349766324224&w=2

I don't know.. those seem focussed on mm_cpumask() and we can't use that
per Will's email.


So I think we need to think anew on this, start from the ground up.

What is missing for this:

static void ipi_mb(void *info)
{
        smp_mb(); // IPIs should be serializing but paranoid
}


sys_membarrier()
{
        smp_mb(); // because sysenter isn't an unconditional mb

        for_each_online_cpu(cpu) {
                struct task_struct *p;

                rcu_read_lock();
                p = task_rcu_dereference(&cpu_curr(cpu));
                if (p && p->mm == current->mm)
                        __set_bit(cpus, cpu);
                rcu_read_unlock();
        }

        on_cpu_cpu_mask(cpus, ipi_mb, NULL, 1); // does local smp_mb() too
}

VS

__schedule()
{
        spin_lock(&rq->lock);
        smp_mb__after_spinlock(); // really full mb implied

        /* lots */

        if (likely(prev != next)) {

                rq->curr = next;

                context_switch() {
                        switch_mm();
                        switch_to();
                        // neither need imply a barrier

                        spin_unlock(&rq->lock);
                }
        }
}




So I think we need either switch_mm() or switch_to() to imply a full
barrier for this to work, otherwise we get:

  CPU0                          CPU1


  lock rq->lock
  mb

  rq->curr = A

  unlock rq->lock

  lock rq->lock
  mb

                                sys_membarrier()

                                mb

                                for_each_online_cpu()
                                  p = A
                                  // no match no IPI

                                mb
  rq->curr = B

  unlock rq->lock


And that's bad, because now CPU0 doesn't have an MB happening _after_
sys_membarrier() if B matches.


So without audit, I only know of PPC and Alpha not having a barrier in
either switch_*().

x86 obviously has barriers all over the place, arm has a super duper
heavy barrier in switch_to().

One option might be to resurrect spin_unlock_wait(), although to use
that here is really ugly too, but it would avoid thrashing the
rq->lock.

I think it'd end up having to look like:

        rq = cpu_rq(cpu);
again:
        rcu_read_lock()
        p = task_rcu_dereference(&rq->curr);
        if (p) {
                raw_spin_unlock_wait(&rq->lock);
                q = task_rcu_dereference(&rq->curr);
                if (q != p) {
                        rcu_read_unlock();
                        goto again;
                }
        }
        ...

which is just about as horrible as it looks.

Re: [PATCH tip/core/rcu 4/5] sys_membarrier: Add expedited option

Reply via email to