On Wed, Jul 26, 2017 at 08:41:10AM -0700, Paul E. McKenney wrote: > On Wed, Jul 26, 2017 at 09:41:28AM +0200, Peter Zijlstra wrote: > > On Tue, Jul 25, 2017 at 04:59:36PM -0700, Paul E. McKenney wrote:
> > Sure, but SCHED_OTHER auto throttles in that if there's anything else to > > run, you get to wait. So you can't generate an IPI storm with it. Also, > > again, we can be limited to a subset of CPUs. > > OK, what is its auto-throttle policy? One round of IPIs per jiffy or > some such? No. Its called wakeup latency :-) Your SCHED_OTHER task will not get to insta-run all the time. If there are other tasks already running, we'll not IPI unless it should preempt. If its idle, nobody cares.. > Does this auto-throttling also apply if the user is running a CPU-bound > SCHED_BATCH or SCHED_IDLE task on each CPU, and periodically waking up > one of a large group of SCHED_OTHER tasks, where the SCHED_OTHER tasks > immediately sleep upon being awakened? SCHED_BATCH is even more likely to suffer wakeup latency since it will never preempt anything. > OK, and the rq->curr assignment is in common code, correct? Does this > allow the IPI-only-requesting-process approach to live entirely within > common code? That is the idea. > The 2010 email thread ended up with sys_membarrier() acquiring the > runqueue lock for each CPU, Yes, that's something I'm not happy with. Machine wide banging of that lock will be a performance no-no. > because doing otherwise meant adding code to the scheduler fastpath. And that's obviously another thing I'm not happy with either. > Don't we still need to do this? > > https://marc.info/?l=linux-kernel&m=126341138408407&w=2 > https://marc.info/?l=linux-kernel&m=126349766324224&w=2 I don't know.. those seem focussed on mm_cpumask() and we can't use that per Will's email. So I think we need to think anew on this, start from the ground up. What is missing for this: static void ipi_mb(void *info) { smp_mb(); // IPIs should be serializing but paranoid } sys_membarrier() { smp_mb(); // because sysenter isn't an unconditional mb for_each_online_cpu(cpu) { struct task_struct *p; rcu_read_lock(); p = task_rcu_dereference(&cpu_curr(cpu)); if (p && p->mm == current->mm) __set_bit(cpus, cpu); rcu_read_unlock(); } on_cpu_cpu_mask(cpus, ipi_mb, NULL, 1); // does local smp_mb() too } VS __schedule() { spin_lock(&rq->lock); smp_mb__after_spinlock(); // really full mb implied /* lots */ if (likely(prev != next)) { rq->curr = next; context_switch() { switch_mm(); switch_to(); // neither need imply a barrier spin_unlock(&rq->lock); } } } So I think we need either switch_mm() or switch_to() to imply a full barrier for this to work, otherwise we get: CPU0 CPU1 lock rq->lock mb rq->curr = A unlock rq->lock lock rq->lock mb sys_membarrier() mb for_each_online_cpu() p = A // no match no IPI mb rq->curr = B unlock rq->lock And that's bad, because now CPU0 doesn't have an MB happening _after_ sys_membarrier() if B matches. So without audit, I only know of PPC and Alpha not having a barrier in either switch_*(). x86 obviously has barriers all over the place, arm has a super duper heavy barrier in switch_to(). One option might be to resurrect spin_unlock_wait(), although to use that here is really ugly too, but it would avoid thrashing the rq->lock. I think it'd end up having to look like: rq = cpu_rq(cpu); again: rcu_read_lock() p = task_rcu_dereference(&rq->curr); if (p) { raw_spin_unlock_wait(&rq->lock); q = task_rcu_dereference(&rq->curr); if (q != p) { rcu_read_unlock(); goto again; } } ... which is just about as horrible as it looks.

