Re: [PATCH -v2 1/5] sched: Fix ttwu() race
On Thu, Jul 23, 2020 at 10:11:28PM +0200, Peter Zijlstra wrote: > On Thu, Jul 23, 2020 at 08:41:03PM +0100, Chris Wilson wrote: > > > I am very sorry for the wild goose chase. > > *phew*... all good then. I was starting to go a little ga-ga trying to > make sense of things. > > Arguably we should probably do something like: > > > @@ -4555,7 +4572,7 @@ asmlinkage __visible void __sched > preempt_schedule_irq(void) > int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int > wake_flags, > void *key) > { > - return try_to_wake_up(curr->private, mode, wake_flags); > + return try_to_wake_up(curr->private, mode, wake_flags & WF_SYNC); > } > EXPORT_SYMBOL(default_wake_function); If you do: Tested-by: Paul E. McKenney This was about nine hours of each of the default rcutorture scenarios. Thanx, Paul > Since I don't think anybody uses anything other than WF_SYNC, ever. And > the rest of the WF_flags are used internally. > > Thanks Chris!
Re: [PATCH -v2 1/5] sched: Fix ttwu() race
On Thu, Jul 23, 2020 at 08:41:03PM +0100, Chris Wilson wrote: > I am very sorry for the wild goose chase. *phew*... all good then. I was starting to go a little ga-ga trying to make sense of things. Arguably we should probably do something like: @@ -4555,7 +4572,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void) int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags, void *key) { - return try_to_wake_up(curr->private, mode, wake_flags); + return try_to_wake_up(curr->private, mode, wake_flags & WF_SYNC); } EXPORT_SYMBOL(default_wake_function); Since I don't think anybody uses anything other than WF_SYNC, ever. And the rest of the WF_flags are used internally. Thanks Chris!
Re: [PATCH -v2 1/5] sched: Fix ttwu() race
Quoting Peter Zijlstra (2020-07-23 19:28:41) > On Wed, Jul 22, 2020 at 10:57:56AM +0100, Chris Wilson wrote: > > > Perhaps more damning is that I can replace WF_ON_CPU with p->on_cpu to > > suppress the warning: > > *argh*, I'm starting to go mad... > > Chris, could you please try the below patch? ttwu-IPI-self: 1==1, p->on_cpu=0;0, task_cpu(p)=1;1 ttwu-IPI-self: 1==1, p->on_cpu=0;0, task_cpu(p)=1;1 ttwu-IPI-self: 0==0, p->on_cpu=0;0, task_cpu(p)=0;0 ttwu-IPI-self: 3==3, p->on_cpu=0;0, task_cpu(p)=3;3 ttwu-IPI-self: 2==2, p->on_cpu=0;0, task_cpu(p)=2;2 ttwu-IPI-self: 1==1, p->on_cpu=0;0, task_cpu(p)=1;1 ttwu-IPI-self: 2==2, p->on_cpu=0;0, task_cpu(p)=2;2 ttwu-IPI-self: 2==2, p->on_cpu=0;0, task_cpu(p)=2;2 ttwu-IPI-self: 2==2, p->on_cpu=0;0, task_cpu(p)=2;2 > Can you also confirm that if you do: > > $ echo NO_TTWU_QUEUE_ON_CPU > /debug/sched_features With, sched_feat_disable(10):TTWU_QUEUE_ON_CPU the pr_alert is still being hit ttwu-IPI-self: 3==3, p->on_cpu=0;0, task_cpu(p)=3;3 At which point, it darns on me. Mea culpa, stray bits being passed into default_wake_function. I am very sorry for the wild goose chase. -Chris
Re: [PATCH -v2 1/5] sched: Fix ttwu() race
On Wed, Jul 22, 2020 at 10:57:56AM +0100, Chris Wilson wrote: > Perhaps more damning is that I can replace WF_ON_CPU with p->on_cpu to > suppress the warning: *argh*, I'm starting to go mad... Chris, could you please try the below patch? Can you also confirm that if you do: $ echo NO_TTWU_QUEUE_ON_CPU > /debug/sched_features or wherever else system-doofus mounts debugfs these days, the issue no longer manifests? Because if I don't get a handle on this soon we might have to disable this thing for now :/ --- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a2a244af9a537..8218779734288 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2430,13 +2430,15 @@ bool cpus_share_cache(int this_cpu, int that_cpu) return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu); } -static inline bool ttwu_queue_cond(int cpu, int wake_flags) +static inline bool ttwu_queue_cond(struct task_struct *p, int cpu, int wake_flags) { + int this_cpu = smp_processor_id(); + /* * If the CPU does not share cache, then queue the task on the * remote rqs wakelist to avoid accessing remote data. */ - if (!cpus_share_cache(smp_processor_id(), cpu)) + if (!cpus_share_cache(this_cpu, cpu)) return true; /* @@ -2445,15 +2447,30 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags) * the soon-to-be-idle CPU as the current CPU is likely busy. * nr_running is checked to avoid unnecessary task stacking. */ - if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1) - return true; + if (wake_flags & WF_ON_CPU) { + + if (unlikely(cpu == this_cpu)) { + int on_cpu = READ_ONCE(p->on_cpu); + int cpu1 = task_cpu(p); + + smp_rmb(); + smp_cond_load_acquire(&p->on_cpu, !VAL); + + pr_alert("ttwu-IPI-self: %d==%d, p->on_cpu=%d;0, task_cpu(p)=%d;%d\n", +cpu, this_cpu, on_cpu, cpu1, task_cpu(p)); + + return false; + } + + return cpu_rq(cpu)->nr_running <= 1; + } return false; } static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags) { - if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) { + if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(p, cpu, wake_flags)) { if (WARN_ON_ONCE(cpu == smp_processor_id())) return false; @@ -2713,7 +2730,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) * to ensure we observe the correct CPU on which the task is currently * scheduling. */ - if (smp_load_acquire(&p->on_cpu) && + if (sched_feat(TTWU_QUEUE_ON_CPU) && smp_load_acquire(&p->on_cpu) && ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU)) goto unlock; diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 7481cd96f3915..b231a840c3eba 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -50,6 +50,7 @@ SCHED_FEAT(NONTASK_CAPACITY, true) * using the scheduler IPI. Reduces rq->lock contention/bounces. */ SCHED_FEAT(TTWU_QUEUE, true) +SCHED_FEAT(TTWU_QUEUE_ON_CPU, true) /* * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
Re: [PATCH -v2 1/5] sched: Fix ttwu() race
Quoting pet...@infradead.org (2020-07-21 12:37:19) > On Tue, Jul 21, 2020 at 11:49:05AM +0100, Chris Wilson wrote: > > Quoting Peter Zijlstra (2020-06-22 11:01:23) > > > @@ -2378,6 +2385,9 @@ static inline bool ttwu_queue_cond(int c > > > static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int > > > wake_flags) > > > { > > > if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) { > > > + if (WARN_ON_ONCE(cpu == smp_processor_id())) > > > + return false; > > > + > > > sched_clock_cpu(cpu); /* Sync clocks across CPUs */ > > > __ttwu_queue_wakelist(p, cpu, wake_flags); > > > return true; > > > > We've been hitting this warning frequently, but have never seen the > > rcu-torture-esque oops ourselves. > > How easy is it to hit this? What, if anything, can I do to make my own > computer go bang? I tried reproducing it in a mockup, hrtimer + irq_work + waitqueue, but it remains elusive. It pops up in an obscure HW tests where we are exercising timeout handling for rogue HW. > > > <4> [181.766705] RIP: 0010:ttwu_queue_wakelist+0xbc/0xd0 > > <4> [181.766710] Code: 00 00 00 5b 5d 41 5c 41 5d c3 31 c0 5b 5d 41 5c 41 > > 5d c3 31 c0 f6 c3 08 74 f2 48 c7 c2 00 ad 03 00 83 7c 11 40 01 77 e4 eb 80 > > <0f> 0b 31 c0 eb dc 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 bf 17 > > <4> [181.766726] RSP: 0018:c9003e08 EFLAGS: 00010046 > > <4> [181.766733] RAX: RBX: RCX: > > 888276a0 > > <4> [181.766740] RDX: 0003ad00 RSI: 8232045b RDI: > > 8233103e > > <4> [181.766747] RBP: R08: R09: > > 0001 > > <4> [181.766754] R10: d3fa25c3 R11: 53712267 R12: > > 88825b912940 > > <4> [181.766761] R13: R14: 0087 R15: > > 0003ad00 > > <4> [181.766769] FS: () GS:888276a0() > > knlGS: > > <4> [181.766777] CS: 0010 DS: ES: CR0: 80050033 > > <4> [181.766783] CR2: 55b8245814e0 CR3: 05610003 CR4: > > 003606f0 > > <4> [181.766790] Call Trace: > > <4> [181.766794] > > <4> [181.766798] try_to_wake_up+0x21b/0x690 > > <4> [181.766805] autoremove_wake_function+0xc/0x50 > > <4> [181.766858] __i915_sw_fence_complete+0x1ee/0x250 [i915] > > <4> [181.766912] dma_i915_sw_fence_wake+0x2d/0x40 [i915] > > Please, don't trim oopses.. > > > We are seeing this on the ttwu_queue() path, so with p->on_cpu=0, and the > > warning is cleared up by > > > > - if (WARN_ON_ONCE(cpu == smp_processor_id())) > > + if (WARN_ON_ONCE(p->on_cpu && cpu == smp_processor_id())) > > > > which would appear to restore the old behaviour for ttwu_queue() and > > seem to be consistent with the intent of this patch. Hopefully this > > helps identify the problem correctly. > > Hurmph, that's actively wrong. We should never queue to self, as that > would result in self-IPI, which is not possible on a bunch of archs. It > works for you because x86 can in fact do that. > > So ttwu_queue_cond() will only return true when: > > - target-cpu and current-cpu do not share cache; >so it cannot be this condition, because you _always_ >share cache with yourself. > > - when WF_ON_CPU and target-cpu has nr_running <= 1; >which means p->on_cpu == true. > > So now you have cpu == smp_processor_id() && p->on_cpu == 1, however > your modified WARN contradicts that. > > *puzzle* Perhaps more damning is that I can replace WF_ON_CPU with p->on_cpu to suppress the warning: -static inline bool ttwu_queue_cond(int cpu, int wake_flags) +static inline bool ttwu_queue_cond(struct task_struct *p, int cpu, int wake_flags) { /* * If the CPU does not share cache, then queue the task on the @@ -2370,7 +2370,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags) * the soon-to-be-idle CPU as the current CPU is likely busy. * nr_running is checked to avoid unnecessary task stacking. */ - if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1) + if (p->on_cpu && cpu_rq(cpu)->nr_running <= 1) return true; return false; @@ -2378,7 +2378,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags) static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags) { - if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) { + if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(p, cpu, wake_flags)) { if (WARN_ON_ONCE(cpu == smp_processor_id())) return false;
Re: [PATCH -v2 1/5] sched: Fix ttwu() race
On Tue, Jul 21, 2020 at 11:49:05AM +0100, Chris Wilson wrote: > Quoting Peter Zijlstra (2020-06-22 11:01:23) > > @@ -2378,6 +2385,9 @@ static inline bool ttwu_queue_cond(int c > > static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int > > wake_flags) > > { > > if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) { > > + if (WARN_ON_ONCE(cpu == smp_processor_id())) > > + return false; > > + > > sched_clock_cpu(cpu); /* Sync clocks across CPUs */ > > __ttwu_queue_wakelist(p, cpu, wake_flags); > > return true; > > We've been hitting this warning frequently, but have never seen the > rcu-torture-esque oops ourselves. How easy is it to hit this? What, if anything, can I do to make my own computer go bang? > <4> [181.766705] RIP: 0010:ttwu_queue_wakelist+0xbc/0xd0 > <4> [181.766710] Code: 00 00 00 5b 5d 41 5c 41 5d c3 31 c0 5b 5d 41 5c 41 5d > c3 31 c0 f6 c3 08 74 f2 48 c7 c2 00 ad 03 00 83 7c 11 40 01 77 e4 eb 80 <0f> > 0b 31 c0 eb dc 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 bf 17 > <4> [181.766726] RSP: 0018:c9003e08 EFLAGS: 00010046 > <4> [181.766733] RAX: RBX: RCX: > 888276a0 > <4> [181.766740] RDX: 0003ad00 RSI: 8232045b RDI: > 8233103e > <4> [181.766747] RBP: R08: R09: > 0001 > <4> [181.766754] R10: d3fa25c3 R11: 53712267 R12: > 88825b912940 > <4> [181.766761] R13: R14: 0087 R15: > 0003ad00 > <4> [181.766769] FS: () GS:888276a0() > knlGS: > <4> [181.766777] CS: 0010 DS: ES: CR0: 80050033 > <4> [181.766783] CR2: 55b8245814e0 CR3: 05610003 CR4: > 003606f0 > <4> [181.766790] Call Trace: > <4> [181.766794] > <4> [181.766798] try_to_wake_up+0x21b/0x690 > <4> [181.766805] autoremove_wake_function+0xc/0x50 > <4> [181.766858] __i915_sw_fence_complete+0x1ee/0x250 [i915] > <4> [181.766912] dma_i915_sw_fence_wake+0x2d/0x40 [i915] Please, don't trim oopses.. > We are seeing this on the ttwu_queue() path, so with p->on_cpu=0, and the > warning is cleared up by > > - if (WARN_ON_ONCE(cpu == smp_processor_id())) > + if (WARN_ON_ONCE(p->on_cpu && cpu == smp_processor_id())) > > which would appear to restore the old behaviour for ttwu_queue() and > seem to be consistent with the intent of this patch. Hopefully this > helps identify the problem correctly. Hurmph, that's actively wrong. We should never queue to self, as that would result in self-IPI, which is not possible on a bunch of archs. It works for you because x86 can in fact do that. So ttwu_queue_cond() will only return true when: - target-cpu and current-cpu do not share cache; so it cannot be this condition, because you _always_ share cache with yourself. - when WF_ON_CPU and target-cpu has nr_running <= 1; which means p->on_cpu == true. So now you have cpu == smp_processor_id() && p->on_cpu == 1, however your modified WARN contradicts that. *puzzle*
Re: [PATCH -v2 1/5] sched: Fix ttwu() race
Quoting Peter Zijlstra (2020-06-22 11:01:23) > @@ -2378,6 +2385,9 @@ static inline bool ttwu_queue_cond(int c > static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int > wake_flags) > { > if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) { > + if (WARN_ON_ONCE(cpu == smp_processor_id())) > + return false; > + > sched_clock_cpu(cpu); /* Sync clocks across CPUs */ > __ttwu_queue_wakelist(p, cpu, wake_flags); > return true; We've been hitting this warning frequently, but have never seen the rcu-torture-esque oops ourselves. <4> [181.766705] RIP: 0010:ttwu_queue_wakelist+0xbc/0xd0 <4> [181.766710] Code: 00 00 00 5b 5d 41 5c 41 5d c3 31 c0 5b 5d 41 5c 41 5d c3 31 c0 f6 c3 08 74 f2 48 c7 c2 00 ad 03 00 83 7c 11 40 01 77 e4 eb 80 <0f> 0b 31 c0 eb dc 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 bf 17 <4> [181.766726] RSP: 0018:c9003e08 EFLAGS: 00010046 <4> [181.766733] RAX: RBX: RCX: 888276a0 <4> [181.766740] RDX: 0003ad00 RSI: 8232045b RDI: 8233103e <4> [181.766747] RBP: R08: R09: 0001 <4> [181.766754] R10: d3fa25c3 R11: 53712267 R12: 88825b912940 <4> [181.766761] R13: R14: 0087 R15: 0003ad00 <4> [181.766769] FS: () GS:888276a0() knlGS: <4> [181.766777] CS: 0010 DS: ES: CR0: 80050033 <4> [181.766783] CR2: 55b8245814e0 CR3: 05610003 CR4: 003606f0 <4> [181.766790] Call Trace: <4> [181.766794] <4> [181.766798] try_to_wake_up+0x21b/0x690 <4> [181.766805] autoremove_wake_function+0xc/0x50 <4> [181.766858] __i915_sw_fence_complete+0x1ee/0x250 [i915] <4> [181.766912] dma_i915_sw_fence_wake+0x2d/0x40 [i915] We are seeing this on the ttwu_queue() path, so with p->on_cpu=0, and the warning is cleared up by - if (WARN_ON_ONCE(cpu == smp_processor_id())) + if (WARN_ON_ONCE(p->on_cpu && cpu == smp_processor_id())) which would appear to restore the old behaviour for ttwu_queue() and seem to be consistent with the intent of this patch. Hopefully this helps identify the problem correctly. -Chris
Re: [PATCH -v2 1/5] sched: Fix ttwu() race
*sigh*, this one should actually build and I got a smatch report that there's an uninitizlied usage of @cpu, so I shuffled that around a bit. --- Subject: sched: Fix ttwu() race From: Peter Zijlstra Date: Mon, 22 Jun 2020 12:01:23 +0200 Paul reported rcutorture occasionally hitting a NULL deref: sched_ttwu_pending() ttwu_do_wakeup() check_preempt_curr() := check_preempt_wakeup() find_matching_se() is_same_group() if (se->cfs_rq == pse->cfs_rq) <-- *BOOM* Debugging showed that this only appears to happen when we take the new code-path from commit: 2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling") and only when @cpu == smp_processor_id(). Something which should not be possible, because p->on_cpu can only be true for remote tasks. Similarly, without the new code-path from commit: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") this would've unconditionally hit: smp_cond_load_acquire(&p->on_cpu, !VAL); and if: 'cpu == smp_processor_id() && p->on_cpu' is possible, this would result in an instant live-lock (with IRQs disabled), something that hasn't been reported. The NULL deref can be explained however if the task_cpu(p) load at the beginning of try_to_wake_up() returns an old value, and this old value happens to be smp_processor_id(). Further assume that the p->on_cpu load accurately returns 1, it really is still running, just not here. Then, when we enqueue the task locally, we can crash in exactly the observed manner because p->se.cfs_rq != rq->cfs_rq, because p's cfs_rq is from the wrong CPU, therefore we'll iterate into the non-existant parents and NULL deref. The closest semi-plausible scenario I've managed to contrive is somewhat elaborate (then again, actual reproduction takes many CPU hours of rcutorture, so it can't be anything obvious): X->cpu = 1 rq(1)->curr = X CPU0CPU1CPU2 // switch away from X LOCK rq(1)->lock smp_mb__after_spinlock dequeue_task(X) X->on_rq = 9 switch_to(Z) X->on_cpu = 0 UNLOCK rq(1)->lock // migrate X to cpu 0 LOCK rq(1)->lock dequeue_task(X) set_task_cpu(X, 0) X->cpu = 0 UNLOCK rq(1)->lock LOCK rq(0)->lock enqueue_task(X) X->on_rq = 1 UNLOCK rq(0)->lock // switch to X LOCK rq(0)->lock smp_mb__after_spinlock switch_to(X) X->on_cpu = 1 UNLOCK rq(0)->lock // X goes sleep X->state = TASK_UNINTERRUPTIBLE smp_mb(); // wake X ttwu() LOCK X->pi_lock smp_mb__after_spinlock if (p->state) cpu = X->cpu; // =? 1 smp_rmb() // X calls schedule() LOCK rq(0)->lock smp_mb__after_spinlock dequeue_task(X) X->on_rq = 0 if (p->on_rq) smp_rmb(); if (p->on_cpu && ttwu_queue_wakelist(..)) [*] smp_cond_load_acquire(&p->on_cpu, !VAL) cpu = select_task_rq(X, X->wake_cpu, ...) if (X->cpu != cpu) switch_to(Y) X->on_cpu = 0 UNLOCK rq(0)->lock However I'm having trouble convincing myself that's actually possible on x86_64 -- after all, every LOCK implies an smp_mb there, so if ttwu observes ->state != RUNNING, it must also observe ->cpu != 1. (Most of the previous ttwu() races were found on very large PowerPC) Nevertheless, this