Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()
On Thu, Nov 12, 2020 at 07:31:12PM +, Valentin Schneider wrote: > I think what is happening here is: > > affine_move_task() > // task_running() case > stop_one_cpu() > wait_for_completion(>done); > > and this is !PREEMPT, so the stopper can very well hit: > > migration_cpu_stop() > // Task moved between unlocks and scheduling the stopper > task_rq(p) != rq && > // task_running() case > dest_cpu >= 0 > > => no complete_all(), ever :( Damn... > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 02076e6d3792..fad0a8e62aca 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1923,7 +1923,7 @@ static int migration_cpu_stop(void *data) > else > p->wake_cpu = dest_cpu; > > - } else if (dest_cpu < 0) { > + } else if (dest_cpu < 0 || pending) { > /* >* This happens when we get migrated between migrate_enable()'s >* preempt_enable() and scheduling the stopper task. At that > @@ -1933,6 +1933,17 @@ static int migration_cpu_stop(void *data) >* more likely. >*/ > > + /* > + * The task moved before the stopper got to run. We're holding > + * ->pi_lock, so the allowed mask is stable - if it got > + * somewhere allowed, we're done. > + */ > + if (pending && cpumask_test_cpu(task_cpu(p), p->cpus_ptr)) { > + p->migration_pending = NULL; > + complete = true; > + goto out; > + } > + > /* >* When this was migrate_enable() but we no longer have an >* @pending, a concurrent SCA 'fixed' things and we should be Agreed, this is very clearly a missing case and the proposed solution seems straight forward enough; but I'm struggling to convince my sleep deprived brain we're actually complete now. I'll continue staring at it a little more. Could you make it into a proper patch?
Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()
On 12/11/20 20:37, Qian Cai wrote: > On Thu, 2020-11-12 at 19:31 +, Valentin Schneider wrote: >> a) Do you also get this on CONFIG_PREEMPT=y? > > This also happens with: > > CONFIG_PREEMPT=y > CONFIG_PREEMPTION=y > CONFIG_PREEMPT_RCU=y > CONFIG_PREEMPT_NOTIFIERS=y > CONFIG_DEBUG_PREEMPT=y > CONFIG_PREEMPTIRQ_TRACEPOINTS=y > Hmph, it should be much less likely to happen with PREEMPT=y, but isn't per se impossible. Thanks for giving it a shot. > [ 1235.044945][ T330] INFO: task trinity-c4:60050 blocked for more than 245 > seconds. > [ 1235.052540][ T330] Not tainted 5.10.0-rc3-next-20201112+ #2 > [ 1235.058774][ T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 1235.067392][ T330] task:trinity-c4 state:D stack:26880 pid:60050 > ppid: 1722 flags:0x4000 > [ 1235.076505][ T330] Call Trace: > [ 1235.079680][ T330] __schedule (kernel/sched/core.c:4272 > kernel/sched/core.c:5019) > [ 1235.083971][ T330] ? __sched_text_start (kernel/sched/core.c:4901) > [ 1235.088721][ T330] schedule (kernel/sched/core.c:5099 (discriminator 1)) > [ 1235.092661][ T330] schedule_timeout (kernel/time/timer.c:1848) > [ 1235.097399][ T330] ? usleep_range (kernel/time/timer.c:1833) > [ 1235.101945][ T330] ? wait_for_completion (kernel/sched/completion.c:85 > kernel/sched/completion.c:106 kernel/sched/completion.c:117 > kernel/sched/completion.c:138) > [ 1235.107156][ T330] ? lock_downgrade (kernel/locking/lockdep.c:5443) > [ 1235.111883][ T330] ? rcu_read_unlock (./include/linux/rcupdate.h:692 > (discriminator 5)) > [ 1235.116561][ T330] ? do_raw_spin_lock (./arch/x86/include/asm/atomic.h:202 > ./include/asm-generic/atomic-instrumented.h:707 > ./include/asm-generic/qspinlock.h:82 kernel/locking/spinlock_debug.c:113) > [ 1235.121459][ T330] ? _raw_spin_unlock_irq > (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 > ./include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199) > [ 1235.126601][ T330] wait_for_completion (kernel/sched/completion.c:86 > kernel/sched/completion.c:106 kernel/sched/completion.c:117 > kernel/sched/completion.c:138) > [ 1235.131591][ T330] ? wait_for_completion_interruptible > (kernel/sched/completion.c:137) > [ 1235.138013][ T330] ? _raw_spin_unlock_irqrestore > (./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) > [ 1235.143698][ T330] affine_move_task (./include/linux/instrumented.h:101 > ./include/asm-generic/atomic-instrumented.h:220 > ./include/linux/refcount.h:272 ./include/linux/refcount.h:315 > ./include/linux/refcount.h:333 kernel/sched/core.c:2263) > [ 1235.148451][ T330] ? move_queued_task (kernel/sched/core.c:2151) > [ 1235.153351][ T330] ? update_curr (kernel/sched/sched.h:1176 > kernel/sched/fair.c:845) > [ 1235.157848][ T330] ? enqueue_entity (kernel/sched/fair.c:4247) > [ 1235.162658][ T330] ? set_next_task_fair > (./arch/x86/include/asm/jump_label.h:25 (discriminator 2) > ./include/linux/jump_label.h:200 (discriminator 2) kernel/sched/fair.c:4567 > (discriminator 2) kernel/sched/fair.c:4683 (discriminator 2) > kernel/sched/fair.c:10953 (discriminator 2)) > [ 1235.167667][ T330] __set_cpus_allowed_ptr (kernel/sched/core.c:2353) > [ 1235.172905][ T330] ? affine_move_task (kernel/sched/core.c:2287) > [ 1235.177826][ T330] ? _raw_spin_unlock_irqrestore > (./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) > [ 1235.183501][ T330] sched_setaffinity (kernel/sched/core.c:6460) > [ 1235.188345][ T330] ? __ia32_sys_sched_getattr (kernel/sched/core.c:6393) > [ 1235.193937][ T330] ? _copy_from_user > (./arch/x86/include/asm/uaccess_64.h:46 > ./arch/x86/include/asm/uaccess_64.h:52 lib/usercopy.c:16) > [ 1235.198605][ T330] __x64_sys_sched_setaffinity (kernel/sched/core.c:6511 > kernel/sched/core.c:6500 kernel/sched/core.c:6500) > [ 1235.204291][ T330] ? sched_setaffinity (kernel/sched/core.c:6500) > [ 1235.209324][ T330] ? syscall_enter_from_user_mode > (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 > kernel/entry/common.c:98) > [ 1235.215133][ T330] do_syscall_64 (arch/x86/entry/common.c:46) > [ 1235.219431][ T330] entry_SYSCALL_64_after_hwframe > (arch/x86/entry/entry_64.S:127) > [ 1235.225251][ T330] RIP: 0033:0x7fb102b1178d > >> b) Could you try the below? > > It is running good so far on multiple systems. I'll keep it running and report > back if it happens again. Thanks! All of this is somewhat fragile, so I'll want to have another look with a fresher mind; if the diff makes a difference at least it'll mean I wasn't completely off.
Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()
On Thu, 2020-11-12 at 19:31 +, Valentin Schneider wrote: > a) Do you also get this on CONFIG_PREEMPT=y? This also happens with: CONFIG_PREEMPT=y CONFIG_PREEMPTION=y CONFIG_PREEMPT_RCU=y CONFIG_PREEMPT_NOTIFIERS=y CONFIG_DEBUG_PREEMPT=y CONFIG_PREEMPTIRQ_TRACEPOINTS=y [ 1235.044945][ T330] INFO: task trinity-c4:60050 blocked for more than 245 seconds. [ 1235.052540][ T330] Not tainted 5.10.0-rc3-next-20201112+ #2 [ 1235.058774][ T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1235.067392][ T330] task:trinity-c4 state:D stack:26880 pid:60050 ppid: 1722 flags:0x4000 [ 1235.076505][ T330] Call Trace: [ 1235.079680][ T330] __schedule (kernel/sched/core.c:4272 kernel/sched/core.c:5019) [ 1235.083971][ T330] ? __sched_text_start (kernel/sched/core.c:4901) [ 1235.088721][ T330] schedule (kernel/sched/core.c:5099 (discriminator 1)) [ 1235.092661][ T330] schedule_timeout (kernel/time/timer.c:1848) [ 1235.097399][ T330] ? usleep_range (kernel/time/timer.c:1833) [ 1235.101945][ T330] ? wait_for_completion (kernel/sched/completion.c:85 kernel/sched/completion.c:106 kernel/sched/completion.c:117 kernel/sched/completion.c:138) [ 1235.107156][ T330] ? lock_downgrade (kernel/locking/lockdep.c:5443) [ 1235.111883][ T330] ? rcu_read_unlock (./include/linux/rcupdate.h:692 (discriminator 5)) [ 1235.116561][ T330] ? do_raw_spin_lock (./arch/x86/include/asm/atomic.h:202 ./include/asm-generic/atomic-instrumented.h:707 ./include/asm-generic/qspinlock.h:82 kernel/locking/spinlock_debug.c:113) [ 1235.121459][ T330] ? _raw_spin_unlock_irq (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 ./include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199) [ 1235.126601][ T330] wait_for_completion (kernel/sched/completion.c:86 kernel/sched/completion.c:106 kernel/sched/completion.c:117 kernel/sched/completion.c:138) [ 1235.131591][ T330] ? wait_for_completion_interruptible (kernel/sched/completion.c:137) [ 1235.138013][ T330] ? _raw_spin_unlock_irqrestore (./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) [ 1235.143698][ T330] affine_move_task (./include/linux/instrumented.h:101 ./include/asm-generic/atomic-instrumented.h:220 ./include/linux/refcount.h:272 ./include/linux/refcount.h:315 ./include/linux/refcount.h:333 kernel/sched/core.c:2263) [ 1235.148451][ T330] ? move_queued_task (kernel/sched/core.c:2151) [ 1235.153351][ T330] ? update_curr (kernel/sched/sched.h:1176 kernel/sched/fair.c:845) [ 1235.157848][ T330] ? enqueue_entity (kernel/sched/fair.c:4247) [ 1235.162658][ T330] ? set_next_task_fair (./arch/x86/include/asm/jump_label.h:25 (discriminator 2) ./include/linux/jump_label.h:200 (discriminator 2) kernel/sched/fair.c:4567 (discriminator 2) kernel/sched/fair.c:4683 (discriminator 2) kernel/sched/fair.c:10953 (discriminator 2)) [ 1235.167667][ T330] __set_cpus_allowed_ptr (kernel/sched/core.c:2353) [ 1235.172905][ T330] ? affine_move_task (kernel/sched/core.c:2287) [ 1235.177826][ T330] ? _raw_spin_unlock_irqrestore (./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) [ 1235.183501][ T330] sched_setaffinity (kernel/sched/core.c:6460) [ 1235.188345][ T330] ? __ia32_sys_sched_getattr (kernel/sched/core.c:6393) [ 1235.193937][ T330] ? _copy_from_user (./arch/x86/include/asm/uaccess_64.h:46 ./arch/x86/include/asm/uaccess_64.h:52 lib/usercopy.c:16) [ 1235.198605][ T330] __x64_sys_sched_setaffinity (kernel/sched/core.c:6511 kernel/sched/core.c:6500 kernel/sched/core.c:6500) [ 1235.204291][ T330] ? sched_setaffinity (kernel/sched/core.c:6500) [ 1235.209324][ T330] ? syscall_enter_from_user_mode (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 kernel/entry/common.c:98) [ 1235.215133][ T330] do_syscall_64 (arch/x86/entry/common.c:46) [ 1235.219431][ T330] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:127) [ 1235.225251][ T330] RIP: 0033:0x7fb102b1178d > b) Could you try the below? It is running good so far on multiple systems. I'll keep it running and report back if it happens again.
Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()
On Thu, 2020-11-12 at 19:31 +, Valentin Schneider wrote: > One thing I don't get: that trace shows refcount_dec_and_test() > (kernel/sched/core.c:2263) happening before the wait_for_completion(). It's > not the case in the below trace. Yes, that is normal. Sometimes, the decoding is a bit off not sure because of some debugging options like KASAN obscures it. > a) Do you also get this on CONFIG_PREEMPT=y? I don't know. None of the systems here has that, but I could probably try. > b) Could you try the below? Let me run it and report. > > --- > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 02076e6d3792..fad0a8e62aca 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1923,7 +1923,7 @@ static int migration_cpu_stop(void *data) > else > p->wake_cpu = dest_cpu; > > - } else if (dest_cpu < 0) { > + } else if (dest_cpu < 0 || pending) { > /* >* This happens when we get migrated between migrate_enable()'s >* preempt_enable() and scheduling the stopper task. At that > @@ -1933,6 +1933,17 @@ static int migration_cpu_stop(void *data) >* more likely. >*/ > > + /* > + * The task moved before the stopper got to run. We're holding > + * ->pi_lock, so the allowed mask is stable - if it got > + * somewhere allowed, we're done. > + */ > + if (pending && cpumask_test_cpu(task_cpu(p), p->cpus_ptr)) { > + p->migration_pending = NULL; > + complete = true; > + goto out; > + } > + > /* >* When this was migrate_enable() but we no longer have an >* @pending, a concurrent SCA 'fixed' things and we should be >
Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()
On 12/11/20 18:01, Qian Cai wrote: > On Thu, 2020-11-12 at 17:26 +, Valentin Schneider wrote: >> On 12/11/20 16:38, Qian Cai wrote: >> > Some syscall fuzzing from an unprivileged user starts to trigger this below >> > since this commit first appeared in the linux-next today. Does it ring any >> > bells? >> > >> >> What's the .config? I'm interested in >> CONFIG_PREEMPT >> CONFIG_PREEMPT_RT >> CONFIG_SMP > > https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config > > # CONFIG_PREEMPT is not set > CONFIG_SMP=y > So that's CONFIG_PREEMPT_NONE=y > Also, I have been able to reproduce this on powerpc as well just now. > [...] > > [12065.101987][ T1310] __switch_to (arch/arm64/kernel/process.c:580) > [12065.106227][ T1310] __schedule (kernel/sched/core.c:4272 > kernel/sched/core.c:5019) > [12065.110505][ T1310] schedule (./arch/arm64/include/asm/current.h:19 > (discriminator 1) ./arch/arm64/include/asm/preempt.h:53 (discriminator 1) > kernel/sched/core.c:5099 (discriminator 1)) > [12065.114562][ T1310] schedule_timeout (kernel/time/timer.c:1848) > [12065.119275][ T1310] wait_for_completion (kernel/sched/completion.c:85 > kernel/sched/completion.c:106 kernel/sched/completion.c:117 > kernel/sched/completion.c:138) > [12065.124257][ T1310] affine_move_task (./include/linux/instrumented.h:101 > ./include/asm-generic/atomic-instrumented.h:220 > ./include/linux/refcount.h:272 ./include/linux/refcount.h:315 > ./include/linux/refcount.h:333 kernel/sched/core.c:2263) > [12065.129013][ T1310] __set_cpus_allowed_ptr (kernel/sched/core.c:2353) > [12065.134248][ T1310] sched_setaffinity (kernel/sched/core.c:6460) > [12065.139088][ T1310] __arm64_sys_sched_setaffinity > (kernel/sched/core.c:6511 kernel/sched/core.c:6500 kernel/sched/core.c:6500) > [12065.144972][ T1310] do_el0_svc (arch/arm64/kernel/syscall.c:36 > arch/arm64/kernel/syscall.c:48 arch/arm64/kernel/syscall.c:159 > arch/arm64/kernel/syscall.c:205) > [12065.149165][ T1310] el0_sync_handler (arch/arm64/kernel/entry-common.c:236 > arch/arm64/kernel/entry-common.c:254) > [12065.153876][ T1310] el0_sync (arch/arm64/kernel/entry.S:741) > Thanks! One thing I don't get: that trace shows refcount_dec_and_test() (kernel/sched/core.c:2263) happening before the wait_for_completion(). It's not the case in the below trace. > == powerpc == > [18060.020301][ T676] [c000200014227670] [c0a6d1e8] > __func__.5350+0x1220e0/0x181338 unreliable > [18060.020333][ T676] [c000200014227850] [c001a278] __switch_to > (arch/powerpc/kernel/process.c:1273) > [18060.020351][ T676] [c0002000142278c0] [c08f3e94] __schedule > (kernel/sched/core.c:4269 kernel/sched/core.c:5019) > [18060.020377][ T676] [c000200014227990] [c08f4638] schedule > (./include/asm-generic/preempt.h:59 (discriminator 1) > kernel/sched/core.c:5099 (discriminator 1)) > [18060.020394][ T676] [c0002000142279c0] [c08fbd34] schedule_timeout > (kernel/time/timer.c:1847) > [18060.020420][ T676] [c000200014227ac0] [c08f6398] > wait_for_completion (kernel/sched/completion.c:85 > kernel/sched/completion.c:106 kernel/sched/completion.c:117 > kernel/sched/completion.c:138) > [18060.020455][ T676] [c000200014227b30] [c0100fd4] affine_move_task > (kernel/sched/core.c:2261) > [18060.020481][ T676] [c000200014227c90] [c0101444] > __set_cpus_allowed_ptr (kernel/sched/core.c:2353) > [18060.020507][ T676] [c000200014227d00] [c0106eac] sched_setaffinity > (kernel/sched/core.c:6460) > [18060.020533][ T676] [c000200014227d70] [c0107134] > sys_sched_setaffinity (kernel/sched/core.c:6511 kernel/sched/core.c:6500) > [18060.020559][ T676] [c000200014227dc0] [c002a6d8] > system_call_exception (arch/powerpc/kernel/syscall_64.c:111) > [18060.020585][ T676] [c000200014227e20] [c000d0a8] > system_call_common (arch/powerpc/kernel/entry_64.S:302) I take back what I said in that previous email; we could have gone through the task_running() stop_one_cpu() and *then* hit the wait_for_completion(>done); and that is actually the only case that makes sense to me here, because the !task_running() one will do the completion before waiting (that kernel has no way to make a task Migration Disabled). I think what is happening here is: affine_move_task() // task_running() case stop_one_cpu() wait_for_completion(>done); and this is !PREEMPT, so the stopper can very well hit: migration_cpu_stop() // Task moved between unlocks and scheduling the stopper task_rq(p) != rq && // task_running() case dest_cpu >= 0 => no complete_all(), ever :( This is an annoying case because we didn't have to bother about it before; a rq mismatch meant the task was fine, because we modified ->cpus_allowed_mask prior. With migrate_disable(), we have to chase after the bloody task because we have to preempt it to get a stable is_migration_disabled() reading. It could have been Migration
Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()
On Thu, 2020-11-12 at 17:26 +, Valentin Schneider wrote: > On 12/11/20 16:38, Qian Cai wrote: > > Some syscall fuzzing from an unprivileged user starts to trigger this below > > since this commit first appeared in the linux-next today. Does it ring any > > bells? X86 in a KVM guest as well. guest .config: https://cailca.coding.net/public/linux/mm/git/files/master/x86.config To reproduce: # /usr/libexec/qemu-kvm -name kata -cpu host -smp 48 -m 48g -hda rhel-8.3- x86_64-kvm.img.qcow2 -cdrom kata.iso -nic user,hostfwd=tcp::-:22 -nographic == inside the guest === # git clone https://e.coding.net/cailca/linux/mm # cd mm; make # ./random -x 0-100 -f [17213.432777][ T348] INFO: task trinity-c7:216885 can't die for more than 122 seconds. [17213.434895][ T348] task:trinity-c7 state:D stack:27088 pid:216885 ppid:103237 flags:0x4004 [17213.437297][ T348] Call Trace: [17213.438142][ T348] __schedule (kernel/sched/core.c:4272 kernel/sched/core.c:5019) [17213.439256][ T348] ? __sched_text_start (kernel/sched/core.c:4901) [17213.440477][ T348] schedule (./arch/x86/include/asm/current.h:15 (discriminator 1) ./include/linux/sched.h:1892 (discriminator 1) kernel/sched/core.c:5100 (discriminator 1)) [17213.441501][ T348] schedule_timeout (kernel/time/timer.c:1848) [17213.442834][ T348] ? usleep_range (kernel/time/timer.c:1833) [17213.444070][ T348] ? wait_for_completion (kernel/sched/completion.c:85 kernel/sched/completion.c:106 kernel/sched/completion.c:117 kernel/sched/completion.c:138) [17213.445457][ T348] ? lock_downgrade (kernel/locking/lockdep.c:5443) [17213.446695][ T348] ? rcu_read_unlock (./include/linux/rcupdate.h:692 (discriminator 5)) [17213.447911][ T348] ? do_raw_spin_lock (./arch/x86/include/asm/atomic.h:202 ./include/asm-generic/atomic-instrumented.h:707 ./include/asm-generic/qspinlock.h:82 kernel/locking/spinlock_debug.c:113) [17213.449190][ T348] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4036 kernel/locking/lockdep.c:4096 kernel/locking/lockdep.c:4048) [17213.450714][ T348] ? _raw_spin_unlock_irq (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 ./include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199) [17213.452042][ T348] wait_for_completion (kernel/sched/completion.c:86 kernel/sched/completion.c:106 kernel/sched/completion.c:117 kernel/sched/completion.c:138) [17213.453468][ T348] ? wait_for_completion_interruptible (kernel/sched/completion.c:137) [17213.455152][ T348] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4036 kernel/locking/lockdep.c:4096 kernel/locking/lockdep.c:4048) [17213.456651][ T348] ? _raw_spin_unlock_irqrestore (./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) [17213.458115][ T348] affine_move_task (./include/linux/instrumented.h:101 ./include/asm-generic/atomic-instrumented.h:220 ./include/linux/refcount.h:272 ./include/linux/refcount.h:315 ./include/linux/refcount.h:333 kernel/sched/core.c:2263) [17213.459313][ T348] ? move_queued_task (kernel/sched/core.c:2151) [17213.460553][ T348] ? update_curr (kernel/sched/sched.h:1176 kernel/sched/fair.c:845) [17213.461684][ T348] ? enqueue_entity (kernel/sched/fair.c:4247) [17213.463001][ T348] ? set_next_task_fair (./arch/x86/include/asm/jump_label.h:25 (discriminator 2) ./include/linux/jump_label.h:200 (discriminator 2) kernel/sched/fair.c:4567 (discriminator 2) kernel/sched/fair.c:4683 (discriminator 2) kernel/sched/fair.c:10953 (discriminator 2)) [17213.464294][ T348] __set_cpus_allowed_ptr (kernel/sched/core.c:2353) [17213.465668][ T348] ? affine_move_task (kernel/sched/core.c:2287) [17213.466952][ T348] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4036 kernel/locking/lockdep.c:4096 kernel/locking/lockdep.c:4048) [17213.468452][ T348] ? _raw_spin_unlock_irqrestore (./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) [17213.469908][ T348] sched_setaffinity (kernel/sched/core.c:6460) [17213.471127][ T348] ? __ia32_sys_sched_getattr (kernel/sched/core.c:6393) [17213.472644][ T348] ? _copy_from_user (./arch/x86/include/asm/uaccess_64.h:46 ./arch/x86/include/asm/uaccess_64.h:52 lib/usercopy.c:16) [17213.473850][ T348] __x64_sys_sched_setaffinity (kernel/sched/core.c:6511 kernel/sched/core.c:6500 kernel/sched/core.c:6500) [17213.475307][ T348] ? sched_setaffinity (kernel/sched/core.c:6500) [17213.476542][ T348] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4036 kernel/locking/lockdep.c:4096 kernel/locking/lockdep.c:4048) [17213.477991][ T348] ? syscall_enter_from_user_mode (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 kernel/entry/common.c:98) [17213.479428][ T348] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 22)) [17213.480642][ T348] do_syscall_64 (arch/x86/entry/common.c:46) [17213.481706][ T348] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:127)
Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()
On Thu, 2020-11-12 at 17:26 +, Valentin Schneider wrote: > On 12/11/20 16:38, Qian Cai wrote: > > Some syscall fuzzing from an unprivileged user starts to trigger this below > > since this commit first appeared in the linux-next today. Does it ring any > > bells? > > > > What's the .config? I'm interested in > CONFIG_PREEMPT > CONFIG_PREEMPT_RT > CONFIG_SMP https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config # CONFIG_PREEMPT is not set CONFIG_SMP=y Also, I have been able to reproduce this on powerpc as well just now. > > From a quick look it seems that tree doesn't have Thomas' "generalization" of > migrate_disable(), so if this doesn't have PREEMPT_RT we could forget about > migrate_disable() for now. > > > [12065.065837][ T1310] INFO: task trinity-c30:91730 blocked for more than > > 368 seconds. > > [12065.073524][ T1310] Tainted: G L5.10.0-rc3-next- > > 20201112 #2 > > [12065.081076][ T1310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [12065.089648][ T1310] task:trinity-c30 state:D stack:26576 pid:91730 > > ppid: 82688 flags:0x > > [12065.098818][ T1310] Call trace: > > [12065.101987][ T1310] __switch_to+0xf0/0x1a8 > > [12065.106227][ T1310] __schedule+0x6ec/0x1708 > > [12065.110505][ T1310] schedule+0x1bc/0x3b0 > > [12065.114562][ T1310] schedule_timeout+0x3c4/0x4c0 > > [12065.119275][ T1310] wait_for_completion+0x13c/0x248 > > [12065.124257][ T1310] affine_move_task+0x410/0x688 > > (inlined by) affine_move_task at kernel/sched/core.c:2261 > > [12065.129013][ T1310] __set_cpus_allowed_ptr+0x1b4/0x370 > > [12065.134248][ T1310] sched_setaffinity+0x4f0/0x7e8 > > [12065.139088][ T1310] __arm64_sys_sched_setaffinity+0x1f4/0x2a0 > > [12065.144972][ T1310] do_el0_svc+0x124/0x228 > > [12065.149165][ T1310] el0_sync_handler+0x208/0x384 > > [12065.153876][ T1310] el0_sync+0x140/0x180 > > [12065.157971][ T1310] > > So that's a task changing the affinity of some task (either itself or > another; I can't say without a decoded stacktrace), and then blocking on a > wait_for_completion() that apparently never happens. > > I don't see stop_one_cpu() in the trace, so I assume it's the !task_running > case, for which the completion should be completed before getting to the > wait (unless we *do* have migrate_disable()). > > Could you please run scripts/decode_stacktrace.sh on the above? [12065.101987][ T1310] __switch_to (arch/arm64/kernel/process.c:580) [12065.106227][ T1310] __schedule (kernel/sched/core.c:4272 kernel/sched/core.c:5019) [12065.110505][ T1310] schedule (./arch/arm64/include/asm/current.h:19 (discriminator 1) ./arch/arm64/include/asm/preempt.h:53 (discriminator 1) kernel/sched/core.c:5099 (discriminator 1)) [12065.114562][ T1310] schedule_timeout (kernel/time/timer.c:1848) [12065.119275][ T1310] wait_for_completion (kernel/sched/completion.c:85 kernel/sched/completion.c:106 kernel/sched/completion.c:117 kernel/sched/completion.c:138) [12065.124257][ T1310] affine_move_task (./include/linux/instrumented.h:101 ./include/asm-generic/atomic-instrumented.h:220 ./include/linux/refcount.h:272 ./include/linux/refcount.h:315 ./include/linux/refcount.h:333 kernel/sched/core.c:2263) [12065.129013][ T1310] __set_cpus_allowed_ptr (kernel/sched/core.c:2353) [12065.134248][ T1310] sched_setaffinity (kernel/sched/core.c:6460) [12065.139088][ T1310] __arm64_sys_sched_setaffinity (kernel/sched/core.c:6511 kernel/sched/core.c:6500 kernel/sched/core.c:6500) [12065.144972][ T1310] do_el0_svc (arch/arm64/kernel/syscall.c:36 arch/arm64/kernel/syscall.c:48 arch/arm64/kernel/syscall.c:159 arch/arm64/kernel/syscall.c:205) [12065.149165][ T1310] el0_sync_handler (arch/arm64/kernel/entry-common.c:236 arch/arm64/kernel/entry-common.c:254) [12065.153876][ T1310] el0_sync (arch/arm64/kernel/entry.S:741) == powerpc == [18060.020301][ T676] [c000200014227670] [c0a6d1e8] __func__.5350+0x1220e0/0x181338 unreliable [18060.020333][ T676] [c000200014227850] [c001a278] __switch_to (arch/powerpc/kernel/process.c:1273) [18060.020351][ T676] [c0002000142278c0] [c08f3e94] __schedule (kernel/sched/core.c:4269 kernel/sched/core.c:5019) [18060.020377][ T676] [c000200014227990] [c08f4638] schedule (./include/asm-generic/preempt.h:59 (discriminator 1) kernel/sched/core.c:5099 (discriminator 1)) [18060.020394][ T676] [c0002000142279c0] [c08fbd34] schedule_timeout (kernel/time/timer.c:1847) [18060.020420][ T676] [c000200014227ac0] [c08f6398] wait_for_completion (kernel/sched/completion.c:85 kernel/sched/completion.c:106 kernel/sched/completion.c:117 kernel/sched/completion.c:138) [18060.020455][ T676] [c000200014227b30] [c0100fd4] affine_move_task (kernel/sched/core.c:2261) [18060.020481][ T676] [c000200014227c90] [c0101444] __set_cpus_allowed_ptr (kernel/sched/core.c:2353) [18060.020507][ T676] [c000200014227d00]
Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()
On 12/11/20 16:38, Qian Cai wrote: > Some syscall fuzzing from an unprivileged user starts to trigger this below > since this commit first appeared in the linux-next today. Does it ring any > bells? > What's the .config? I'm interested in CONFIG_PREEMPT CONFIG_PREEMPT_RT CONFIG_SMP >From a quick look it seems that tree doesn't have Thomas' "generalization" of migrate_disable(), so if this doesn't have PREEMPT_RT we could forget about migrate_disable() for now. > [12065.065837][ T1310] INFO: task trinity-c30:91730 blocked for more than 368 > seconds. > [12065.073524][ T1310] Tainted: G L > 5.10.0-rc3-next-20201112 #2 > [12065.081076][ T1310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [12065.089648][ T1310] task:trinity-c30 state:D stack:26576 pid:91730 > ppid: 82688 flags:0x > [12065.098818][ T1310] Call trace: > [12065.101987][ T1310] __switch_to+0xf0/0x1a8 > [12065.106227][ T1310] __schedule+0x6ec/0x1708 > [12065.110505][ T1310] schedule+0x1bc/0x3b0 > [12065.114562][ T1310] schedule_timeout+0x3c4/0x4c0 > [12065.119275][ T1310] wait_for_completion+0x13c/0x248 > [12065.124257][ T1310] affine_move_task+0x410/0x688 > (inlined by) affine_move_task at kernel/sched/core.c:2261 > [12065.129013][ T1310] __set_cpus_allowed_ptr+0x1b4/0x370 > [12065.134248][ T1310] sched_setaffinity+0x4f0/0x7e8 > [12065.139088][ T1310] __arm64_sys_sched_setaffinity+0x1f4/0x2a0 > [12065.144972][ T1310] do_el0_svc+0x124/0x228 > [12065.149165][ T1310] el0_sync_handler+0x208/0x384 > [12065.153876][ T1310] el0_sync+0x140/0x180 > [12065.157971][ T1310] So that's a task changing the affinity of some task (either itself or another; I can't say without a decoded stacktrace), and then blocking on a wait_for_completion() that apparently never happens. I don't see stop_one_cpu() in the trace, so I assume it's the !task_running case, for which the completion should be completed before getting to the wait (unless we *do* have migrate_disable()). Could you please run scripts/decode_stacktrace.sh on the above? > [12065.157971][ T1310] Showing all locks held in the system: > [12065.166401][ T1310] 1 lock held by khungtaskd/1310: > [12065.171288][ T1310] #0: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: > rcu_lock_acquire.constprop.56+0x0/0x38 > [12065.182210][ T1310] 4 locks held by trinity-main/82688: > [12065.187515][ T1310] 2 locks held by kworker/u513:3/82813: > [12065.192922][ T1310] #0: 00419d38 > ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x69c/0x18c8 > [12065.203890][ T1310] #1: 122bfd40 > ((work_completion)(>work)){+.+.}-{0:0}, at: __update_idle_core+0xa8/0x460 > [12065.214916][ T1310] 1 lock held by trinity-c35/137168: > [12065.220061][ T1310] #0: 0087ce767898 (>ldisc_sem){}-{0:0}, > at: ldsem_down_read+0x3c/0x48 > [12065.229483][ T1310] 3 locks held by trinity-c61/137611: > [12065.234757][ T1310] 1 lock held by trinity-c7/137630: > [12065.239828][ T1310] 1 lock held by trinity-c57/137714: > [12065.242612][T137611] futex_wake_op: trinity-c61 tries to shift op by 1008; > fix this program > [12065.245012][ T1310] 1 lock held by trinity-c52/137771: > [12065.258538][ T1310] 2 locks held by trinity-c42/137835: > [12065.263783][ T1310] 4 locks held by trinity-c22/137868: > [12065.269051][ T1310] #0: 000e78503798 (>lock){-.-.}-{2:2}, at: > newidle_balance+0x92c/0xd78 > [12065.278155][ T1310] #1: 0087ce767930 > (>atomic_write_lock){+.+.}-{3:3}, at: tty_write_lock+0x30/0x58 > [12065.288317][ T1310] #2: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: > __mutex_lock+0x24c/0x1310 > [12065.297592][ T1310] #3: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: > lock_page_memcg+0x98/0x240 > [12065.307026][ T1310] 2 locks held by trinity-c34/137896: > [12065.312266][ T1310] #0: 000e78463798 (>lock){-.-.}-{2:2}, at: > __schedule+0x22c/0x1708 > [12065.321023][ T1310] #1: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: > __update_idle_core+0xa8/0x460 > [12065.330663][ T1310] 2 locks held by trinity-c43/137909: > [12065.335996][ T1310] 1 lock held by trinity-c24/137910: > [12065.341164][ T1310] 1 lock held by trinity-c1/137954: > [12065.346272][ T1310] 1 lock held by trinity-c49/138020: > [12065.351425][ T1310] 1 lock held by trinity-c10/138021: > [12065.356649][ T1310] 1 lock held by trinity-c32/138039: > [12065.361813][ T1310] 4 locks held by trinity-c36/138042: > [12065.367129][ T1310] 2 locks held by trinity-c14/138061: > [12065.372378][ T1310] 2 locks held by trinity-c38/138070: > [12065.377688][ T1310] 1 lock held by trinity-c50/138074: > [12065.382885][ T1310] 1 lock held by trinity-c12/138085: > [12065.388186][ T1310] 1 lock held by trinity-c4/138087: > [12065.393272][ T1310] 3 locks held by trinity-c6/138091: > [12065.398492][ T1310] 2 locks held by trinity-c48/138095: > [12065.403757][ T1310] 2 locks held by trinity-c62/138097: > [12065.409045][ T1310] 2 locks held by
Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()
On Fri, 2020-10-23 at 12:12 +0200, Peter Zijlstra wrote: > Concurrent migrate_disable() and set_cpus_allowed_ptr() has > interesting features. We rely on set_cpus_allowed_ptr() to not return > until the task runs inside the provided mask. This expectation is > exported to userspace. > > This means that any set_cpus_allowed_ptr() caller must wait until > migrate_enable() allows migrations. > > At the same time, we don't want migrate_enable() to schedule, due to > patterns like: > > preempt_disable(); > migrate_disable(); > ... > migrate_enable(); > preempt_enable(); > > And: > > raw_spin_lock(); > spin_unlock(); > > this means that when migrate_enable() must restore the affinity > mask, it cannot wait for completion thereof. Luck will have it that > that is exactly the case where there is a pending > set_cpus_allowed_ptr(), so let that provide storage for the async stop > machine. > > Much thanks to Valentin who used TLA+ most effective and found lots of > 'interesting' cases. > > Signed-off-by: Peter Zijlstra (Intel) > --- > include/linux/sched.h |1 > kernel/sched/core.c | 234 +++- > -- > 2 files changed, 205 insertions(+), 30 deletions(-) Some syscall fuzzing from an unprivileged user starts to trigger this below since this commit first appeared in the linux-next today. Does it ring any bells? [12065.065837][ T1310] INFO: task trinity-c30:91730 blocked for more than 368 seconds. [12065.073524][ T1310] Tainted: G L 5.10.0-rc3-next-20201112 #2 [12065.081076][ T1310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [12065.089648][ T1310] task:trinity-c30 state:D stack:26576 pid:91730 ppid: 82688 flags:0x [12065.098818][ T1310] Call trace: [12065.101987][ T1310] __switch_to+0xf0/0x1a8 [12065.106227][ T1310] __schedule+0x6ec/0x1708 [12065.110505][ T1310] schedule+0x1bc/0x3b0 [12065.114562][ T1310] schedule_timeout+0x3c4/0x4c0 [12065.119275][ T1310] wait_for_completion+0x13c/0x248 [12065.124257][ T1310] affine_move_task+0x410/0x688 (inlined by) affine_move_task at kernel/sched/core.c:2261 [12065.129013][ T1310] __set_cpus_allowed_ptr+0x1b4/0x370 [12065.134248][ T1310] sched_setaffinity+0x4f0/0x7e8 [12065.139088][ T1310] __arm64_sys_sched_setaffinity+0x1f4/0x2a0 [12065.144972][ T1310] do_el0_svc+0x124/0x228 [12065.149165][ T1310] el0_sync_handler+0x208/0x384 [12065.153876][ T1310] el0_sync+0x140/0x180 [12065.157971][ T1310] [12065.157971][ T1310] Showing all locks held in the system: [12065.166401][ T1310] 1 lock held by khungtaskd/1310: [12065.171288][ T1310] #0: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: rcu_lock_acquire.constprop.56+0x0/0x38 [12065.182210][ T1310] 4 locks held by trinity-main/82688: [12065.187515][ T1310] 2 locks held by kworker/u513:3/82813: [12065.192922][ T1310] #0: 00419d38 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x69c/0x18c8 [12065.203890][ T1310] #1: 122bfd40 ((work_completion)(>work)){+.+.}-{0:0}, at: __update_idle_core+0xa8/0x460 [12065.214916][ T1310] 1 lock held by trinity-c35/137168: [12065.220061][ T1310] #0: 0087ce767898 (>ldisc_sem){}-{0:0}, at: ldsem_down_read+0x3c/0x48 [12065.229483][ T1310] 3 locks held by trinity-c61/137611: [12065.234757][ T1310] 1 lock held by trinity-c7/137630: [12065.239828][ T1310] 1 lock held by trinity-c57/137714: [12065.242612][T137611] futex_wake_op: trinity-c61 tries to shift op by 1008; fix this program [12065.245012][ T1310] 1 lock held by trinity-c52/137771: [12065.258538][ T1310] 2 locks held by trinity-c42/137835: [12065.263783][ T1310] 4 locks held by trinity-c22/137868: [12065.269051][ T1310] #0: 000e78503798 (>lock){-.-.}-{2:2}, at: newidle_balance+0x92c/0xd78 [12065.278155][ T1310] #1: 0087ce767930 (>atomic_write_lock){+.+.}-{3:3}, at: tty_write_lock+0x30/0x58 [12065.288317][ T1310] #2: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: __mutex_lock+0x24c/0x1310 [12065.297592][ T1310] #3: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: lock_page_memcg+0x98/0x240 [12065.307026][ T1310] 2 locks held by trinity-c34/137896: [12065.312266][ T1310] #0: 000e78463798 (>lock){-.-.}-{2:2}, at: __schedule+0x22c/0x1708 [12065.321023][ T1310] #1: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: __update_idle_core+0xa8/0x460 [12065.330663][ T1310] 2 locks held by trinity-c43/137909: [12065.335996][ T1310] 1 lock held by trinity-c24/137910: [12065.341164][ T1310] 1 lock held by trinity-c1/137954: [12065.346272][ T1310] 1 lock held by trinity-c49/138020: [12065.351425][ T1310] 1 lock held by trinity-c10/138021: [12065.356649][ T1310] 1 lock held by trinity-c32/138039: [12065.361813][ T1310] 4 locks held by trinity-c36/138042: [12065.367129][ T1310] 2 locks held by trinity-c14/138061: [12065.372378][ T1310] 2 locks held by trinity-c38/138070: [12065.377688][ T1310] 1 lock held by