Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()

2020-11-13 Thread Peter Zijlstra
On Thu, Nov 12, 2020 at 07:31:12PM +, Valentin Schneider wrote:

> I think what is happening here is:
> 
>   affine_move_task()
>   // task_running() case
>   stop_one_cpu()
>   wait_for_completion(>done);
> 
> and this is !PREEMPT, so the stopper can very well hit:
> 
>   migration_cpu_stop()
> // Task moved between unlocks and scheduling the stopper
> task_rq(p) != rq &&
> // task_running() case
> dest_cpu >= 0
> 
> => no complete_all(), ever :(

Damn...

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 02076e6d3792..fad0a8e62aca 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1923,7 +1923,7 @@ static int migration_cpu_stop(void *data)
>   else
>   p->wake_cpu = dest_cpu;
>  
> - } else if (dest_cpu < 0) {
> + } else if (dest_cpu < 0 || pending) {
>   /*
>* This happens when we get migrated between migrate_enable()'s
>* preempt_enable() and scheduling the stopper task. At that
> @@ -1933,6 +1933,17 @@ static int migration_cpu_stop(void *data)
>* more likely.
>*/
>  
> + /*
> +  * The task moved before the stopper got to run. We're holding
> +  * ->pi_lock, so the allowed mask is stable - if it got
> +  * somewhere allowed, we're done.
> +  */
> + if (pending && cpumask_test_cpu(task_cpu(p), p->cpus_ptr)) {
> + p->migration_pending = NULL;
> + complete = true;
> + goto out;
> + }
> +
>   /*
>* When this was migrate_enable() but we no longer have an
>* @pending, a concurrent SCA 'fixed' things and we should be

Agreed, this is very clearly a missing case and the proposed solution
seems straight forward enough; but I'm struggling to convince my sleep
deprived brain we're actually complete now.

I'll continue staring at it a little more. Could you make it into a
proper patch?


Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()

2020-11-12 Thread Valentin Schneider


On 12/11/20 20:37, Qian Cai wrote:
> On Thu, 2020-11-12 at 19:31 +, Valentin Schneider wrote:
>> a) Do you also get this on CONFIG_PREEMPT=y?
>
> This also happens with:
>
> CONFIG_PREEMPT=y
> CONFIG_PREEMPTION=y
> CONFIG_PREEMPT_RCU=y
> CONFIG_PREEMPT_NOTIFIERS=y
> CONFIG_DEBUG_PREEMPT=y
> CONFIG_PREEMPTIRQ_TRACEPOINTS=y
>

Hmph, it should be much less likely to happen with PREEMPT=y, but isn't per
se impossible. Thanks for giving it a shot.

> [ 1235.044945][  T330] INFO: task trinity-c4:60050 blocked for more than 245 
> seconds.
> [ 1235.052540][  T330]   Not tainted 5.10.0-rc3-next-20201112+ #2
> [ 1235.058774][  T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [ 1235.067392][  T330] task:trinity-c4  state:D stack:26880 pid:60050 
> ppid:  1722 flags:0x4000
> [ 1235.076505][  T330] Call Trace:
> [ 1235.079680][ T330] __schedule (kernel/sched/core.c:4272 
> kernel/sched/core.c:5019) 
> [ 1235.083971][ T330] ? __sched_text_start (kernel/sched/core.c:4901) 
> [ 1235.088721][ T330] schedule (kernel/sched/core.c:5099 (discriminator 1)) 
> [ 1235.092661][ T330] schedule_timeout (kernel/time/timer.c:1848) 
> [ 1235.097399][ T330] ? usleep_range (kernel/time/timer.c:1833) 
> [ 1235.101945][ T330] ? wait_for_completion (kernel/sched/completion.c:85 
> kernel/sched/completion.c:106 kernel/sched/completion.c:117 
> kernel/sched/completion.c:138) 
> [ 1235.107156][ T330] ? lock_downgrade (kernel/locking/lockdep.c:5443) 
> [ 1235.111883][ T330] ? rcu_read_unlock (./include/linux/rcupdate.h:692 
> (discriminator 5)) 
> [ 1235.116561][ T330] ? do_raw_spin_lock (./arch/x86/include/asm/atomic.h:202 
> ./include/asm-generic/atomic-instrumented.h:707 
> ./include/asm-generic/qspinlock.h:82 kernel/locking/spinlock_debug.c:113) 
> [ 1235.121459][ T330] ? _raw_spin_unlock_irq 
> (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 
> ./include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199) 
> [ 1235.126601][ T330] wait_for_completion (kernel/sched/completion.c:86 
> kernel/sched/completion.c:106 kernel/sched/completion.c:117 
> kernel/sched/completion.c:138) 
> [ 1235.131591][ T330] ? wait_for_completion_interruptible 
> (kernel/sched/completion.c:137) 
> [ 1235.138013][ T330] ? _raw_spin_unlock_irqrestore 
> (./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) 
> [ 1235.143698][ T330] affine_move_task (./include/linux/instrumented.h:101 
> ./include/asm-generic/atomic-instrumented.h:220 
> ./include/linux/refcount.h:272 ./include/linux/refcount.h:315 
> ./include/linux/refcount.h:333 kernel/sched/core.c:2263) 
> [ 1235.148451][ T330] ? move_queued_task (kernel/sched/core.c:2151) 
> [ 1235.153351][ T330] ? update_curr (kernel/sched/sched.h:1176 
> kernel/sched/fair.c:845) 
> [ 1235.157848][ T330] ? enqueue_entity (kernel/sched/fair.c:4247) 
> [ 1235.162658][ T330] ? set_next_task_fair 
> (./arch/x86/include/asm/jump_label.h:25 (discriminator 2) 
> ./include/linux/jump_label.h:200 (discriminator 2) kernel/sched/fair.c:4567 
> (discriminator 2) kernel/sched/fair.c:4683 (discriminator 2) 
> kernel/sched/fair.c:10953 (discriminator 2)) 
> [ 1235.167667][ T330] __set_cpus_allowed_ptr (kernel/sched/core.c:2353) 
> [ 1235.172905][ T330] ? affine_move_task (kernel/sched/core.c:2287) 
> [ 1235.177826][ T330] ? _raw_spin_unlock_irqrestore 
> (./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) 
> [ 1235.183501][ T330] sched_setaffinity (kernel/sched/core.c:6460) 
> [ 1235.188345][ T330] ? __ia32_sys_sched_getattr (kernel/sched/core.c:6393) 
> [ 1235.193937][ T330] ? _copy_from_user 
> (./arch/x86/include/asm/uaccess_64.h:46 
> ./arch/x86/include/asm/uaccess_64.h:52 lib/usercopy.c:16) 
> [ 1235.198605][ T330] __x64_sys_sched_setaffinity (kernel/sched/core.c:6511 
> kernel/sched/core.c:6500 kernel/sched/core.c:6500) 
> [ 1235.204291][ T330] ? sched_setaffinity (kernel/sched/core.c:6500) 
> [ 1235.209324][ T330] ? syscall_enter_from_user_mode 
> (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 
> kernel/entry/common.c:98) 
> [ 1235.215133][ T330] do_syscall_64 (arch/x86/entry/common.c:46) 
> [ 1235.219431][ T330] entry_SYSCALL_64_after_hwframe 
> (arch/x86/entry/entry_64.S:127) 
> [ 1235.225251][  T330] RIP: 0033:0x7fb102b1178d
>
>> b) Could you try the below?
>
> It is running good so far on multiple systems. I'll keep it running and report
> back if it happens again.

Thanks! All of this is somewhat fragile, so I'll want to have another look
with a fresher mind; if the diff makes a difference at least it'll mean
I wasn't completely off.


Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()

2020-11-12 Thread Qian Cai
On Thu, 2020-11-12 at 19:31 +, Valentin Schneider wrote:
> a) Do you also get this on CONFIG_PREEMPT=y?

This also happens with:

CONFIG_PREEMPT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_DEBUG_PREEMPT=y
CONFIG_PREEMPTIRQ_TRACEPOINTS=y

[ 1235.044945][  T330] INFO: task trinity-c4:60050 blocked for more than 245 
seconds.
[ 1235.052540][  T330]   Not tainted 5.10.0-rc3-next-20201112+ #2
[ 1235.058774][  T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 1235.067392][  T330] task:trinity-c4  state:D stack:26880 pid:60050 ppid: 
 1722 flags:0x4000
[ 1235.076505][  T330] Call Trace:
[ 1235.079680][ T330] __schedule (kernel/sched/core.c:4272 
kernel/sched/core.c:5019) 
[ 1235.083971][ T330] ? __sched_text_start (kernel/sched/core.c:4901) 
[ 1235.088721][ T330] schedule (kernel/sched/core.c:5099 (discriminator 1)) 
[ 1235.092661][ T330] schedule_timeout (kernel/time/timer.c:1848) 
[ 1235.097399][ T330] ? usleep_range (kernel/time/timer.c:1833) 
[ 1235.101945][ T330] ? wait_for_completion (kernel/sched/completion.c:85 
kernel/sched/completion.c:106 kernel/sched/completion.c:117 
kernel/sched/completion.c:138) 
[ 1235.107156][ T330] ? lock_downgrade (kernel/locking/lockdep.c:5443) 
[ 1235.111883][ T330] ? rcu_read_unlock (./include/linux/rcupdate.h:692 
(discriminator 5)) 
[ 1235.116561][ T330] ? do_raw_spin_lock (./arch/x86/include/asm/atomic.h:202 
./include/asm-generic/atomic-instrumented.h:707 
./include/asm-generic/qspinlock.h:82 kernel/locking/spinlock_debug.c:113) 
[ 1235.121459][ T330] ? _raw_spin_unlock_irq 
(./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 
./include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199) 
[ 1235.126601][ T330] wait_for_completion (kernel/sched/completion.c:86 
kernel/sched/completion.c:106 kernel/sched/completion.c:117 
kernel/sched/completion.c:138) 
[ 1235.131591][ T330] ? wait_for_completion_interruptible 
(kernel/sched/completion.c:137) 
[ 1235.138013][ T330] ? _raw_spin_unlock_irqrestore 
(./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) 
[ 1235.143698][ T330] affine_move_task (./include/linux/instrumented.h:101 
./include/asm-generic/atomic-instrumented.h:220 ./include/linux/refcount.h:272 
./include/linux/refcount.h:315 ./include/linux/refcount.h:333 
kernel/sched/core.c:2263) 
[ 1235.148451][ T330] ? move_queued_task (kernel/sched/core.c:2151) 
[ 1235.153351][ T330] ? update_curr (kernel/sched/sched.h:1176 
kernel/sched/fair.c:845) 
[ 1235.157848][ T330] ? enqueue_entity (kernel/sched/fair.c:4247) 
[ 1235.162658][ T330] ? set_next_task_fair 
(./arch/x86/include/asm/jump_label.h:25 (discriminator 2) 
./include/linux/jump_label.h:200 (discriminator 2) kernel/sched/fair.c:4567 
(discriminator 2) kernel/sched/fair.c:4683 (discriminator 2) 
kernel/sched/fair.c:10953 (discriminator 2)) 
[ 1235.167667][ T330] __set_cpus_allowed_ptr (kernel/sched/core.c:2353) 
[ 1235.172905][ T330] ? affine_move_task (kernel/sched/core.c:2287) 
[ 1235.177826][ T330] ? _raw_spin_unlock_irqrestore 
(./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) 
[ 1235.183501][ T330] sched_setaffinity (kernel/sched/core.c:6460) 
[ 1235.188345][ T330] ? __ia32_sys_sched_getattr (kernel/sched/core.c:6393) 
[ 1235.193937][ T330] ? _copy_from_user (./arch/x86/include/asm/uaccess_64.h:46 
./arch/x86/include/asm/uaccess_64.h:52 lib/usercopy.c:16) 
[ 1235.198605][ T330] __x64_sys_sched_setaffinity (kernel/sched/core.c:6511 
kernel/sched/core.c:6500 kernel/sched/core.c:6500) 
[ 1235.204291][ T330] ? sched_setaffinity (kernel/sched/core.c:6500) 
[ 1235.209324][ T330] ? syscall_enter_from_user_mode 
(./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 
kernel/entry/common.c:98) 
[ 1235.215133][ T330] do_syscall_64 (arch/x86/entry/common.c:46) 
[ 1235.219431][ T330] entry_SYSCALL_64_after_hwframe 
(arch/x86/entry/entry_64.S:127) 
[ 1235.225251][  T330] RIP: 0033:0x7fb102b1178d

> b) Could you try the below?

It is running good so far on multiple systems. I'll keep it running and report
back if it happens again.



Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()

2020-11-12 Thread Qian Cai
On Thu, 2020-11-12 at 19:31 +, Valentin Schneider wrote:
> One thing I don't get: that trace shows refcount_dec_and_test()
> (kernel/sched/core.c:2263) happening before the wait_for_completion(). It's
> not the case in the below trace.

Yes, that is normal. Sometimes, the decoding is a bit off not sure because of
some debugging options like KASAN obscures it.

> a) Do you also get this on CONFIG_PREEMPT=y?

I don't know. None of the systems here has that, but I could probably try.

> b) Could you try the below?

Let me run it and report.

> 
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 02076e6d3792..fad0a8e62aca 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1923,7 +1923,7 @@ static int migration_cpu_stop(void *data)
>   else
>   p->wake_cpu = dest_cpu;
>  
> - } else if (dest_cpu < 0) {
> + } else if (dest_cpu < 0 || pending) {
>   /*
>* This happens when we get migrated between migrate_enable()'s
>* preempt_enable() and scheduling the stopper task. At that
> @@ -1933,6 +1933,17 @@ static int migration_cpu_stop(void *data)
>* more likely.
>*/
>  
> + /*
> +  * The task moved before the stopper got to run. We're holding
> +  * ->pi_lock, so the allowed mask is stable - if it got
> +  * somewhere allowed, we're done.
> +  */
> + if (pending && cpumask_test_cpu(task_cpu(p), p->cpus_ptr)) {
> + p->migration_pending = NULL;
> + complete = true;
> + goto out;
> + }
> +
>   /*
>* When this was migrate_enable() but we no longer have an
>* @pending, a concurrent SCA 'fixed' things and we should be
> 



Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()

2020-11-12 Thread Valentin Schneider


On 12/11/20 18:01, Qian Cai wrote:
> On Thu, 2020-11-12 at 17:26 +, Valentin Schneider wrote:
>> On 12/11/20 16:38, Qian Cai wrote:
>> > Some syscall fuzzing from an unprivileged user starts to trigger this below
>> > since this commit first appeared in the linux-next today. Does it ring any
>> > bells?
>> >
>>
>> What's the .config? I'm interested in
>> CONFIG_PREEMPT
>> CONFIG_PREEMPT_RT
>> CONFIG_SMP
>
> https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config
>
> # CONFIG_PREEMPT is not set
> CONFIG_SMP=y
>

So that's CONFIG_PREEMPT_NONE=y

> Also, I have been able to reproduce this on powerpc as well just now.
>
[...]
>
> [12065.101987][ T1310] __switch_to (arch/arm64/kernel/process.c:580)
> [12065.106227][ T1310] __schedule (kernel/sched/core.c:4272 
> kernel/sched/core.c:5019)
> [12065.110505][ T1310] schedule (./arch/arm64/include/asm/current.h:19 
> (discriminator 1) ./arch/arm64/include/asm/preempt.h:53 (discriminator 1) 
> kernel/sched/core.c:5099 (discriminator 1))
> [12065.114562][ T1310] schedule_timeout (kernel/time/timer.c:1848)
> [12065.119275][ T1310] wait_for_completion (kernel/sched/completion.c:85 
> kernel/sched/completion.c:106 kernel/sched/completion.c:117 
> kernel/sched/completion.c:138)
> [12065.124257][ T1310] affine_move_task (./include/linux/instrumented.h:101 
> ./include/asm-generic/atomic-instrumented.h:220 
> ./include/linux/refcount.h:272 ./include/linux/refcount.h:315 
> ./include/linux/refcount.h:333 kernel/sched/core.c:2263)
> [12065.129013][ T1310] __set_cpus_allowed_ptr (kernel/sched/core.c:2353)
> [12065.134248][ T1310] sched_setaffinity (kernel/sched/core.c:6460)
> [12065.139088][ T1310] __arm64_sys_sched_setaffinity 
> (kernel/sched/core.c:6511 kernel/sched/core.c:6500 kernel/sched/core.c:6500)
> [12065.144972][ T1310] do_el0_svc (arch/arm64/kernel/syscall.c:36 
> arch/arm64/kernel/syscall.c:48 arch/arm64/kernel/syscall.c:159 
> arch/arm64/kernel/syscall.c:205)
> [12065.149165][ T1310] el0_sync_handler (arch/arm64/kernel/entry-common.c:236 
> arch/arm64/kernel/entry-common.c:254)
> [12065.153876][ T1310] el0_sync (arch/arm64/kernel/entry.S:741)
>

Thanks!

One thing I don't get: that trace shows refcount_dec_and_test()
(kernel/sched/core.c:2263) happening before the wait_for_completion(). It's
not the case in the below trace.

> == powerpc ==
> [18060.020301][ T676] [c000200014227670] [c0a6d1e8] 
> __func__.5350+0x1220e0/0x181338 unreliable
> [18060.020333][ T676] [c000200014227850] [c001a278] __switch_to 
> (arch/powerpc/kernel/process.c:1273)
> [18060.020351][ T676] [c0002000142278c0] [c08f3e94] __schedule 
> (kernel/sched/core.c:4269 kernel/sched/core.c:5019)
> [18060.020377][ T676] [c000200014227990] [c08f4638] schedule 
> (./include/asm-generic/preempt.h:59 (discriminator 1) 
> kernel/sched/core.c:5099 (discriminator 1))
> [18060.020394][ T676] [c0002000142279c0] [c08fbd34] schedule_timeout 
> (kernel/time/timer.c:1847)
> [18060.020420][ T676] [c000200014227ac0] [c08f6398] 
> wait_for_completion (kernel/sched/completion.c:85 
> kernel/sched/completion.c:106 kernel/sched/completion.c:117 
> kernel/sched/completion.c:138)
> [18060.020455][ T676] [c000200014227b30] [c0100fd4] affine_move_task 
> (kernel/sched/core.c:2261)
> [18060.020481][ T676] [c000200014227c90] [c0101444] 
> __set_cpus_allowed_ptr (kernel/sched/core.c:2353)
> [18060.020507][ T676] [c000200014227d00] [c0106eac] sched_setaffinity 
> (kernel/sched/core.c:6460)
> [18060.020533][ T676] [c000200014227d70] [c0107134] 
> sys_sched_setaffinity (kernel/sched/core.c:6511 kernel/sched/core.c:6500)
> [18060.020559][ T676] [c000200014227dc0] [c002a6d8] 
> system_call_exception (arch/powerpc/kernel/syscall_64.c:111)
> [18060.020585][ T676] [c000200014227e20] [c000d0a8] 
> system_call_common (arch/powerpc/kernel/entry_64.S:302)

I take back what I said in that previous email; we could have gone through
the task_running() stop_one_cpu() and *then* hit the

  wait_for_completion(>done);

and that is actually the only case that makes sense to me here, because the
!task_running() one will do the completion before waiting (that kernel has
no way to make a task Migration Disabled).

I think what is happening here is:

  affine_move_task()
  // task_running() case
  stop_one_cpu()
  wait_for_completion(>done);

and this is !PREEMPT, so the stopper can very well hit:

  migration_cpu_stop()
// Task moved between unlocks and scheduling the stopper
task_rq(p) != rq &&
// task_running() case
dest_cpu >= 0

=> no complete_all(), ever :(

This is an annoying case because we didn't have to bother about it before;
a rq mismatch meant the task was fine, because we modified
->cpus_allowed_mask prior.

With migrate_disable(), we have to chase after the bloody task because
we have to preempt it to get a stable is_migration_disabled() reading. It
could have been Migration 

Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()

2020-11-12 Thread Qian Cai
On Thu, 2020-11-12 at 17:26 +, Valentin Schneider wrote:
> On 12/11/20 16:38, Qian Cai wrote:
> > Some syscall fuzzing from an unprivileged user starts to trigger this below
> > since this commit first appeared in the linux-next today. Does it ring any
> > bells?

X86 in a KVM guest as well.

guest .config: 
https://cailca.coding.net/public/linux/mm/git/files/master/x86.config

To reproduce:

# /usr/libexec/qemu-kvm -name kata -cpu host -smp 48 -m 48g -hda rhel-8.3-
x86_64-kvm.img.qcow2 -cdrom kata.iso -nic user,hostfwd=tcp::-:22 -nographic

== inside the guest ===
# git clone https://e.coding.net/cailca/linux/mm
# cd mm; make
# ./random -x 0-100 -f

[17213.432777][ T348] INFO: task trinity-c7:216885 can't die for more than 122 
seconds.
[17213.434895][ T348] task:trinity-c7  state:D stack:27088 pid:216885 
ppid:103237 flags:0x4004
[17213.437297][ T348] Call Trace:
[17213.438142][ T348] __schedule (kernel/sched/core.c:4272 
kernel/sched/core.c:5019) 
[17213.439256][ T348] ? __sched_text_start (kernel/sched/core.c:4901) 
[17213.440477][ T348] schedule (./arch/x86/include/asm/current.h:15 
(discriminator 1) ./include/linux/sched.h:1892 (discriminator 1) 
kernel/sched/core.c:5100 (discriminator 1)) 
[17213.441501][ T348] schedule_timeout (kernel/time/timer.c:1848) 
[17213.442834][ T348] ? usleep_range (kernel/time/timer.c:1833) 
[17213.444070][ T348] ? wait_for_completion (kernel/sched/completion.c:85 
kernel/sched/completion.c:106 kernel/sched/completion.c:117 
kernel/sched/completion.c:138) 
[17213.445457][ T348] ? lock_downgrade (kernel/locking/lockdep.c:5443) 
[17213.446695][ T348] ? rcu_read_unlock (./include/linux/rcupdate.h:692 
(discriminator 5)) 
[17213.447911][ T348] ? do_raw_spin_lock (./arch/x86/include/asm/atomic.h:202 
./include/asm-generic/atomic-instrumented.h:707 
./include/asm-generic/qspinlock.h:82 kernel/locking/spinlock_debug.c:113) 
[17213.449190][ T348] ? lockdep_hardirqs_on_prepare 
(kernel/locking/lockdep.c:4036 kernel/locking/lockdep.c:4096 
kernel/locking/lockdep.c:4048) 
[17213.450714][ T348] ? _raw_spin_unlock_irq 
(./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 
./include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199) 
[17213.452042][ T348] wait_for_completion (kernel/sched/completion.c:86 
kernel/sched/completion.c:106 kernel/sched/completion.c:117 
kernel/sched/completion.c:138) 
[17213.453468][ T348] ? wait_for_completion_interruptible 
(kernel/sched/completion.c:137) 
[17213.455152][ T348] ? lockdep_hardirqs_on_prepare 
(kernel/locking/lockdep.c:4036 kernel/locking/lockdep.c:4096 
kernel/locking/lockdep.c:4048) 
[17213.456651][ T348] ? _raw_spin_unlock_irqrestore 
(./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) 
[17213.458115][ T348] affine_move_task (./include/linux/instrumented.h:101 
./include/asm-generic/atomic-instrumented.h:220 ./include/linux/refcount.h:272 
./include/linux/refcount.h:315 ./include/linux/refcount.h:333 
kernel/sched/core.c:2263) 
[17213.459313][ T348] ? move_queued_task (kernel/sched/core.c:2151) 
[17213.460553][ T348] ? update_curr (kernel/sched/sched.h:1176 
kernel/sched/fair.c:845) 
[17213.461684][ T348] ? enqueue_entity (kernel/sched/fair.c:4247) 
[17213.463001][ T348] ? set_next_task_fair 
(./arch/x86/include/asm/jump_label.h:25 (discriminator 2) 
./include/linux/jump_label.h:200 (discriminator 2) kernel/sched/fair.c:4567 
(discriminator 2) kernel/sched/fair.c:4683 (discriminator 2) 
kernel/sched/fair.c:10953 (discriminator 2)) 
[17213.464294][ T348] __set_cpus_allowed_ptr (kernel/sched/core.c:2353) 
[17213.465668][ T348] ? affine_move_task (kernel/sched/core.c:2287) 
[17213.466952][ T348] ? lockdep_hardirqs_on_prepare 
(kernel/locking/lockdep.c:4036 kernel/locking/lockdep.c:4096 
kernel/locking/lockdep.c:4048) 
[17213.468452][ T348] ? _raw_spin_unlock_irqrestore 
(./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) 
[17213.469908][ T348] sched_setaffinity (kernel/sched/core.c:6460) 
[17213.471127][ T348] ? __ia32_sys_sched_getattr (kernel/sched/core.c:6393) 
[17213.472644][ T348] ? _copy_from_user (./arch/x86/include/asm/uaccess_64.h:46 
./arch/x86/include/asm/uaccess_64.h:52 lib/usercopy.c:16) 
[17213.473850][ T348] __x64_sys_sched_setaffinity (kernel/sched/core.c:6511 
kernel/sched/core.c:6500 kernel/sched/core.c:6500) 
[17213.475307][ T348] ? sched_setaffinity (kernel/sched/core.c:6500) 
[17213.476542][ T348] ? lockdep_hardirqs_on_prepare 
(kernel/locking/lockdep.c:4036 kernel/locking/lockdep.c:4096 
kernel/locking/lockdep.c:4048) 
[17213.477991][ T348] ? syscall_enter_from_user_mode 
(./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 
kernel/entry/common.c:98) 
[17213.479428][ T348] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 
(discriminator 22)) 
[17213.480642][ T348] do_syscall_64 (arch/x86/entry/common.c:46) 
[17213.481706][ T348] entry_SYSCALL_64_after_hwframe 
(arch/x86/entry/entry_64.S:127) 

Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()

2020-11-12 Thread Qian Cai
On Thu, 2020-11-12 at 17:26 +, Valentin Schneider wrote:
> On 12/11/20 16:38, Qian Cai wrote:
> > Some syscall fuzzing from an unprivileged user starts to trigger this below
> > since this commit first appeared in the linux-next today. Does it ring any
> > bells?
> > 
> 
> What's the .config? I'm interested in
> CONFIG_PREEMPT
> CONFIG_PREEMPT_RT
> CONFIG_SMP

https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config

# CONFIG_PREEMPT is not set
CONFIG_SMP=y

Also, I have been able to reproduce this on powerpc as well just now.

> 
> From a quick look it seems that tree doesn't have Thomas' "generalization" of
> migrate_disable(), so if this doesn't have PREEMPT_RT we could forget about
> migrate_disable() for now.
> 
> > [12065.065837][ T1310] INFO: task trinity-c30:91730 blocked for more than
> > 368 seconds.
> > [12065.073524][ T1310]   Tainted: G L5.10.0-rc3-next-
> > 20201112 #2
> > [12065.081076][ T1310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [12065.089648][ T1310] task:trinity-c30 state:D stack:26576 pid:91730
> > ppid: 82688 flags:0x
> > [12065.098818][ T1310] Call trace:
> > [12065.101987][ T1310]  __switch_to+0xf0/0x1a8
> > [12065.106227][ T1310]  __schedule+0x6ec/0x1708
> > [12065.110505][ T1310]  schedule+0x1bc/0x3b0
> > [12065.114562][ T1310]  schedule_timeout+0x3c4/0x4c0
> > [12065.119275][ T1310]  wait_for_completion+0x13c/0x248
> > [12065.124257][ T1310]  affine_move_task+0x410/0x688
> > (inlined by) affine_move_task at kernel/sched/core.c:2261
> > [12065.129013][ T1310]  __set_cpus_allowed_ptr+0x1b4/0x370
> > [12065.134248][ T1310]  sched_setaffinity+0x4f0/0x7e8
> > [12065.139088][ T1310]  __arm64_sys_sched_setaffinity+0x1f4/0x2a0
> > [12065.144972][ T1310]  do_el0_svc+0x124/0x228
> > [12065.149165][ T1310]  el0_sync_handler+0x208/0x384
> > [12065.153876][ T1310]  el0_sync+0x140/0x180
> > [12065.157971][ T1310]
> 
> So that's a task changing the affinity of some task (either itself or
> another; I can't say without a decoded stacktrace), and then blocking on a
> wait_for_completion() that apparently never happens.
> 
> I don't see stop_one_cpu() in the trace, so I assume it's the !task_running
> case, for which the completion should be completed before getting to the
> wait (unless we *do* have migrate_disable()).
> 
> Could you please run scripts/decode_stacktrace.sh on the above?

[12065.101987][ T1310] __switch_to (arch/arm64/kernel/process.c:580) 
[12065.106227][ T1310] __schedule (kernel/sched/core.c:4272 
kernel/sched/core.c:5019) 
[12065.110505][ T1310] schedule (./arch/arm64/include/asm/current.h:19 
(discriminator 1) ./arch/arm64/include/asm/preempt.h:53 (discriminator 1) 
kernel/sched/core.c:5099 (discriminator 1)) 
[12065.114562][ T1310] schedule_timeout (kernel/time/timer.c:1848) 
[12065.119275][ T1310] wait_for_completion (kernel/sched/completion.c:85 
kernel/sched/completion.c:106 kernel/sched/completion.c:117 
kernel/sched/completion.c:138) 
[12065.124257][ T1310] affine_move_task (./include/linux/instrumented.h:101 
./include/asm-generic/atomic-instrumented.h:220 ./include/linux/refcount.h:272 
./include/linux/refcount.h:315 ./include/linux/refcount.h:333 
kernel/sched/core.c:2263) 
[12065.129013][ T1310] __set_cpus_allowed_ptr (kernel/sched/core.c:2353) 
[12065.134248][ T1310] sched_setaffinity (kernel/sched/core.c:6460) 
[12065.139088][ T1310] __arm64_sys_sched_setaffinity (kernel/sched/core.c:6511 
kernel/sched/core.c:6500 kernel/sched/core.c:6500) 
[12065.144972][ T1310] do_el0_svc (arch/arm64/kernel/syscall.c:36 
arch/arm64/kernel/syscall.c:48 arch/arm64/kernel/syscall.c:159 
arch/arm64/kernel/syscall.c:205) 
[12065.149165][ T1310] el0_sync_handler (arch/arm64/kernel/entry-common.c:236 
arch/arm64/kernel/entry-common.c:254) 
[12065.153876][ T1310] el0_sync (arch/arm64/kernel/entry.S:741)

== powerpc ==
[18060.020301][ T676] [c000200014227670] [c0a6d1e8] 
__func__.5350+0x1220e0/0x181338 unreliable 
[18060.020333][ T676] [c000200014227850] [c001a278] __switch_to 
(arch/powerpc/kernel/process.c:1273) 
[18060.020351][ T676] [c0002000142278c0] [c08f3e94] __schedule 
(kernel/sched/core.c:4269 kernel/sched/core.c:5019) 
[18060.020377][ T676] [c000200014227990] [c08f4638] schedule 
(./include/asm-generic/preempt.h:59 (discriminator 1) kernel/sched/core.c:5099 
(discriminator 1)) 
[18060.020394][ T676] [c0002000142279c0] [c08fbd34] schedule_timeout 
(kernel/time/timer.c:1847) 
[18060.020420][ T676] [c000200014227ac0] [c08f6398] wait_for_completion 
(kernel/sched/completion.c:85 kernel/sched/completion.c:106 
kernel/sched/completion.c:117 kernel/sched/completion.c:138) 
[18060.020455][ T676] [c000200014227b30] [c0100fd4] affine_move_task 
(kernel/sched/core.c:2261) 
[18060.020481][ T676] [c000200014227c90] [c0101444] 
__set_cpus_allowed_ptr (kernel/sched/core.c:2353) 
[18060.020507][ T676] [c000200014227d00] 

Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()

2020-11-12 Thread Valentin Schneider


On 12/11/20 16:38, Qian Cai wrote:
> Some syscall fuzzing from an unprivileged user starts to trigger this below
> since this commit first appeared in the linux-next today. Does it ring any
> bells?
>

What's the .config? I'm interested in
CONFIG_PREEMPT
CONFIG_PREEMPT_RT
CONFIG_SMP

>From a quick look it seems that tree doesn't have Thomas' "generalization" of
migrate_disable(), so if this doesn't have PREEMPT_RT we could forget about
migrate_disable() for now.

> [12065.065837][ T1310] INFO: task trinity-c30:91730 blocked for more than 368 
> seconds.
> [12065.073524][ T1310]   Tainted: G L
> 5.10.0-rc3-next-20201112 #2
> [12065.081076][ T1310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [12065.089648][ T1310] task:trinity-c30 state:D stack:26576 pid:91730 
> ppid: 82688 flags:0x
> [12065.098818][ T1310] Call trace:
> [12065.101987][ T1310]  __switch_to+0xf0/0x1a8
> [12065.106227][ T1310]  __schedule+0x6ec/0x1708
> [12065.110505][ T1310]  schedule+0x1bc/0x3b0
> [12065.114562][ T1310]  schedule_timeout+0x3c4/0x4c0
> [12065.119275][ T1310]  wait_for_completion+0x13c/0x248
> [12065.124257][ T1310]  affine_move_task+0x410/0x688
> (inlined by) affine_move_task at kernel/sched/core.c:2261
> [12065.129013][ T1310]  __set_cpus_allowed_ptr+0x1b4/0x370
> [12065.134248][ T1310]  sched_setaffinity+0x4f0/0x7e8
> [12065.139088][ T1310]  __arm64_sys_sched_setaffinity+0x1f4/0x2a0
> [12065.144972][ T1310]  do_el0_svc+0x124/0x228
> [12065.149165][ T1310]  el0_sync_handler+0x208/0x384
> [12065.153876][ T1310]  el0_sync+0x140/0x180
> [12065.157971][ T1310]

So that's a task changing the affinity of some task (either itself or
another; I can't say without a decoded stacktrace), and then blocking on a
wait_for_completion() that apparently never happens.

I don't see stop_one_cpu() in the trace, so I assume it's the !task_running
case, for which the completion should be completed before getting to the
wait (unless we *do* have migrate_disable()).

Could you please run scripts/decode_stacktrace.sh on the above?

> [12065.157971][ T1310] Showing all locks held in the system:
> [12065.166401][ T1310] 1 lock held by khungtaskd/1310:
> [12065.171288][ T1310]  #0: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: 
> rcu_lock_acquire.constprop.56+0x0/0x38
> [12065.182210][ T1310] 4 locks held by trinity-main/82688:
> [12065.187515][ T1310] 2 locks held by kworker/u513:3/82813:
> [12065.192922][ T1310]  #0: 00419d38 
> ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x69c/0x18c8
> [12065.203890][ T1310]  #1: 122bfd40 
> ((work_completion)(>work)){+.+.}-{0:0}, at: __update_idle_core+0xa8/0x460
> [12065.214916][ T1310] 1 lock held by trinity-c35/137168:
> [12065.220061][ T1310]  #0: 0087ce767898 (>ldisc_sem){}-{0:0}, 
> at: ldsem_down_read+0x3c/0x48
> [12065.229483][ T1310] 3 locks held by trinity-c61/137611:
> [12065.234757][ T1310] 1 lock held by trinity-c7/137630:
> [12065.239828][ T1310] 1 lock held by trinity-c57/137714:
> [12065.242612][T137611] futex_wake_op: trinity-c61 tries to shift op by 1008; 
> fix this program
> [12065.245012][ T1310] 1 lock held by trinity-c52/137771:
> [12065.258538][ T1310] 2 locks held by trinity-c42/137835:
> [12065.263783][ T1310] 4 locks held by trinity-c22/137868:
> [12065.269051][ T1310]  #0: 000e78503798 (>lock){-.-.}-{2:2}, at: 
> newidle_balance+0x92c/0xd78
> [12065.278155][ T1310]  #1: 0087ce767930 
> (>atomic_write_lock){+.+.}-{3:3}, at: tty_write_lock+0x30/0x58
> [12065.288317][ T1310]  #2: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: 
> __mutex_lock+0x24c/0x1310
> [12065.297592][ T1310]  #3: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: 
> lock_page_memcg+0x98/0x240
> [12065.307026][ T1310] 2 locks held by trinity-c34/137896:
> [12065.312266][ T1310]  #0: 000e78463798 (>lock){-.-.}-{2:2}, at: 
> __schedule+0x22c/0x1708
> [12065.321023][ T1310]  #1: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: 
> __update_idle_core+0xa8/0x460
> [12065.330663][ T1310] 2 locks held by trinity-c43/137909:
> [12065.335996][ T1310] 1 lock held by trinity-c24/137910:
> [12065.341164][ T1310] 1 lock held by trinity-c1/137954:
> [12065.346272][ T1310] 1 lock held by trinity-c49/138020:
> [12065.351425][ T1310] 1 lock held by trinity-c10/138021:
> [12065.356649][ T1310] 1 lock held by trinity-c32/138039:
> [12065.361813][ T1310] 4 locks held by trinity-c36/138042:
> [12065.367129][ T1310] 2 locks held by trinity-c14/138061:
> [12065.372378][ T1310] 2 locks held by trinity-c38/138070:
> [12065.377688][ T1310] 1 lock held by trinity-c50/138074:
> [12065.382885][ T1310] 1 lock held by trinity-c12/138085:
> [12065.388186][ T1310] 1 lock held by trinity-c4/138087:
> [12065.393272][ T1310] 3 locks held by trinity-c6/138091:
> [12065.398492][ T1310] 2 locks held by trinity-c48/138095:
> [12065.403757][ T1310] 2 locks held by trinity-c62/138097:
> [12065.409045][ T1310] 2 locks held by 

Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()

2020-11-12 Thread Qian Cai
On Fri, 2020-10-23 at 12:12 +0200, Peter Zijlstra wrote:
> Concurrent migrate_disable() and set_cpus_allowed_ptr() has
> interesting features. We rely on set_cpus_allowed_ptr() to not return
> until the task runs inside the provided mask. This expectation is
> exported to userspace.
> 
> This means that any set_cpus_allowed_ptr() caller must wait until
> migrate_enable() allows migrations.
> 
> At the same time, we don't want migrate_enable() to schedule, due to
> patterns like:
> 
>   preempt_disable();
>   migrate_disable();
>   ...
>   migrate_enable();
>   preempt_enable();
> 
> And:
> 
>   raw_spin_lock();
>   spin_unlock();
> 
> this means that when migrate_enable() must restore the affinity
> mask, it cannot wait for completion thereof. Luck will have it that
> that is exactly the case where there is a pending
> set_cpus_allowed_ptr(), so let that provide storage for the async stop
> machine.
> 
> Much thanks to Valentin who used TLA+ most effective and found lots of
> 'interesting' cases.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  include/linux/sched.h |1 
>  kernel/sched/core.c   |  234 +++-
> --
>  2 files changed, 205 insertions(+), 30 deletions(-)

Some syscall fuzzing from an unprivileged user starts to trigger this below
since this commit first appeared in the linux-next today. Does it ring any
bells?

[12065.065837][ T1310] INFO: task trinity-c30:91730 blocked for more than 368 
seconds.
[12065.073524][ T1310]   Tainted: G L
5.10.0-rc3-next-20201112 #2
[12065.081076][ T1310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[12065.089648][ T1310] task:trinity-c30 state:D stack:26576 pid:91730 ppid: 
82688 flags:0x
[12065.098818][ T1310] Call trace:
[12065.101987][ T1310]  __switch_to+0xf0/0x1a8
[12065.106227][ T1310]  __schedule+0x6ec/0x1708
[12065.110505][ T1310]  schedule+0x1bc/0x3b0
[12065.114562][ T1310]  schedule_timeout+0x3c4/0x4c0
[12065.119275][ T1310]  wait_for_completion+0x13c/0x248
[12065.124257][ T1310]  affine_move_task+0x410/0x688
(inlined by) affine_move_task at kernel/sched/core.c:2261
[12065.129013][ T1310]  __set_cpus_allowed_ptr+0x1b4/0x370
[12065.134248][ T1310]  sched_setaffinity+0x4f0/0x7e8
[12065.139088][ T1310]  __arm64_sys_sched_setaffinity+0x1f4/0x2a0
[12065.144972][ T1310]  do_el0_svc+0x124/0x228
[12065.149165][ T1310]  el0_sync_handler+0x208/0x384
[12065.153876][ T1310]  el0_sync+0x140/0x180
[12065.157971][ T1310] 
[12065.157971][ T1310] Showing all locks held in the system:
[12065.166401][ T1310] 1 lock held by khungtaskd/1310:
[12065.171288][ T1310]  #0: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: 
rcu_lock_acquire.constprop.56+0x0/0x38
[12065.182210][ T1310] 4 locks held by trinity-main/82688:
[12065.187515][ T1310] 2 locks held by kworker/u513:3/82813:
[12065.192922][ T1310]  #0: 00419d38 
((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x69c/0x18c8
[12065.203890][ T1310]  #1: 122bfd40 
((work_completion)(>work)){+.+.}-{0:0}, at: __update_idle_core+0xa8/0x460
[12065.214916][ T1310] 1 lock held by trinity-c35/137168:
[12065.220061][ T1310]  #0: 0087ce767898 (>ldisc_sem){}-{0:0}, at: 
ldsem_down_read+0x3c/0x48
[12065.229483][ T1310] 3 locks held by trinity-c61/137611:
[12065.234757][ T1310] 1 lock held by trinity-c7/137630:
[12065.239828][ T1310] 1 lock held by trinity-c57/137714:
[12065.242612][T137611] futex_wake_op: trinity-c61 tries to shift op by 1008; 
fix this program
[12065.245012][ T1310] 1 lock held by trinity-c52/137771:
[12065.258538][ T1310] 2 locks held by trinity-c42/137835:
[12065.263783][ T1310] 4 locks held by trinity-c22/137868:
[12065.269051][ T1310]  #0: 000e78503798 (>lock){-.-.}-{2:2}, at: 
newidle_balance+0x92c/0xd78
[12065.278155][ T1310]  #1: 0087ce767930 
(>atomic_write_lock){+.+.}-{3:3}, at: tty_write_lock+0x30/0x58
[12065.288317][ T1310]  #2: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: 
__mutex_lock+0x24c/0x1310
[12065.297592][ T1310]  #3: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: 
lock_page_memcg+0x98/0x240
[12065.307026][ T1310] 2 locks held by trinity-c34/137896:
[12065.312266][ T1310]  #0: 000e78463798 (>lock){-.-.}-{2:2}, at: 
__schedule+0x22c/0x1708
[12065.321023][ T1310]  #1: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: 
__update_idle_core+0xa8/0x460
[12065.330663][ T1310] 2 locks held by trinity-c43/137909:
[12065.335996][ T1310] 1 lock held by trinity-c24/137910:
[12065.341164][ T1310] 1 lock held by trinity-c1/137954:
[12065.346272][ T1310] 1 lock held by trinity-c49/138020:
[12065.351425][ T1310] 1 lock held by trinity-c10/138021:
[12065.356649][ T1310] 1 lock held by trinity-c32/138039:
[12065.361813][ T1310] 4 locks held by trinity-c36/138042:
[12065.367129][ T1310] 2 locks held by trinity-c14/138061:
[12065.372378][ T1310] 2 locks held by trinity-c38/138070:
[12065.377688][ T1310] 1 lock held by