On Wed, Jan 17, 2024 at 12:31:33PM -0800, Andrew Morton wrote: > On Wed, 17 Jan 2024 06:16:36 +0000 Chen Zhongjin <chenzhong...@huawei.com> > wrote: > > > There is a deadlock scenario in kprobe_optimizer(): > > > > pid A pid B pid C > > kprobe_optimizer() do_exit() perf_kprobe_init() > > mutex_lock(&kprobe_mutex) exit_tasks_rcu_start() > > mutex_lock(&kprobe_mutex) > > synchronize_rcu_tasks() zap_pid_ns_processes() // waiting > > kprobe_mutex > > // waiting tasks_rcu_exit_srcu kernel_wait4() > > // waiting pid C exit > > > > To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in > > kprobe_optimizer() > > rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also > > promise > > that all preempted tasks have scheduled, but it will not wait > > tasks_rcu_exit_srcu.
Hello, Chen Zhongjin, Relying on the non-preemptability of the last portion of the do_exit() function does appear to be the basis for a much better solution than what we currently have. So at the very least, thank you for the idea!!! I feel a bit silly for not having thought of it myself. ;-) However, just invoking synchronize_rcu_tasks_rude() will be bad for both energy efficiency and real-time response. This is due to the fact that synchronize_rcu_tasks_rude() sends an IPI to each and every online CPUs, almost none of which will be in the non-preemptible tail of do_exit() at any given time. These additional unnecessary IPIs will drain battery when they hit idle CPUs and degrade real-time response when they hit CPUs running aggressive real-time applications. Which might not make people happy. So, would you be willing to use RCU's do_exit() hooks and RCU's hook into the scheduler (either rcu_note_context_switch() or rcu_tasks_qs(), whichever would work better) maintain a per-CPU variable that is a pointer to the task in the non-preemptible tail of do_exit() if there is one or NULL otherwise? This would get us the deadlock-avoidance simplicity of your underlying idea, while avoiding (almost all of) the energy-efficiency and real-time-response issues with your patch. This does require a bit of memory-ordering care, so if you would prefer that I do the implementation, just let me know. (The memory-ordering trick is to use "smp_mb(); smp_load_acquire();" to sample the counter and "smp_store_release(); smp_mb();" to update it.) Thanx, Paul > > Signed-off-by: Chen Zhongjin <chenzhong...@huawei.com> > > Thanks. Should we backport this fix into earlier kernels? If so, are > we able to identify a suitable Fixes: target?