On Tue, 2024-10-29 at 09:29 -0700, Paul E. McKenney wrote: > External email : Please do not click links or open attachments until you have > verified the sender or the content. > > > On Tue, Oct 29, 2024 at 02:20:51AM +0000, Cheng-Jui Wang (王正睿) wrote: > > On Mon, 2024-10-28 at 17:22 -0700, Paul E. McKenney wrote: > > > The result is that the current leaf rcu_node structure's ->lock is > > > acquired only if a stack backtrace might be needed from the current CPU, > > > and is held across only that CPU's backtrace. As a result, if there are > > > > After upgrading our device to kernel-6.11, we encountered a lockup > > scenario under stall warning. > > I had prepared a patch to submit, but I noticed that this series has > > already addressed some issues, though it hasn't been merged into the > > mainline yet. So, I decided to reply to this series for discussion on > > how to fix it before pushing. Here is the lockup scenario We > > encountered: > > > > Devices: arm64 with only 8 cores > > One CPU holds rnp->lock in rcu_dump_cpu_stack() while trying to dump > > other CPUs, but it waits for the corresponding CPU to dump backtrace, > > with a 10-second timeout. > > > > __delay() > > __const_udelay() > > nmi_trigger_cpumask_backtrace() > > arch_trigger_cpumask_backtrace() > > trigger_single_cpu_backtrace() > > dump_cpu_task() > > rcu_dump_cpu_stacks() <- holding rnp->lock > > print_other_cpu_stall() > > check_cpu_stall() > > rcu_pending() > > rcu_sched_clock_irq() > > update_process_times() > > > > However, the other 7 CPUs are waiting for rnp->lock on the path to > > report qs. > > > > queued_spin_lock_slowpath() > > queued_spin_lock() > > do_raw_spin_lock() > > __raw_spin_lock_irqsave() > > _raw_spin_lock_irqsave() > > rcu_report_qs_rdp() > > rcu_check_quiescent_state() > > rcu_core() > > rcu_core_si() > > handle_softirqs() > > __do_softirq() > > ____do_softirq() > > call_on_irq_stack() > > > > Since the arm64 architecture uses IPI instead of true NMI to implement > > arch_trigger_cpumask_backtrace(), spin_lock_irqsave disables > > interrupts, which is enough to block this IPI request. > > Therefore, if other CPUs start waiting for the lock before receiving > > the IPI, a semi-deadlock scenario like the following occurs: > > > > CPU0 CPU1 CPU2 > > ----- ----- ----- > > lock_irqsave(rnp->lock) > > lock_irqsave(rnp->lock) > > <can't receive IPI> > > <send ipi to CPU 1> > > <wait CPU 1 for 10s> > > lock_irqsave(rnp->lock) > > <can't receive IPI> > > <send ipi to CPU 2> > > <wait CPU 2 for 10s> > > ... > > > > > > In our scenario, with 7 CPUs to dump, the lockup takes nearly 70 > > seconds, causing subsequent useful logs to be unable to print, leading > > to a watchdog timeout and system reboot. > > > > This series of changes re-acquires the lock after each dump, > > significantly reducing lock-holding time. However, since it still holds > > the lock while dumping CPU backtrace, there's still a chance for two > > CPUs to wait for each other for 10 seconds, which is still too long. > > So, I would like to ask if it's necessary to dump backtrace within the > > spinlock section? > > If not, especially now that lockless checks are possible, maybe it can > > be changed as follows? > > > > - if (!(data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, > > cpu))) > > - continue; > > - raw_spin_lock_irqsave_rcu_node(rnp, flags); > > - if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { > > + if (data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, > > cpu)) { > > if (cpu_is_offline(cpu)) > > pr_err("Offline CPU %d blocking > > current GP.\n", cpu); > > else > > dump_cpu_task(cpu); > > } > > } > > - raw_spin_unlock_irqrestore_rcu_node(rnp, > > flags); > > > > Or should this be considered an arm64 issue, and they should switch to > > true NMI, otherwise, they shouldn't use > > nmi_trigger_cpumask_backtrace()? > > Thank you for looking into this! > > We do assume that nmi_trigger_cpumask_backtrace() uses true NMIs, so, > yes, nmi_trigger_cpumask_backtrace() should use true NMIs, just like > the name says. ;-) In the comments of following patch, the arm64 maintainers have differing views on the use of nmi_trigger_cpumask_backtrace(). I'm a bit confused and unsure which perspective is more reasonable.
https://lore.kernel.org/all/20230906090246.v13.4.Ie6c132b96ebbbcddbf6954b9469ed40a6960343c@changeid/ > /* > * NOTE: though nmi_trigger_cpumask_backtrace() has "nmi_" in the name, > * nothing about it truly needs to be implemented using an NMI, it's > * just that it's _allowed_ to work with NMIs. If ipi_should_be_nmi() > * returned false our backtrace attempt will just use a regular IPI. > */ > Alternatively, arm64 could continue using nmi_trigger_cpumask_backtrace() > with normal interrupts (for example, on SoCs not implementing true NMIs), > but have a short timeout (maybe a few jiffies?) after which its returns > false (and presumably also cancels the backtrace request so that when > the non-NMI interrupt eventually does happen, its handler simply returns > without backtracing). This should be implemented using atomics to avoid > deadlock issues. This alternative approach would provide accurate arm64 > backtraces in the common case where interrupts are enabled, but allow > a graceful fallback to remote tracing otherwise. > > Would you be interested in working this issue, whatever solution the > arm64 maintainers end up preferring? > The 10-second timeout is hard-coded in nmi_trigger_cpumask_backtrace(). It is shared code and not architecture-specific. Currently, I haven't thought of a feasible solution. I have also CC'd the authors of the aforementioned patch to see if they have any other ideas. Regarding the rcu stall warning, I think the purpose of acquiring `rnp- >lock` is to protect the rnp->qsmask variable rather than to protect the `dump_cpu_task()` operation, right? Therefore, there is no need to call dump_cpu_task() while holding the lock. When holding the spinlock, we can store the CPUs that need to be dumped into a cpumask, and then dump them all at once after releasing the lock. Here is my temporary solution used locally based on kernel-6.11. + cpumask_var_t mask; + bool mask_ok; + mask_ok = zalloc_cpumask_var(&mask, GFP_ATOMIC); rcu_for_each_leaf_node(rnp) { raw_spin_lock_irqsave_rcu_node(rnp, flags); for_each_leaf_node_possible_cpu(rnp, cpu) if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { if (cpu_is_offline(cpu)) pr_err("Offline CPU %d blocking current GP.\n", cpu); + else if (mask_ok) + cpumask_set_cpu(cpu, mask); else dump_cpu_task(cpu); } raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } + if (mask_ok) { + if (!trigger_cpumask_backtrace(mask)) { + for_each_cpu(cpu, mask) + dump_cpu_task(cpu); + } + free_cpumask_var(mask); + } After applying this, I haven't encountered the lockup issue for five days, whereas it used to occur about once a day. > Thanx, Paul