On Tue, 2024-10-29 at 09:29 -0700, Paul E. McKenney wrote:
> External email : Please do not click links or open attachments until you have 
> verified the sender or the content.
> 
> 
> On Tue, Oct 29, 2024 at 02:20:51AM +0000, Cheng-Jui Wang (王正睿) wrote:
> > On Mon, 2024-10-28 at 17:22 -0700, Paul E. McKenney wrote:
> > > The result is that the current leaf rcu_node structure's ->lock is
> > > acquired only if a stack backtrace might be needed from the current CPU,
> > > and is held across only that CPU's backtrace. As a result, if there are
> > 
> > After upgrading our device to kernel-6.11, we encountered a lockup
> > scenario under stall warning.
> > I had prepared a patch to submit, but I noticed that this series has
> > already addressed some issues, though it hasn't been merged into the
> > mainline yet. So, I decided to reply to this series for discussion on
> > how to fix it before pushing. Here is the lockup scenario We
> > encountered:
> > 
> > Devices: arm64 with only 8 cores
> > One CPU holds rnp->lock in rcu_dump_cpu_stack() while trying to dump
> > other CPUs, but it waits for the corresponding CPU to dump backtrace,
> > with a 10-second timeout.
> > 
> >    __delay()
> >    __const_udelay()
> >    nmi_trigger_cpumask_backtrace()
> >    arch_trigger_cpumask_backtrace()
> >    trigger_single_cpu_backtrace()
> >    dump_cpu_task()
> >    rcu_dump_cpu_stacks()  <- holding rnp->lock
> >    print_other_cpu_stall()
> >    check_cpu_stall()
> >    rcu_pending()
> >    rcu_sched_clock_irq()
> >    update_process_times()
> > 
> > However, the other 7 CPUs are waiting for rnp->lock on the path to
> > report qs.
> > 
> >    queued_spin_lock_slowpath()
> >    queued_spin_lock()
> >    do_raw_spin_lock()
> >    __raw_spin_lock_irqsave()
> >    _raw_spin_lock_irqsave()
> >    rcu_report_qs_rdp()
> >    rcu_check_quiescent_state()
> >    rcu_core()
> >    rcu_core_si()
> >    handle_softirqs()
> >    __do_softirq()
> >    ____do_softirq()
> >    call_on_irq_stack()
> > 
> > Since the arm64 architecture uses IPI instead of true NMI to implement
> > arch_trigger_cpumask_backtrace(), spin_lock_irqsave disables
> > interrupts, which is enough to block this IPI request.
> > Therefore, if other CPUs start waiting for the lock before receiving
> > the IPI, a semi-deadlock scenario like the following occurs:
> > 
> > CPU0                    CPU1                    CPU2
> > -----                   -----                   -----
> > lock_irqsave(rnp->lock)
> >                         lock_irqsave(rnp->lock)
> >                         <can't receive IPI>
> > <send ipi to CPU 1>
> > <wait CPU 1 for 10s>
> >                                                 lock_irqsave(rnp->lock)
> >                                                 <can't receive IPI>
> > <send ipi to CPU 2>
> > <wait CPU 2 for 10s>
> > ...
> > 
> > 
> > In our scenario, with 7 CPUs to dump, the lockup takes nearly 70
> > seconds, causing subsequent useful logs to be unable to print, leading
> > to a watchdog timeout and system reboot.
> > 
> > This series of changes re-acquires the lock after each dump,
> > significantly reducing lock-holding time. However, since it still holds
> > the lock while dumping CPU backtrace, there's still a chance for two
> > CPUs to wait for each other for 10 seconds, which is still too long.
> > So, I would like to ask if it's necessary to dump backtrace within the
> > spinlock section?
> > If not, especially now that lockless checks are possible, maybe it can
> > be changed as follows?
> > 
> > -                     if (!(data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, 
> > cpu)))
> > -                             continue;
> > -                     raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > -                     if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) {
> > +                     if (data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, 
> > cpu)) {
> >                               if (cpu_is_offline(cpu))
> >                                       pr_err("Offline CPU %d blocking 
> > current GP.\n", cpu);
> >                               else
> >                                       dump_cpu_task(cpu);
> >                               }
> >                       }
> > -                     raw_spin_unlock_irqrestore_rcu_node(rnp,
> > flags);
> > 
> > Or should this be considered an arm64 issue, and they should switch to
> > true NMI, otherwise, they shouldn't use
> > nmi_trigger_cpumask_backtrace()?
> 
> Thank you for looking into this!
> 
> We do assume that nmi_trigger_cpumask_backtrace() uses true NMIs, so,
> yes, nmi_trigger_cpumask_backtrace() should use true NMIs, just like
> the name says.  ;-)
In the comments of following patch, the arm64 maintainers have
differing views on the use of nmi_trigger_cpumask_backtrace(). I'm a
bit confused and unsure which perspective is more reasonable.

https://lore.kernel.org/all/20230906090246.v13.4.Ie6c132b96ebbbcddbf6954b9469ed40a6960343c@changeid/

> /*
>  * NOTE: though nmi_trigger_cpumask_backtrace() has "nmi_" in the
name,
>  * nothing about it truly needs to be implemented using an NMI, it's
>  * just that it's _allowed_ to work with NMIs. If ipi_should_be_nmi()
>  * returned false our backtrace attempt will just use a regular IPI.
>  */

> Alternatively, arm64 could continue using nmi_trigger_cpumask_backtrace()
> with normal interrupts (for example, on SoCs not implementing true NMIs),
> but have a short timeout (maybe a few jiffies?) after which its returns
> false (and presumably also cancels the backtrace request so that when
> the non-NMI interrupt eventually does happen, its handler simply returns
> without backtracing).  This should be implemented using atomics to avoid
> deadlock issues.  This alternative approach would provide accurate arm64
> backtraces in the common case where interrupts are enabled, but allow
> a graceful fallback to remote tracing otherwise.
> 
> Would you be interested in working this issue, whatever solution the
> arm64 maintainers end up preferring?
> 

The 10-second timeout is hard-coded in nmi_trigger_cpumask_backtrace().
It is shared code and not architecture-specific. Currently, I haven't
thought of a feasible solution. I have also CC'd the authors of the
aforementioned patch to see if they have any other ideas.

Regarding the rcu stall warning, I think the purpose of acquiring `rnp-
>lock` is to protect the rnp->qsmask variable rather than to protect
the `dump_cpu_task()` operation, right?
Therefore, there is no need to call dump_cpu_task() while holding the
lock.
When holding the spinlock, we can store the CPUs that need to be dumped
into a cpumask, and then dump them all at once after releasing the
lock.
Here is my temporary solution used locally based on kernel-6.11.

+       cpumask_var_t mask;
+       bool mask_ok;

+       mask_ok = zalloc_cpumask_var(&mask, GFP_ATOMIC);
        rcu_for_each_leaf_node(rnp) {
                raw_spin_lock_irqsave_rcu_node(rnp, flags);
                for_each_leaf_node_possible_cpu(rnp, cpu)
                        if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu))
{
                                if (cpu_is_offline(cpu))
                                        pr_err("Offline CPU %d blocking
current GP.\n", cpu);
+                               else if (mask_ok)
+                                       cpumask_set_cpu(cpu, mask);
                                else
                                        dump_cpu_task(cpu);
                        }
                raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
        }
+       if (mask_ok) {
+               if (!trigger_cpumask_backtrace(mask)) {
+                       for_each_cpu(cpu, mask)
+                               dump_cpu_task(cpu);
+               }
+               free_cpumask_var(mask);
+       }

After applying this, I haven't encountered the lockup issue for five
days, whereas it used to occur about once a day.

>                                                         Thanx, Paul

Reply via email to