On Fri, Nov 01, 2024 at 07:41:27AM +0000, Cheng-Jui Wang (王正睿) wrote:
> On Wed, 2024-10-30 at 06:54 -0700, Paul E. McKenney wrote:
> > > > Alternatively, arm64 could continue using 
> > > > nmi_trigger_cpumask_backtrace()
> > > > with normal interrupts (for example, on SoCs not implementing true 
> > > > NMIs),
> > > > but have a short timeout (maybe a few jiffies?) after which its returns
> > > > false (and presumably also cancels the backtrace request so that when
> > > > the non-NMI interrupt eventually does happen, its handler simply returns
> > > > without backtracing).  This should be implemented using atomics to avoid
> > > > deadlock issues.  This alternative approach would provide accurate arm64
> > > > backtraces in the common case where interrupts are enabled, but allow
> > > > a graceful fallback to remote tracing otherwise.
> > > > 
> > > > Would you be interested in working this issue, whatever solution the
> > > > arm64 maintainers end up preferring?
> > > 
> > > The 10-second timeout is hard-coded in nmi_trigger_cpumask_backtrace().
> > > It is shared code and not architecture-specific. Currently, I haven't
> > > thought of a feasible solution. I have also CC'd the authors of the
> > > aforementioned patch to see if they have any other ideas.
> > 
> > It should be possible for arm64 to have an architecture-specific hook
> > that enables them to use a much shorter timeout.  Or, to eventually
> > switch to real NMIs.
> 
> There is already another thread discussing the timeout issue, but I
> still have some questions about RCU. To avoid mixing the discussions, I
> start this separate thread to discuss RCU.
> 
> > > Regarding the rcu stall warning, I think the purpose of acquiring `rnp-
> > > > lock` is to protect the rnp->qsmask variable rather than to protect
> > > 
> > > the `dump_cpu_task()` operation, right?
> > 
> > As noted below, it is also to prevent false-positive stack dumps.
> > 
> > > Therefore, there is no need to call dump_cpu_task() while holding the
> > > lock.
> > > When holding the spinlock, we can store the CPUs that need to be dumped
> > > into a cpumask, and then dump them all at once after releasing the
> > > lock.
> > > Here is my temporary solution used locally based on kernel-6.11.
> > > 
> > > +     cpumask_var_t mask;
> > > +     bool mask_ok;
> > > 
> > > +     mask_ok = zalloc_cpumask_var(&mask, GFP_ATOMIC);
> > >       rcu_for_each_leaf_node(rnp) {
> > >               raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > >               for_each_leaf_node_possible_cpu(rnp, cpu)
> > >                       if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu))
> > > {
> > >                               if (cpu_is_offline(cpu))
> > >                                       pr_err("Offline CPU %d blocking
> > > current GP.\n", cpu);
> > > +                             else if (mask_ok)
> > > +                                     cpumask_set_cpu(cpu, mask);
> > >                               else
> > >                                       dump_cpu_task(cpu);
> > >                       }
> > >               raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > >       }
> > > +     if (mask_ok) {
> > > +             if (!trigger_cpumask_backtrace(mask)) {
> > > +                     for_each_cpu(cpu, mask)
> > > +                             dump_cpu_task(cpu);
> > > +             }
> > > +             free_cpumask_var(mask);
> > > +     }
> > > 
> > > After applying this, I haven't encountered the lockup issue for five
> > > days, whereas it used to occur about once a day.
> > 
> > We used to do it this way, and the reason that we changed was to avoid
> > false-positive (and very confusing) stack dumps in the surprisingly
> > common case where the act of dumping the first stack caused the stalled
> > grace period to end.
> > 
> > So sorry, but we really cannot go back to doing it that way.
> > 
> >                                                         Thanx, Paul
> 
> Let me clarify, the reason for the issue mentioned above is that it
> pre-determines all the CPUs to be dumped before starting the dump
> process. Then, dumping the first stack caused the stalled grace period
> to end. Subsequently, many CPUs that do not need to be dumped (false
> positives) are dumped.
> 
> So,to prevent false positives, it should be about excluding those CPUs
> that do not to be dumped, right? Therefore, the action that trully help
> is actually "releasing the lock after each dump (allowing other CPUs to
> update qsmask) and rechecking (gp_seq and qsmask) to confirm whether to
> continue dumping".
> 
> I think holding the lock while dumping CPUs does not help prevent false
> positives; it only blocks those CPUs waiting for the lock (e.g., CPUs
> aboult to report qs). For CPUs that do not interact with this lock,
> holding it should not have any impact. Did I miss anything?

Yes.

The stalled CPU could unstall itself just after the lock was released,
so that the stack dump would be from some random irrelevant and confusing
point in the code.  This would not be a good thing.  In contrast, with the
lock held, the stalled CPU cannot fully exit its RCU read-side critical
section, so that the dump has at least some relevance.

Let's please instead confine the change to architecture-specific code
that chooses to use interrupts instead of NMIs, as suggested in my
previous email.  If there is more than one such architecture (arm64 and
arm32?), they can of course share code, if appropriate.

                                                        Thanx, Paul

Reply via email to