On 6/3/2025 11:37 PM, Qi Xi wrote: > Hi Joel, > > The patch works as expected. Previously, the issue triggered a soft lockup > within ~10 minutes, but after applying the fix, the system ran stably for 30+ > minutes without any issues.
Great to hear! Thanks for testing. I/we will roll this into a proper patch and will provide it once have something ready. - Joel > > Thanks, > Qi > > On 2025/6/4 11:25, Xiongfeng Wang wrote: >> On 2025/6/4 9:35, Joel Fernandes wrote: >>> On Tue, Jun 03, 2025 at 03:22:42PM -0400, Joel Fernandes wrote: >>>> On 6/3/2025 3:03 PM, Joel Fernandes wrote: >>>>> On 6/3/2025 2:59 PM, Joel Fernandes wrote: >>>>>> On Fri, May 30, 2025 at 09:55:45AM +0800, Xiongfeng Wang wrote: >>>>>>> Hi Joel, >>>>>>> >>>>>>> On 2025/5/29 0:30, Joel Fernandes wrote: >>>>>>>> On Wed, May 21, 2025 at 5:43 AM Xiongfeng Wang >>>>>>>> <wangxiongfe...@huawei.com> wrote: >>>>>>>>> Hi RCU experts, >>>>>>>>> >>>>>>>>> When I ran syskaller in Linux 6.6 with CONFIG_PREEMPT_RCU enabled, I >>>>>>>>> got >>>>>>>>> the following soft lockup. The Calltrace is too long. I put it in the >>>>>>>>> end. >>>>>>>>> The issue can also be reproduced in the latest kernel. >>>>>>>>> >>>>>>>>> The issue is as follows. CPU3 is waiting for a spin_lock, which is >>>>>>>>> got by CPU1. >>>>>>>>> But CPU1 stuck in the following dead loop. >>>>>>>>> >>>>>>>>> irq_exit() >>>>>>>>> __irq_exit_rcu() >>>>>>>>> /* in_hardirq() returns false after this */ >>>>>>>>> preempt_count_sub(HARDIRQ_OFFSET) >>>>>>>>> tick_irq_exit() >>>>>>>>> tick_nohz_irq_exit() >>>>>>>>> tick_nohz_stop_sched_tick() >>>>>>>>> trace_tick_stop() /* a bpf prog is hooked on this >>>>>>>>> trace point */ >>>>>>>>> __bpf_trace_tick_stop() >>>>>>>>> bpf_trace_run2() >>>>>>>>> rcu_read_unlock_special() >>>>>>>>> /* will send a IPI to itself */ >>>>>>>>> irq_work_queue_on(&rdp->defer_qs_iw, >>>>>>>>> rdp->cpu); >>>>>>>>> >>>>>>>>> /* after interrupt is enabled again, the irq_work is called */ >>>>>>>>> asm_sysvec_irq_work() >>>>>>>>> sysvec_irq_work() >>>>>>>>> irq_exit() /* after handled the irq_work, we again enter into >>>>>>>>> irq_exit() */ >>>>>>>>> __irq_exit_rcu() >>>>>>>>> ...skip... >>>>>>>>> /* we queue a irq_work again, and enter a dead loop */ >>>>>>>>> irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu); >>> The following is a candidate fix (among other fixes being >>> considered/discussed). The change is to check if context tracking thinks >>> we're in IRQ and if so, avoid the irq_work. IMO, this should be rare enough >>> that it shouldn't be an issue and it is dangerous to self-IPI consistently >>> while we're exiting an IRQ anyway. >>> >>> Thoughts? Xiongfeng, do you want to try it? >> Thanks a lot for the fast response. My colleague is testing the modification. >> She will feedback the result. >> >> Thanks, >> Xiongfeng >> >>> Btw, I could easily reproduce it as a boot hang by doing: >>> >>> --- a/kernel/softirq.c >>> +++ b/kernel/softirq.c >>> @@ -638,6 +638,10 @@ void irq_enter(void) >>> >>> static inline void tick_irq_exit(void) >>> { >>> + rcu_read_lock(); >>> + WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, true); >>> + rcu_read_unlock(); >>> + >>> #ifdef CONFIG_NO_HZ_COMMON >>> int cpu = smp_processor_id(); >>> >>> ---8<----------------------- >>> >>> From: Joel Fernandes <joelagn...@nvidia.com> >>> Subject: [PATCH] Do not schedule irq_work when IRQ is exiting >>> >>> Signed-off-by: Joel Fernandes <joelagn...@nvidia.com> >>> --- >>> include/linux/context_tracking_irq.h | 2 ++ >>> kernel/context_tracking.c | 12 ++++++++++++ >>> kernel/rcu/tree_plugin.h | 3 ++- >>> 3 files changed, 16 insertions(+), 1 deletion(-) >>> >>> diff --git a/include/linux/context_tracking_irq.h >>> b/include/linux/context_tracking_irq.h >>> index 197916ee91a4..35a5ad971514 100644 >>> --- a/include/linux/context_tracking_irq.h >>> +++ b/include/linux/context_tracking_irq.h >>> @@ -9,6 +9,7 @@ void ct_irq_enter_irqson(void); >>> void ct_irq_exit_irqson(void); >>> void ct_nmi_enter(void); >>> void ct_nmi_exit(void); >>> +bool ct_in_irq(void); >>> #else >>> static __always_inline void ct_irq_enter(void) { } >>> static __always_inline void ct_irq_exit(void) { } >>> @@ -16,6 +17,7 @@ static inline void ct_irq_enter_irqson(void) { } >>> static inline void ct_irq_exit_irqson(void) { } >>> static __always_inline void ct_nmi_enter(void) { } >>> static __always_inline void ct_nmi_exit(void) { } >>> +static inline bool ct_in_irq(void) { return false; } >>> #endif >>> >>> #endif >>> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c >>> index fb5be6e9b423..8e8055cf04af 100644 >>> --- a/kernel/context_tracking.c >>> +++ b/kernel/context_tracking.c >>> @@ -392,6 +392,18 @@ noinstr void ct_irq_exit(void) >>> ct_nmi_exit(); >>> } >>> >>> +/** >>> + * ct_in_irq - check if CPU is currently in a tracked IRQ context. >>> + * >>> + * Returns true if ct_irq_enter() has been called and ct_irq_exit() >>> + * has not yet been called. This indicates the CPU is currently >>> + * processing an interrupt. >>> + */ >>> +bool ct_in_irq(void) >>> +{ >>> + return ct_nmi_nesting() != 0; >>> +} >>> + >>> /* >>> * Wrapper for ct_irq_enter() where interrupts are enabled. >>> * >>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h >>> index 3c0bbbbb686f..a3eebd4c841e 100644 >>> --- a/kernel/rcu/tree_plugin.h >>> +++ b/kernel/rcu/tree_plugin.h >>> @@ -673,7 +673,8 @@ static void rcu_read_unlock_special(struct task_struct >>> *t) >>> set_tsk_need_resched(current); >>> set_preempt_need_resched(); >>> if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled && >>> - expboost && !rdp->defer_qs_iw_pending && >>> cpu_online(rdp->cpu)) { >>> + expboost && !rdp->defer_qs_iw_pending && >>> cpu_online(rdp->cpu) && >>> + !ct_in_irq()) { >>> // Get scheduler to re-evaluate and call hooks. >>> // If !IRQ_WORK, FQS scan will eventually IPI. >>> if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) >>> && >>>