On Fri, Sep 15, 2017 at 04:44:38PM +0530, Neeraj Upadhyay wrote: > Hi, > > We have one query regarding the behavior of RCU expedited grace period, > for scenario where resched_cpu() in sync_sched_exp_handler() fails to > acquire the rq lock and returns w/o setting the need_resched. In this > case, how do we ensure that the CPU notify rcu about the > end of sched grace period (schedule() -> __schedule() -> > rcu_note_context_switch(cpu) -> rcu_sched_qs()) , for cases where tick > is stopped on that CPU. Is it implied from the rq lock acquisition > failure, that the owner of the rq lock will enforce context switch? > For which scenarios in RCU paths (as the function is used only in RCU > code), we need trylock check in resched_cpu()? > > void resched_cpu(int cpu) > { > struct rq *rq = cpu_rq(cpu); > unsigned long flags; > > if (!raw_spin_trylock_irqsave(&rq->lock, flags)) > return; > resched_curr(rq); > raw_spin_unlock_irqrestore(&rq->lock, flags); > } > > > This issue was observed in below scenario, where one of the CPUs (CPU1) > started synchronize_sched_expedited and sent IPI to CPU5, which is in > the idle path but handled sync_sched_exp_handler() IPI before > rcu_idle_enter(). > As resched_cpu() failed to acquire the rq lock, need_resched was not set, > and CPU went to idle; resulting in expedited stall getting reported > by CPU1. > > Below is the scenario: > > • CPU1 is waiting for expedited wait to complete: > sync_rcu_exp_select_cpus > rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5 > IPI sent to CPU5 > > synchronize_sched_expedited_wait > ret = swait_event_timeout( > rsp->expedited_wq, > sync_rcu_preempt_exp_done(rnp_root), > jiffies_stall); > > expmask = 0x20 , and CPU 5 is in idle path (in cpuidle_enter()) > > > > • CPU5 handles IPI and fails to acquire rq lock. > > Handles IPI > sync_sched_exp_handler > resched_cpu > returns while failing to try lock acquire rq->lock > need_resched is not set > > • CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to > idle (schedule() is not called). > > • CPU 1 reports RCU stall.
Good catch and good detective work!!! I will be working on a fix this week, hopefully involving resched_cpu() getting a return value so that I can track who needs a later retry. Thanx, Paul