Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

Paul E. McKenney Sat, 10 Aug 2013 08:09:07 -0700

On Sat, Aug 10, 2013 at 11:43:59AM +0800, Lai Jiangshan wrote:
> Hi, Steven
> 
> I was considering rtmutex's lock->wait_lock is a scheduler lock,
> But it is not, and it is just a spinlock of process context.
> I hope you change it to a spinlock of irq context.
> 
> 1) it causes rcu read site more deadlockable, example:
> x is a spinlock of softirq context.
> 
> CPU1                                  cpu2(rcu boost)
> rcu_read_lock()                               rt_mutext_lock()
> <preemption and reschedule back>      raw_spin_lock(lock->wait_lock)
> spin_lock_bh(x)                               <interrupt and doing softirq 
> after irq>
> rcu_read_unlock()                         do_softirq()
>   rcu_read_unlock_special()
>     rt_mutext_unlock()
>       raw_spin_lock(lock->wait_lock)      spin_lock_bh(x)  DEADLOCK
> 
> This example can happen on any one of these code:
>       without my patchset
>       with my patchset
>       with my patchset but reverted patch2
> 
> 2) why it causes more deadlockable: it extends the range of suspect locks.
> #DEFINE suspect locks: any lock can be (chained) nested in 
> rcu_read_unlock_special().
> 
> So the suspect locks are: rnp->lock, scheduler locks, rtmutex's 
> lock->wait_lock,
> locks in prink()/WARN_ON() and the locks which can be chained/indirectly 
> nested
> in the above locks.
> 
> If the rtmutex's lock->wait_lock is a spinlock of irq context, all suspect 
> locks are
> some spinlocks of irq context.
> 
> If the rtmutex's lock->wait_lock is a spinlock of process context, suspect 
> locks
> will be extended to, all spinlocks of irq context, all spinlocks of softirq 
> context,
> and (all spinlocks of process context but nested in rtmutex's 
> lock->wait_lock).
> 
> We can see from the definition, if rcu_read_unlock_special() is called from
> any suspect lock, it may be deadlock like the example. the rtmutex's 
> lock->wait_lock
> extends the range of suspect locks, it causes more deadlockable.
> 
> 3) How my algorithm works, why smaller range of suspect locks help us.
> Since rcu_read_unlock_special() can't be called from suspect locks context,
> we should defer rcu_read_unlock_special() when in these contexts.
> It is hard to find out current context is suspect locks context or not,
> but we can determine it based on irq/softirq/process context.
> 
> if all suspect locks are some spinlocks of irq context:
>       if (irqs_disabled) /* we may be in suspect locks context */
>               defer rcu_read_unlock_special().
> 
> if all suspect locks are some spinlocks of irq/softirq/process context:
>       if (irqs_disabled || in_atomic()) /* we may be in suspect locks context 
> */
>               defer rcu_read_unlock_special().
> In this case, the deferring becomes large more, I can't accept it.
> So I have to narrow the range of suspect locks. Two choices:
> A) don't call rt_mutex_unlock() from rcu_read_unlock(), only call it
>    from rcu_preempt_not_context_switch(). we need to rework these
>    two functions and it will add complexity to RCU, and it also still
>    adds some probability of deferring.


One advantage of bh-disable locks is that enabling bh checks
TIF_NEED_RESCHED, so that there is no deferring beyond that
needed by bh disable.  The same of course applies to preempt_disable().

So one approach is to defer when rcu_read_unlock_special() is entered
with either preemption or bh disabled.  Your current set_need_resched()
trick would work fine in this case.  Unfortunately, re-enabling interrupts
does -not- check TIF_NEED_RESCHED, which is why we have latency problems
in that case.  (Hence my earlier question about making self-IPI safe
on all arches, which would result in an interrupt as soon as interrupts
were re-enabled.)

Another possibility is to defer only when preemption or bh are disabled
on entry ro rcu_read_unlock_special(), but to retain the current
(admittedly ugly) nesting rules for the scheduler locks.

> B) change rtmutex's lock->wait_lock to irqs-disabled.

I have to defer to Steven on this one.

                                                        Thanx, Paul

> 4) In the view of rtmutex, I think it will be better if ->wait_lock is 
> irqs-disabled.
>    A) like trylock of mutex/rw_sem, we may call rt_mutex_trylock() in irq in 
> future.
>    B) the critical section of ->wait_lock is short,
>       making it irqs-disabled don't hurts responsibility/latency.
>    C) almost all time of the critical section of ->wait_lock is irqs-disabled
>       (due to task->pi_lock), I think converting whole time of the critical 
> section
>       of ->wait_lock to irqs-disabled is OK.
> 
> So I hope you change rtmutex's lock->wait_lock.
> 
> Any feedback from anyone is welcome.
> 
> Thanks,
> Lai
> 
> On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> > On Wed, Aug 07, 2013 at 06:25:01PM +0800, Lai Jiangshan wrote:
> >> Background)
> >>
> >> Although all articles declare that rcu read site is deadlock-immunity.
> >> It is not true for rcu-preempt, it will be deadlock if rcu read site
> >> overlaps with scheduler lock.
> >>
> >> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
> >> is still not deadlock-immunity. And the problem described in 016a8d5b
> >> is still existed(rcu_read_unlock_special() calls wake_up).
> >>
> >> Aim)
> >>
> >> We want to fix the problem forever, we want to keep rcu read site
> >> is deadlock-immunity as books say.
> >>
> >> How)
> >>
> >> The problem is solved by "if rcu_read_unlock_special() is called inside
> >> any lock which can be (chained) nested in rcu_read_unlock_special(),
> >> we defer rcu_read_unlock_special()".
> >> This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks
> >> in printk()/WARN_ON() and all locks nested in these locks or chained nested
> >> in these locks.
> >>
> >> The problem is reduced to "how to distinguish all these locks(context)",
> >> We don't distinguish all these locks, we know that all these locks
> >> should be nested in local_irqs_disable().
> >>
> >> we just consider if rcu_read_unlock_special() is called in irqs-disabled
> >> context, it may be called in these suspect locks, we should defer
> >> rcu_read_unlock_special().
> >>
> >> The algorithm enlarges the probability of deferring, but the probability
> >> is still very very low.
> >>
> >> Deferring does add a small overhead, but it offers us:
> >>    1) really deadlock-immunity for rcu read site
> >>    2) remove the overhead of the irq-work(250 times per second in avg.)
> > 
> > One problem here -- it may take quite some time for a set_need_resched()
> > to take effect.  This is especially a problem for RCU priority boosting,
> > but can also needlessly delay preemptible-RCU grace periods because
> > local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.
> > 
> > OK, alternatives...
> > 
> > o   Keep the current rule saying that if the scheduler is going
> >     to exit an RCU read-side critical section while holding
> >     one of its spinlocks, preemption has to have been disabled
> >     throughout the full duration of that critical section.
> >     Well, we can certainly do this, but it would be nice to get
> >     rid of this rule.
> > 
> > o   Use per-CPU variables, possibly injecting delay.  This has ugly
> >     disadvantages as noted above.
> > 
> > o   irq_work_queue() can wait a jiffy (or on some architectures,
> >     quite a bit longer) before actually doing anything.
> > 
> > o   raise_softirq() is more immediate and is an easy change, but
> >     adds a softirq vector -- which people are really trying to
> >     get rid of.  Also, wakeup_softirqd() calls things that acquire
> >     the scheduler locks, which is exactly what we were trying to
> >     avoid doing.
> > 
> > o   invoke_rcu_core() can invoke raise_softirq() as above.
> > 
> > o   IPI to self.  From what I can see, not all architectures
> >     support this.  Easy to fake if you have at least two CPUs,
> >     but not so good from an OS jitter viewpoint...
> > 
> > o   Add a check to local_irq_disable() and friends.  I would guess
> >     that this suggestion would not make architecture maintainers
> >     happy.
> > 
> > Other thoughts?
> > 
> >                                                     Thanx, Paul
> > 
> >> Signed-off-by: Lai Jiangshan <[email protected]>
> >> ---
> >>  include/linux/rcupdate.h |    2 +-
> >>  kernel/rcupdate.c        |    2 +-
> >>  kernel/rcutree_plugin.h  |   47 
> >> +++++++++++++++++++++++++++++++++++++++++----
> >>  3 files changed, 44 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> >> index 4b14bdc..00b4220 100644
> >> --- a/include/linux/rcupdate.h
> >> +++ b/include/linux/rcupdate.h
> >> @@ -180,7 +180,7 @@ extern void synchronize_sched(void);
> >>
> >>  extern void __rcu_read_lock(void);
> >>  extern void __rcu_read_unlock(void);
> >> -extern void rcu_read_unlock_special(struct task_struct *t);
> >> +extern void rcu_read_unlock_special(struct task_struct *t, bool unlock);
> >>  void synchronize_rcu(void);
> >>
> >>  /*
> >> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> >> index cce6ba8..33b89a3 100644
> >> --- a/kernel/rcupdate.c
> >> +++ b/kernel/rcupdate.c
> >> @@ -90,7 +90,7 @@ void __rcu_read_unlock(void)
> >>  #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
> >>            barrier();  /* assign before ->rcu_read_unlock_special load */
> >>            if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
> >> -                  rcu_read_unlock_special(t);
> >> +                  rcu_read_unlock_special(t, true);
> >>            barrier();  /* ->rcu_read_unlock_special load before assign */
> >>            t->rcu_read_lock_nesting = 0;
> >>    }
> >> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> >> index fc8b36f..997b424 100644
> >> --- a/kernel/rcutree_plugin.h
> >> +++ b/kernel/rcutree_plugin.h
> >> @@ -242,15 +242,16 @@ static void rcu_preempt_note_context_switch(int cpu)
> >>                                   ? rnp->gpnum
> >>                                   : rnp->gpnum + 1);
> >>            raw_spin_unlock_irqrestore(&rnp->lock, flags);
> >> -  } else if (t->rcu_read_lock_nesting < 0 &&
> >> -             !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&
> >> -             t->rcu_read_unlock_special) {
> >> +  } else if (t->rcu_read_lock_nesting == 0 ||
> >> +             (t->rcu_read_lock_nesting < 0 &&
> >> +             !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN))) {
> >>
> >>            /*
> >>             * Complete exit from RCU read-side critical section on
> >>             * behalf of preempted instance of __rcu_read_unlock().
> >>             */
> >> -          rcu_read_unlock_special(t);
> >> +          if (t->rcu_read_unlock_special)
> >> +                  rcu_read_unlock_special(t, false);
> >>    }
> >>
> >>    /*
> >> @@ -333,7 +334,7 @@ static struct list_head *rcu_next_node_entry(struct 
> >> task_struct *t,
> >>   * notify RCU core processing or task having blocked during the RCU
> >>   * read-side critical section.
> >>   */
> >> -void rcu_read_unlock_special(struct task_struct *t)
> >> +void rcu_read_unlock_special(struct task_struct *t, bool unlock)
> >>  {
> >>    int empty;
> >>    int empty_exp;
> >> @@ -364,6 +365,42 @@ void rcu_read_unlock_special(struct task_struct *t)
> >>
> >>    /* Clean up if blocked during RCU read-side critical section. */
> >>    if (special & RCU_READ_UNLOCK_BLOCKED) {
> >> +          /*
> >> +           * If rcu read lock overlaps with scheduler lock,
> >> +           * rcu_read_unlock_special() may lead to deadlock:
> >> +           *
> >> +           * rcu_read_lock();
> >> +           * preempt_schedule[_irq]() (when preemption)
> >> +           * scheduler lock; (or some other locks can be (chained) nested
> >> +           *                  in rcu_read_unlock_special()/rnp->lock)
> >> +           * access and check rcu data
> >> +           * rcu_read_unlock();
> >> +           *   rcu_read_unlock_special();
> >> +           *     wake_up();                 DEAD LOCK
> >> +           *
> >> +           * To avoid all these kinds of deadlock, we should quit
> >> +           * rcu_read_unlock_special() here and defer it to
> >> +           * rcu_preempt_note_context_switch() or next outmost
> >> +           * rcu_read_unlock() if we consider this case may happen.
> >> +           *
> >> +           * Although we can't know whether current _special()
> >> +           * is nested in scheduler lock or not. But we know that
> >> +           * irqs are always disabled in this case. so we just quit
> >> +           * and defer it to rcu_preempt_note_context_switch()
> >> +           * when irqs are disabled.
> >> +           *
> >> +           * It means we always defer _special() when it is
> >> +           * nested in irqs disabled context, but
> >> +           *      (special & RCU_READ_UNLOCK_BLOCKED) &&
> >> +           *      irqs_disabled_flags(flags)
> >> +           * is still unlikely to be true.
> >> +           */
> >> +          if (unlikely(unlock && irqs_disabled_flags(flags))) {
> >> +                  set_need_resched();
> >> +                  local_irq_restore(flags);
> >> +                  return;
> >> +          }
> >> +
> >>            t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;
> >>
> >>            /*
> >> -- 
> >> 1.7.4.4
> >>
> > 
> > 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

Reply via email to