On Sat, Aug 10, 2013 at 08:07:15AM -0700, Paul E. McKenney wrote: > On Sat, Aug 10, 2013 at 11:43:59AM +0800, Lai Jiangshan wrote: > > Hi, Steven > > > > I was considering rtmutex's lock->wait_lock is a scheduler lock, > > But it is not, and it is just a spinlock of process context. > > I hope you change it to a spinlock of irq context. > > > > 1) it causes rcu read site more deadlockable, example: > > x is a spinlock of softirq context. > > > > CPU1 cpu2(rcu boost) > > rcu_read_lock() rt_mutext_lock() > > <preemption and reschedule back> raw_spin_lock(lock->wait_lock) > > spin_lock_bh(x) <interrupt and doing softirq > > after irq> > > rcu_read_unlock() do_softirq() > > rcu_read_unlock_special() > > rt_mutext_unlock() > > raw_spin_lock(lock->wait_lock) spin_lock_bh(x) DEADLOCK > > > > This example can happen on any one of these code: > > without my patchset > > with my patchset > > with my patchset but reverted patch2 > > > > 2) why it causes more deadlockable: it extends the range of suspect locks. > > #DEFINE suspect locks: any lock can be (chained) nested in > > rcu_read_unlock_special(). > > > > So the suspect locks are: rnp->lock, scheduler locks, rtmutex's > > lock->wait_lock, > > locks in prink()/WARN_ON() and the locks which can be chained/indirectly > > nested > > in the above locks. > > > > If the rtmutex's lock->wait_lock is a spinlock of irq context, all suspect > > locks are > > some spinlocks of irq context. > > > > If the rtmutex's lock->wait_lock is a spinlock of process context, suspect > > locks > > will be extended to, all spinlocks of irq context, all spinlocks of softirq > > context, > > and (all spinlocks of process context but nested in rtmutex's > > lock->wait_lock). > > > > We can see from the definition, if rcu_read_unlock_special() is called from > > any suspect lock, it may be deadlock like the example. the rtmutex's > > lock->wait_lock > > extends the range of suspect locks, it causes more deadlockable. > > > > 3) How my algorithm works, why smaller range of suspect locks help us. > > Since rcu_read_unlock_special() can't be called from suspect locks context, > > we should defer rcu_read_unlock_special() when in these contexts. > > It is hard to find out current context is suspect locks context or not, > > but we can determine it based on irq/softirq/process context. > > > > if all suspect locks are some spinlocks of irq context: > > if (irqs_disabled) /* we may be in suspect locks context */ > > defer rcu_read_unlock_special(). > > > > if all suspect locks are some spinlocks of irq/softirq/process context: > > if (irqs_disabled || in_atomic()) /* we may be in suspect locks context > > */ > > defer rcu_read_unlock_special(). > > In this case, the deferring becomes large more, I can't accept it. > > So I have to narrow the range of suspect locks. Two choices: > > A) don't call rt_mutex_unlock() from rcu_read_unlock(), only call it > > from rcu_preempt_not_context_switch(). we need to rework these > > two functions and it will add complexity to RCU, and it also still > > adds some probability of deferring. > > One advantage of bh-disable locks is that enabling bh checks > TIF_NEED_RESCHED, so that there is no deferring beyond that > needed by bh disable. The same of course applies to preempt_disable(). > > So one approach is to defer when rcu_read_unlock_special() is entered > with either preemption or bh disabled. Your current set_need_resched() > trick would work fine in this case. Unfortunately, re-enabling interrupts > does -not- check TIF_NEED_RESCHED, which is why we have latency problems > in that case. (Hence my earlier question about making self-IPI safe > on all arches, which would result in an interrupt as soon as interrupts > were re-enabled.) > > Another possibility is to defer only when preemption or bh are disabled > on entry ro rcu_read_unlock_special(), but to retain the current > (admittedly ugly) nesting rules for the scheduler locks. > > > B) change rtmutex's lock->wait_lock to irqs-disabled. > > I have to defer to Steven on this one.
C) Remove support for RCU priority boosting. I am reluctant to do this, but... Thanx, Paul > > 4) In the view of rtmutex, I think it will be better if ->wait_lock is > > irqs-disabled. > > A) like trylock of mutex/rw_sem, we may call rt_mutex_trylock() in irq > > in future. > > B) the critical section of ->wait_lock is short, > > making it irqs-disabled don't hurts responsibility/latency. > > C) almost all time of the critical section of ->wait_lock is > > irqs-disabled > > (due to task->pi_lock), I think converting whole time of the critical > > section > > of ->wait_lock to irqs-disabled is OK. > > > > So I hope you change rtmutex's lock->wait_lock. > > > > Any feedback from anyone is welcome. > > > > Thanks, > > Lai > > > > On 08/09/2013 04:40 AM, Paul E. McKenney wrote: > > > On Wed, Aug 07, 2013 at 06:25:01PM +0800, Lai Jiangshan wrote: > > >> Background) > > >> > > >> Although all articles declare that rcu read site is deadlock-immunity. > > >> It is not true for rcu-preempt, it will be deadlock if rcu read site > > >> overlaps with scheduler lock. > > >> > > >> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read > > >> site > > >> is still not deadlock-immunity. And the problem described in 016a8d5b > > >> is still existed(rcu_read_unlock_special() calls wake_up). > > >> > > >> Aim) > > >> > > >> We want to fix the problem forever, we want to keep rcu read site > > >> is deadlock-immunity as books say. > > >> > > >> How) > > >> > > >> The problem is solved by "if rcu_read_unlock_special() is called inside > > >> any lock which can be (chained) nested in rcu_read_unlock_special(), > > >> we defer rcu_read_unlock_special()". > > >> This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks > > >> in printk()/WARN_ON() and all locks nested in these locks or chained > > >> nested > > >> in these locks. > > >> > > >> The problem is reduced to "how to distinguish all these locks(context)", > > >> We don't distinguish all these locks, we know that all these locks > > >> should be nested in local_irqs_disable(). > > >> > > >> we just consider if rcu_read_unlock_special() is called in irqs-disabled > > >> context, it may be called in these suspect locks, we should defer > > >> rcu_read_unlock_special(). > > >> > > >> The algorithm enlarges the probability of deferring, but the probability > > >> is still very very low. > > >> > > >> Deferring does add a small overhead, but it offers us: > > >> 1) really deadlock-immunity for rcu read site > > >> 2) remove the overhead of the irq-work(250 times per second in avg.) > > > > > > One problem here -- it may take quite some time for a set_need_resched() > > > to take effect. This is especially a problem for RCU priority boosting, > > > but can also needlessly delay preemptible-RCU grace periods because > > > local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit. > > > > > > OK, alternatives... > > > > > > o Keep the current rule saying that if the scheduler is going > > > to exit an RCU read-side critical section while holding > > > one of its spinlocks, preemption has to have been disabled > > > throughout the full duration of that critical section. > > > Well, we can certainly do this, but it would be nice to get > > > rid of this rule. > > > > > > o Use per-CPU variables, possibly injecting delay. This has ugly > > > disadvantages as noted above. > > > > > > o irq_work_queue() can wait a jiffy (or on some architectures, > > > quite a bit longer) before actually doing anything. > > > > > > o raise_softirq() is more immediate and is an easy change, but > > > adds a softirq vector -- which people are really trying to > > > get rid of. Also, wakeup_softirqd() calls things that acquire > > > the scheduler locks, which is exactly what we were trying to > > > avoid doing. > > > > > > o invoke_rcu_core() can invoke raise_softirq() as above. > > > > > > o IPI to self. From what I can see, not all architectures > > > support this. Easy to fake if you have at least two CPUs, > > > but not so good from an OS jitter viewpoint... > > > > > > o Add a check to local_irq_disable() and friends. I would guess > > > that this suggestion would not make architecture maintainers > > > happy. > > > > > > Other thoughts? > > > > > > Thanx, Paul > > > > > >> Signed-off-by: Lai Jiangshan <la...@cn.fujitsu.com> > > >> --- > > >> include/linux/rcupdate.h | 2 +- > > >> kernel/rcupdate.c | 2 +- > > >> kernel/rcutree_plugin.h | 47 > > >> +++++++++++++++++++++++++++++++++++++++++---- > > >> 3 files changed, 44 insertions(+), 7 deletions(-) > > >> > > >> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > >> index 4b14bdc..00b4220 100644 > > >> --- a/include/linux/rcupdate.h > > >> +++ b/include/linux/rcupdate.h > > >> @@ -180,7 +180,7 @@ extern void synchronize_sched(void); > > >> > > >> extern void __rcu_read_lock(void); > > >> extern void __rcu_read_unlock(void); > > >> -extern void rcu_read_unlock_special(struct task_struct *t); > > >> +extern void rcu_read_unlock_special(struct task_struct *t, bool unlock); > > >> void synchronize_rcu(void); > > >> > > >> /* > > >> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c > > >> index cce6ba8..33b89a3 100644 > > >> --- a/kernel/rcupdate.c > > >> +++ b/kernel/rcupdate.c > > >> @@ -90,7 +90,7 @@ void __rcu_read_unlock(void) > > >> #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */ > > >> barrier(); /* assign before ->rcu_read_unlock_special > > >> load */ > > >> if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special))) > > >> - rcu_read_unlock_special(t); > > >> + rcu_read_unlock_special(t, true); > > >> barrier(); /* ->rcu_read_unlock_special load before > > >> assign */ > > >> t->rcu_read_lock_nesting = 0; > > >> } > > >> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h > > >> index fc8b36f..997b424 100644 > > >> --- a/kernel/rcutree_plugin.h > > >> +++ b/kernel/rcutree_plugin.h > > >> @@ -242,15 +242,16 @@ static void rcu_preempt_note_context_switch(int > > >> cpu) > > >> ? rnp->gpnum > > >> : rnp->gpnum + 1); > > >> raw_spin_unlock_irqrestore(&rnp->lock, flags); > > >> - } else if (t->rcu_read_lock_nesting < 0 && > > >> - !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) && > > >> - t->rcu_read_unlock_special) { > > >> + } else if (t->rcu_read_lock_nesting == 0 || > > >> + (t->rcu_read_lock_nesting < 0 && > > >> + !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN))) > > >> { > > >> > > >> /* > > >> * Complete exit from RCU read-side critical section on > > >> * behalf of preempted instance of __rcu_read_unlock(). > > >> */ > > >> - rcu_read_unlock_special(t); > > >> + if (t->rcu_read_unlock_special) > > >> + rcu_read_unlock_special(t, false); > > >> } > > >> > > >> /* > > >> @@ -333,7 +334,7 @@ static struct list_head *rcu_next_node_entry(struct > > >> task_struct *t, > > >> * notify RCU core processing or task having blocked during the RCU > > >> * read-side critical section. > > >> */ > > >> -void rcu_read_unlock_special(struct task_struct *t) > > >> +void rcu_read_unlock_special(struct task_struct *t, bool unlock) > > >> { > > >> int empty; > > >> int empty_exp; > > >> @@ -364,6 +365,42 @@ void rcu_read_unlock_special(struct task_struct *t) > > >> > > >> /* Clean up if blocked during RCU read-side critical section. */ > > >> if (special & RCU_READ_UNLOCK_BLOCKED) { > > >> + /* > > >> + * If rcu read lock overlaps with scheduler lock, > > >> + * rcu_read_unlock_special() may lead to deadlock: > > >> + * > > >> + * rcu_read_lock(); > > >> + * preempt_schedule[_irq]() (when preemption) > > >> + * scheduler lock; (or some other locks can be > > >> (chained) nested > > >> + * in > > >> rcu_read_unlock_special()/rnp->lock) > > >> + * access and check rcu data > > >> + * rcu_read_unlock(); > > >> + * rcu_read_unlock_special(); > > >> + * wake_up(); DEAD LOCK > > >> + * > > >> + * To avoid all these kinds of deadlock, we should quit > > >> + * rcu_read_unlock_special() here and defer it to > > >> + * rcu_preempt_note_context_switch() or next outmost > > >> + * rcu_read_unlock() if we consider this case may > > >> happen. > > >> + * > > >> + * Although we can't know whether current _special() > > >> + * is nested in scheduler lock or not. But we know that > > >> + * irqs are always disabled in this case. so we just > > >> quit > > >> + * and defer it to rcu_preempt_note_context_switch() > > >> + * when irqs are disabled. > > >> + * > > >> + * It means we always defer _special() when it is > > >> + * nested in irqs disabled context, but > > >> + * (special & RCU_READ_UNLOCK_BLOCKED) && > > >> + * irqs_disabled_flags(flags) > > >> + * is still unlikely to be true. > > >> + */ > > >> + if (unlikely(unlock && irqs_disabled_flags(flags))) { > > >> + set_need_resched(); > > >> + local_irq_restore(flags); > > >> + return; > > >> + } > > >> + > > >> t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED; > > >> > > >> /* > > >> -- > > >> 1.7.4.4 > > >> > > > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/