On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> On Wed, Aug 07, 2013 at 06:25:01PM +0800, Lai Jiangshan wrote:
>> Background)
>>
>> Although all articles declare that rcu read site is deadlock-immunity.
>> It is not true for rcu-preempt, it will be deadlock if rcu read site
>> overlaps with scheduler lock.
>>
>> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
>> is still not deadlock-immunity. And the problem described in 016a8d5b
>> is still existed(rcu_read_unlock_special() calls wake_up).
>>
>> Aim)
>>
>> We want to fix the problem forever, we want to keep rcu read site
>> is deadlock-immunity as books say.
>>
>> How)
>>
>> The problem is solved by "if rcu_read_unlock_special() is called inside
>> any lock which can be (chained) nested in rcu_read_unlock_special(),
>> we defer rcu_read_unlock_special()".
>> This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks
>> in printk()/WARN_ON() and all locks nested in these locks or chained nested
>> in these locks.
>>
>> The problem is reduced to "how to distinguish all these locks(context)",
>> We don't distinguish all these locks, we know that all these locks
>> should be nested in local_irqs_disable().
>>
>> we just consider if rcu_read_unlock_special() is called in irqs-disabled
>> context, it may be called in these suspect locks, we should defer
>> rcu_read_unlock_special().
>>
>> The algorithm enlarges the probability of deferring, but the probability
>> is still very very low.
>>
>> Deferring does add a small overhead, but it offers us:
>>      1) really deadlock-immunity for rcu read site
>>      2) remove the overhead of the irq-work(250 times per second in avg.)
> 
> One problem here -- it may take quite some time for a set_need_resched()
> to take effect.  This is especially a problem for RCU priority boosting,
> but can also needlessly delay preemptible-RCU grace periods because
> local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.


The final effect of deboosting(rt_mutex_unlock()) is also accomplished
via set_need_resched()/set_tsk_need_resched().
set_need_resched() is enough for RCU priority boosting issue here.

Since rcu_read_unlock_special() is deferred, it does take quite some time for
QS report to take effect.


> 
> OK, alternatives...
> 
> o     Keep the current rule saying that if the scheduler is going
>       to exit an RCU read-side critical section while holding
>       one of its spinlocks, preemption has to have been disabled

Since rtmutex'lock->wait_lock is not irqs-disabled nor bh-disabled.

This kind of spinlocks include scheduler locks, rtmutex'lock->wait_lock,
all locks can be acquired in irq/SOFTIRQ.

So this rule is not only applied for scheduler locks, it should also
be applied for almost all spinlocks in the kernel.

I was hard to accept that rcu read site is not deadlock-immunity.

Thanks,
Lai

>       throughout the full duration of that critical section.
>       Well, we can certainly do this, but it would be nice to get
>       rid of this rule.
> 
> o     Use per-CPU variables, possibly injecting delay.  This has ugly
>       disadvantages as noted above.
> 
> o     irq_work_queue() can wait a jiffy (or on some architectures,
>       quite a bit longer) before actually doing anything.
> 
> o     raise_softirq() is more immediate and is an easy change, but
>       adds a softirq vector -- which people are really trying to
>       get rid of.  Also, wakeup_softirqd() calls things that acquire
>       the scheduler locks, which is exactly what we were trying to
>       avoid doing.
> 
> o     invoke_rcu_core() can invoke raise_softirq() as above.
> 
> o     IPI to self.  From what I can see, not all architectures
>       support this.  Easy to fake if you have at least two CPUs,
>       but not so good from an OS jitter viewpoint...
> 
> o     Add a check to local_irq_disable() and friends.  I would guess
>       that this suggestion would not make architecture maintainers
>       happy.
> 
> Other thoughts?
> 
>                                                       Thanx, Paul
> 
>> Signed-off-by: Lai Jiangshan <la...@cn.fujitsu.com>
>> ---
>>  include/linux/rcupdate.h |    2 +-
>>  kernel/rcupdate.c        |    2 +-
>>  kernel/rcutree_plugin.h  |   47 
>> +++++++++++++++++++++++++++++++++++++++++----
>>  3 files changed, 44 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>> index 4b14bdc..00b4220 100644
>> --- a/include/linux/rcupdate.h
>> +++ b/include/linux/rcupdate.h
>> @@ -180,7 +180,7 @@ extern void synchronize_sched(void);
>>
>>  extern void __rcu_read_lock(void);
>>  extern void __rcu_read_unlock(void);
>> -extern void rcu_read_unlock_special(struct task_struct *t);
>> +extern void rcu_read_unlock_special(struct task_struct *t, bool unlock);
>>  void synchronize_rcu(void);
>>
>>  /*
>> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
>> index cce6ba8..33b89a3 100644
>> --- a/kernel/rcupdate.c
>> +++ b/kernel/rcupdate.c
>> @@ -90,7 +90,7 @@ void __rcu_read_unlock(void)
>>  #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
>>              barrier();  /* assign before ->rcu_read_unlock_special load */
>>              if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
>> -                    rcu_read_unlock_special(t);
>> +                    rcu_read_unlock_special(t, true);
>>              barrier();  /* ->rcu_read_unlock_special load before assign */
>>              t->rcu_read_lock_nesting = 0;
>>      }
>> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
>> index fc8b36f..997b424 100644
>> --- a/kernel/rcutree_plugin.h
>> +++ b/kernel/rcutree_plugin.h
>> @@ -242,15 +242,16 @@ static void rcu_preempt_note_context_switch(int cpu)
>>                                     ? rnp->gpnum
>>                                     : rnp->gpnum + 1);
>>              raw_spin_unlock_irqrestore(&rnp->lock, flags);
>> -    } else if (t->rcu_read_lock_nesting < 0 &&
>> -               !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&
>> -               t->rcu_read_unlock_special) {
>> +    } else if (t->rcu_read_lock_nesting == 0 ||
>> +               (t->rcu_read_lock_nesting < 0 &&
>> +               !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN))) {
>>
>>              /*
>>               * Complete exit from RCU read-side critical section on
>>               * behalf of preempted instance of __rcu_read_unlock().
>>               */
>> -            rcu_read_unlock_special(t);
>> +            if (t->rcu_read_unlock_special)
>> +                    rcu_read_unlock_special(t, false);
>>      }
>>
>>      /*
>> @@ -333,7 +334,7 @@ static struct list_head *rcu_next_node_entry(struct 
>> task_struct *t,
>>   * notify RCU core processing or task having blocked during the RCU
>>   * read-side critical section.
>>   */
>> -void rcu_read_unlock_special(struct task_struct *t)
>> +void rcu_read_unlock_special(struct task_struct *t, bool unlock)
>>  {
>>      int empty;
>>      int empty_exp;
>> @@ -364,6 +365,42 @@ void rcu_read_unlock_special(struct task_struct *t)
>>
>>      /* Clean up if blocked during RCU read-side critical section. */
>>      if (special & RCU_READ_UNLOCK_BLOCKED) {
>> +            /*
>> +             * If rcu read lock overlaps with scheduler lock,
>> +             * rcu_read_unlock_special() may lead to deadlock:
>> +             *
>> +             * rcu_read_lock();
>> +             * preempt_schedule[_irq]() (when preemption)
>> +             * scheduler lock; (or some other locks can be (chained) nested
>> +             *                  in rcu_read_unlock_special()/rnp->lock)
>> +             * access and check rcu data
>> +             * rcu_read_unlock();
>> +             *   rcu_read_unlock_special();
>> +             *     wake_up();                 DEAD LOCK
>> +             *
>> +             * To avoid all these kinds of deadlock, we should quit
>> +             * rcu_read_unlock_special() here and defer it to
>> +             * rcu_preempt_note_context_switch() or next outmost
>> +             * rcu_read_unlock() if we consider this case may happen.
>> +             *
>> +             * Although we can't know whether current _special()
>> +             * is nested in scheduler lock or not. But we know that
>> +             * irqs are always disabled in this case. so we just quit
>> +             * and defer it to rcu_preempt_note_context_switch()
>> +             * when irqs are disabled.
>> +             *
>> +             * It means we always defer _special() when it is
>> +             * nested in irqs disabled context, but
>> +             *      (special & RCU_READ_UNLOCK_BLOCKED) &&
>> +             *      irqs_disabled_flags(flags)
>> +             * is still unlikely to be true.
>> +             */
>> +            if (unlikely(unlock && irqs_disabled_flags(flags))) {
>> +                    set_need_resched();
>> +                    local_irq_restore(flags);
>> +                    return;
>> +            }
>> +
>>              t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;
>>
>>              /*
>> -- 
>> 1.7.4.4
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to