On Wed, Feb 12, 2014 at 07:49:07AM -0800, Andy Lutomirski wrote:
> On Wed, Feb 12, 2014 at 2:13 AM, Peter Zijlstra <pet...@infradead.org> wrote:
> > On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote:
> >> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner <t...@linutronix.de> 
> >> wrote:
> >> >> A small number of reschedule interrupts appear to be due to a race:
> >> >> both resched_task and wake_up_idle_cpu do, essentially:
> >> >>
> >> >> set_tsk_need_resched(t);
> >> >> smb_mb();
> >> >> if (!tsk_is_polling(t))
> >> >>   smp_send_reschedule(cpu);
> >> >>
> >> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
> >> >> is too quick (which isn't surprising if it was in C0 or C1), then it
> >> >> could *clear* TS_POLLING before tsk_is_polling is read.
> >
> > Yeah we have the wrong default for the idle loops.. it should default to
> > polling and only switch to !polling at the very last moment if it really
> > needs an interrupt to wake.
> 
> I might be missing something, but won't that break the scheduler? 

for the idle task.. all other tasks will have it !polling.

But note how the current generic idle loop does:

  if (!current_clr_polling_and_test()) {
        ...
        if (cpuidle_idle_call())
                arch_cpu_idle();
        ...
  }

This means that it still runs a metric ton of code, right up to the
mwait with !polling, and then at the mwait we switch it back to polling.

Completely daft.

> Since rq->lock is held, the resched calls could check the rq state
> (curr == idle, maybe) to distinguish these cases.

Not enough; but I'm afraid I confused you with the above.

My suggestion was really more that we should call into the cpuidle/arch
idle code with polling set, and only right before we hit hlt/wfi/etc..
should we clear the polling bit.

> > It can't we're holding its rq->lock.
> 
> Exactly.  AFAICT the only reason that any of this code holds rq->lock
> (especially ttwu_queue_remote, which I seem to call a few thousand
> times per second) is because the only way to make a cpu reschedule
> involves playing with per-task flags.  If the flags were per-rq or
> per-cpu instead, then rq->lock wouldn't be needed.  If this were all
> done locklessly, then I think either a full cmpxchg or some fairly
> careful use of full barriers would be needed, but I bet that cmpxchg
> is still considerably faster than a spinlock plus a set_bit.

Ahh, that's what you're saying. Yes we should be able to do something
clever there.

Something like the below is I think as close as we can come without
major surgery and moving TIF_NEED_RESCHED and POLLING into a per-cpu
variable.

I might have messed it up though; brain seems to have given out for the
day :/

---
 kernel/sched/core.c  | 17 +++++++++++++----
 kernel/sched/idle.c  | 21 +++++++++++++--------
 kernel/sched/sched.h |  5 ++++-
 3 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fb9764fbc537..a5b64040c21d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -529,7 +529,7 @@ void resched_task(struct task_struct *p)
        }
 
        /* NEED_RESCHED must be visible before we test polling */
-       smp_mb();
+       smp_mb__after_clear_bit();
        if (!tsk_is_polling(p))
                smp_send_reschedule(cpu);
 }
@@ -1476,12 +1476,15 @@ static int ttwu_remote(struct task_struct *p, int 
wake_flags)
 }
 
 #ifdef CONFIG_SMP
-static void sched_ttwu_pending(void)
+void sched_ttwu_pending(void)
 {
        struct rq *rq = this_rq();
        struct llist_node *llist = llist_del_all(&rq->wake_list);
        struct task_struct *p;
 
+       if (!llist)
+               return;
+
        raw_spin_lock(&rq->lock);
 
        while (llist) {
@@ -1536,8 +1539,14 @@ void scheduler_ipi(void)
 
 static void ttwu_queue_remote(struct task_struct *p, int cpu)
 {
-       if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
-               smp_send_reschedule(cpu);
+       struct rq *rq = cpu_rq(cpu);
+
+       if (llist_add(&p->wake_entry, &rq->wake_list)) {
+               set_tsk_need_resched(rq->idle);
+               smp_mb__after_clear_bit();
+               if (!tsk_is_polling(rq->idle) || rq->curr != rq->idle)
+                       smp_send_reschedule(cpu);
+       }
 }
 
 bool cpus_share_cache(int this_cpu, int that_cpu)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 14ca43430aee..bd8ed2d2f2f7 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -105,19 +105,24 @@ static void cpu_idle_loop(void)
                                } else {
                                        local_irq_enable();
                                }
-                               __current_set_polling();
                        }
                        arch_cpu_idle_exit();
-                       /*
-                        * We need to test and propagate the TIF_NEED_RESCHED
-                        * bit here because we might not have send the
-                        * reschedule IPI to idle tasks.
-                        */
-                       if (tif_need_resched())
-                               set_preempt_need_resched();
                }
+
+               /*
+                * We must clear polling before running sched_ttwu_pending().
+                * Otherwise it becomes possible to have entries added in
+                * ttwu_queue_remote() and still not get an IPI to process
+                * them.
+                */
+               __current_clr_polling();
+
+               set_preempt_need_resched();
+               sched_ttwu_pending();
+
                tick_nohz_idle_exit();
                schedule_preempt_disabled();
+               __current_set_polling();
        }
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1bf34c257d3b..b59dbdb135d8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1157,9 +1157,10 @@ extern const struct sched_class rt_sched_class;
 extern const struct sched_class fair_sched_class;
 extern const struct sched_class idle_sched_class;
 
-
 #ifdef CONFIG_SMP
 
+extern void sched_ttwu_pending(void)
+
 extern void update_group_power(struct sched_domain *sd, int cpu);
 
 extern void trigger_load_balance(struct rq *rq);
@@ -1170,6 +1171,8 @@ extern void idle_exit_fair(struct rq *this_rq);
 
 #else  /* CONFIG_SMP */
 
+static inline void sched_ttwu_pending(void) { }
+
 static inline void idle_balance(int cpu, struct rq *rq)
 {
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to