On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner <t...@linutronix.de> wrote:
> On Tue, 11 Feb 2014, Andy Lutomirski wrote:
>
> Just adding Peter for now, as I'm too tired to grok the issue right
> now.
>
>> Rumor has it that Linux 3.13 was supposed to get rid of all the silly
>> rescheduling interrupts.  It doesn't, although it does seem to have
>> improved the situation.
>>
>> A small number of reschedule interrupts appear to be due to a race:
>> both resched_task and wake_up_idle_cpu do, essentially:
>>
>> set_tsk_need_resched(t);
>> smb_mb();
>> if (!tsk_is_polling(t))
>>   smp_send_reschedule(cpu);
>>
>> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
>> is too quick (which isn't surprising if it was in C0 or C1), then it
>> could *clear* TS_POLLING before tsk_is_polling is read.
>>
>> Is there a good reason that TIF_NEED_RESCHED is in thread->flags and
>> TS_POLLING is in thread->status?  Couldn't both of these be in the
>> same field in something like struct rq?  That would allow a real
>> atomic op here.
>>
>> The more serious issue is that AFAICS default_wake_function is
>> completely missing the polling check.  It goes through
>> ttwu_queue_remote, which unconditionally sends an interrupt.

There would be an extra benefit of moving the resched-related bits to
some per-cpu structure: it would allow lockless wakeups.
ttwu_queue_remote, and probably all of the other reschedule-a-cpu
functions, could do something like:

if (...) {
  old = atomic_read(resched_flags(cpu));
  while(true) {
      if (old & RESCHED_NEED_RESCHED)
        return;
      if (!(old & RESCHED_POLLING)) {
        smp_send_reschedule(cpu);
        return;
      }
      new = old | RESCHED_NEED_RESCHED;
      old = atomic_cmpxchg(resched_flags(cpu), old, new);
  }
}

The point being that, with the current location of the flags, either
an interrupt needs to be sent or something needs to be done to prevent
rq->curr from disappearing.  (It probably doesn't matter if the
current task changes, because TS_POLLING will be clear, but what if
the task goes away entirely?)

All that being said, it looks like ttwu_queue_remote doesn't actually
work if the IPI isn't sent.  The attached patch appears to work (and
reduces total rescheduling IPIs by a large amount for my workload),
but I don't really think it's worthy of being applied...

--Andy
From 9dfa6a99e5eb5ab0bc3a4d6beb599ba0f2f633af Mon Sep 17 00:00:00 2001
Message-Id: <9dfa6a99e5eb5ab0bc3a4d6beb599ba0f2f633af.1392157722.git.l...@amacapital.net>
From: Andy Lutomirski <l...@amacapital.net>
Date: Tue, 11 Feb 2014 14:26:46 -0800
Subject: [PATCH] sched: Try to avoid sending an IPI in ttwu_queue_remote

This is an experimental patch.  It should probably not be applied.

Signed-off-by: Andy Lutomirski <l...@amacapital.net>
---
 kernel/sched/core.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a88f4a4..fc7b048 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1475,20 +1475,23 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
 }
 
 #ifdef CONFIG_SMP
-static void sched_ttwu_pending(void)
+static void __sched_ttwu_pending(struct rq *rq)
 {
-	struct rq *rq = this_rq();
 	struct llist_node *llist = llist_del_all(&rq->wake_list);
 	struct task_struct *p;
 
-	raw_spin_lock(&rq->lock);
-
 	while (llist) {
 		p = llist_entry(llist, struct task_struct, wake_entry);
 		llist = llist_next(llist);
 		ttwu_do_activate(rq, p, 0);
 	}
+}
 
+static void sched_ttwu_pending(void)
+{
+	struct rq *rq = this_rq();
+	raw_spin_lock(&rq->lock);
+	__sched_ttwu_pending(rq);
 	raw_spin_unlock(&rq->lock);
 }
 
@@ -1536,8 +1539,15 @@ void scheduler_ipi(void)
 
 static void ttwu_queue_remote(struct task_struct *p, int cpu)
 {
-	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
-		smp_send_reschedule(cpu);
+	struct rq *rq = cpu_rq(cpu);
+
+	if (llist_add(&p->wake_entry, &rq->wake_list)) {
+		unsigned long flags;
+
+		raw_spin_lock_irqsave(&rq->lock, flags);
+		resched_task(cpu_curr(cpu));
+		raw_spin_unlock_irqrestore(&rq->lock, flags);
+	}
 }
 
 bool cpus_share_cache(int this_cpu, int that_cpu)
@@ -2525,6 +2535,8 @@ need_resched:
 	smp_mb__before_spinlock();
 	raw_spin_lock_irq(&rq->lock);
 
+	__sched_ttwu_pending(rq);
+
 	switch_count = &prev->nivcsw;
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
 		if (unlikely(signal_pending_state(prev->state, prev))) {
-- 
1.8.5.3

Reply via email to