Re: Too many rescheduling interrupts (still!)

2014-02-12 Thread Peter Zijlstra
On Wed, Feb 12, 2014 at 10:19:42AM -0800, Andy Lutomirski wrote:
> >  static void ttwu_queue_remote(struct task_struct *p, int cpu)
> >  {
> > -   if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
> > -   smp_send_reschedule(cpu);
> > +   struct rq *rq = cpu_rq(cpu);
> > +
> > +   if (llist_add(&p->wake_entry, &rq->wake_list)) {
> > +   set_tsk_need_resched(rq->idle);
> > +   smp_mb__after_clear_bit();
> > +   if (!tsk_is_polling(rq->idle) || rq->curr != rq->idle)
> > +   smp_send_reschedule(cpu);
> > +   }
> 
> At the very least this needs a comment pointing out that rq->lock is
> intentionally not taken.  This makes my brain hurt a little :)

Oh absolutely; I wanted to write one, but couldn't get a straight story
so gave up for now.

> > +   /*
> > +* We must clear polling before running 
> > sched_ttwu_pending().
> > +* Otherwise it becomes possible to have entries added in
> > +* ttwu_queue_remote() and still not get an IPI to process
> > +* them.
> > +*/
> > +   __current_clr_polling();
> > +
> > +   set_preempt_need_resched();
> > +   sched_ttwu_pending();
> > +
> > tick_nohz_idle_exit();
> > schedule_preempt_disabled();
> > +   __current_set_polling();
> 
> I wonder if this side has enough barriers to make this work.

sched_ttwu_pending() does xchg() as first op and thereby orders itself
against the clr_polling.


I'll need a fresh brain for your proposal.. will read it again in the
morning.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Too many rescheduling interrupts (still!)

2014-02-12 Thread Andy Lutomirski
On Wed, Feb 12, 2014 at 8:39 AM, Peter Zijlstra  wrote:
> On Wed, Feb 12, 2014 at 07:49:07AM -0800, Andy Lutomirski wrote:
>> On Wed, Feb 12, 2014 at 2:13 AM, Peter Zijlstra  wrote:
>> Exactly.  AFAICT the only reason that any of this code holds rq->lock
>> (especially ttwu_queue_remote, which I seem to call a few thousand
>> times per second) is because the only way to make a cpu reschedule
>> involves playing with per-task flags.  If the flags were per-rq or
>> per-cpu instead, then rq->lock wouldn't be needed.  If this were all
>> done locklessly, then I think either a full cmpxchg or some fairly
>> careful use of full barriers would be needed, but I bet that cmpxchg
>> is still considerably faster than a spinlock plus a set_bit.
>
> Ahh, that's what you're saying. Yes we should be able to do something
> clever there.
>
> Something like the below is I think as close as we can come without
> major surgery and moving TIF_NEED_RESCHED and POLLING into a per-cpu
> variable.
>
> I might have messed it up though; brain seems to have given out for the
> day :/
>
> ---
>  kernel/sched/core.c  | 17 +
>  kernel/sched/idle.c  | 21 +
>  kernel/sched/sched.h |  5 -
>  3 files changed, 30 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fb9764fbc537..a5b64040c21d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -529,7 +529,7 @@ void resched_task(struct task_struct *p)
> }
>
> /* NEED_RESCHED must be visible before we test polling */
> -   smp_mb();
> +   smp_mb__after_clear_bit();
> if (!tsk_is_polling(p))
> smp_send_reschedule(cpu);
>  }
> @@ -1476,12 +1476,15 @@ static int ttwu_remote(struct task_struct *p, int 
> wake_flags)
>  }
>
>  #ifdef CONFIG_SMP
> -static void sched_ttwu_pending(void)
> +void sched_ttwu_pending(void)
>  {
> struct rq *rq = this_rq();
> struct llist_node *llist = llist_del_all(&rq->wake_list);
> struct task_struct *p;
>
> +   if (!llist)
> +   return;
> +
> raw_spin_lock(&rq->lock);
>
> while (llist) {
> @@ -1536,8 +1539,14 @@ void scheduler_ipi(void)
>
>  static void ttwu_queue_remote(struct task_struct *p, int cpu)
>  {
> -   if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
> -   smp_send_reschedule(cpu);
> +   struct rq *rq = cpu_rq(cpu);
> +
> +   if (llist_add(&p->wake_entry, &rq->wake_list)) {
> +   set_tsk_need_resched(rq->idle);
> +   smp_mb__after_clear_bit();
> +   if (!tsk_is_polling(rq->idle) || rq->curr != rq->idle)
> +   smp_send_reschedule(cpu);
> +   }

At the very least this needs a comment pointing out that rq->lock is
intentionally not taken.  This makes my brain hurt a little :)

>  }
>
>  bool cpus_share_cache(int this_cpu, int that_cpu)
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 14ca43430aee..bd8ed2d2f2f7 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -105,19 +105,24 @@ static void cpu_idle_loop(void)
> } else {
> local_irq_enable();
> }
> -   __current_set_polling();
> }
> arch_cpu_idle_exit();
> -   /*
> -* We need to test and propagate the TIF_NEED_RESCHED
> -* bit here because we might not have send the
> -* reschedule IPI to idle tasks.
> -*/
> -   if (tif_need_resched())
> -   set_preempt_need_resched();
> }
> +
> +   /*
> +* We must clear polling before running sched_ttwu_pending().
> +* Otherwise it becomes possible to have entries added in
> +* ttwu_queue_remote() and still not get an IPI to process
> +* them.
> +*/
> +   __current_clr_polling();
> +
> +   set_preempt_need_resched();
> +   sched_ttwu_pending();
> +
> tick_nohz_idle_exit();
> schedule_preempt_disabled();
> +   __current_set_polling();

I wonder if this side has enough barriers to make this work.


I'll see if I have a few free minutes (yeah right!) to try out the
major surgery approach.  I think I can do it without even cmpxchg.
Basically, there would be a percpu variable idlepoll_state with three
values: IDLEPOLL_NOT_POLLING, IDLEPOLL_WOKEN, and IDLEPOLL_POLLING.

The polling idle code does:

idlepoll_state = IDLEPOLL_POLLING;
smp_mb();
check for ttwu and need_resched;
mwait, poll, or whatever until idlepoll_state != IDLEPOLL_POLLING;
idlepoll_state = IDLEPOLL_NOT_POLLING;
smp_mb();
check for ttwu and need_resched;

The idle non-poll

Re: Too many rescheduling interrupts (still!)

2014-02-12 Thread Peter Zijlstra
On Wed, Feb 12, 2014 at 06:46:39PM +0100, Frederic Weisbecker wrote:
> Ok but if the target is idle, dynticks and not polling, we don't have the 
> choice
> but to send an IPI, right? I'm talking about this kind of case.

Yes; but Andy doesn't seem concerned with such hardware (!x86).

Anything x86 (except ancient stuff) is effectively polling and wakes up
from the TIF_NEED_RESCHED write.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Too many rescheduling interrupts (still!)

2014-02-12 Thread Frederic Weisbecker
On Wed, Feb 12, 2014 at 05:43:56PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 12, 2014 at 04:59:52PM +0100, Frederic Weisbecker wrote:
> > 2014-02-12 11:13 GMT+01:00 Peter Zijlstra :
> > > On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote:
> > >> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner  
> > >> wrote:
> > >> >> A small number of reschedule interrupts appear to be due to a race:
> > >> >> both resched_task and wake_up_idle_cpu do, essentially:
> > >> >>
> > >> >> set_tsk_need_resched(t);
> > >> >> smb_mb();
> > >> >> if (!tsk_is_polling(t))
> > >> >>   smp_send_reschedule(cpu);
> > >> >>
> > >> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
> > >> >> is too quick (which isn't surprising if it was in C0 or C1), then it
> > >> >> could *clear* TS_POLLING before tsk_is_polling is read.
> > >
> > > Yeah we have the wrong default for the idle loops.. it should default to
> > > polling and only switch to !polling at the very last moment if it really
> > > needs an interrupt to wake.
> > >
> > > Changing this requires someone (probably me again :/) to audit all arch
> > > cpu idle drivers/functions.
> > 
> > Looking at wake_up_idle_cpu(), we set need_resched and send the IPI.
> > On the other end, the CPU wakes up, exits the idle loop and even goes
> > to the scheduler while there is probably no task to schedule.
> > 
> > I wonder if this is all necessary. All we need is the timer to be
> > handled by the dynticks code to re-evaluate the next tick. So calling
> > irq_exit() -> tick_nohz_irq_exit() from the scheduler_ipi() should be
> > enough.
> 
> No no, the idea was to NOT send IPIs. So falling out of idle by writing
> TIF_NEED_RESCHED and having the idle loop fixup the timers on its way
> back to idle is what you want.

Ok but if the target is idle, dynticks and not polling, we don't have the choice
but to send an IPI, right? I'm talking about this kind of case.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Too many rescheduling interrupts (still!)

2014-02-12 Thread Peter Zijlstra
On Wed, Feb 12, 2014 at 04:59:52PM +0100, Frederic Weisbecker wrote:
> 2014-02-12 11:13 GMT+01:00 Peter Zijlstra :
> > On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote:
> >> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner  
> >> wrote:
> >> >> A small number of reschedule interrupts appear to be due to a race:
> >> >> both resched_task and wake_up_idle_cpu do, essentially:
> >> >>
> >> >> set_tsk_need_resched(t);
> >> >> smb_mb();
> >> >> if (!tsk_is_polling(t))
> >> >>   smp_send_reschedule(cpu);
> >> >>
> >> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
> >> >> is too quick (which isn't surprising if it was in C0 or C1), then it
> >> >> could *clear* TS_POLLING before tsk_is_polling is read.
> >
> > Yeah we have the wrong default for the idle loops.. it should default to
> > polling and only switch to !polling at the very last moment if it really
> > needs an interrupt to wake.
> >
> > Changing this requires someone (probably me again :/) to audit all arch
> > cpu idle drivers/functions.
> 
> Looking at wake_up_idle_cpu(), we set need_resched and send the IPI.
> On the other end, the CPU wakes up, exits the idle loop and even goes
> to the scheduler while there is probably no task to schedule.
> 
> I wonder if this is all necessary. All we need is the timer to be
> handled by the dynticks code to re-evaluate the next tick. So calling
> irq_exit() -> tick_nohz_irq_exit() from the scheduler_ipi() should be
> enough.

No no, the idea was to NOT send IPIs. So falling out of idle by writing
TIF_NEED_RESCHED and having the idle loop fixup the timers on its way
back to idle is what you want.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Too many rescheduling interrupts (still!)

2014-02-12 Thread Peter Zijlstra
On Wed, Feb 12, 2014 at 07:49:07AM -0800, Andy Lutomirski wrote:
> On Wed, Feb 12, 2014 at 2:13 AM, Peter Zijlstra  wrote:
> > On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote:
> >> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner  
> >> wrote:
> >> >> A small number of reschedule interrupts appear to be due to a race:
> >> >> both resched_task and wake_up_idle_cpu do, essentially:
> >> >>
> >> >> set_tsk_need_resched(t);
> >> >> smb_mb();
> >> >> if (!tsk_is_polling(t))
> >> >>   smp_send_reschedule(cpu);
> >> >>
> >> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
> >> >> is too quick (which isn't surprising if it was in C0 or C1), then it
> >> >> could *clear* TS_POLLING before tsk_is_polling is read.
> >
> > Yeah we have the wrong default for the idle loops.. it should default to
> > polling and only switch to !polling at the very last moment if it really
> > needs an interrupt to wake.
> 
> I might be missing something, but won't that break the scheduler? 

for the idle task.. all other tasks will have it !polling.

But note how the current generic idle loop does:

  if (!current_clr_polling_and_test()) {
...
if (cpuidle_idle_call())
arch_cpu_idle();
...
  }

This means that it still runs a metric ton of code, right up to the
mwait with !polling, and then at the mwait we switch it back to polling.

Completely daft.

> Since rq->lock is held, the resched calls could check the rq state
> (curr == idle, maybe) to distinguish these cases.

Not enough; but I'm afraid I confused you with the above.

My suggestion was really more that we should call into the cpuidle/arch
idle code with polling set, and only right before we hit hlt/wfi/etc..
should we clear the polling bit.

> > It can't we're holding its rq->lock.
> 
> Exactly.  AFAICT the only reason that any of this code holds rq->lock
> (especially ttwu_queue_remote, which I seem to call a few thousand
> times per second) is because the only way to make a cpu reschedule
> involves playing with per-task flags.  If the flags were per-rq or
> per-cpu instead, then rq->lock wouldn't be needed.  If this were all
> done locklessly, then I think either a full cmpxchg or some fairly
> careful use of full barriers would be needed, but I bet that cmpxchg
> is still considerably faster than a spinlock plus a set_bit.

Ahh, that's what you're saying. Yes we should be able to do something
clever there.

Something like the below is I think as close as we can come without
major surgery and moving TIF_NEED_RESCHED and POLLING into a per-cpu
variable.

I might have messed it up though; brain seems to have given out for the
day :/

---
 kernel/sched/core.c  | 17 +
 kernel/sched/idle.c  | 21 +
 kernel/sched/sched.h |  5 -
 3 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fb9764fbc537..a5b64040c21d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -529,7 +529,7 @@ void resched_task(struct task_struct *p)
}
 
/* NEED_RESCHED must be visible before we test polling */
-   smp_mb();
+   smp_mb__after_clear_bit();
if (!tsk_is_polling(p))
smp_send_reschedule(cpu);
 }
@@ -1476,12 +1476,15 @@ static int ttwu_remote(struct task_struct *p, int 
wake_flags)
 }
 
 #ifdef CONFIG_SMP
-static void sched_ttwu_pending(void)
+void sched_ttwu_pending(void)
 {
struct rq *rq = this_rq();
struct llist_node *llist = llist_del_all(&rq->wake_list);
struct task_struct *p;
 
+   if (!llist)
+   return;
+
raw_spin_lock(&rq->lock);
 
while (llist) {
@@ -1536,8 +1539,14 @@ void scheduler_ipi(void)
 
 static void ttwu_queue_remote(struct task_struct *p, int cpu)
 {
-   if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
-   smp_send_reschedule(cpu);
+   struct rq *rq = cpu_rq(cpu);
+
+   if (llist_add(&p->wake_entry, &rq->wake_list)) {
+   set_tsk_need_resched(rq->idle);
+   smp_mb__after_clear_bit();
+   if (!tsk_is_polling(rq->idle) || rq->curr != rq->idle)
+   smp_send_reschedule(cpu);
+   }
 }
 
 bool cpus_share_cache(int this_cpu, int that_cpu)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 14ca43430aee..bd8ed2d2f2f7 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -105,19 +105,24 @@ static void cpu_idle_loop(void)
} else {
local_irq_enable();
}
-   __current_set_polling();
}
arch_cpu_idle_exit();
-   /*
-* We need to test and propagate the TIF_NEED_RESCHED
-* bit here because we might not have send the
-* reschedule IPI to i

Re: Too many rescheduling interrupts (still!)

2014-02-12 Thread Frederic Weisbecker
2014-02-12 11:13 GMT+01:00 Peter Zijlstra :
> On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote:
>> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner  wrote:
>> >> A small number of reschedule interrupts appear to be due to a race:
>> >> both resched_task and wake_up_idle_cpu do, essentially:
>> >>
>> >> set_tsk_need_resched(t);
>> >> smb_mb();
>> >> if (!tsk_is_polling(t))
>> >>   smp_send_reschedule(cpu);
>> >>
>> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
>> >> is too quick (which isn't surprising if it was in C0 or C1), then it
>> >> could *clear* TS_POLLING before tsk_is_polling is read.
>
> Yeah we have the wrong default for the idle loops.. it should default to
> polling and only switch to !polling at the very last moment if it really
> needs an interrupt to wake.
>
> Changing this requires someone (probably me again :/) to audit all arch
> cpu idle drivers/functions.

Looking at wake_up_idle_cpu(), we set need_resched and send the IPI.
On the other end, the CPU wakes up, exits the idle loop and even goes
to the scheduler while there is probably no task to schedule.

I wonder if this is all necessary. All we need is the timer to be
handled by the dynticks code to re-evaluate the next tick. So calling
irq_exit() -> tick_nohz_irq_exit() from the scheduler_ipi() should be
enough.

We could use a specific flag set before smp_send_reschedule() and read
in scheduler_ipi() entry to check if we need irq_entry()/irq_exit().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Too many rescheduling interrupts (still!)

2014-02-12 Thread Andy Lutomirski
On Wed, Feb 12, 2014 at 2:13 AM, Peter Zijlstra  wrote:
> On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote:
>> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner  wrote:
>> >> A small number of reschedule interrupts appear to be due to a race:
>> >> both resched_task and wake_up_idle_cpu do, essentially:
>> >>
>> >> set_tsk_need_resched(t);
>> >> smb_mb();
>> >> if (!tsk_is_polling(t))
>> >>   smp_send_reschedule(cpu);
>> >>
>> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
>> >> is too quick (which isn't surprising if it was in C0 or C1), then it
>> >> could *clear* TS_POLLING before tsk_is_polling is read.
>
> Yeah we have the wrong default for the idle loops.. it should default to
> polling and only switch to !polling at the very last moment if it really
> needs an interrupt to wake.

I might be missing something, but won't that break the scheduler?  If
tsk_is_polling always returns true on mwait-capable systems, then
other cpus won't be able to use the polling bit to distinguish between
the idle state (where setting need_resched is enough) and the non-idle
state (where the IPI is needed to preempt whatever task is running).

Since rq->lock is held, the resched calls could check the rq state
(curr == idle, maybe) to distinguish these cases.


>
>> There would be an extra benefit of moving the resched-related bits to
>> some per-cpu structure: it would allow lockless wakeups.
>> ttwu_queue_remote, and probably all of the other reschedule-a-cpu
>> functions, could do something like:
>>
>> if (...) {
>>   old = atomic_read(resched_flags(cpu));
>>   while(true) {
>>   if (old & RESCHED_NEED_RESCHED)
>> return;
>>   if (!(old & RESCHED_POLLING)) {
>> smp_send_reschedule(cpu);
>> return;
>>   }
>>   new = old | RESCHED_NEED_RESCHED;
>>   old = atomic_cmpxchg(resched_flags(cpu), old, new);
>>   }
>> }
>
> That looks hideously expensive.. for no apparent reason.
>
> Sending that IPI isn't _that_ bad, esp if we get the false-positive
> window smaller than it is now (its far too wide because of the wrong
> default state).
>
>> The point being that, with the current location of the flags, either
>> an interrupt needs to be sent or something needs to be done to prevent
>> rq->curr from disappearing.  (It probably doesn't matter if the
>> current task changes, because TS_POLLING will be clear, but what if
>> the task goes away entirely?)
>
> It can't we're holding its rq->lock.

Exactly.  AFAICT the only reason that any of this code holds rq->lock
(especially ttwu_queue_remote, which I seem to call a few thousand
times per second) is because the only way to make a cpu reschedule
involves playing with per-task flags.  If the flags were per-rq or
per-cpu instead, then rq->lock wouldn't be needed.  If this were all
done locklessly, then I think either a full cmpxchg or some fairly
careful use of full barriers would be needed, but I bet that cmpxchg
is still considerably faster than a spinlock plus a set_bit.

>
>> All that being said, it looks like ttwu_queue_remote doesn't actually
>> work if the IPI isn't sent.  The attached patch appears to work (and
>> reduces total rescheduling IPIs by a large amount for my workload),
>> but I don't really think it's worthy of being applied...
>
> We can do something similar though; we can move sched_ttwu_pending()
> into the generic idle loop, right next to set_preempt_need_resched().

Oh, right -- either the IPI or the idle code is guaranteed to happen
soon.  (But wouldn't setting TS_POLLING always break this, too?)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Too many rescheduling interrupts (still!)

2014-02-12 Thread Peter Zijlstra
On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote:
> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner  wrote:
> >> A small number of reschedule interrupts appear to be due to a race:
> >> both resched_task and wake_up_idle_cpu do, essentially:
> >>
> >> set_tsk_need_resched(t);
> >> smb_mb();
> >> if (!tsk_is_polling(t))
> >>   smp_send_reschedule(cpu);
> >>
> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
> >> is too quick (which isn't surprising if it was in C0 or C1), then it
> >> could *clear* TS_POLLING before tsk_is_polling is read.

Yeah we have the wrong default for the idle loops.. it should default to
polling and only switch to !polling at the very last moment if it really
needs an interrupt to wake.

Changing this requires someone (probably me again :/) to audit all arch
cpu idle drivers/functions.

> >> Is there a good reason that TIF_NEED_RESCHED is in thread->flags and
> >> TS_POLLING is in thread->status?  Couldn't both of these be in the
> >> same field in something like struct rq?  That would allow a real
> >> atomic op here.

I don't see the value of an atomic op there; but many archs already have
this, grep for TIF_POLLING.

> >> The more serious issue is that AFAICS default_wake_function is
> >> completely missing the polling check.  It goes through
> >> ttwu_queue_remote, which unconditionally sends an interrupt.

Yah, because it does more than just wake the CPU; at the time we didn't
have a generic idle path, we could cure things now though.

> There would be an extra benefit of moving the resched-related bits to
> some per-cpu structure: it would allow lockless wakeups.
> ttwu_queue_remote, and probably all of the other reschedule-a-cpu
> functions, could do something like:
> 
> if (...) {
>   old = atomic_read(resched_flags(cpu));
>   while(true) {
>   if (old & RESCHED_NEED_RESCHED)
> return;
>   if (!(old & RESCHED_POLLING)) {
> smp_send_reschedule(cpu);
> return;
>   }
>   new = old | RESCHED_NEED_RESCHED;
>   old = atomic_cmpxchg(resched_flags(cpu), old, new);
>   }
> }

That looks hideously expensive.. for no apparent reason.

Sending that IPI isn't _that_ bad, esp if we get the false-positive
window smaller than it is now (its far too wide because of the wrong
default state).

> The point being that, with the current location of the flags, either
> an interrupt needs to be sent or something needs to be done to prevent
> rq->curr from disappearing.  (It probably doesn't matter if the
> current task changes, because TS_POLLING will be clear, but what if
> the task goes away entirely?)

It can't we're holding its rq->lock.

> All that being said, it looks like ttwu_queue_remote doesn't actually
> work if the IPI isn't sent.  The attached patch appears to work (and
> reduces total rescheduling IPIs by a large amount for my workload),
> but I don't really think it's worthy of being applied...

We can do something similar though; we can move sched_ttwu_pending()
into the generic idle loop, right next to set_preempt_need_resched().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Too many rescheduling interrupts (still!)

2014-02-11 Thread Andy Lutomirski
On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner  wrote:
> On Tue, 11 Feb 2014, Andy Lutomirski wrote:
>
> Just adding Peter for now, as I'm too tired to grok the issue right
> now.
>
>> Rumor has it that Linux 3.13 was supposed to get rid of all the silly
>> rescheduling interrupts.  It doesn't, although it does seem to have
>> improved the situation.
>>
>> A small number of reschedule interrupts appear to be due to a race:
>> both resched_task and wake_up_idle_cpu do, essentially:
>>
>> set_tsk_need_resched(t);
>> smb_mb();
>> if (!tsk_is_polling(t))
>>   smp_send_reschedule(cpu);
>>
>> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
>> is too quick (which isn't surprising if it was in C0 or C1), then it
>> could *clear* TS_POLLING before tsk_is_polling is read.
>>
>> Is there a good reason that TIF_NEED_RESCHED is in thread->flags and
>> TS_POLLING is in thread->status?  Couldn't both of these be in the
>> same field in something like struct rq?  That would allow a real
>> atomic op here.
>>
>> The more serious issue is that AFAICS default_wake_function is
>> completely missing the polling check.  It goes through
>> ttwu_queue_remote, which unconditionally sends an interrupt.

There would be an extra benefit of moving the resched-related bits to
some per-cpu structure: it would allow lockless wakeups.
ttwu_queue_remote, and probably all of the other reschedule-a-cpu
functions, could do something like:

if (...) {
  old = atomic_read(resched_flags(cpu));
  while(true) {
  if (old & RESCHED_NEED_RESCHED)
return;
  if (!(old & RESCHED_POLLING)) {
smp_send_reschedule(cpu);
return;
  }
  new = old | RESCHED_NEED_RESCHED;
  old = atomic_cmpxchg(resched_flags(cpu), old, new);
  }
}

The point being that, with the current location of the flags, either
an interrupt needs to be sent or something needs to be done to prevent
rq->curr from disappearing.  (It probably doesn't matter if the
current task changes, because TS_POLLING will be clear, but what if
the task goes away entirely?)

All that being said, it looks like ttwu_queue_remote doesn't actually
work if the IPI isn't sent.  The attached patch appears to work (and
reduces total rescheduling IPIs by a large amount for my workload),
but I don't really think it's worthy of being applied...

--Andy
From 9dfa6a99e5eb5ab0bc3a4d6beb599ba0f2f633af Mon Sep 17 00:00:00 2001
Message-Id: <9dfa6a99e5eb5ab0bc3a4d6beb599ba0f2f633af.1392157722.git.l...@amacapital.net>
From: Andy Lutomirski 
Date: Tue, 11 Feb 2014 14:26:46 -0800
Subject: [PATCH] sched: Try to avoid sending an IPI in ttwu_queue_remote

This is an experimental patch.  It should probably not be applied.

Signed-off-by: Andy Lutomirski 
---
 kernel/sched/core.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a88f4a4..fc7b048 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1475,20 +1475,23 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
 }
 
 #ifdef CONFIG_SMP
-static void sched_ttwu_pending(void)
+static void __sched_ttwu_pending(struct rq *rq)
 {
-	struct rq *rq = this_rq();
 	struct llist_node *llist = llist_del_all(&rq->wake_list);
 	struct task_struct *p;
 
-	raw_spin_lock(&rq->lock);
-
 	while (llist) {
 		p = llist_entry(llist, struct task_struct, wake_entry);
 		llist = llist_next(llist);
 		ttwu_do_activate(rq, p, 0);
 	}
+}
 
+static void sched_ttwu_pending(void)
+{
+	struct rq *rq = this_rq();
+	raw_spin_lock(&rq->lock);
+	__sched_ttwu_pending(rq);
 	raw_spin_unlock(&rq->lock);
 }
 
@@ -1536,8 +1539,15 @@ void scheduler_ipi(void)
 
 static void ttwu_queue_remote(struct task_struct *p, int cpu)
 {
-	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
-		smp_send_reschedule(cpu);
+	struct rq *rq = cpu_rq(cpu);
+
+	if (llist_add(&p->wake_entry, &rq->wake_list)) {
+		unsigned long flags;
+
+		raw_spin_lock_irqsave(&rq->lock, flags);
+		resched_task(cpu_curr(cpu));
+		raw_spin_unlock_irqrestore(&rq->lock, flags);
+	}
 }
 
 bool cpus_share_cache(int this_cpu, int that_cpu)
@@ -2525,6 +2535,8 @@ need_resched:
 	smp_mb__before_spinlock();
 	raw_spin_lock_irq(&rq->lock);
 
+	__sched_ttwu_pending(rq);
+
 	switch_count = &prev->nivcsw;
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
 		if (unlikely(signal_pending_state(prev->state, prev))) {
-- 
1.8.5.3



Re: Too many rescheduling interrupts (still!)

2014-02-11 Thread Thomas Gleixner
On Tue, 11 Feb 2014, Andy Lutomirski wrote:

Just adding Peter for now, as I'm too tired to grok the issue right
now.

> Rumor has it that Linux 3.13 was supposed to get rid of all the silly
> rescheduling interrupts.  It doesn't, although it does seem to have
> improved the situation.
> 
> A small number of reschedule interrupts appear to be due to a race:
> both resched_task and wake_up_idle_cpu do, essentially:
> 
> set_tsk_need_resched(t);
> smb_mb();
> if (!tsk_is_polling(t))
>   smp_send_reschedule(cpu);
> 
> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
> is too quick (which isn't surprising if it was in C0 or C1), then it
> could *clear* TS_POLLING before tsk_is_polling is read.
> 
> Is there a good reason that TIF_NEED_RESCHED is in thread->flags and
> TS_POLLING is in thread->status?  Couldn't both of these be in the
> same field in something like struct rq?  That would allow a real
> atomic op here.
> 
> The more serious issue is that AFAICS default_wake_function is
> completely missing the polling check.  It goes through
> ttwu_queue_remote, which unconditionally sends an interrupt.
> 
> --Andy
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Too many rescheduling interrupts (still!)

2014-02-11 Thread Andy Lutomirski
Rumor has it that Linux 3.13 was supposed to get rid of all the silly
rescheduling interrupts.  It doesn't, although it does seem to have
improved the situation.

A small number of reschedule interrupts appear to be due to a race:
both resched_task and wake_up_idle_cpu do, essentially:

set_tsk_need_resched(t);
smb_mb();
if (!tsk_is_polling(t))
  smp_send_reschedule(cpu);

The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
is too quick (which isn't surprising if it was in C0 or C1), then it
could *clear* TS_POLLING before tsk_is_polling is read.

Is there a good reason that TIF_NEED_RESCHED is in thread->flags and
TS_POLLING is in thread->status?  Couldn't both of these be in the
same field in something like struct rq?  That would allow a real
atomic op here.

The more serious issue is that AFAICS default_wake_function is
completely missing the polling check.  It goes through
ttwu_queue_remote, which unconditionally sends an interrupt.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Rescheduling interrupts

2008-01-23 Thread Matt Mackall

On Wed, 2008-01-23 at 09:53 +0100, Andi Kleen wrote:
> Ingo Molnar <[EMAIL PROTECTED]> writes:
> 
> > that would probably be the case if it's multiple sockets - but for 
> > multiple cores exactly the opposite is true: the sooner _both_ cores 
> > finish processing, the deeper power use the CPU can reach. 
> 
> That's only true on setups where the cores don't have 
> separate sleep states. But that's not generally true anymore.
> e.g. AMD Fam10h has completely separate power planes for
> the cores and I believe newer Intel CPUs can also let their 
> cores go to at least some sleep states independently (although
> the deepest sleep modi still require all cores idle) 

I think we can expect everyone to rapidly evolve towards full
independence of core power states. In fact, it wouldn't surprise me if
we eventually get to the point of shutting down individual functional
units like the FPU.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Rescheduling interrupts

2008-01-22 Thread Ingo Molnar

* Matt Mackall <[EMAIL PROTECTED]> wrote:

> > amarokapp does wake up threads every 20 microseconds - that could 
> > explain it. It's probably Xorg running on one core, amarokapp on the 
> > other core. That's already 100 reschedules/sec.
> 
> That suggests we want an "anti-load-balancing" heuristic when CPU 
> usage is very low. Migrating everything onto one core when we're close 
> to idle will save power and probably reduce latencies.

that would probably be the case if it's multiple sockets - but for 
multiple cores exactly the opposite is true: the sooner _both_ cores 
finish processing, the deeper power use the CPU can reach. So effective 
and immediate spreading of workloads amongst multiple cores - especially 
with shared L2 caches where the cost of migration is low, helps power 
consumption. (and it obviously helps latencies and bandwith)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Rescheduling interrupts

2008-01-22 Thread Matt Mackall

On Tue, 2008-01-22 at 17:05 +0100, Ingo Molnar wrote:
> * S.Çağlar Onur <[EMAIL PROTECTED]> wrote:
> 
> > > My theory is that for whatever reason we get "repeat" IPIs: multiple 
> > > reschedule IPIs although the other CPU only initiated one.
> > 
> > Ok, please see http://cekirdek.pardus.org.tr/~caglar/dmesg.3rd :)
> 
> hm, the IPI sending and receiving is nicely paired up:
> 
> [  625.795008] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1:
> [  625.795223] IPI (@native_smp_send_reschedule) from task amarokapp:2882 on 
> CPU#1:
> 
> amarokapp does wake up threads every 20 microseconds - that could 
> explain it. It's probably Xorg running on one core, amarokapp on the 
> other core. That's already 100 reschedules/sec.

That suggests we want an "anti-load-balancing" heuristic when CPU usage
is very low. Migrating everything onto one core when we're close to idle
will save power and probably reduce latencies.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Rescheduling interrupts

2008-01-22 Thread S.Çağlar Onur
Hi;

22 Oca 2008 Sal tarihinde, S.Çağlar Onur şunları yazmıştı: 
> > hm, the IPI sending and receiving is nicely paired up:
> > 
> > [  625.795008] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1:
> > [  625.795223] IPI (@native_smp_send_reschedule) from task amarokapp:2882 
> > on CPU#1:
> > 
> > amarokapp does wake up threads every 20 microseconds - that could 
> > explain it. It's probably Xorg running on one core, amarokapp on the 
> > other core. That's already 100 reschedules/sec.
> 
> Heh, killing amarok ends up with following;
> 
>  PowerTOP version 1.9   (C) 2007 Intel Corporation
> 
> CnAvg residency   P-states (frequencies)
> C0 (cpu running)( 0,9%)
> C10,0ms ( 0,0%)
> C20,2ms ( 0,0%)
> C35,1ms (99,1%)
> 
> 
> Wakeups-from-idle per second : 197,8interval: 10,0s
> no ACPI power usage estimate available
> 
> Top causes for wakeups:
>   34,7% (130,7)   USB device  3-2 : HP Integrated Module (Broadcom Corp)
>   26,5% (100,0): uhci_hcd:usb3
>5,8% ( 22,0)  java : futex_wait (hrtimer_wakeup)
>5,3% ( 20,0): iwl3945
>4,1% ( 15,4)   USB device  2-2 : Microsoft Wireless Optical Mouse .00 
> (Microsoft)
>2,9% ( 11,0): libata
>2,7% ( 10,1): extra timer interrupt
>2,7% ( 10,0)      java : schedule_timeout (process_timeout)
>2,7% ( 10,0)  : scan_async (ehci_watchdog)
>2,4% (  9,0)   : Rescheduling interrupts
>2,1% (  8,0): usb_hcd_poll_rh_status (rh_timer_func)
>1,7% (  6,4): uhci_hcd:usb2
>1,7% (  6,4) artsd : schedule_timeout (process_timeout)
>0,6% (  2,1): ohci1394, uhci_hcd:usb4, nvidia
>0,5% (  2,0)  : clocksource_check_watchdog 
> (clocksource_watchdog)
>0,5% (  1,7)wpa_supplicant : schedule_timeout (process_timeout)
>0,3% (  1,0)kicker : schedule_timeout (process_timeout)
>0,3% (  1,0)  kwin : schedule_timeout (process_timeout)
>0,3% (  1,0)  kdesktop : schedule_timeout (process_timeout)
>0,3% (  1,0)   klipper : schedule_timeout (process_timeout)
>0,3% (  1,0)  kwrapper : do_nanosleep (hrtimer_wakeup)
>0,3% (  1,0) X : nv_start_rc_timer (nv_kern_rc_timer)
 
By the way loging out from KDE also suffers same problem, this time kdm 
migrated to CPU1 and powertop reports ~300 wakeups for " : 
Rescheduling interrupts" again.

[...]
[ 2058.246692] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#0:
[ 2058.246737] IPI (@native_smp_send_reschedule) from task kdm_greet:6122 on 
CPU#1:
[ 2058.246812] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1:
[ 2058.278947] IPI (@native_smp_send_reschedule) from task X:2073 on CPU#0:
[ 2058.279070] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#0:
[ 2058.279175] IPI (@native_smp_send_reschedule) from task kdm_greet:6122 on 
CPU#1:
[ 2058.279251] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1:
[ 2058.279301] IPI (@native_smp_send_reschedule) from task X:2073 on CPU#0:
[ 2058.279377] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#0:
[ 2058.279425] IPI (@native_smp_send_reschedule) from task kdm_greet:6122 on 
CPU#1:
[ 2058.279503] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1:
[ 2058.279565] IPI (@native_smp_send_reschedule) from task X:2073 on CPU#0:
[ 2058.279637] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#0:
[ 2058.279683] IPI (@native_smp_send_reschedule) from task kdm_greet:6122 on 
CPU#1:
[ 2058.279758] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1:
[ 2058.311903] IPI (@native_smp_send_reschedule) from task X:2073 on CPU#0:
[ 2058.312028] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#0:
[...]

Cheers
-- 
S.Çağlar Onur <[EMAIL PROTECTED]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Rescheduling interrupts

2008-01-22 Thread S.Çağlar Onur
22 Oca 2008 Sal tarihinde, Ingo Molnar şunları yazmıştı: 
> 
> * S.Çağlar Onur <[EMAIL PROTECTED]> wrote:
> 
> > > My theory is that for whatever reason we get "repeat" IPIs: multiple 
> > > reschedule IPIs although the other CPU only initiated one.
> > 
> > Ok, please see http://cekirdek.pardus.org.tr/~caglar/dmesg.3rd :)
> 
> hm, the IPI sending and receiving is nicely paired up:
> 
> [  625.795008] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1:
> [  625.795223] IPI (@native_smp_send_reschedule) from task amarokapp:2882 on 
> CPU#1:
> 
> amarokapp does wake up threads every 20 microseconds - that could 
> explain it. It's probably Xorg running on one core, amarokapp on the 
> other core. That's already 100 reschedules/sec.

Heh, killing amarok ends up with following;

 PowerTOP version 1.9   (C) 2007 Intel Corporation

CnAvg residency   P-states (frequencies)
C0 (cpu running)( 0,9%)
C10,0ms ( 0,0%)
C20,2ms ( 0,0%)
C35,1ms (99,1%)


Wakeups-from-idle per second : 197,8interval: 10,0s
no ACPI power usage estimate available

Top causes for wakeups:
  34,7% (130,7)   USB device  3-2 : HP Integrated Module (Broadcom Corp)
  26,5% (100,0): uhci_hcd:usb3
   5,8% ( 22,0)  java : futex_wait (hrtimer_wakeup)
   5,3% ( 20,0): iwl3945
   4,1% ( 15,4)   USB device  2-2 : Microsoft Wireless Optical Mouse .00 
(Microsoft)
   2,9% ( 11,0): libata
   2,7% ( 10,1): extra timer interrupt
   2,7% ( 10,0)  java : schedule_timeout (process_timeout)
   2,7% ( 10,0)  : scan_async (ehci_watchdog)
   2,4% (  9,0)   : Rescheduling interrupts
   2,1% (  8,0): usb_hcd_poll_rh_status (rh_timer_func)
   1,7% (  6,4): uhci_hcd:usb2
   1,7% (  6,4) artsd : schedule_timeout (process_timeout)
   0,6% (  2,1): ohci1394, uhci_hcd:usb4, nvidia
   0,5% (  2,0)  : clocksource_check_watchdog 
(clocksource_watchdog)
   0,5% (  1,7)wpa_supplicant : schedule_timeout (process_timeout)
   0,3% (  1,0)kicker : schedule_timeout (process_timeout)
   0,3% (  1,0)  kwin : schedule_timeout (process_timeout)
   0,3% (  1,0)  kdesktop : schedule_timeout (process_timeout)
   0,3% (  1,0)   klipper : schedule_timeout (process_timeout)
   0,3% (  1,0)  kwrapper : do_nanosleep (hrtimer_wakeup)
   0,3% (  1,0) X : nv_start_rc_timer (nv_kern_rc_timer)

-- 
S.Çağlar Onur <[EMAIL PROTECTED]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Rescheduling interrupts

2008-01-22 Thread Ingo Molnar

* S.Çağlar Onur <[EMAIL PROTECTED]> wrote:

> > My theory is that for whatever reason we get "repeat" IPIs: multiple 
> > reschedule IPIs although the other CPU only initiated one.
> 
> Ok, please see http://cekirdek.pardus.org.tr/~caglar/dmesg.3rd :)

hm, the IPI sending and receiving is nicely paired up:

[  625.795008] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1:
[  625.795223] IPI (@native_smp_send_reschedule) from task amarokapp:2882 on 
CPU#1:

amarokapp does wake up threads every 20 microseconds - that could 
explain it. It's probably Xorg running on one core, amarokapp on the 
other core. That's already 100 reschedules/sec.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Rescheduling interrupts

2008-01-22 Thread S.Çağlar Onur
Hi;

22 Oca 2008 Sal tarihinde, Ingo Molnar şunları yazmıştı: 
> 
> also, this might reduce the number of cross-CPU wakeups on near-idle 
> systems:
> 
>   echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
> 
> [ or if it doesnt, it should ;) ]
> 
>   Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

Seems like nothing changes 


zangetsu ~ # cat /sys/devices/system/cpu/sched_mc_power_savings
1



Powertop still reports ~300 wakeups for " : Rescheduling interrupts"

 PowerTOP version 1.9   (C) 2007 Intel Corporation

CnAvg residency   P-states (frequencies)
C0 (cpu running)( 4,8%)
C10,0ms ( 0,0%)
C20,2ms ( 2,4%)
C32,4ms (92,8%)


Wakeups-from-idle per second : 495,2interval: 3,0s
no ACPI power usage estimate available

Top causes for wakeups:
  40,0% (330,7)   : Rescheduling interrupts
  12,3% (102,0)   USB device  3-2 : HP Integrated Module (Broadcom Corp)
  12,1% (100,0): uhci_hcd:usb3
   8,0% ( 66,3): extra timer interrupt
   7,0% ( 58,0) amarokapp : schedule_timeout (process_timeout)
   4,0% ( 33,0): uhci_hcd:usb2



and this is what system is doing while powertop reports above;

USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root 1  0.1  0.0   1512   532 ?Ss   17:41   0:00 init [3]  
root 2  0.0  0.0  0 0 ?S<   17:41   0:00 [kthreadd]
root 3  0.0  0.0  0 0 ?S<   17:41   0:00 [migration/0]
root 4  0.0  0.0  0 0 ?S<   17:41   0:00 [ksoftirqd/0]
root 5  0.0  0.0  0 0 ?S<   17:41   0:00 [migration/1]
root 6  0.0  0.0  0 0 ?S<   17:41   0:00 [ksoftirqd/1]
root 7  0.0  0.0  0 0 ?S<   17:41   0:00 [events/0]
root 8  0.0  0.0  0 0 ?S<   17:41   0:00 [events/1]
root 9  0.0  0.0  0 0 ?S<   17:41   0:00 [khelper]
root10  0.0  0.0  0 0 ?S<   17:41   0:00 [kblockd/0]
root11  0.0  0.0  0 0 ?S<   17:41   0:00 [kblockd/1]
root12  0.0  0.0  0 0 ?S<   17:41   0:00 [kacpid]
root13  0.0  0.0  0 0 ?S<   17:41   0:00 [kacpi_notify]
root14  0.0  0.0  0 0 ?S<   17:41   0:00 [cqueue/0]
root15  0.0  0.0  0 0 ?S<   17:41   0:00 [cqueue/1]
root16  0.0  0.0  0 0 ?S<   17:41   0:00 [kseriod]
root17  0.0  0.0  0 0 ?S17:41   0:00 [pdflush]
root18  0.0  0.0  0 0 ?S17:41   0:00 [pdflush]
root19  0.0  0.0  0 0 ?S<   17:41   0:00 [kswapd0]
root20  0.0  0.0  0 0 ?S<   17:41   0:00 [aio/0]
root21  0.0  0.0  0 0 ?S<   17:41   0:00 [aio/1]
root22  0.0  0.0  0 0 ?S<   17:41   0:00 [kpsmoused]
root42  0.0  0.0  0 0 ?S<   17:41   0:00 [khpsbpkt]
root46  0.0  0.0  0 0 ?S<   17:41   0:00 [knodemgrd_0]
root55  0.0  0.0  0 0 ?S<   17:41   0:00 [ata/0]
root56  0.0  0.0  0 0 ?S<   17:41   0:00 [ata/1]
root57  0.0  0.0  0 0 ?S<   17:41   0:00 [ata_aux]
root61  0.0  0.0  0 0 ?S<   17:41   0:00 [scsi_eh_0]
root62  0.0  0.0  0 0 ?S<   17:41   0:00 [scsi_eh_1]
root63  0.0  0.0  0 0 ?S<   17:41   0:00 [scsi_eh_2]
root64  0.0  0.0  0 0 ?S<   17:41   0:00 [scsi_eh_3]
root70  0.0  0.0  0 0 ?S<   17:41   0:00 [ksuspend_usbd]
root71  0.0  0.0  0 0 ?S<   17:41   0:00 [khubd]
root80  0.0  0.0  0 0 ?S<   17:41   0:00 [scsi_eh_4]
root81  0.0  0.0  0 0 ?S<   17:41   0:00 [scsi_eh_5]
root   159  0.0  0.0  0 0 ?S<   17:41   0:00 [kjournald]
root   194  0.0  0.0   2452  1304 ?S
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Rescheduling interrupts

2008-01-22 Thread S.Çağlar Onur
Hi;

22 Oca 2008 Sal tarihinde, Ingo Molnar şunları yazmıştı: 
> * S.Çağlar Onur <[EMAIL PROTECTED]> wrote:
> > I grabbed the logs two times to make sure to catch needed info. 1st [1]
> > one is generated while "Rescheduling interrupts" wakeups ~200 times and
> > 2nd one generated for ~350 wakeups.
> >
> > [1] http://cekirdek.pardus.org.tr/~caglar/dmesg.1st
> > [2] http://cekirdek.pardus.org.tr/~caglar/dmesg.2nd
>
> thanks, these seem to be mostly normal wakeups from standard tasks:
>
>  IPI from task kdm_greet:2118 on CPU#0:
>  IPI from task X:2079 on CPU#1:
>  IPI from task kdm_greet:2118 on CPU#0:
>  IPI from task hald-addon-inpu:2009 on CPU#1:
>  IPI from task events/0:7 on CPU#1:
>  IPI from task bash:2129 on CPU#0:
>  IPI from task kdm_greet:2118 on CPU#0:
>  IPI from task events/0:7 on CPU#1:
>  IPI from task events/0:7 on CPU#1:
>  IPI from task events/0:7 on CPU#1:
>  IPI from task bash:3902 on CPU#1:
>  IPI from task bash:3902 on CPU#1:
>  IPI from task amarokapp:3423 on CPU#1:
>  IPI from task amarokapp:3423 on CPU#1:
>  IPI from task amarokapp:3423 on CPU#1:
>  IPI from task X:2079 on CPU#0:
>  IPI from task yakuake:3422 on CPU#0:
>  IPI from task X:2079 on CPU#1:
>  IPI from task amarokapp:3423 on CPU#1:
>  IPI from task amarokapp:3423 on CPU#1:
>
> could you also add a similar IPI printouts (with the same panic_timeout
> logic) to arch/x86/kernel/smp_32.c's smp_reschedule_interrupt() function
> - while still keeping the other printouts too?
>
> Could you also enable PRINTK_TIME timestamps, so that we can see the
> timings? (And do a "dmesg -n 1" so that the printks happen fast and the
> timings are accurate.) I'd suggest to increase CONFIG_LOG_BUF_SHIFT to
> 20, so that your dmesg buffer is large enough. Plus try to capture 100
> events, ok?
>
> My theory is that for whatever reason we get "repeat" IPIs: multiple
> reschedule IPIs although the other CPU only initiated one.

Ok, please see http://cekirdek.pardus.org.tr/~caglar/dmesg.3rd :)

Cheers
-- 
S.Çağlar Onur <[EMAIL PROTECTED]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Rescheduling interrupts

2008-01-22 Thread Ingo Molnar

also, this might reduce the number of cross-CPU wakeups on near-idle 
systems:

  echo 1 > /sys/devices/system/cpu/sched_mc_power_savings

[ or if it doesnt, it should ;) ]

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Rescheduling interrupts

2008-01-22 Thread Ingo Molnar

* S.Çağlar Onur <[EMAIL PROTECTED]> wrote:

> I grabbed the logs two times to make sure to catch needed info. 1st [1] one 
> is generated while "Rescheduling interrupts" wakeups ~200 times and 2nd one 
> generated for ~350 wakeups.
> 
> [1] http://cekirdek.pardus.org.tr/~caglar/dmesg.1st 
> [2] http://cekirdek.pardus.org.tr/~caglar/dmesg.2nd

thanks, these seem to be mostly normal wakeups from standard tasks:

 IPI from task kdm_greet:2118 on CPU#0:
 IPI from task X:2079 on CPU#1:
 IPI from task kdm_greet:2118 on CPU#0:
 IPI from task hald-addon-inpu:2009 on CPU#1:
 IPI from task events/0:7 on CPU#1:
 IPI from task bash:2129 on CPU#0:
 IPI from task kdm_greet:2118 on CPU#0:
 IPI from task events/0:7 on CPU#1:
 IPI from task events/0:7 on CPU#1:
 IPI from task events/0:7 on CPU#1:
 IPI from task bash:3902 on CPU#1:
 IPI from task bash:3902 on CPU#1:
 IPI from task amarokapp:3423 on CPU#1:
 IPI from task amarokapp:3423 on CPU#1:
 IPI from task amarokapp:3423 on CPU#1:
 IPI from task X:2079 on CPU#0:
 IPI from task yakuake:3422 on CPU#0:
 IPI from task X:2079 on CPU#1:
 IPI from task amarokapp:3423 on CPU#1:
 IPI from task amarokapp:3423 on CPU#1:

could you also add a similar IPI printouts (with the same panic_timeout 
logic) to arch/x86/kernel/smp_32.c's smp_reschedule_interrupt() function 
- while still keeping the other printouts too?

Could you also enable PRINTK_TIME timestamps, so that we can see the 
timings? (And do a "dmesg -n 1" so that the printks happen fast and the 
timings are accurate.) I'd suggest to increase CONFIG_LOG_BUF_SHIFT to 
20, so that your dmesg buffer is large enough. Plus try to capture 100 
events, ok?

My theory is that for whatever reason we get "repeat" IPIs: multiple 
reschedule IPIs although the other CPU only initiated one.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Rescheduling interrupts

2008-01-22 Thread S.Çağlar Onur
Hi;

22 Oca 2008 Sal tarihinde, Ingo Molnar şunları yazmıştı: 
> * S.Çağlar Onur <[EMAIL PROTECTED]> wrote:
> 
> > Top causes for wakeups:
> >   59,9% (238,4)   : Rescheduling interrupts
> > ^^
> >   14,7% ( 58,6) amarokapp : schedule_timeout (process_timeout)
> 
> hm, would be nice to figure out what causes these IPIs. Could you stick 
> something like this into arch/x86/kernel/smp_32.c's 
> smp_send_reschedule() function [this is the function that generates the 
> IPI]:
> 
> static void native_smp_send_reschedule(int cpu)
> {
> WARN_ON(cpu_is_offline(cpu));
> send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR);
>   if (panic_timeout > 0) {
>   panic_timeout--;
>   printk("IPI from task %s:%d on CPU#%d:\n", 
>   current->comm, current->pid, cpu);
>   dump_stack();
>   }
> }
> 
> NOTE: if you run an SMP kernel then first remove these two lines from 
> kernel/printk.c:
> 
> if (!oops_in_progress && waitqueue_active(&log_wait))
> wake_up_interruptible(&log_wait);
> 
> otherwise you'll get lockups. (the IPI is sent while holding the 
> runqueue lock, so the printks will lock up)
> 
> then wait for the bad condition to occur on your system and generate a 
> stream of ~10 backtraces, via:
> 
>   echo 10 > /proc/sys/kernel/panic
> 
> you should be getting 10 immediate backtraces - please send them to us. 
> The backtraces should show the place that generates the wakeups. [turn 
> on CONFIG_FRAME_POINTERS=y to get high quality backtraces.]
>  
> If you do _not_ get 10 immediate backtraces, then something in the 
> system is generating such IPIs outside of the scheduler's control. That 
> would suggest some other sort of borkage.
> 
>   Ingo

I grabbed the logs two times to make sure to catch needed info. 1st [1] one is 
generated while "Rescheduling interrupts" wakeups ~200 times and 2nd one 
generated for ~350 wakeups.

[1] http://cekirdek.pardus.org.tr/~caglar/dmesg.1st 
[2] http://cekirdek.pardus.org.tr/~caglar/dmesg.2nd

Cheers
-- 
S.Çağlar Onur <[EMAIL PROTECTED]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Rescheduling interrupts

2008-01-22 Thread Ingo Molnar

* S.Çağlar Onur <[EMAIL PROTECTED]> wrote:

> Top causes for wakeups:
>   59,9% (238,4)   : Rescheduling interrupts
> ^^
>   14,7% ( 58,6) amarokapp : schedule_timeout (process_timeout)

hm, would be nice to figure out what causes these IPIs. Could you stick 
something like this into arch/x86/kernel/smp_32.c's 
smp_send_reschedule() function [this is the function that generates the 
IPI]:

static void native_smp_send_reschedule(int cpu)
{
WARN_ON(cpu_is_offline(cpu));
send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR);
if (panic_timeout > 0) {
panic_timeout--;
printk("IPI from task %s:%d on CPU#%d:\n", 
current->comm, current->pid, cpu);
dump_stack();
}
}

NOTE: if you run an SMP kernel then first remove these two lines from 
kernel/printk.c:

if (!oops_in_progress && waitqueue_active(&log_wait))
wake_up_interruptible(&log_wait);

otherwise you'll get lockups. (the IPI is sent while holding the 
runqueue lock, so the printks will lock up)

then wait for the bad condition to occur on your system and generate a 
stream of ~10 backtraces, via:

echo 10 > /proc/sys/kernel/panic

you should be getting 10 immediate backtraces - please send them to us. 
The backtraces should show the place that generates the wakeups. [turn 
on CONFIG_FRAME_POINTERS=y to get high quality backtraces.]
 
If you do _not_ get 10 immediate backtraces, then something in the 
system is generating such IPIs outside of the scheduler's control. That 
would suggest some other sort of borkage.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Rescheduling interrupts

2008-01-21 Thread S.Çağlar Onur
Hi;

With Linus's latest git, powertop reports following while system nearly %100 
idle;

PowerTOP version 1.9   (C) 2007 Intel Corporation

CnAvg residency   P-states (frequencies)
C0 (cpu running)( 6,3%) 1,84 Ghz 0,4%
C10,0ms ( 0,0%) 1333 Mhz 0,0%
C20,1ms ( 0,5%) 1000 Mhz99,6%
C33,7ms (93,2%)


Wakeups-from-idle per second : 306,8interval: 10,0s
Power usage (5 minute ACPI estimate) :  23,1 W (0,5 hours left)

Top causes for wakeups:
  59,9% (238,4)   : Rescheduling interrupts
^^
  14,7% ( 58,6) amarokapp : schedule_timeout (process_timeout)
   5,5% ( 21,9)  java : futex_wait (hrtimer_wakeup)
   5,0% ( 19,8): iwl3945
   2,5% ( 10,0)  java : schedule_timeout (process_timeout)
   2,5% ( 10,0)  : ehci_work (ehci_watchdog)
   2,5% ( 10,0): extra timer interrupt
   1,6% (  6,4) artsd : schedule_timeout (process_timeout)
   1,0% (  4,0): usb_hcd_poll_rh_status (rh_timer_func)
   0,5% (  2,0): ohci1394, uhci_hcd:usb4, nvidia
   0,5% (  2,0)  : clocksource_check_watchdog 
(clocksource_watchdog)
   0,5% (  2,0)  kwin : schedule_timeout (process_timeout)
   0,5% (  1,9)wpa_supplicant : schedule_timeout (process_timeout)
   0,3% (  1,2)  kdesktop : schedule_timeout (process_timeout)
   0,3% (  1,0)  kwrapper : do_nanosleep (hrtimer_wakeup)
   0,3% (  1,0)   klipper : schedule_timeout (process_timeout)
   0,3% (  1,0) artsd : do_setitimer (it_real_fn)
   0,3% (  1,0) gpg-agent : schedule_timeout (process_timeout)
   0,3% (  1,0) X : nv_start_rc_timer (nv_kern_rc_timer)
   0,3% (  1,0)kicker : schedule_timeout (process_timeout)
   0,1% (  0,5)   iwl3945 : ieee80211_authenticate 
(ieee80211_sta_timer)
   0,1% (  0,5)  : neigh_table_init_no_netlink 
(neigh_periodic_timer)

This " : Rescheduling interrupts" causes at least 200 wakeups 
(sometimes i see ~400 wakeups) for me and a quick google search yields [1], 
but i didn't see this reported to LKML, so here it is :).

If anything else is needed please yell...

[1] http://www.mail-archive.com/[EMAIL PROTECTED]/msg01009.html
-- 
S.Çağlar Onur <[EMAIL PROTECTED]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


signature.asc
Description: This is a digitally signed message part.