Re: Too many rescheduling interrupts (still!)
On Wed, Feb 12, 2014 at 10:19:42AM -0800, Andy Lutomirski wrote: > > static void ttwu_queue_remote(struct task_struct *p, int cpu) > > { > > - if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) > > - smp_send_reschedule(cpu); > > + struct rq *rq = cpu_rq(cpu); > > + > > + if (llist_add(&p->wake_entry, &rq->wake_list)) { > > + set_tsk_need_resched(rq->idle); > > + smp_mb__after_clear_bit(); > > + if (!tsk_is_polling(rq->idle) || rq->curr != rq->idle) > > + smp_send_reschedule(cpu); > > + } > > At the very least this needs a comment pointing out that rq->lock is > intentionally not taken. This makes my brain hurt a little :) Oh absolutely; I wanted to write one, but couldn't get a straight story so gave up for now. > > + /* > > +* We must clear polling before running > > sched_ttwu_pending(). > > +* Otherwise it becomes possible to have entries added in > > +* ttwu_queue_remote() and still not get an IPI to process > > +* them. > > +*/ > > + __current_clr_polling(); > > + > > + set_preempt_need_resched(); > > + sched_ttwu_pending(); > > + > > tick_nohz_idle_exit(); > > schedule_preempt_disabled(); > > + __current_set_polling(); > > I wonder if this side has enough barriers to make this work. sched_ttwu_pending() does xchg() as first op and thereby orders itself against the clr_polling. I'll need a fresh brain for your proposal.. will read it again in the morning. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Too many rescheduling interrupts (still!)
On Wed, Feb 12, 2014 at 8:39 AM, Peter Zijlstra wrote: > On Wed, Feb 12, 2014 at 07:49:07AM -0800, Andy Lutomirski wrote: >> On Wed, Feb 12, 2014 at 2:13 AM, Peter Zijlstra wrote: >> Exactly. AFAICT the only reason that any of this code holds rq->lock >> (especially ttwu_queue_remote, which I seem to call a few thousand >> times per second) is because the only way to make a cpu reschedule >> involves playing with per-task flags. If the flags were per-rq or >> per-cpu instead, then rq->lock wouldn't be needed. If this were all >> done locklessly, then I think either a full cmpxchg or some fairly >> careful use of full barriers would be needed, but I bet that cmpxchg >> is still considerably faster than a spinlock plus a set_bit. > > Ahh, that's what you're saying. Yes we should be able to do something > clever there. > > Something like the below is I think as close as we can come without > major surgery and moving TIF_NEED_RESCHED and POLLING into a per-cpu > variable. > > I might have messed it up though; brain seems to have given out for the > day :/ > > --- > kernel/sched/core.c | 17 + > kernel/sched/idle.c | 21 + > kernel/sched/sched.h | 5 - > 3 files changed, 30 insertions(+), 13 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index fb9764fbc537..a5b64040c21d 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -529,7 +529,7 @@ void resched_task(struct task_struct *p) > } > > /* NEED_RESCHED must be visible before we test polling */ > - smp_mb(); > + smp_mb__after_clear_bit(); > if (!tsk_is_polling(p)) > smp_send_reschedule(cpu); > } > @@ -1476,12 +1476,15 @@ static int ttwu_remote(struct task_struct *p, int > wake_flags) > } > > #ifdef CONFIG_SMP > -static void sched_ttwu_pending(void) > +void sched_ttwu_pending(void) > { > struct rq *rq = this_rq(); > struct llist_node *llist = llist_del_all(&rq->wake_list); > struct task_struct *p; > > + if (!llist) > + return; > + > raw_spin_lock(&rq->lock); > > while (llist) { > @@ -1536,8 +1539,14 @@ void scheduler_ipi(void) > > static void ttwu_queue_remote(struct task_struct *p, int cpu) > { > - if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) > - smp_send_reschedule(cpu); > + struct rq *rq = cpu_rq(cpu); > + > + if (llist_add(&p->wake_entry, &rq->wake_list)) { > + set_tsk_need_resched(rq->idle); > + smp_mb__after_clear_bit(); > + if (!tsk_is_polling(rq->idle) || rq->curr != rq->idle) > + smp_send_reschedule(cpu); > + } At the very least this needs a comment pointing out that rq->lock is intentionally not taken. This makes my brain hurt a little :) > } > > bool cpus_share_cache(int this_cpu, int that_cpu) > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c > index 14ca43430aee..bd8ed2d2f2f7 100644 > --- a/kernel/sched/idle.c > +++ b/kernel/sched/idle.c > @@ -105,19 +105,24 @@ static void cpu_idle_loop(void) > } else { > local_irq_enable(); > } > - __current_set_polling(); > } > arch_cpu_idle_exit(); > - /* > -* We need to test and propagate the TIF_NEED_RESCHED > -* bit here because we might not have send the > -* reschedule IPI to idle tasks. > -*/ > - if (tif_need_resched()) > - set_preempt_need_resched(); > } > + > + /* > +* We must clear polling before running sched_ttwu_pending(). > +* Otherwise it becomes possible to have entries added in > +* ttwu_queue_remote() and still not get an IPI to process > +* them. > +*/ > + __current_clr_polling(); > + > + set_preempt_need_resched(); > + sched_ttwu_pending(); > + > tick_nohz_idle_exit(); > schedule_preempt_disabled(); > + __current_set_polling(); I wonder if this side has enough barriers to make this work. I'll see if I have a few free minutes (yeah right!) to try out the major surgery approach. I think I can do it without even cmpxchg. Basically, there would be a percpu variable idlepoll_state with three values: IDLEPOLL_NOT_POLLING, IDLEPOLL_WOKEN, and IDLEPOLL_POLLING. The polling idle code does: idlepoll_state = IDLEPOLL_POLLING; smp_mb(); check for ttwu and need_resched; mwait, poll, or whatever until idlepoll_state != IDLEPOLL_POLLING; idlepoll_state = IDLEPOLL_NOT_POLLING; smp_mb(); check for ttwu and need_resched; The idle non-poll
Re: Too many rescheduling interrupts (still!)
On Wed, Feb 12, 2014 at 06:46:39PM +0100, Frederic Weisbecker wrote: > Ok but if the target is idle, dynticks and not polling, we don't have the > choice > but to send an IPI, right? I'm talking about this kind of case. Yes; but Andy doesn't seem concerned with such hardware (!x86). Anything x86 (except ancient stuff) is effectively polling and wakes up from the TIF_NEED_RESCHED write. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Too many rescheduling interrupts (still!)
On Wed, Feb 12, 2014 at 05:43:56PM +0100, Peter Zijlstra wrote: > On Wed, Feb 12, 2014 at 04:59:52PM +0100, Frederic Weisbecker wrote: > > 2014-02-12 11:13 GMT+01:00 Peter Zijlstra : > > > On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote: > > >> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner > > >> wrote: > > >> >> A small number of reschedule interrupts appear to be due to a race: > > >> >> both resched_task and wake_up_idle_cpu do, essentially: > > >> >> > > >> >> set_tsk_need_resched(t); > > >> >> smb_mb(); > > >> >> if (!tsk_is_polling(t)) > > >> >> smp_send_reschedule(cpu); > > >> >> > > >> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU > > >> >> is too quick (which isn't surprising if it was in C0 or C1), then it > > >> >> could *clear* TS_POLLING before tsk_is_polling is read. > > > > > > Yeah we have the wrong default for the idle loops.. it should default to > > > polling and only switch to !polling at the very last moment if it really > > > needs an interrupt to wake. > > > > > > Changing this requires someone (probably me again :/) to audit all arch > > > cpu idle drivers/functions. > > > > Looking at wake_up_idle_cpu(), we set need_resched and send the IPI. > > On the other end, the CPU wakes up, exits the idle loop and even goes > > to the scheduler while there is probably no task to schedule. > > > > I wonder if this is all necessary. All we need is the timer to be > > handled by the dynticks code to re-evaluate the next tick. So calling > > irq_exit() -> tick_nohz_irq_exit() from the scheduler_ipi() should be > > enough. > > No no, the idea was to NOT send IPIs. So falling out of idle by writing > TIF_NEED_RESCHED and having the idle loop fixup the timers on its way > back to idle is what you want. Ok but if the target is idle, dynticks and not polling, we don't have the choice but to send an IPI, right? I'm talking about this kind of case. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Too many rescheduling interrupts (still!)
On Wed, Feb 12, 2014 at 04:59:52PM +0100, Frederic Weisbecker wrote: > 2014-02-12 11:13 GMT+01:00 Peter Zijlstra : > > On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote: > >> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner > >> wrote: > >> >> A small number of reschedule interrupts appear to be due to a race: > >> >> both resched_task and wake_up_idle_cpu do, essentially: > >> >> > >> >> set_tsk_need_resched(t); > >> >> smb_mb(); > >> >> if (!tsk_is_polling(t)) > >> >> smp_send_reschedule(cpu); > >> >> > >> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU > >> >> is too quick (which isn't surprising if it was in C0 or C1), then it > >> >> could *clear* TS_POLLING before tsk_is_polling is read. > > > > Yeah we have the wrong default for the idle loops.. it should default to > > polling and only switch to !polling at the very last moment if it really > > needs an interrupt to wake. > > > > Changing this requires someone (probably me again :/) to audit all arch > > cpu idle drivers/functions. > > Looking at wake_up_idle_cpu(), we set need_resched and send the IPI. > On the other end, the CPU wakes up, exits the idle loop and even goes > to the scheduler while there is probably no task to schedule. > > I wonder if this is all necessary. All we need is the timer to be > handled by the dynticks code to re-evaluate the next tick. So calling > irq_exit() -> tick_nohz_irq_exit() from the scheduler_ipi() should be > enough. No no, the idea was to NOT send IPIs. So falling out of idle by writing TIF_NEED_RESCHED and having the idle loop fixup the timers on its way back to idle is what you want. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Too many rescheduling interrupts (still!)
On Wed, Feb 12, 2014 at 07:49:07AM -0800, Andy Lutomirski wrote: > On Wed, Feb 12, 2014 at 2:13 AM, Peter Zijlstra wrote: > > On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote: > >> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner > >> wrote: > >> >> A small number of reschedule interrupts appear to be due to a race: > >> >> both resched_task and wake_up_idle_cpu do, essentially: > >> >> > >> >> set_tsk_need_resched(t); > >> >> smb_mb(); > >> >> if (!tsk_is_polling(t)) > >> >> smp_send_reschedule(cpu); > >> >> > >> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU > >> >> is too quick (which isn't surprising if it was in C0 or C1), then it > >> >> could *clear* TS_POLLING before tsk_is_polling is read. > > > > Yeah we have the wrong default for the idle loops.. it should default to > > polling and only switch to !polling at the very last moment if it really > > needs an interrupt to wake. > > I might be missing something, but won't that break the scheduler? for the idle task.. all other tasks will have it !polling. But note how the current generic idle loop does: if (!current_clr_polling_and_test()) { ... if (cpuidle_idle_call()) arch_cpu_idle(); ... } This means that it still runs a metric ton of code, right up to the mwait with !polling, and then at the mwait we switch it back to polling. Completely daft. > Since rq->lock is held, the resched calls could check the rq state > (curr == idle, maybe) to distinguish these cases. Not enough; but I'm afraid I confused you with the above. My suggestion was really more that we should call into the cpuidle/arch idle code with polling set, and only right before we hit hlt/wfi/etc.. should we clear the polling bit. > > It can't we're holding its rq->lock. > > Exactly. AFAICT the only reason that any of this code holds rq->lock > (especially ttwu_queue_remote, which I seem to call a few thousand > times per second) is because the only way to make a cpu reschedule > involves playing with per-task flags. If the flags were per-rq or > per-cpu instead, then rq->lock wouldn't be needed. If this were all > done locklessly, then I think either a full cmpxchg or some fairly > careful use of full barriers would be needed, but I bet that cmpxchg > is still considerably faster than a spinlock plus a set_bit. Ahh, that's what you're saying. Yes we should be able to do something clever there. Something like the below is I think as close as we can come without major surgery and moving TIF_NEED_RESCHED and POLLING into a per-cpu variable. I might have messed it up though; brain seems to have given out for the day :/ --- kernel/sched/core.c | 17 + kernel/sched/idle.c | 21 + kernel/sched/sched.h | 5 - 3 files changed, 30 insertions(+), 13 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fb9764fbc537..a5b64040c21d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -529,7 +529,7 @@ void resched_task(struct task_struct *p) } /* NEED_RESCHED must be visible before we test polling */ - smp_mb(); + smp_mb__after_clear_bit(); if (!tsk_is_polling(p)) smp_send_reschedule(cpu); } @@ -1476,12 +1476,15 @@ static int ttwu_remote(struct task_struct *p, int wake_flags) } #ifdef CONFIG_SMP -static void sched_ttwu_pending(void) +void sched_ttwu_pending(void) { struct rq *rq = this_rq(); struct llist_node *llist = llist_del_all(&rq->wake_list); struct task_struct *p; + if (!llist) + return; + raw_spin_lock(&rq->lock); while (llist) { @@ -1536,8 +1539,14 @@ void scheduler_ipi(void) static void ttwu_queue_remote(struct task_struct *p, int cpu) { - if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) - smp_send_reschedule(cpu); + struct rq *rq = cpu_rq(cpu); + + if (llist_add(&p->wake_entry, &rq->wake_list)) { + set_tsk_need_resched(rq->idle); + smp_mb__after_clear_bit(); + if (!tsk_is_polling(rq->idle) || rq->curr != rq->idle) + smp_send_reschedule(cpu); + } } bool cpus_share_cache(int this_cpu, int that_cpu) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 14ca43430aee..bd8ed2d2f2f7 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -105,19 +105,24 @@ static void cpu_idle_loop(void) } else { local_irq_enable(); } - __current_set_polling(); } arch_cpu_idle_exit(); - /* -* We need to test and propagate the TIF_NEED_RESCHED -* bit here because we might not have send the -* reschedule IPI to i
Re: Too many rescheduling interrupts (still!)
2014-02-12 11:13 GMT+01:00 Peter Zijlstra : > On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote: >> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner wrote: >> >> A small number of reschedule interrupts appear to be due to a race: >> >> both resched_task and wake_up_idle_cpu do, essentially: >> >> >> >> set_tsk_need_resched(t); >> >> smb_mb(); >> >> if (!tsk_is_polling(t)) >> >> smp_send_reschedule(cpu); >> >> >> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU >> >> is too quick (which isn't surprising if it was in C0 or C1), then it >> >> could *clear* TS_POLLING before tsk_is_polling is read. > > Yeah we have the wrong default for the idle loops.. it should default to > polling and only switch to !polling at the very last moment if it really > needs an interrupt to wake. > > Changing this requires someone (probably me again :/) to audit all arch > cpu idle drivers/functions. Looking at wake_up_idle_cpu(), we set need_resched and send the IPI. On the other end, the CPU wakes up, exits the idle loop and even goes to the scheduler while there is probably no task to schedule. I wonder if this is all necessary. All we need is the timer to be handled by the dynticks code to re-evaluate the next tick. So calling irq_exit() -> tick_nohz_irq_exit() from the scheduler_ipi() should be enough. We could use a specific flag set before smp_send_reschedule() and read in scheduler_ipi() entry to check if we need irq_entry()/irq_exit(). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Too many rescheduling interrupts (still!)
On Wed, Feb 12, 2014 at 2:13 AM, Peter Zijlstra wrote: > On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote: >> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner wrote: >> >> A small number of reschedule interrupts appear to be due to a race: >> >> both resched_task and wake_up_idle_cpu do, essentially: >> >> >> >> set_tsk_need_resched(t); >> >> smb_mb(); >> >> if (!tsk_is_polling(t)) >> >> smp_send_reschedule(cpu); >> >> >> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU >> >> is too quick (which isn't surprising if it was in C0 or C1), then it >> >> could *clear* TS_POLLING before tsk_is_polling is read. > > Yeah we have the wrong default for the idle loops.. it should default to > polling and only switch to !polling at the very last moment if it really > needs an interrupt to wake. I might be missing something, but won't that break the scheduler? If tsk_is_polling always returns true on mwait-capable systems, then other cpus won't be able to use the polling bit to distinguish between the idle state (where setting need_resched is enough) and the non-idle state (where the IPI is needed to preempt whatever task is running). Since rq->lock is held, the resched calls could check the rq state (curr == idle, maybe) to distinguish these cases. > >> There would be an extra benefit of moving the resched-related bits to >> some per-cpu structure: it would allow lockless wakeups. >> ttwu_queue_remote, and probably all of the other reschedule-a-cpu >> functions, could do something like: >> >> if (...) { >> old = atomic_read(resched_flags(cpu)); >> while(true) { >> if (old & RESCHED_NEED_RESCHED) >> return; >> if (!(old & RESCHED_POLLING)) { >> smp_send_reschedule(cpu); >> return; >> } >> new = old | RESCHED_NEED_RESCHED; >> old = atomic_cmpxchg(resched_flags(cpu), old, new); >> } >> } > > That looks hideously expensive.. for no apparent reason. > > Sending that IPI isn't _that_ bad, esp if we get the false-positive > window smaller than it is now (its far too wide because of the wrong > default state). > >> The point being that, with the current location of the flags, either >> an interrupt needs to be sent or something needs to be done to prevent >> rq->curr from disappearing. (It probably doesn't matter if the >> current task changes, because TS_POLLING will be clear, but what if >> the task goes away entirely?) > > It can't we're holding its rq->lock. Exactly. AFAICT the only reason that any of this code holds rq->lock (especially ttwu_queue_remote, which I seem to call a few thousand times per second) is because the only way to make a cpu reschedule involves playing with per-task flags. If the flags were per-rq or per-cpu instead, then rq->lock wouldn't be needed. If this were all done locklessly, then I think either a full cmpxchg or some fairly careful use of full barriers would be needed, but I bet that cmpxchg is still considerably faster than a spinlock plus a set_bit. > >> All that being said, it looks like ttwu_queue_remote doesn't actually >> work if the IPI isn't sent. The attached patch appears to work (and >> reduces total rescheduling IPIs by a large amount for my workload), >> but I don't really think it's worthy of being applied... > > We can do something similar though; we can move sched_ttwu_pending() > into the generic idle loop, right next to set_preempt_need_resched(). Oh, right -- either the IPI or the idle code is guaranteed to happen soon. (But wouldn't setting TS_POLLING always break this, too?) --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Too many rescheduling interrupts (still!)
On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote: > On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner wrote: > >> A small number of reschedule interrupts appear to be due to a race: > >> both resched_task and wake_up_idle_cpu do, essentially: > >> > >> set_tsk_need_resched(t); > >> smb_mb(); > >> if (!tsk_is_polling(t)) > >> smp_send_reschedule(cpu); > >> > >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU > >> is too quick (which isn't surprising if it was in C0 or C1), then it > >> could *clear* TS_POLLING before tsk_is_polling is read. Yeah we have the wrong default for the idle loops.. it should default to polling and only switch to !polling at the very last moment if it really needs an interrupt to wake. Changing this requires someone (probably me again :/) to audit all arch cpu idle drivers/functions. > >> Is there a good reason that TIF_NEED_RESCHED is in thread->flags and > >> TS_POLLING is in thread->status? Couldn't both of these be in the > >> same field in something like struct rq? That would allow a real > >> atomic op here. I don't see the value of an atomic op there; but many archs already have this, grep for TIF_POLLING. > >> The more serious issue is that AFAICS default_wake_function is > >> completely missing the polling check. It goes through > >> ttwu_queue_remote, which unconditionally sends an interrupt. Yah, because it does more than just wake the CPU; at the time we didn't have a generic idle path, we could cure things now though. > There would be an extra benefit of moving the resched-related bits to > some per-cpu structure: it would allow lockless wakeups. > ttwu_queue_remote, and probably all of the other reschedule-a-cpu > functions, could do something like: > > if (...) { > old = atomic_read(resched_flags(cpu)); > while(true) { > if (old & RESCHED_NEED_RESCHED) > return; > if (!(old & RESCHED_POLLING)) { > smp_send_reschedule(cpu); > return; > } > new = old | RESCHED_NEED_RESCHED; > old = atomic_cmpxchg(resched_flags(cpu), old, new); > } > } That looks hideously expensive.. for no apparent reason. Sending that IPI isn't _that_ bad, esp if we get the false-positive window smaller than it is now (its far too wide because of the wrong default state). > The point being that, with the current location of the flags, either > an interrupt needs to be sent or something needs to be done to prevent > rq->curr from disappearing. (It probably doesn't matter if the > current task changes, because TS_POLLING will be clear, but what if > the task goes away entirely?) It can't we're holding its rq->lock. > All that being said, it looks like ttwu_queue_remote doesn't actually > work if the IPI isn't sent. The attached patch appears to work (and > reduces total rescheduling IPIs by a large amount for my workload), > but I don't really think it's worthy of being applied... We can do something similar though; we can move sched_ttwu_pending() into the generic idle loop, right next to set_preempt_need_resched(). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Too many rescheduling interrupts (still!)
On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner wrote: > On Tue, 11 Feb 2014, Andy Lutomirski wrote: > > Just adding Peter for now, as I'm too tired to grok the issue right > now. > >> Rumor has it that Linux 3.13 was supposed to get rid of all the silly >> rescheduling interrupts. It doesn't, although it does seem to have >> improved the situation. >> >> A small number of reschedule interrupts appear to be due to a race: >> both resched_task and wake_up_idle_cpu do, essentially: >> >> set_tsk_need_resched(t); >> smb_mb(); >> if (!tsk_is_polling(t)) >> smp_send_reschedule(cpu); >> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU >> is too quick (which isn't surprising if it was in C0 or C1), then it >> could *clear* TS_POLLING before tsk_is_polling is read. >> >> Is there a good reason that TIF_NEED_RESCHED is in thread->flags and >> TS_POLLING is in thread->status? Couldn't both of these be in the >> same field in something like struct rq? That would allow a real >> atomic op here. >> >> The more serious issue is that AFAICS default_wake_function is >> completely missing the polling check. It goes through >> ttwu_queue_remote, which unconditionally sends an interrupt. There would be an extra benefit of moving the resched-related bits to some per-cpu structure: it would allow lockless wakeups. ttwu_queue_remote, and probably all of the other reschedule-a-cpu functions, could do something like: if (...) { old = atomic_read(resched_flags(cpu)); while(true) { if (old & RESCHED_NEED_RESCHED) return; if (!(old & RESCHED_POLLING)) { smp_send_reschedule(cpu); return; } new = old | RESCHED_NEED_RESCHED; old = atomic_cmpxchg(resched_flags(cpu), old, new); } } The point being that, with the current location of the flags, either an interrupt needs to be sent or something needs to be done to prevent rq->curr from disappearing. (It probably doesn't matter if the current task changes, because TS_POLLING will be clear, but what if the task goes away entirely?) All that being said, it looks like ttwu_queue_remote doesn't actually work if the IPI isn't sent. The attached patch appears to work (and reduces total rescheduling IPIs by a large amount for my workload), but I don't really think it's worthy of being applied... --Andy From 9dfa6a99e5eb5ab0bc3a4d6beb599ba0f2f633af Mon Sep 17 00:00:00 2001 Message-Id: <9dfa6a99e5eb5ab0bc3a4d6beb599ba0f2f633af.1392157722.git.l...@amacapital.net> From: Andy Lutomirski Date: Tue, 11 Feb 2014 14:26:46 -0800 Subject: [PATCH] sched: Try to avoid sending an IPI in ttwu_queue_remote This is an experimental patch. It should probably not be applied. Signed-off-by: Andy Lutomirski --- kernel/sched/core.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a88f4a4..fc7b048 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1475,20 +1475,23 @@ static int ttwu_remote(struct task_struct *p, int wake_flags) } #ifdef CONFIG_SMP -static void sched_ttwu_pending(void) +static void __sched_ttwu_pending(struct rq *rq) { - struct rq *rq = this_rq(); struct llist_node *llist = llist_del_all(&rq->wake_list); struct task_struct *p; - raw_spin_lock(&rq->lock); - while (llist) { p = llist_entry(llist, struct task_struct, wake_entry); llist = llist_next(llist); ttwu_do_activate(rq, p, 0); } +} +static void sched_ttwu_pending(void) +{ + struct rq *rq = this_rq(); + raw_spin_lock(&rq->lock); + __sched_ttwu_pending(rq); raw_spin_unlock(&rq->lock); } @@ -1536,8 +1539,15 @@ void scheduler_ipi(void) static void ttwu_queue_remote(struct task_struct *p, int cpu) { - if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) - smp_send_reschedule(cpu); + struct rq *rq = cpu_rq(cpu); + + if (llist_add(&p->wake_entry, &rq->wake_list)) { + unsigned long flags; + + raw_spin_lock_irqsave(&rq->lock, flags); + resched_task(cpu_curr(cpu)); + raw_spin_unlock_irqrestore(&rq->lock, flags); + } } bool cpus_share_cache(int this_cpu, int that_cpu) @@ -2525,6 +2535,8 @@ need_resched: smp_mb__before_spinlock(); raw_spin_lock_irq(&rq->lock); + __sched_ttwu_pending(rq); + switch_count = &prev->nivcsw; if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) { if (unlikely(signal_pending_state(prev->state, prev))) { -- 1.8.5.3
Re: Too many rescheduling interrupts (still!)
On Tue, 11 Feb 2014, Andy Lutomirski wrote: Just adding Peter for now, as I'm too tired to grok the issue right now. > Rumor has it that Linux 3.13 was supposed to get rid of all the silly > rescheduling interrupts. It doesn't, although it does seem to have > improved the situation. > > A small number of reschedule interrupts appear to be due to a race: > both resched_task and wake_up_idle_cpu do, essentially: > > set_tsk_need_resched(t); > smb_mb(); > if (!tsk_is_polling(t)) > smp_send_reschedule(cpu); > > The problem is that set_tsk_need_resched wakes the CPU and, if the CPU > is too quick (which isn't surprising if it was in C0 or C1), then it > could *clear* TS_POLLING before tsk_is_polling is read. > > Is there a good reason that TIF_NEED_RESCHED is in thread->flags and > TS_POLLING is in thread->status? Couldn't both of these be in the > same field in something like struct rq? That would allow a real > atomic op here. > > The more serious issue is that AFAICS default_wake_function is > completely missing the polling check. It goes through > ttwu_queue_remote, which unconditionally sends an interrupt. > > --Andy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Too many rescheduling interrupts (still!)
Rumor has it that Linux 3.13 was supposed to get rid of all the silly rescheduling interrupts. It doesn't, although it does seem to have improved the situation. A small number of reschedule interrupts appear to be due to a race: both resched_task and wake_up_idle_cpu do, essentially: set_tsk_need_resched(t); smb_mb(); if (!tsk_is_polling(t)) smp_send_reschedule(cpu); The problem is that set_tsk_need_resched wakes the CPU and, if the CPU is too quick (which isn't surprising if it was in C0 or C1), then it could *clear* TS_POLLING before tsk_is_polling is read. Is there a good reason that TIF_NEED_RESCHED is in thread->flags and TS_POLLING is in thread->status? Couldn't both of these be in the same field in something like struct rq? That would allow a real atomic op here. The more serious issue is that AFAICS default_wake_function is completely missing the polling check. It goes through ttwu_queue_remote, which unconditionally sends an interrupt. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rescheduling interrupts
On Wed, 2008-01-23 at 09:53 +0100, Andi Kleen wrote: > Ingo Molnar <[EMAIL PROTECTED]> writes: > > > that would probably be the case if it's multiple sockets - but for > > multiple cores exactly the opposite is true: the sooner _both_ cores > > finish processing, the deeper power use the CPU can reach. > > That's only true on setups where the cores don't have > separate sleep states. But that's not generally true anymore. > e.g. AMD Fam10h has completely separate power planes for > the cores and I believe newer Intel CPUs can also let their > cores go to at least some sleep states independently (although > the deepest sleep modi still require all cores idle) I think we can expect everyone to rapidly evolve towards full independence of core power states. In fact, it wouldn't surprise me if we eventually get to the point of shutting down individual functional units like the FPU. -- Mathematics is the supreme nostalgia of our time. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rescheduling interrupts
* Matt Mackall <[EMAIL PROTECTED]> wrote: > > amarokapp does wake up threads every 20 microseconds - that could > > explain it. It's probably Xorg running on one core, amarokapp on the > > other core. That's already 100 reschedules/sec. > > That suggests we want an "anti-load-balancing" heuristic when CPU > usage is very low. Migrating everything onto one core when we're close > to idle will save power and probably reduce latencies. that would probably be the case if it's multiple sockets - but for multiple cores exactly the opposite is true: the sooner _both_ cores finish processing, the deeper power use the CPU can reach. So effective and immediate spreading of workloads amongst multiple cores - especially with shared L2 caches where the cost of migration is low, helps power consumption. (and it obviously helps latencies and bandwith) Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rescheduling interrupts
On Tue, 2008-01-22 at 17:05 +0100, Ingo Molnar wrote: > * S.Çağlar Onur <[EMAIL PROTECTED]> wrote: > > > > My theory is that for whatever reason we get "repeat" IPIs: multiple > > > reschedule IPIs although the other CPU only initiated one. > > > > Ok, please see http://cekirdek.pardus.org.tr/~caglar/dmesg.3rd :) > > hm, the IPI sending and receiving is nicely paired up: > > [ 625.795008] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1: > [ 625.795223] IPI (@native_smp_send_reschedule) from task amarokapp:2882 on > CPU#1: > > amarokapp does wake up threads every 20 microseconds - that could > explain it. It's probably Xorg running on one core, amarokapp on the > other core. That's already 100 reschedules/sec. That suggests we want an "anti-load-balancing" heuristic when CPU usage is very low. Migrating everything onto one core when we're close to idle will save power and probably reduce latencies. -- Mathematics is the supreme nostalgia of our time. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rescheduling interrupts
Hi; 22 Oca 2008 Sal tarihinde, S.Çağlar Onur şunları yazmıştı: > > hm, the IPI sending and receiving is nicely paired up: > > > > [ 625.795008] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1: > > [ 625.795223] IPI (@native_smp_send_reschedule) from task amarokapp:2882 > > on CPU#1: > > > > amarokapp does wake up threads every 20 microseconds - that could > > explain it. It's probably Xorg running on one core, amarokapp on the > > other core. That's already 100 reschedules/sec. > > Heh, killing amarok ends up with following; > > PowerTOP version 1.9 (C) 2007 Intel Corporation > > CnAvg residency P-states (frequencies) > C0 (cpu running)( 0,9%) > C10,0ms ( 0,0%) > C20,2ms ( 0,0%) > C35,1ms (99,1%) > > > Wakeups-from-idle per second : 197,8interval: 10,0s > no ACPI power usage estimate available > > Top causes for wakeups: > 34,7% (130,7) USB device 3-2 : HP Integrated Module (Broadcom Corp) > 26,5% (100,0): uhci_hcd:usb3 >5,8% ( 22,0) java : futex_wait (hrtimer_wakeup) >5,3% ( 20,0): iwl3945 >4,1% ( 15,4) USB device 2-2 : Microsoft Wireless Optical Mouse .00 > (Microsoft) >2,9% ( 11,0): libata >2,7% ( 10,1): extra timer interrupt >2,7% ( 10,0) java : schedule_timeout (process_timeout) >2,7% ( 10,0) : scan_async (ehci_watchdog) >2,4% ( 9,0) : Rescheduling interrupts >2,1% ( 8,0): usb_hcd_poll_rh_status (rh_timer_func) >1,7% ( 6,4): uhci_hcd:usb2 >1,7% ( 6,4) artsd : schedule_timeout (process_timeout) >0,6% ( 2,1): ohci1394, uhci_hcd:usb4, nvidia >0,5% ( 2,0) : clocksource_check_watchdog > (clocksource_watchdog) >0,5% ( 1,7)wpa_supplicant : schedule_timeout (process_timeout) >0,3% ( 1,0)kicker : schedule_timeout (process_timeout) >0,3% ( 1,0) kwin : schedule_timeout (process_timeout) >0,3% ( 1,0) kdesktop : schedule_timeout (process_timeout) >0,3% ( 1,0) klipper : schedule_timeout (process_timeout) >0,3% ( 1,0) kwrapper : do_nanosleep (hrtimer_wakeup) >0,3% ( 1,0) X : nv_start_rc_timer (nv_kern_rc_timer) By the way loging out from KDE also suffers same problem, this time kdm migrated to CPU1 and powertop reports ~300 wakeups for " : Rescheduling interrupts" again. [...] [ 2058.246692] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#0: [ 2058.246737] IPI (@native_smp_send_reschedule) from task kdm_greet:6122 on CPU#1: [ 2058.246812] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1: [ 2058.278947] IPI (@native_smp_send_reschedule) from task X:2073 on CPU#0: [ 2058.279070] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#0: [ 2058.279175] IPI (@native_smp_send_reschedule) from task kdm_greet:6122 on CPU#1: [ 2058.279251] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1: [ 2058.279301] IPI (@native_smp_send_reschedule) from task X:2073 on CPU#0: [ 2058.279377] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#0: [ 2058.279425] IPI (@native_smp_send_reschedule) from task kdm_greet:6122 on CPU#1: [ 2058.279503] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1: [ 2058.279565] IPI (@native_smp_send_reschedule) from task X:2073 on CPU#0: [ 2058.279637] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#0: [ 2058.279683] IPI (@native_smp_send_reschedule) from task kdm_greet:6122 on CPU#1: [ 2058.279758] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1: [ 2058.311903] IPI (@native_smp_send_reschedule) from task X:2073 on CPU#0: [ 2058.312028] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#0: [...] Cheers -- S.Çağlar Onur <[EMAIL PROTECTED]> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rescheduling interrupts
22 Oca 2008 Sal tarihinde, Ingo Molnar şunları yazmıştı: > > * S.Çağlar Onur <[EMAIL PROTECTED]> wrote: > > > > My theory is that for whatever reason we get "repeat" IPIs: multiple > > > reschedule IPIs although the other CPU only initiated one. > > > > Ok, please see http://cekirdek.pardus.org.tr/~caglar/dmesg.3rd :) > > hm, the IPI sending and receiving is nicely paired up: > > [ 625.795008] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1: > [ 625.795223] IPI (@native_smp_send_reschedule) from task amarokapp:2882 on > CPU#1: > > amarokapp does wake up threads every 20 microseconds - that could > explain it. It's probably Xorg running on one core, amarokapp on the > other core. That's already 100 reschedules/sec. Heh, killing amarok ends up with following; PowerTOP version 1.9 (C) 2007 Intel Corporation CnAvg residency P-states (frequencies) C0 (cpu running)( 0,9%) C10,0ms ( 0,0%) C20,2ms ( 0,0%) C35,1ms (99,1%) Wakeups-from-idle per second : 197,8interval: 10,0s no ACPI power usage estimate available Top causes for wakeups: 34,7% (130,7) USB device 3-2 : HP Integrated Module (Broadcom Corp) 26,5% (100,0): uhci_hcd:usb3 5,8% ( 22,0) java : futex_wait (hrtimer_wakeup) 5,3% ( 20,0): iwl3945 4,1% ( 15,4) USB device 2-2 : Microsoft Wireless Optical Mouse .00 (Microsoft) 2,9% ( 11,0): libata 2,7% ( 10,1): extra timer interrupt 2,7% ( 10,0) java : schedule_timeout (process_timeout) 2,7% ( 10,0) : scan_async (ehci_watchdog) 2,4% ( 9,0) : Rescheduling interrupts 2,1% ( 8,0): usb_hcd_poll_rh_status (rh_timer_func) 1,7% ( 6,4): uhci_hcd:usb2 1,7% ( 6,4) artsd : schedule_timeout (process_timeout) 0,6% ( 2,1): ohci1394, uhci_hcd:usb4, nvidia 0,5% ( 2,0) : clocksource_check_watchdog (clocksource_watchdog) 0,5% ( 1,7)wpa_supplicant : schedule_timeout (process_timeout) 0,3% ( 1,0)kicker : schedule_timeout (process_timeout) 0,3% ( 1,0) kwin : schedule_timeout (process_timeout) 0,3% ( 1,0) kdesktop : schedule_timeout (process_timeout) 0,3% ( 1,0) klipper : schedule_timeout (process_timeout) 0,3% ( 1,0) kwrapper : do_nanosleep (hrtimer_wakeup) 0,3% ( 1,0) X : nv_start_rc_timer (nv_kern_rc_timer) -- S.Çağlar Onur <[EMAIL PROTECTED]> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rescheduling interrupts
* S.Çağlar Onur <[EMAIL PROTECTED]> wrote: > > My theory is that for whatever reason we get "repeat" IPIs: multiple > > reschedule IPIs although the other CPU only initiated one. > > Ok, please see http://cekirdek.pardus.org.tr/~caglar/dmesg.3rd :) hm, the IPI sending and receiving is nicely paired up: [ 625.795008] IPI (@smp_reschedule_interrupt) from task swapper:0 on CPU#1: [ 625.795223] IPI (@native_smp_send_reschedule) from task amarokapp:2882 on CPU#1: amarokapp does wake up threads every 20 microseconds - that could explain it. It's probably Xorg running on one core, amarokapp on the other core. That's already 100 reschedules/sec. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rescheduling interrupts
Hi; 22 Oca 2008 Sal tarihinde, Ingo Molnar şunları yazmıştı: > > also, this might reduce the number of cross-CPU wakeups on near-idle > systems: > > echo 1 > /sys/devices/system/cpu/sched_mc_power_savings > > [ or if it doesnt, it should ;) ] > > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > Seems like nothing changes zangetsu ~ # cat /sys/devices/system/cpu/sched_mc_power_savings 1 Powertop still reports ~300 wakeups for " : Rescheduling interrupts" PowerTOP version 1.9 (C) 2007 Intel Corporation CnAvg residency P-states (frequencies) C0 (cpu running)( 4,8%) C10,0ms ( 0,0%) C20,2ms ( 2,4%) C32,4ms (92,8%) Wakeups-from-idle per second : 495,2interval: 3,0s no ACPI power usage estimate available Top causes for wakeups: 40,0% (330,7) : Rescheduling interrupts 12,3% (102,0) USB device 3-2 : HP Integrated Module (Broadcom Corp) 12,1% (100,0): uhci_hcd:usb3 8,0% ( 66,3): extra timer interrupt 7,0% ( 58,0) amarokapp : schedule_timeout (process_timeout) 4,0% ( 33,0): uhci_hcd:usb2 and this is what system is doing while powertop reports above; USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 1 0.1 0.0 1512 532 ?Ss 17:41 0:00 init [3] root 2 0.0 0.0 0 0 ?S< 17:41 0:00 [kthreadd] root 3 0.0 0.0 0 0 ?S< 17:41 0:00 [migration/0] root 4 0.0 0.0 0 0 ?S< 17:41 0:00 [ksoftirqd/0] root 5 0.0 0.0 0 0 ?S< 17:41 0:00 [migration/1] root 6 0.0 0.0 0 0 ?S< 17:41 0:00 [ksoftirqd/1] root 7 0.0 0.0 0 0 ?S< 17:41 0:00 [events/0] root 8 0.0 0.0 0 0 ?S< 17:41 0:00 [events/1] root 9 0.0 0.0 0 0 ?S< 17:41 0:00 [khelper] root10 0.0 0.0 0 0 ?S< 17:41 0:00 [kblockd/0] root11 0.0 0.0 0 0 ?S< 17:41 0:00 [kblockd/1] root12 0.0 0.0 0 0 ?S< 17:41 0:00 [kacpid] root13 0.0 0.0 0 0 ?S< 17:41 0:00 [kacpi_notify] root14 0.0 0.0 0 0 ?S< 17:41 0:00 [cqueue/0] root15 0.0 0.0 0 0 ?S< 17:41 0:00 [cqueue/1] root16 0.0 0.0 0 0 ?S< 17:41 0:00 [kseriod] root17 0.0 0.0 0 0 ?S17:41 0:00 [pdflush] root18 0.0 0.0 0 0 ?S17:41 0:00 [pdflush] root19 0.0 0.0 0 0 ?S< 17:41 0:00 [kswapd0] root20 0.0 0.0 0 0 ?S< 17:41 0:00 [aio/0] root21 0.0 0.0 0 0 ?S< 17:41 0:00 [aio/1] root22 0.0 0.0 0 0 ?S< 17:41 0:00 [kpsmoused] root42 0.0 0.0 0 0 ?S< 17:41 0:00 [khpsbpkt] root46 0.0 0.0 0 0 ?S< 17:41 0:00 [knodemgrd_0] root55 0.0 0.0 0 0 ?S< 17:41 0:00 [ata/0] root56 0.0 0.0 0 0 ?S< 17:41 0:00 [ata/1] root57 0.0 0.0 0 0 ?S< 17:41 0:00 [ata_aux] root61 0.0 0.0 0 0 ?S< 17:41 0:00 [scsi_eh_0] root62 0.0 0.0 0 0 ?S< 17:41 0:00 [scsi_eh_1] root63 0.0 0.0 0 0 ?S< 17:41 0:00 [scsi_eh_2] root64 0.0 0.0 0 0 ?S< 17:41 0:00 [scsi_eh_3] root70 0.0 0.0 0 0 ?S< 17:41 0:00 [ksuspend_usbd] root71 0.0 0.0 0 0 ?S< 17:41 0:00 [khubd] root80 0.0 0.0 0 0 ?S< 17:41 0:00 [scsi_eh_4] root81 0.0 0.0 0 0 ?S< 17:41 0:00 [scsi_eh_5] root 159 0.0 0.0 0 0 ?S< 17:41 0:00 [kjournald] root 194 0.0 0.0 2452 1304 ?S http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rescheduling interrupts
Hi; 22 Oca 2008 Sal tarihinde, Ingo Molnar şunları yazmıştı: > * S.Çağlar Onur <[EMAIL PROTECTED]> wrote: > > I grabbed the logs two times to make sure to catch needed info. 1st [1] > > one is generated while "Rescheduling interrupts" wakeups ~200 times and > > 2nd one generated for ~350 wakeups. > > > > [1] http://cekirdek.pardus.org.tr/~caglar/dmesg.1st > > [2] http://cekirdek.pardus.org.tr/~caglar/dmesg.2nd > > thanks, these seem to be mostly normal wakeups from standard tasks: > > IPI from task kdm_greet:2118 on CPU#0: > IPI from task X:2079 on CPU#1: > IPI from task kdm_greet:2118 on CPU#0: > IPI from task hald-addon-inpu:2009 on CPU#1: > IPI from task events/0:7 on CPU#1: > IPI from task bash:2129 on CPU#0: > IPI from task kdm_greet:2118 on CPU#0: > IPI from task events/0:7 on CPU#1: > IPI from task events/0:7 on CPU#1: > IPI from task events/0:7 on CPU#1: > IPI from task bash:3902 on CPU#1: > IPI from task bash:3902 on CPU#1: > IPI from task amarokapp:3423 on CPU#1: > IPI from task amarokapp:3423 on CPU#1: > IPI from task amarokapp:3423 on CPU#1: > IPI from task X:2079 on CPU#0: > IPI from task yakuake:3422 on CPU#0: > IPI from task X:2079 on CPU#1: > IPI from task amarokapp:3423 on CPU#1: > IPI from task amarokapp:3423 on CPU#1: > > could you also add a similar IPI printouts (with the same panic_timeout > logic) to arch/x86/kernel/smp_32.c's smp_reschedule_interrupt() function > - while still keeping the other printouts too? > > Could you also enable PRINTK_TIME timestamps, so that we can see the > timings? (And do a "dmesg -n 1" so that the printks happen fast and the > timings are accurate.) I'd suggest to increase CONFIG_LOG_BUF_SHIFT to > 20, so that your dmesg buffer is large enough. Plus try to capture 100 > events, ok? > > My theory is that for whatever reason we get "repeat" IPIs: multiple > reschedule IPIs although the other CPU only initiated one. Ok, please see http://cekirdek.pardus.org.tr/~caglar/dmesg.3rd :) Cheers -- S.Çağlar Onur <[EMAIL PROTECTED]> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rescheduling interrupts
also, this might reduce the number of cross-CPU wakeups on near-idle systems: echo 1 > /sys/devices/system/cpu/sched_mc_power_savings [ or if it doesnt, it should ;) ] Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rescheduling interrupts
* S.Çağlar Onur <[EMAIL PROTECTED]> wrote: > I grabbed the logs two times to make sure to catch needed info. 1st [1] one > is generated while "Rescheduling interrupts" wakeups ~200 times and 2nd one > generated for ~350 wakeups. > > [1] http://cekirdek.pardus.org.tr/~caglar/dmesg.1st > [2] http://cekirdek.pardus.org.tr/~caglar/dmesg.2nd thanks, these seem to be mostly normal wakeups from standard tasks: IPI from task kdm_greet:2118 on CPU#0: IPI from task X:2079 on CPU#1: IPI from task kdm_greet:2118 on CPU#0: IPI from task hald-addon-inpu:2009 on CPU#1: IPI from task events/0:7 on CPU#1: IPI from task bash:2129 on CPU#0: IPI from task kdm_greet:2118 on CPU#0: IPI from task events/0:7 on CPU#1: IPI from task events/0:7 on CPU#1: IPI from task events/0:7 on CPU#1: IPI from task bash:3902 on CPU#1: IPI from task bash:3902 on CPU#1: IPI from task amarokapp:3423 on CPU#1: IPI from task amarokapp:3423 on CPU#1: IPI from task amarokapp:3423 on CPU#1: IPI from task X:2079 on CPU#0: IPI from task yakuake:3422 on CPU#0: IPI from task X:2079 on CPU#1: IPI from task amarokapp:3423 on CPU#1: IPI from task amarokapp:3423 on CPU#1: could you also add a similar IPI printouts (with the same panic_timeout logic) to arch/x86/kernel/smp_32.c's smp_reschedule_interrupt() function - while still keeping the other printouts too? Could you also enable PRINTK_TIME timestamps, so that we can see the timings? (And do a "dmesg -n 1" so that the printks happen fast and the timings are accurate.) I'd suggest to increase CONFIG_LOG_BUF_SHIFT to 20, so that your dmesg buffer is large enough. Plus try to capture 100 events, ok? My theory is that for whatever reason we get "repeat" IPIs: multiple reschedule IPIs although the other CPU only initiated one. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rescheduling interrupts
Hi; 22 Oca 2008 Sal tarihinde, Ingo Molnar şunları yazmıştı: > * S.Çağlar Onur <[EMAIL PROTECTED]> wrote: > > > Top causes for wakeups: > > 59,9% (238,4) : Rescheduling interrupts > > ^^ > > 14,7% ( 58,6) amarokapp : schedule_timeout (process_timeout) > > hm, would be nice to figure out what causes these IPIs. Could you stick > something like this into arch/x86/kernel/smp_32.c's > smp_send_reschedule() function [this is the function that generates the > IPI]: > > static void native_smp_send_reschedule(int cpu) > { > WARN_ON(cpu_is_offline(cpu)); > send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR); > if (panic_timeout > 0) { > panic_timeout--; > printk("IPI from task %s:%d on CPU#%d:\n", > current->comm, current->pid, cpu); > dump_stack(); > } > } > > NOTE: if you run an SMP kernel then first remove these two lines from > kernel/printk.c: > > if (!oops_in_progress && waitqueue_active(&log_wait)) > wake_up_interruptible(&log_wait); > > otherwise you'll get lockups. (the IPI is sent while holding the > runqueue lock, so the printks will lock up) > > then wait for the bad condition to occur on your system and generate a > stream of ~10 backtraces, via: > > echo 10 > /proc/sys/kernel/panic > > you should be getting 10 immediate backtraces - please send them to us. > The backtraces should show the place that generates the wakeups. [turn > on CONFIG_FRAME_POINTERS=y to get high quality backtraces.] > > If you do _not_ get 10 immediate backtraces, then something in the > system is generating such IPIs outside of the scheduler's control. That > would suggest some other sort of borkage. > > Ingo I grabbed the logs two times to make sure to catch needed info. 1st [1] one is generated while "Rescheduling interrupts" wakeups ~200 times and 2nd one generated for ~350 wakeups. [1] http://cekirdek.pardus.org.tr/~caglar/dmesg.1st [2] http://cekirdek.pardus.org.tr/~caglar/dmesg.2nd Cheers -- S.Çağlar Onur <[EMAIL PROTECTED]> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Rescheduling interrupts
* S.Çağlar Onur <[EMAIL PROTECTED]> wrote: > Top causes for wakeups: > 59,9% (238,4) : Rescheduling interrupts > ^^ > 14,7% ( 58,6) amarokapp : schedule_timeout (process_timeout) hm, would be nice to figure out what causes these IPIs. Could you stick something like this into arch/x86/kernel/smp_32.c's smp_send_reschedule() function [this is the function that generates the IPI]: static void native_smp_send_reschedule(int cpu) { WARN_ON(cpu_is_offline(cpu)); send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR); if (panic_timeout > 0) { panic_timeout--; printk("IPI from task %s:%d on CPU#%d:\n", current->comm, current->pid, cpu); dump_stack(); } } NOTE: if you run an SMP kernel then first remove these two lines from kernel/printk.c: if (!oops_in_progress && waitqueue_active(&log_wait)) wake_up_interruptible(&log_wait); otherwise you'll get lockups. (the IPI is sent while holding the runqueue lock, so the printks will lock up) then wait for the bad condition to occur on your system and generate a stream of ~10 backtraces, via: echo 10 > /proc/sys/kernel/panic you should be getting 10 immediate backtraces - please send them to us. The backtraces should show the place that generates the wakeups. [turn on CONFIG_FRAME_POINTERS=y to get high quality backtraces.] If you do _not_ get 10 immediate backtraces, then something in the system is generating such IPIs outside of the scheduler's control. That would suggest some other sort of borkage. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Rescheduling interrupts
Hi; With Linus's latest git, powertop reports following while system nearly %100 idle; PowerTOP version 1.9 (C) 2007 Intel Corporation CnAvg residency P-states (frequencies) C0 (cpu running)( 6,3%) 1,84 Ghz 0,4% C10,0ms ( 0,0%) 1333 Mhz 0,0% C20,1ms ( 0,5%) 1000 Mhz99,6% C33,7ms (93,2%) Wakeups-from-idle per second : 306,8interval: 10,0s Power usage (5 minute ACPI estimate) : 23,1 W (0,5 hours left) Top causes for wakeups: 59,9% (238,4) : Rescheduling interrupts ^^ 14,7% ( 58,6) amarokapp : schedule_timeout (process_timeout) 5,5% ( 21,9) java : futex_wait (hrtimer_wakeup) 5,0% ( 19,8): iwl3945 2,5% ( 10,0) java : schedule_timeout (process_timeout) 2,5% ( 10,0) : ehci_work (ehci_watchdog) 2,5% ( 10,0): extra timer interrupt 1,6% ( 6,4) artsd : schedule_timeout (process_timeout) 1,0% ( 4,0): usb_hcd_poll_rh_status (rh_timer_func) 0,5% ( 2,0): ohci1394, uhci_hcd:usb4, nvidia 0,5% ( 2,0) : clocksource_check_watchdog (clocksource_watchdog) 0,5% ( 2,0) kwin : schedule_timeout (process_timeout) 0,5% ( 1,9)wpa_supplicant : schedule_timeout (process_timeout) 0,3% ( 1,2) kdesktop : schedule_timeout (process_timeout) 0,3% ( 1,0) kwrapper : do_nanosleep (hrtimer_wakeup) 0,3% ( 1,0) klipper : schedule_timeout (process_timeout) 0,3% ( 1,0) artsd : do_setitimer (it_real_fn) 0,3% ( 1,0) gpg-agent : schedule_timeout (process_timeout) 0,3% ( 1,0) X : nv_start_rc_timer (nv_kern_rc_timer) 0,3% ( 1,0)kicker : schedule_timeout (process_timeout) 0,1% ( 0,5) iwl3945 : ieee80211_authenticate (ieee80211_sta_timer) 0,1% ( 0,5) : neigh_table_init_no_netlink (neigh_periodic_timer) This " : Rescheduling interrupts" causes at least 200 wakeups (sometimes i see ~400 wakeups) for me and a quick google search yields [1], but i didn't see this reported to LKML, so here it is :). If anything else is needed please yell... [1] http://www.mail-archive.com/[EMAIL PROTECTED]/msg01009.html -- S.Çağlar Onur <[EMAIL PROTECTED]> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! signature.asc Description: This is a digitally signed message part.