Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-13 Thread nigel
Hi.

Ingo Molnar wrote:
> * [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> 
> Just out of curiosity, could you try the appended cumulative patch 
> and report .clock_warps, .clock_overflows and .clock_underflows as 
> you did.
 With those patches, CONFIG_NO_HZ works just fine.
>> Could these patches also help with hibernation issues? I'm trying 
>> x86_64+NO_HZ, and seeing activity delayed during the atomic copy and 
>> afterwards until I manually generate interrupts (by pressing keys).
> 
> i dont think that should be related to cpu_clock() use. Does the patch 
> below make any difference? (or could you try x86.git to get the whole 
> stack of x86 changes that we have at the moment.) Here's the coordinates 
> for x86.git:

Sorry for the delay in replying. Something seems to help, but I haven't
managed to identify what yet. I don't think it was the patch appended
because I'm on UP. If you care, I'll see if I can find the time to look
more carefully.

Nigel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-13 Thread nigel
Hi.

Ingo Molnar wrote:
 * [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 
 Just out of curiosity, could you try the appended cumulative patch 
 and report .clock_warps, .clock_overflows and .clock_underflows as 
 you did.
 With those patches, CONFIG_NO_HZ works just fine.
 Could these patches also help with hibernation issues? I'm trying 
 x86_64+NO_HZ, and seeing activity delayed during the atomic copy and 
 afterwards until I manually generate interrupts (by pressing keys).
 
 i dont think that should be related to cpu_clock() use. Does the patch 
 below make any difference? (or could you try x86.git to get the whole 
 stack of x86 changes that we have at the moment.) Here's the coordinates 
 for x86.git:

Sorry for the delay in replying. Something seems to help, but I haven't
managed to identify what yet. I don't think it was the patch appended
because I'm on UP. If you care, I'll see if I can find the time to look
more carefully.

Nigel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Guillaume Chazarain
Guillaume Chazarain <[EMAIL PROTECTED]> wrote:

> FYI, I'm currently trying to track down where rq->clock started to
> overflow with nohz=off, and it seems to be before 2.6.23, so my patches
> are not at fault ;-) Or maybe I am dreaming and it was always
> overflowing. Investigating ...

And the winner is:

commit 529c77261bccd9d37f110f58b0753d95beaa9fa2
Author: Ingo Molnar <[EMAIL PROTECTED]>
Date:   Fri Aug 10 23:05:11 2007 +0200

sched: improve rq-clock overflow logic

improve the rq-clock overflow logic: limit the absolute rq->clock
delta since the last scheduler tick, instead of limiting the delta
itself.

tested by Arjan van de Ven - whole laptop was misbehaving due to
an incorrectly calibrated cpu_khz confusing sched_clock().

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
Signed-off-by: Arjan van de Ven <[EMAIL PROTECTED]>

diff --git a/kernel/sched.c b/kernel/sched.c
index b0afd8d..6247e4a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -263,6 +263,7 @@ struct rq {
 
unsigned int clock_warps, clock_overflows;
unsigned int clock_unstable_events;
+   u64 tick_timestamp;
 
atomic_t nr_iowait;
 
@@ -341,8 +342,11 @@ static void __update_rq_clock(struct rq *rq)
/*
 * Catch too large forward jumps too:
 */
-   if (unlikely(delta > 2*TICK_NSEC)) {
-   clock++;
+   if (unlikely(clock + delta > rq->tick_timestamp + TICK_NSEC)) {
+   if (clock < rq->tick_timestamp + TICK_NSEC)
+   clock = rq->tick_timestamp + TICK_NSEC;
+   else
+   clock++;
rq->clock_overflows++;
} else {
if (unlikely(delta > rq->clock_max_delta))
@@ -3308,9 +3312,16 @@ void scheduler_tick(void)
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
struct task_struct *curr = rq->curr;
+   u64 next_tick = rq->tick_timestamp + TICK_NSEC;
 
spin_lock(>lock);
__update_rq_clock(rq);
+   /*
+* Let rq->clock advance by at least TICK_NSEC:
+*/
+   if (unlikely(rq->clock < next_tick))
+   rq->clock = next_tick;
+   rq->tick_timestamp = rq->clock;
update_cpu_load(rq);
if (curr != rq->idle) /* FIXME: needed? */
curr->sched_class->task_tick(rq, curr);


Seems like I originally was not the only one seeing 2 jiffies jumps ;-)
I'll adapt my patches.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Guillaume Chazarain
Ingo Molnar <[EMAIL PROTECTED]> wrote:

> ok. I have applied all but this one

Hmm, I couldn't find them in mingo/linux-2.6-sched-devel.git.

> i think it's much simpler to do what i have below. Could you try it on 
> your box? Or if it is using ACPI idle - in that case the callbacks 
> should already be there and there should be no need for further fixups.
> 
> Subject: x86: idle wakeup event in the HLT loop

I use ACPI, so this patch has no effect.

FYI, I'm currently trying to track down where rq->clock started to
overflow with nohz=off, and it seems to be before 2.6.23, so my patches
are not at fault ;-) Or maybe I am dreaming and it was always
overflowing. Investigating ...

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread David Dillow

On Fri, 2008-01-11 at 10:34 +0100, Ingo Molnar wrote:
> * David Dillow <[EMAIL PROTECTED]> wrote:
> 
> > Ingo, Thomas added as I think this is related to 
> > sched.c:__update_rq_clock()'s checking for forward time warps.
> 
> yep, we've got some fixes in this area. Do blktrace timestamps work fine 
> in v2.6.23, with NOHZ?

Do you still want this tested, given that your ktime change works? If
so, it will likely be next week before I can devote some cycles to it --
it'll take installing a new distro on the machines.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* Guillaume Chazarain <[EMAIL PROTECTED]> wrote:

> David Dillow <[EMAIL PROTECTED]> wrote:
> 
> > Patched kernel, nohz=off:
> >   .clock_underflows  : 213887
> 
> A little bit of warning about these patches, they are WIP, that's why 
> I did not send them earlier. It regress nohz=off.

ok. I have applied all but this one:

> A bit of context: these patches aim at making sure cpu_clock() on my
> laptop (cpufreq enabled) never overflows/underflows/warps with
> CONFIG_NOHZ enabled. With these patches, I have a few hundreds
> overflows and underflows during early bootup, and then nothing :-)

cool :-)

> > sched: Fix rq->clock overflows detection with CONFIG_NO_HZ
> 
> I think this one is the most important for David, but unfortunately it
> has some problems.
> 
> > +static inline u64 max_skipped_ticks(struct rq *rq)
> > +{
> > +   return nohz_on(cpu_of(rq)) ? jiffies - rq->last_tick_seen + 2 : 1;
> > +}
> 
> Here, I initially wrote rq->last_tick_seen + 1 but experiments showed
> that +2 was needed as I really saw deltas of 2 milliseconds.
> 
> These patches have two objectives:
>  - taking into account that jiffies are not always incremented by 1
> thanks to nohz
>  - as the tick is stopped and restarted it may not tick at the exact
> expected moment, so allow a window of 1 jiffie. If the tick occurs
> during the right jiffy, we know the TSC is more precise than the tick
> so don't correct the clock.

i think it's much simpler to do what i have below. Could you try it on 
your box? Or if it is using ACPI idle - in that case the callbacks 
should already be there and there should be no need for further fixups.

> And the problem is that I seem to need a window of 2 jiffies, so I need
> some help.
> 
> > sched: make sure jiffies is up to date before calling 
> > __update_rq_clock()
> 
> This is one is needed too but I'm less confident in its validity.
> 
> > scheduler_tick() is not called every jiffies
> 
> This one is a bit ugly and seems to break nohz=off.

ok, i took this one out.

Ingo

>
Subject: x86: idle wakeup event in the HLT loop
From: Ingo Molnar <[EMAIL PROTECTED]>

do a proper idle-wakeup event on HLT as well - some CPUs stop the TSC
in HLT too, not just when going through the ACPI methods.

(the ACPI idle code already does this.)

[ update the 64-bit side too, as noticed by Jiri Slaby. ]

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 arch/x86/kernel/process_32.c |   15 ---
 arch/x86/kernel/process_64.c |   13 ++---
 2 files changed, 22 insertions(+), 6 deletions(-)

Index: linux-x86.q/arch/x86/kernel/process_32.c
===
--- linux-x86.q.orig/arch/x86/kernel/process_32.c
+++ linux-x86.q/arch/x86/kernel/process_32.c
@@ -113,10 +113,19 @@ void default_idle(void)
smp_mb();
 
local_irq_disable();
-   if (!need_resched())
+   if (!need_resched()) {
+   ktime_t t0, t1;
+   u64 t0n, t1n;
+
+   t0 = ktime_get();
+   t0n = ktime_to_ns(t0);
safe_halt();/* enables interrupts racelessly */
-   else
-   local_irq_enable();
+   local_irq_disable();
+   t1 = ktime_get();
+   t1n = ktime_to_ns(t1);
+   sched_clock_idle_wakeup_event(t1n - t0n);
+   }
+   local_irq_enable();
current_thread_info()->status |= TS_POLLING;
} else {
/* loop is done by the caller */
Index: linux-x86.q/arch/x86/kernel/process_64.c
===
--- linux-x86.q.orig/arch/x86/kernel/process_64.c
+++ linux-x86.q/arch/x86/kernel/process_64.c
@@ -116,9 +116,16 @@ static void default_idle(void)
smp_mb();
local_irq_disable();
if (!need_resched()) {
-   /* Enables interrupts one instruction before HLT.
-  x86 special cases this so there is no race. */
-   safe_halt();
+   ktime_t t0, t1;
+   u64 t0n, t1n;
+
+   t0 = ktime_get();
+   t0n = ktime_to_ns(t0);
+   safe_halt();/* enables interrupts racelessly */
+   local_irq_disable();
+   t1 = ktime_get();
+   t1n = ktime_to_ns(t1);
+   sched_clock_idle_wakeup_event(t1n - t0n);
} else
local_irq_enable();
current_thread_info()->status |= TS_POLLING;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Guillaume Chazarain
David Dillow <[EMAIL PROTECTED]> wrote:

> Patched kernel, nohz=off:
>   .clock_underflows  : 213887

A little bit of warning about these patches, they are WIP, that's why I
did not send them earlier. It regress nohz=off.

A bit of context: these patches aim at making sure cpu_clock() on my
laptop (cpufreq enabled) never overflows/underflows/warps with
CONFIG_NOHZ enabled. With these patches, I have a few hundreds
overflows and underflows during early bootup, and then nothing :-)

Ingo Molnar <[EMAIL PROTECTED]> wrote:

> they are from the scheduler git tree (except the first debug patch), but 
> queued up for v2.6.25 at the moment.

You are talking about "x86: scale cyc_2_nsec according to CPU
frequency" here, but I don't think it is at stakes here as David has:

> CONFIG_CPU_FREQ is not set

Let me review my patches myself to give a bit of context:

> sched: monitor clock underflows in /proc/sched_debug

This, I'd like to have it in .25 just for convenience.

> x86: scale cyc_2_nsec according to CPU frequency

You already know that one ;-)

> sched: fix rq->clock warps on frequency changes

This is a bugfix for .25 once the previous patch is applied. I don't
think it helps David, but it could help blktrace users with cpufreq
enabled.

> sched: Fix rq->clock overflows detection with CONFIG_NO_HZ

I think this one is the most important for David, but unfortunately it
has some problems.

> +static inline u64 max_skipped_ticks(struct rq *rq)
> +{
> + return nohz_on(cpu_of(rq)) ? jiffies - rq->last_tick_seen + 2 : 1;
> +}

Here, I initially wrote rq->last_tick_seen + 1 but experiments showed
that +2 was needed as I really saw deltas of 2 milliseconds.

These patches have two objectives:
 - taking into account that jiffies are not always incremented by 1
thanks to nohz
 - as the tick is stopped and restarted it may not tick at the exact
expected moment, so allow a window of 1 jiffie. If the tick occurs
during the right jiffy, we know the TSC is more precise than the tick
so don't correct the clock.

And the problem is that I seem to need a window of 2 jiffies, so I need
some help.

> sched: make sure jiffies is up to date before calling __update_rq_clock()

This is one is needed too but I'm less confident in its validity.

> scheduler_tick() is not called every jiffies

This one is a bit ugly and seems to break nohz=off.

> - if (unlikely(rq->clock < next_tick)) {
> + if (unlikely(rq->clock < next_tick - nohz_on(cpu) * TICK_NSEC)) {

No, I'm not proud of this :-(

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Jens Axboe
On Fri, Jan 11 2008, Ingo Molnar wrote:
> 
> * Jens Axboe <[EMAIL PROTECTED]> wrote:
> 
> > > they are from the scheduler git tree (except the first debug patch), 
> > > but queued up for v2.6.25 at the moment.
> > 
> > So this means that blktrace will be broken with CONFIG_NO_HZ for 
> > 2.6.24? That's clearly a regression.
> 
> 64-bit CONFIG_NO_HZ is a new feature in v2.6.24. If it happens on 32-bit 
> too and it didnt happen in v2.6.23 32-bit then it's a regression.

If blktrace worked in 2.6.23 and it doesn't in 2.6.24 because of some
option that isn't immediately apparent, then it's a regression. Period.

> all this comes from blktrace's original decision of using sched_clock()
> :-) It's not a global timesource and it's not trivial to turn it into a
> halfways usable global timesource.

Hey, it was a high res time source and the only one easily available :)
I'm fine with using another timesource, I'll take suggestions or patches
any day!

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> > With those patches, CONFIG_NO_HZ works just fine.
> 
> could you please try the two patches below, do they fix the problem as 
> well? They got a ton of testing in x86.git in the past ~2 months and 
> we could perhaps still push them into v2.6.24.

plus, if needed, perhaps this patch below from Guillaume ontop of it.

Ingo

>
Subject: sched: fix rq->clock warps on frequency changes
From: Guillaume Chazarain <[EMAIL PROTECTED]>

sched: fix rq->clock warps on frequency changes

Fix 2bacec8c318ca0418c0ee9ac662ee44207765dd4
(sched: touch softlockup watchdog after idling) that reintroduced warps
on frequency changes. touch_softlockup_watchdog() calls __update_rq_clock
that checks rq->clock for warps, so call it after adjusting rq->clock.

Signed-off-by: Guillaume Chazarain <[EMAIL PROTECTED]>
Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 kernel/sched.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-x86.q/kernel/sched.c
===
--- linux-x86.q.orig/kernel/sched.c
+++ linux-x86.q/kernel/sched.c
@@ -668,7 +668,6 @@ void sched_clock_idle_wakeup_event(u64 d
struct rq *rq = cpu_rq(smp_processor_id());
u64 now = sched_clock();
 
-   touch_softlockup_watchdog();
rq->idle_clock += delta_ns;
/*
 * Override the previous timestamp and ignore all
@@ -680,6 +679,7 @@ void sched_clock_idle_wakeup_event(u64 d
rq->prev_clock_raw = now;
rq->clock += delta_ns;
spin_unlock(>lock);
+   touch_softlockup_watchdog();
 }
 EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* David Dillow <[EMAIL PROTECTED]> wrote:

> > Just out of curiosity, could you try the appended cumulative patch 
> > and report .clock_warps, .clock_overflows and .clock_underflows as 
> > you did.
> 
> With those patches, CONFIG_NO_HZ works just fine.

could you please try the two patches below, do they fix the problem as 
well? They got a ton of testing in x86.git in the past ~2 months and we 
could perhaps still push them into v2.6.24.

Ingo

-->
Subject: x86: scale cyc_2_nsec according to CPU frequency
From: "Guillaume Chazarain" <[EMAIL PROTECTED]>

scale the sched_clock() cyc_2_nsec scaling factor according to
CPU frequency changes.

[ [EMAIL PROTECTED]: simplified it and fixed it for SMP. ]

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
Signed-off-by: Thomas Gleixner <[EMAIL PROTECTED]>
---
 arch/x86/kernel/tsc_32.c |   43 ++-
 arch/x86/kernel/tsc_64.c |   57 ++-
 include/asm-x86/timer.h  |   23 ++
 3 files changed, 102 insertions(+), 21 deletions(-)

Index: linux-x86.q/arch/x86/kernel/tsc_32.c
===
--- linux-x86.q.orig/arch/x86/kernel/tsc_32.c
+++ linux-x86.q/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -80,13 +81,31 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *
  * [EMAIL PROTECTED] "math is hard, lets go shopping!"
  */
-unsigned long cyc2ns_scale __read_mostly;
 
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-   cyc2ns_scale = (100 << CYC2NS_SCALE_FACTOR)/cpu_khz;
+   unsigned long flags, prev_scale, *scale;
+   unsigned long long tsc_now, ns_now;
+
+   local_irq_save(flags);
+   sched_clock_idle_sleep_event();
+
+   scale = _cpu(cyc2ns, cpu);
+
+   rdtscll(tsc_now);
+   ns_now = __cycles_2_ns(tsc_now);
+
+   prev_scale = *scale;
+   if (cpu_khz)
+   *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+   /*
+* Start smoothly with the new frequency:
+*/
+   sched_clock_idle_wakeup_event(0);
+   local_irq_restore(flags);
 }
 
 /*
@@ -239,7 +258,9 @@ time_cpufreq_notifier(struct notifier_bl
ref_freq, freq->new);
if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
tsc_khz = cpu_khz;
-   set_cyc2ns_scale(cpu_khz);
+   preempt_disable();
+   set_cyc2ns_scale(cpu_khz, smp_processor_id());
+   preempt_enable();
/*
 * TSC based sched_clock turns
 * to junk w/ cpufreq
@@ -367,6 +388,8 @@ static inline void check_geode_tsc_relia
 
 void __init tsc_init(void)
 {
+   int cpu;
+
if (!cpu_has_tsc || tsc_disable)
goto out_no_tsc;
 
@@ -380,7 +403,15 @@ void __init tsc_init(void)
(unsigned long)cpu_khz / 1000,
(unsigned long)cpu_khz % 1000);
 
-   set_cyc2ns_scale(cpu_khz);
+   /*
+* Secondary CPUs do not run through tsc_init(), so set up
+* all the scale factors for all CPUs, assuming the same
+* speed as the bootup CPU. (cpufreq notifiers will fix this
+* up if their speed diverges)
+*/
+   for_each_possible_cpu(cpu)
+   set_cyc2ns_scale(cpu_khz, cpu);
+
use_tsc_delay();
 
/* Check and install the TSC clocksource */
Index: linux-x86.q/arch/x86/kernel/tsc_64.c
===
--- linux-x86.q.orig/arch/x86/kernel/tsc_64.c
+++ linux-x86.q/arch/x86/kernel/tsc_64.c
@@ -10,6 +10,7 @@
 
 #include 
 #include 
+#include 
 
 static int notsc __initdata = 0;
 
@@ -18,16 +19,48 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
-static unsigned int cyc2ns_scale __read_mostly;
+/* Accelerators for sched_clock()
+ * convert from cycles(64bits) => nanoseconds (64bits)
+ *  basic equation:
+ * ns = cycles / (freq / ns_per_sec)
+ * ns = cycles * (ns_per_sec / freq)
+ * ns = cycles * (10^9 / (cpu_khz * 10^3))
+ * ns = cycles * (10^6 / cpu_khz)
+ *
+ * Then we use scaling math (suggested by [EMAIL PROTECTED]) to get:
+ * ns = cycles * (10^6 * SC / cpu_khz) / SC
+ * ns = cycles * cyc2ns_scale / SC
+ *
+ * And since SC is a constant power of two, we can convert the div
+ *  into a shift.
+ *
+ *  We can use khz divisor instead of mhz to keep a better precision, since
+ *  

Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* Jens Axboe <[EMAIL PROTECTED]> wrote:

> > they are from the scheduler git tree (except the first debug patch), 
> > but queued up for v2.6.25 at the moment.
> 
> So this means that blktrace will be broken with CONFIG_NO_HZ for 
> 2.6.24? That's clearly a regression.

64-bit CONFIG_NO_HZ is a new feature in v2.6.24. If it happens on 32-bit 
too and it didnt happen in v2.6.23 32-bit then it's a regression.

all this comes from blktrace's original decision of using sched_clock()
:-) It's not a global timesource and it's not trivial to turn it into a
halfways usable global timesource.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* David Dillow <[EMAIL PROTECTED]> wrote:

> Ingo, Thomas added as I think this is related to 
> sched.c:__update_rq_clock()'s checking for forward time warps.

yep, we've got some fixes in this area. Do blktrace timestamps work fine 
in v2.6.23, with NOHZ?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Jens Axboe
On Fri, Jan 11 2008, Ingo Molnar wrote:
> 
> * Ingo Molnar <[EMAIL PROTECTED]> wrote:
> 
> > > Thanks for reporting this. Guillaume, did you write this patch? We 
> > > need to get it into 2.6.24-rc7 asap. Let me know if I should take 
> > > care of that, or if it's already queued up elsewhere.
> > 
> > they are from the scheduler git tree (except the first debug patch), 
> > but queued up for v2.6.25 at the moment.
> 
> some of them are not - Guillaume has not sent them. I'll sort it out.

Thanks Ingo!

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

> >>> Just out of curiosity, could you try the appended cumulative patch 
> >>> and report .clock_warps, .clock_overflows and .clock_underflows as 
> >>> you did.
> >> With those patches, CONFIG_NO_HZ works just fine.
> 
> Could these patches also help with hibernation issues? I'm trying 
> x86_64+NO_HZ, and seeing activity delayed during the atomic copy and 
> afterwards until I manually generate interrupts (by pressing keys).

i dont think that should be related to cpu_clock() use. Does the patch 
below make any difference? (or could you try x86.git to get the whole 
stack of x86 changes that we have at the moment.) Here's the coordinates 
for x86.git:

--{ x86.git instructions }-->

# Add Linus's tree as a remote
git remote --add linus
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

# Add Ingo's tree as a remote
git remote --add x86
git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git

# With that setup, just run the following to get any changes you
# don't have.  It will also notice any new branches Ingo/Linus
# add to their repo.  Look in .git/config afterwards, the format
# to add new remotes is easy to figure out.
git remote update

Ingo

--->
Subject: x86: kick CPUS that might be sleeping in cpus_idle_wait
From: Steven Rostedt <[EMAIL PROTECTED]>

Sometimes cpu_idle_wait gets stuck because it might miss CPUS that are
already in idle, have no tasks waiting to run and have no interrupts going
to them.  This is common on bootup when switching cpu idle governors.

This patch gives those CPUS that don't check in an IPI kick.

Background:
---
I notice this while developing the mcount patches, that every once in a
while the system would hang. Looking deeper, the hang was always at boot
up when registering init_menu of the cpu_idle menu governor. Talking
with Thomas Gliexner, we discovered that one of the CPUS had no timer
events scheduled for it and it was in idle (running with NO_HZ). So the
CPU would not set the cpu_idle_state bit.

Hitting sysrq-t a few times would eventually route the interrupt to the
stuck CPU and the system would continue.

Note, I would have used the PDA isidle but that is set after the
cpu_idle_state bit is cleared, and would leave a window open where we
may miss being kicked.

hmm, looking closer at this, we still have a small race window between
clearing the cpu_idle_state and disabling interrupts (hence the RFC).

CPU0:  CPU 1:
  -   -
 cpu_idle_wait(): cpu_idle():
  |   __cpu_cpu_var(is_idle) = 1;
  |   if (__get_cpu_var(cpu_idle_state)) /* == 0 */
 per_cpu(cpu_idle_state, 1) = 1; |
 if (per_cpu(is_idle, 1)) /* == 1 */ |
 smp_call_function(1)|
  | receives ipi and runs do_nothing.
 wait on map == empty   idle();
   /* waits forever */

So really we need interrupts off for most of this then. One might think
that we could simply clear the cpu_idle_state from do_nothing, but I'm
assuming that cpu_idle governors can be removed, and this might cause a
race that a governor might be used after the module was removed.

Venki said:

  I think your RFC patch is the right solution here.  As I see it, there is
  no race with your RFC patch.  As long as you call a dummy smp_call_function
  on all CPUs, we should be OK.  We can get rid of cpu_idle_state and the
  current wait forever logic altogether with dummy smp_call_function.  And so
  there wont be any wait forever scenario.

  The whole point of cpu_idle_wait() is to make all CPUs come out of idle
  loop atleast once.  The caller will use cpu_idle_wait something like this.

  // Want to change idle handler

  - Switch global idle handler to always present default_idle

  - call cpu_idle_wait so that all cpus come out of idle for an instant
and stop using old idle pointer and start using default idle

  - Change the idle handler to a new handler

  - optional cpu_idle_wait if you want all cpus to start using the new
handler immediately.

Maybe the below 1s patch is safe bet for .24.  But for .25, I would say we
just replace all complicated logic by simple dummy smp_call_function and
remove cpu_idle_state altogether.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
Cc: Venkatesh Pallipadi <[EMAIL PROTECTED]>
Cc: Andi Kleen <[EMAIL PROTECTED]>
Cc: Len Brown <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 arch/x86/kernel/process_32.c |   11 +++
 arch/x86/kernel/process_64.c |   11 +++
 2 files changed, 22 insertions(+)

Index: linux-x86.q/arch/x86/kernel/process_32.c
===
--- linux-x86.q.orig/arch/x86/kernel/process_32.c
+++ 

Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> > Thanks for reporting this. Guillaume, did you write this patch? We 
> > need to get it into 2.6.24-rc7 asap. Let me know if I should take 
> > care of that, or if it's already queued up elsewhere.
> 
> they are from the scheduler git tree (except the first debug patch), 
> but queued up for v2.6.25 at the moment.

some of them are not - Guillaume has not sent them. I'll sort it out.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread nigel
Hi.

Jens Axboe wrote:
> On Thu, Jan 10 2008, David Dillow wrote:
>> On Thu, 2008-01-10 at 23:44 +0100, Guillaume Chazarain wrote:
>>> David Dillow <[EMAIL PROTECTED]> wrote:
>>>
 At the moment, I'm not sure how to track this farther, or how to fix it
 properly. Any advice would be appreciated.
>>> Just out of curiosity, could you try the appended cumulative patch and
>>> report .clock_warps, .clock_overflows and .clock_underflows as you did.
>> With those patches, CONFIG_NO_HZ works just fine.

Could these patches also help with hibernation issues? I'm trying
x86_64+NO_HZ, and seeing activity delayed during the atomic copy and
afterwards until I manually generate interrupts (by pressing keys).

Regards,

Nigel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Jens Axboe
On Fri, Jan 11 2008, Ingo Molnar wrote:
> 
> * Jens Axboe <[EMAIL PROTECTED]> wrote:
> 
> > Thanks for reporting this. Guillaume, did you write this patch? We 
> > need to get it into 2.6.24-rc7 asap. Let me know if I should take care 
> > of that, or if it's already queued up elsewhere.
> 
> they are from the scheduler git tree (except the first debug patch), but 
> queued up for v2.6.25 at the moment.

So this means that blktrace will be broken with CONFIG_NO_HZ for 2.6.24?
That's clearly a regression.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* Jens Axboe <[EMAIL PROTECTED]> wrote:

> Thanks for reporting this. Guillaume, did you write this patch? We 
> need to get it into 2.6.24-rc7 asap. Let me know if I should take care 
> of that, or if it's already queued up elsewhere.

they are from the scheduler git tree (except the first debug patch), but 
queued up for v2.6.25 at the moment.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Jens Axboe
On Thu, Jan 10 2008, David Dillow wrote:
> 
> On Thu, 2008-01-10 at 23:44 +0100, Guillaume Chazarain wrote:
> > David Dillow <[EMAIL PROTECTED]> wrote:
> > 
> > > At the moment, I'm not sure how to track this farther, or how to fix it
> > > properly. Any advice would be appreciated.
> > 
> > Just out of curiosity, could you try the appended cumulative patch and
> > report .clock_warps, .clock_overflows and .clock_underflows as you did.
> 
> With those patches, CONFIG_NO_HZ works just fine.
> 
> Patched kernel, nohz=off:
> now at 214257.820809 msecs
>   .clock : 214212.727559
>   .idle_clock: 0.00
>   .prev_clock_raw: 244569.402345
>   .clock_warps   : 0
>   .clock_overflows   : 577
>   .clock_underflows  : 213887
>   .clock_deep_idle_events: 4
>   .clock_max_delta   : 0.999830
> 
> Patched kernel, nohz=on:
> now at 248931.524381 msecs
>   .clock : 248745.808465
>   .idle_clock: 0.00
>   .prev_clock_raw: 270911.098507
>   .clock_warps   : 0
>   .clock_overflows   : 69
>   .clock_underflows  : 784
>   .clock_deep_idle_events: 4
>   .clock_max_delta   : 100.639397
> 
> Running my disk test, blktrace is getting the proper timestamps now with
> CONFIG_NO_HZ.

Thanks for reporting this. Guillaume, did you write this patch? We need
to get it into 2.6.24-rc7 asap. Let me know if I should take care of
that, or if it's already queued up elsewhere.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Jens Axboe
On Thu, Jan 10 2008, David Dillow wrote:
 
 On Thu, 2008-01-10 at 23:44 +0100, Guillaume Chazarain wrote:
  David Dillow [EMAIL PROTECTED] wrote:
  
   At the moment, I'm not sure how to track this farther, or how to fix it
   properly. Any advice would be appreciated.
  
  Just out of curiosity, could you try the appended cumulative patch and
  report .clock_warps, .clock_overflows and .clock_underflows as you did.
 
 With those patches, CONFIG_NO_HZ works just fine.
 
 Patched kernel, nohz=off:
 now at 214257.820809 msecs
   .clock : 214212.727559
   .idle_clock: 0.00
   .prev_clock_raw: 244569.402345
   .clock_warps   : 0
   .clock_overflows   : 577
   .clock_underflows  : 213887
   .clock_deep_idle_events: 4
   .clock_max_delta   : 0.999830
 
 Patched kernel, nohz=on:
 now at 248931.524381 msecs
   .clock : 248745.808465
   .idle_clock: 0.00
   .prev_clock_raw: 270911.098507
   .clock_warps   : 0
   .clock_overflows   : 69
   .clock_underflows  : 784
   .clock_deep_idle_events: 4
   .clock_max_delta   : 100.639397
 
 Running my disk test, blktrace is getting the proper timestamps now with
 CONFIG_NO_HZ.

Thanks for reporting this. Guillaume, did you write this patch? We need
to get it into 2.6.24-rc7 asap. Let me know if I should take care of
that, or if it's already queued up elsewhere.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* Jens Axboe [EMAIL PROTECTED] wrote:

 Thanks for reporting this. Guillaume, did you write this patch? We 
 need to get it into 2.6.24-rc7 asap. Let me know if I should take care 
 of that, or if it's already queued up elsewhere.

they are from the scheduler git tree (except the first debug patch), but 
queued up for v2.6.25 at the moment.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Jens Axboe
On Fri, Jan 11 2008, Ingo Molnar wrote:
 
 * Jens Axboe [EMAIL PROTECTED] wrote:
 
  Thanks for reporting this. Guillaume, did you write this patch? We 
  need to get it into 2.6.24-rc7 asap. Let me know if I should take care 
  of that, or if it's already queued up elsewhere.
 
 they are from the scheduler git tree (except the first debug patch), but 
 queued up for v2.6.25 at the moment.

So this means that blktrace will be broken with CONFIG_NO_HZ for 2.6.24?
That's clearly a regression.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread nigel
Hi.

Jens Axboe wrote:
 On Thu, Jan 10 2008, David Dillow wrote:
 On Thu, 2008-01-10 at 23:44 +0100, Guillaume Chazarain wrote:
 David Dillow [EMAIL PROTECTED] wrote:

 At the moment, I'm not sure how to track this farther, or how to fix it
 properly. Any advice would be appreciated.
 Just out of curiosity, could you try the appended cumulative patch and
 report .clock_warps, .clock_overflows and .clock_underflows as you did.
 With those patches, CONFIG_NO_HZ works just fine.

Could these patches also help with hibernation issues? I'm trying
x86_64+NO_HZ, and seeing activity delayed during the atomic copy and
afterwards until I manually generate interrupts (by pressing keys).

Regards,

Nigel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* Ingo Molnar [EMAIL PROTECTED] wrote:

  Thanks for reporting this. Guillaume, did you write this patch? We 
  need to get it into 2.6.24-rc7 asap. Let me know if I should take 
  care of that, or if it's already queued up elsewhere.
 
 they are from the scheduler git tree (except the first debug patch), 
 but queued up for v2.6.25 at the moment.

some of them are not - Guillaume has not sent them. I'll sort it out.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* Ingo Molnar [EMAIL PROTECTED] wrote:

  With those patches, CONFIG_NO_HZ works just fine.
 
 could you please try the two patches below, do they fix the problem as 
 well? They got a ton of testing in x86.git in the past ~2 months and 
 we could perhaps still push them into v2.6.24.

plus, if needed, perhaps this patch below from Guillaume ontop of it.

Ingo


Subject: sched: fix rq-clock warps on frequency changes
From: Guillaume Chazarain [EMAIL PROTECTED]

sched: fix rq-clock warps on frequency changes

Fix 2bacec8c318ca0418c0ee9ac662ee44207765dd4
(sched: touch softlockup watchdog after idling) that reintroduced warps
on frequency changes. touch_softlockup_watchdog() calls __update_rq_clock
that checks rq-clock for warps, so call it after adjusting rq-clock.

Signed-off-by: Guillaume Chazarain [EMAIL PROTECTED]
Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
---
 kernel/sched.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-x86.q/kernel/sched.c
===
--- linux-x86.q.orig/kernel/sched.c
+++ linux-x86.q/kernel/sched.c
@@ -668,7 +668,6 @@ void sched_clock_idle_wakeup_event(u64 d
struct rq *rq = cpu_rq(smp_processor_id());
u64 now = sched_clock();
 
-   touch_softlockup_watchdog();
rq-idle_clock += delta_ns;
/*
 * Override the previous timestamp and ignore all
@@ -680,6 +679,7 @@ void sched_clock_idle_wakeup_event(u64 d
rq-prev_clock_raw = now;
rq-clock += delta_ns;
spin_unlock(rq-lock);
+   touch_softlockup_watchdog();
 }
 EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* David Dillow [EMAIL PROTECTED] wrote:

 Ingo, Thomas added as I think this is related to 
 sched.c:__update_rq_clock()'s checking for forward time warps.

yep, we've got some fixes in this area. Do blktrace timestamps work fine 
in v2.6.23, with NOHZ?

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Jens Axboe
On Fri, Jan 11 2008, Ingo Molnar wrote:
 
 * Ingo Molnar [EMAIL PROTECTED] wrote:
 
   Thanks for reporting this. Guillaume, did you write this patch? We 
   need to get it into 2.6.24-rc7 asap. Let me know if I should take 
   care of that, or if it's already queued up elsewhere.
  
  they are from the scheduler git tree (except the first debug patch), 
  but queued up for v2.6.25 at the moment.
 
 some of them are not - Guillaume has not sent them. I'll sort it out.

Thanks Ingo!

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Guillaume Chazarain
David Dillow [EMAIL PROTECTED] wrote:

 Patched kernel, nohz=off:
   .clock_underflows  : 213887

A little bit of warning about these patches, they are WIP, that's why I
did not send them earlier. It regress nohz=off.

A bit of context: these patches aim at making sure cpu_clock() on my
laptop (cpufreq enabled) never overflows/underflows/warps with
CONFIG_NOHZ enabled. With these patches, I have a few hundreds
overflows and underflows during early bootup, and then nothing :-)

Ingo Molnar [EMAIL PROTECTED] wrote:

 they are from the scheduler git tree (except the first debug patch), but 
 queued up for v2.6.25 at the moment.

You are talking about x86: scale cyc_2_nsec according to CPU
frequency here, but I don't think it is at stakes here as David has:

 CONFIG_CPU_FREQ is not set

Let me review my patches myself to give a bit of context:

 sched: monitor clock underflows in /proc/sched_debug

This, I'd like to have it in .25 just for convenience.

 x86: scale cyc_2_nsec according to CPU frequency

You already know that one ;-)

 sched: fix rq-clock warps on frequency changes

This is a bugfix for .25 once the previous patch is applied. I don't
think it helps David, but it could help blktrace users with cpufreq
enabled.

 sched: Fix rq-clock overflows detection with CONFIG_NO_HZ

I think this one is the most important for David, but unfortunately it
has some problems.

 +static inline u64 max_skipped_ticks(struct rq *rq)
 +{
 + return nohz_on(cpu_of(rq)) ? jiffies - rq-last_tick_seen + 2 : 1;
 +}

Here, I initially wrote rq-last_tick_seen + 1 but experiments showed
that +2 was needed as I really saw deltas of 2 milliseconds.

These patches have two objectives:
 - taking into account that jiffies are not always incremented by 1
thanks to nohz
 - as the tick is stopped and restarted it may not tick at the exact
expected moment, so allow a window of 1 jiffie. If the tick occurs
during the right jiffy, we know the TSC is more precise than the tick
so don't correct the clock.

And the problem is that I seem to need a window of 2 jiffies, so I need
some help.

 sched: make sure jiffies is up to date before calling __update_rq_clock()

This is one is needed too but I'm less confident in its validity.

 scheduler_tick() is not called every jiffies

This one is a bit ugly and seems to break nohz=off.

 - if (unlikely(rq-clock  next_tick)) {
 + if (unlikely(rq-clock  next_tick - nohz_on(cpu) * TICK_NSEC)) {

No, I'm not proud of this :-(

Thanks.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* Guillaume Chazarain [EMAIL PROTECTED] wrote:

 David Dillow [EMAIL PROTECTED] wrote:
 
  Patched kernel, nohz=off:
.clock_underflows  : 213887
 
 A little bit of warning about these patches, they are WIP, that's why 
 I did not send them earlier. It regress nohz=off.

ok. I have applied all but this one:

 A bit of context: these patches aim at making sure cpu_clock() on my
 laptop (cpufreq enabled) never overflows/underflows/warps with
 CONFIG_NOHZ enabled. With these patches, I have a few hundreds
 overflows and underflows during early bootup, and then nothing :-)

cool :-)

  sched: Fix rq-clock overflows detection with CONFIG_NO_HZ
 
 I think this one is the most important for David, but unfortunately it
 has some problems.
 
  +static inline u64 max_skipped_ticks(struct rq *rq)
  +{
  +   return nohz_on(cpu_of(rq)) ? jiffies - rq-last_tick_seen + 2 : 1;
  +}
 
 Here, I initially wrote rq-last_tick_seen + 1 but experiments showed
 that +2 was needed as I really saw deltas of 2 milliseconds.
 
 These patches have two objectives:
  - taking into account that jiffies are not always incremented by 1
 thanks to nohz
  - as the tick is stopped and restarted it may not tick at the exact
 expected moment, so allow a window of 1 jiffie. If the tick occurs
 during the right jiffy, we know the TSC is more precise than the tick
 so don't correct the clock.

i think it's much simpler to do what i have below. Could you try it on 
your box? Or if it is using ACPI idle - in that case the callbacks 
should already be there and there should be no need for further fixups.

 And the problem is that I seem to need a window of 2 jiffies, so I need
 some help.
 
  sched: make sure jiffies is up to date before calling 
  __update_rq_clock()
 
 This is one is needed too but I'm less confident in its validity.
 
  scheduler_tick() is not called every jiffies
 
 This one is a bit ugly and seems to break nohz=off.

ok, i took this one out.

Ingo


Subject: x86: idle wakeup event in the HLT loop
From: Ingo Molnar [EMAIL PROTECTED]

do a proper idle-wakeup event on HLT as well - some CPUs stop the TSC
in HLT too, not just when going through the ACPI methods.

(the ACPI idle code already does this.)

[ update the 64-bit side too, as noticed by Jiri Slaby. ]

Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
---
 arch/x86/kernel/process_32.c |   15 ---
 arch/x86/kernel/process_64.c |   13 ++---
 2 files changed, 22 insertions(+), 6 deletions(-)

Index: linux-x86.q/arch/x86/kernel/process_32.c
===
--- linux-x86.q.orig/arch/x86/kernel/process_32.c
+++ linux-x86.q/arch/x86/kernel/process_32.c
@@ -113,10 +113,19 @@ void default_idle(void)
smp_mb();
 
local_irq_disable();
-   if (!need_resched())
+   if (!need_resched()) {
+   ktime_t t0, t1;
+   u64 t0n, t1n;
+
+   t0 = ktime_get();
+   t0n = ktime_to_ns(t0);
safe_halt();/* enables interrupts racelessly */
-   else
-   local_irq_enable();
+   local_irq_disable();
+   t1 = ktime_get();
+   t1n = ktime_to_ns(t1);
+   sched_clock_idle_wakeup_event(t1n - t0n);
+   }
+   local_irq_enable();
current_thread_info()-status |= TS_POLLING;
} else {
/* loop is done by the caller */
Index: linux-x86.q/arch/x86/kernel/process_64.c
===
--- linux-x86.q.orig/arch/x86/kernel/process_64.c
+++ linux-x86.q/arch/x86/kernel/process_64.c
@@ -116,9 +116,16 @@ static void default_idle(void)
smp_mb();
local_irq_disable();
if (!need_resched()) {
-   /* Enables interrupts one instruction before HLT.
-  x86 special cases this so there is no race. */
-   safe_halt();
+   ktime_t t0, t1;
+   u64 t0n, t1n;
+
+   t0 = ktime_get();
+   t0n = ktime_to_ns(t0);
+   safe_halt();/* enables interrupts racelessly */
+   local_irq_disable();
+   t1 = ktime_get();
+   t1n = ktime_to_ns(t1);
+   sched_clock_idle_wakeup_event(t1n - t0n);
} else
local_irq_enable();
current_thread_info()-status |= TS_POLLING;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* David Dillow [EMAIL PROTECTED] wrote:

  Just out of curiosity, could you try the appended cumulative patch 
  and report .clock_warps, .clock_overflows and .clock_underflows as 
  you did.
 
 With those patches, CONFIG_NO_HZ works just fine.

could you please try the two patches below, do they fix the problem as 
well? They got a ton of testing in x86.git in the past ~2 months and we 
could perhaps still push them into v2.6.24.

Ingo

--
Subject: x86: scale cyc_2_nsec according to CPU frequency
From: Guillaume Chazarain [EMAIL PROTECTED]

scale the sched_clock() cyc_2_nsec scaling factor according to
CPU frequency changes.

[ [EMAIL PROTECTED]: simplified it and fixed it for SMP. ]

Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
Signed-off-by: Thomas Gleixner [EMAIL PROTECTED]
---
 arch/x86/kernel/tsc_32.c |   43 ++-
 arch/x86/kernel/tsc_64.c |   57 ++-
 include/asm-x86/timer.h  |   23 ++
 3 files changed, 102 insertions(+), 21 deletions(-)

Index: linux-x86.q/arch/x86/kernel/tsc_32.c
===
--- linux-x86.q.orig/arch/x86/kernel/tsc_32.c
+++ linux-x86.q/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include linux/jiffies.h
 #include linux/init.h
 #include linux/dmi.h
+#include linux/percpu.h
 
 #include asm/delay.h
 #include asm/tsc.h
@@ -80,13 +81,31 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *
  * [EMAIL PROTECTED] math is hard, lets go shopping!
  */
-unsigned long cyc2ns_scale __read_mostly;
 
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-   cyc2ns_scale = (100  CYC2NS_SCALE_FACTOR)/cpu_khz;
+   unsigned long flags, prev_scale, *scale;
+   unsigned long long tsc_now, ns_now;
+
+   local_irq_save(flags);
+   sched_clock_idle_sleep_event();
+
+   scale = per_cpu(cyc2ns, cpu);
+
+   rdtscll(tsc_now);
+   ns_now = __cycles_2_ns(tsc_now);
+
+   prev_scale = *scale;
+   if (cpu_khz)
+   *scale = (NSEC_PER_MSEC  CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+   /*
+* Start smoothly with the new frequency:
+*/
+   sched_clock_idle_wakeup_event(0);
+   local_irq_restore(flags);
 }
 
 /*
@@ -239,7 +258,9 @@ time_cpufreq_notifier(struct notifier_bl
ref_freq, freq-new);
if (!(freq-flags  CPUFREQ_CONST_LOOPS)) {
tsc_khz = cpu_khz;
-   set_cyc2ns_scale(cpu_khz);
+   preempt_disable();
+   set_cyc2ns_scale(cpu_khz, smp_processor_id());
+   preempt_enable();
/*
 * TSC based sched_clock turns
 * to junk w/ cpufreq
@@ -367,6 +388,8 @@ static inline void check_geode_tsc_relia
 
 void __init tsc_init(void)
 {
+   int cpu;
+
if (!cpu_has_tsc || tsc_disable)
goto out_no_tsc;
 
@@ -380,7 +403,15 @@ void __init tsc_init(void)
(unsigned long)cpu_khz / 1000,
(unsigned long)cpu_khz % 1000);
 
-   set_cyc2ns_scale(cpu_khz);
+   /*
+* Secondary CPUs do not run through tsc_init(), so set up
+* all the scale factors for all CPUs, assuming the same
+* speed as the bootup CPU. (cpufreq notifiers will fix this
+* up if their speed diverges)
+*/
+   for_each_possible_cpu(cpu)
+   set_cyc2ns_scale(cpu_khz, cpu);
+
use_tsc_delay();
 
/* Check and install the TSC clocksource */
Index: linux-x86.q/arch/x86/kernel/tsc_64.c
===
--- linux-x86.q.orig/arch/x86/kernel/tsc_64.c
+++ linux-x86.q/arch/x86/kernel/tsc_64.c
@@ -10,6 +10,7 @@
 
 #include asm/hpet.h
 #include asm/timex.h
+#include asm/timer.h
 
 static int notsc __initdata = 0;
 
@@ -18,16 +19,48 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
-static unsigned int cyc2ns_scale __read_mostly;
+/* Accelerators for sched_clock()
+ * convert from cycles(64bits) = nanoseconds (64bits)
+ *  basic equation:
+ * ns = cycles / (freq / ns_per_sec)
+ * ns = cycles * (ns_per_sec / freq)
+ * ns = cycles * (10^9 / (cpu_khz * 10^3))
+ * ns = cycles * (10^6 / cpu_khz)
+ *
+ * Then we use scaling math (suggested by [EMAIL PROTECTED]) to get:
+ * ns = cycles * (10^6 * SC / cpu_khz) / SC
+ * ns = cycles * cyc2ns_scale / SC
+ *
+ * And since SC is a constant power of two, we can convert the div
+ *  into a shift.
+ *
+ *  

Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

  Just out of curiosity, could you try the appended cumulative patch 
  and report .clock_warps, .clock_overflows and .clock_underflows as 
  you did.
  With those patches, CONFIG_NO_HZ works just fine.
 
 Could these patches also help with hibernation issues? I'm trying 
 x86_64+NO_HZ, and seeing activity delayed during the atomic copy and 
 afterwards until I manually generate interrupts (by pressing keys).

i dont think that should be related to cpu_clock() use. Does the patch 
below make any difference? (or could you try x86.git to get the whole 
stack of x86 changes that we have at the moment.) Here's the coordinates 
for x86.git:

--{ x86.git instructions }--

# Add Linus's tree as a remote
git remote --add linus
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

# Add Ingo's tree as a remote
git remote --add x86
git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git

# With that setup, just run the following to get any changes you
# don't have.  It will also notice any new branches Ingo/Linus
# add to their repo.  Look in .git/config afterwards, the format
# to add new remotes is easy to figure out.
git remote update

Ingo

---
Subject: x86: kick CPUS that might be sleeping in cpus_idle_wait
From: Steven Rostedt [EMAIL PROTECTED]

Sometimes cpu_idle_wait gets stuck because it might miss CPUS that are
already in idle, have no tasks waiting to run and have no interrupts going
to them.  This is common on bootup when switching cpu idle governors.

This patch gives those CPUS that don't check in an IPI kick.

Background:
---
I notice this while developing the mcount patches, that every once in a
while the system would hang. Looking deeper, the hang was always at boot
up when registering init_menu of the cpu_idle menu governor. Talking
with Thomas Gliexner, we discovered that one of the CPUS had no timer
events scheduled for it and it was in idle (running with NO_HZ). So the
CPU would not set the cpu_idle_state bit.

Hitting sysrq-t a few times would eventually route the interrupt to the
stuck CPU and the system would continue.

Note, I would have used the PDA isidle but that is set after the
cpu_idle_state bit is cleared, and would leave a window open where we
may miss being kicked.

hmm, looking closer at this, we still have a small race window between
clearing the cpu_idle_state and disabling interrupts (hence the RFC).

CPU0:  CPU 1:
  -   -
 cpu_idle_wait(): cpu_idle():
  |   __cpu_cpu_var(is_idle) = 1;
  |   if (__get_cpu_var(cpu_idle_state)) /* == 0 */
 per_cpu(cpu_idle_state, 1) = 1; |
 if (per_cpu(is_idle, 1)) /* == 1 */ |
 smp_call_function(1)|
  | receives ipi and runs do_nothing.
 wait on map == empty   idle();
   /* waits forever */

So really we need interrupts off for most of this then. One might think
that we could simply clear the cpu_idle_state from do_nothing, but I'm
assuming that cpu_idle governors can be removed, and this might cause a
race that a governor might be used after the module was removed.

Venki said:

  I think your RFC patch is the right solution here.  As I see it, there is
  no race with your RFC patch.  As long as you call a dummy smp_call_function
  on all CPUs, we should be OK.  We can get rid of cpu_idle_state and the
  current wait forever logic altogether with dummy smp_call_function.  And so
  there wont be any wait forever scenario.

  The whole point of cpu_idle_wait() is to make all CPUs come out of idle
  loop atleast once.  The caller will use cpu_idle_wait something like this.

  // Want to change idle handler

  - Switch global idle handler to always present default_idle

  - call cpu_idle_wait so that all cpus come out of idle for an instant
and stop using old idle pointer and start using default idle

  - Change the idle handler to a new handler

  - optional cpu_idle_wait if you want all cpus to start using the new
handler immediately.

Maybe the below 1s patch is safe bet for .24.  But for .25, I would say we
just replace all complicated logic by simple dummy smp_call_function and
remove cpu_idle_state altogether.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
Cc: Venkatesh Pallipadi [EMAIL PROTECTED]
Cc: Andi Kleen [EMAIL PROTECTED]
Cc: Len Brown [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
---
 arch/x86/kernel/process_32.c |   11 +++
 arch/x86/kernel/process_64.c |   11 +++
 2 files changed, 22 insertions(+)

Index: linux-x86.q/arch/x86/kernel/process_32.c
===
--- linux-x86.q.orig/arch/x86/kernel/process_32.c
+++ linux-x86.q/arch/x86/kernel/process_32.c
@@ -214,6 

Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Ingo Molnar

* Jens Axboe [EMAIL PROTECTED] wrote:

  they are from the scheduler git tree (except the first debug patch), 
  but queued up for v2.6.25 at the moment.
 
 So this means that blktrace will be broken with CONFIG_NO_HZ for 
 2.6.24? That's clearly a regression.

64-bit CONFIG_NO_HZ is a new feature in v2.6.24. If it happens on 32-bit 
too and it didnt happen in v2.6.23 32-bit then it's a regression.

all this comes from blktrace's original decision of using sched_clock()
:-) It's not a global timesource and it's not trivial to turn it into a
halfways usable global timesource.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Jens Axboe
On Fri, Jan 11 2008, Ingo Molnar wrote:
 
 * Jens Axboe [EMAIL PROTECTED] wrote:
 
   they are from the scheduler git tree (except the first debug patch), 
   but queued up for v2.6.25 at the moment.
  
  So this means that blktrace will be broken with CONFIG_NO_HZ for 
  2.6.24? That's clearly a regression.
 
 64-bit CONFIG_NO_HZ is a new feature in v2.6.24. If it happens on 32-bit 
 too and it didnt happen in v2.6.23 32-bit then it's a regression.

If blktrace worked in 2.6.23 and it doesn't in 2.6.24 because of some
option that isn't immediately apparent, then it's a regression. Period.

 all this comes from blktrace's original decision of using sched_clock()
 :-) It's not a global timesource and it's not trivial to turn it into a
 halfways usable global timesource.

Hey, it was a high res time source and the only one easily available :)
I'm fine with using another timesource, I'll take suggestions or patches
any day!

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread David Dillow

On Fri, 2008-01-11 at 10:34 +0100, Ingo Molnar wrote:
 * David Dillow [EMAIL PROTECTED] wrote:
 
  Ingo, Thomas added as I think this is related to 
  sched.c:__update_rq_clock()'s checking for forward time warps.
 
 yep, we've got some fixes in this area. Do blktrace timestamps work fine 
 in v2.6.23, with NOHZ?

Do you still want this tested, given that your ktime change works? If
so, it will likely be next week before I can devote some cycles to it --
it'll take installing a new distro on the machines.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Guillaume Chazarain
Ingo Molnar [EMAIL PROTECTED] wrote:

 ok. I have applied all but this one

Hmm, I couldn't find them in mingo/linux-2.6-sched-devel.git.

 i think it's much simpler to do what i have below. Could you try it on 
 your box? Or if it is using ACPI idle - in that case the callbacks 
 should already be there and there should be no need for further fixups.
 
 Subject: x86: idle wakeup event in the HLT loop

I use ACPI, so this patch has no effect.

FYI, I'm currently trying to track down where rq-clock started to
overflow with nohz=off, and it seems to be before 2.6.23, so my patches
are not at fault ;-) Or maybe I am dreaming and it was always
overflowing. Investigating ...

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-11 Thread Guillaume Chazarain
Guillaume Chazarain [EMAIL PROTECTED] wrote:

 FYI, I'm currently trying to track down where rq-clock started to
 overflow with nohz=off, and it seems to be before 2.6.23, so my patches
 are not at fault ;-) Or maybe I am dreaming and it was always
 overflowing. Investigating ...

And the winner is:

commit 529c77261bccd9d37f110f58b0753d95beaa9fa2
Author: Ingo Molnar [EMAIL PROTECTED]
Date:   Fri Aug 10 23:05:11 2007 +0200

sched: improve rq-clock overflow logic

improve the rq-clock overflow logic: limit the absolute rq-clock
delta since the last scheduler tick, instead of limiting the delta
itself.

tested by Arjan van de Ven - whole laptop was misbehaving due to
an incorrectly calibrated cpu_khz confusing sched_clock().

Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
Signed-off-by: Arjan van de Ven [EMAIL PROTECTED]

diff --git a/kernel/sched.c b/kernel/sched.c
index b0afd8d..6247e4a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -263,6 +263,7 @@ struct rq {
 
unsigned int clock_warps, clock_overflows;
unsigned int clock_unstable_events;
+   u64 tick_timestamp;
 
atomic_t nr_iowait;
 
@@ -341,8 +342,11 @@ static void __update_rq_clock(struct rq *rq)
/*
 * Catch too large forward jumps too:
 */
-   if (unlikely(delta  2*TICK_NSEC)) {
-   clock++;
+   if (unlikely(clock + delta  rq-tick_timestamp + TICK_NSEC)) {
+   if (clock  rq-tick_timestamp + TICK_NSEC)
+   clock = rq-tick_timestamp + TICK_NSEC;
+   else
+   clock++;
rq-clock_overflows++;
} else {
if (unlikely(delta  rq-clock_max_delta))
@@ -3308,9 +3312,16 @@ void scheduler_tick(void)
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
struct task_struct *curr = rq-curr;
+   u64 next_tick = rq-tick_timestamp + TICK_NSEC;
 
spin_lock(rq-lock);
__update_rq_clock(rq);
+   /*
+* Let rq-clock advance by at least TICK_NSEC:
+*/
+   if (unlikely(rq-clock  next_tick))
+   rq-clock = next_tick;
+   rq-tick_timestamp = rq-clock;
update_cpu_load(rq);
if (curr != rq-idle) /* FIXME: needed? */
curr-sched_class-task_tick(rq, curr);


Seems like I originally was not the only one seeing 2 jiffies jumps ;-)
I'll adapt my patches.

-- 
Guillaume
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-10 Thread David Dillow

On Thu, 2008-01-10 at 23:44 +0100, Guillaume Chazarain wrote:
> David Dillow <[EMAIL PROTECTED]> wrote:
> 
> > At the moment, I'm not sure how to track this farther, or how to fix it
> > properly. Any advice would be appreciated.
> 
> Just out of curiosity, could you try the appended cumulative patch and
> report .clock_warps, .clock_overflows and .clock_underflows as you did.

With those patches, CONFIG_NO_HZ works just fine.

Patched kernel, nohz=off:
now at 214257.820809 msecs
  .clock : 214212.727559
  .idle_clock: 0.00
  .prev_clock_raw: 244569.402345
  .clock_warps   : 0
  .clock_overflows   : 577
  .clock_underflows  : 213887
  .clock_deep_idle_events: 4
  .clock_max_delta   : 0.999830

Patched kernel, nohz=on:
now at 248931.524381 msecs
  .clock : 248745.808465
  .idle_clock: 0.00
  .prev_clock_raw: 270911.098507
  .clock_warps   : 0
  .clock_overflows   : 69
  .clock_underflows  : 784
  .clock_deep_idle_events: 4
  .clock_max_delta   : 100.639397

Running my disk test, blktrace is getting the proper timestamps now with
CONFIG_NO_HZ.

Thanks!
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-10 Thread Guillaume Chazarain
David Dillow <[EMAIL PROTECTED]> wrote:

> At the moment, I'm not sure how to track this farther, or how to fix it
> properly. Any advice would be appreciated.

Just out of curiosity, could you try the appended cumulative patch and
report .clock_warps, .clock_overflows and .clock_underflows as you did.

Thanks.

commit 20fa02359d971bdb820d238184fabd42d8018e4f
Author: Guillaume Chazarain <[EMAIL PROTECTED]>
Date:   Thu Jan 10 23:36:43 2008 +0100

sched: monitor clock underflows in /proc/sched_debug

We monitor clock overflows, let's also monitor clock underflows.

Signed-off-by: Guillaume Chazarain <[EMAIL PROTECTED]>

diff --git a/kernel/sched.c b/kernel/sched.c
index 37cf07a..cab9756 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -317,7 +317,7 @@ struct rq {
u64 clock, prev_clock_raw;
s64 clock_max_delta;
 
-   unsigned int clock_warps, clock_overflows;
+   unsigned int clock_warps, clock_overflows, clock_underflows;
u64 idle_clock;
unsigned int clock_deep_idle_events;
u64 tick_timestamp;
@@ -3485,8 +3485,10 @@ void scheduler_tick(void)
/*
 * Let rq->clock advance by at least TICK_NSEC:
 */
-   if (unlikely(rq->clock < next_tick))
+   if (unlikely(rq->clock < next_tick)) {
rq->clock = next_tick;
+   rq->clock_underflows++;
+   }
rq->tick_timestamp = rq->clock;
update_cpu_load(rq);
if (curr != rq->idle) /* FIXME: needed? */
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 80fbbfc..9e5de09 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -179,6 +179,7 @@ static void print_cpu(struct seq_file *m, int cpu)
PN(prev_clock_raw);
P(clock_warps);
P(clock_overflows);
+   P(clock_underflows);
P(clock_deep_idle_events);
PN(clock_max_delta);
P(cpu_load[0]);

commit c146421cae64bb626714dc951fa39b55d2f819c1
Author: Guillaume Chazarain <[EMAIL PROTECTED]>
Date:   Wed Jan 2 14:10:17 2008 +0100

commit 60c6397ce4e8c9fd7feaeaef4167ace71c3949c8

x86: scale cyc_2_nsec according to CPU frequency

scale the sched_clock() cyc_2_nsec scaling factor according to
CPU frequency changes.

[ [EMAIL PROTECTED]: simplified it and fixed it for SMP. ]

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
Signed-off-by: Thomas Gleixner <[EMAIL PROTECTED]>

diff --git a/arch/x86/kernel/tsc_32.c b/arch/x86/kernel/tsc_32.c
index 9ebc0da..00bb4c1 100644
--- a/arch/x86/kernel/tsc_32.c
+++ b/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -80,13 +81,31 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *
  * [EMAIL PROTECTED] "math is hard, lets go shopping!"
  */
-unsigned long cyc2ns_scale __read_mostly;
 
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-   cyc2ns_scale = (100 << CYC2NS_SCALE_FACTOR)/cpu_khz;
+   unsigned long flags, prev_scale, *scale;
+   unsigned long long tsc_now, ns_now;
+
+   local_irq_save(flags);
+   sched_clock_idle_sleep_event();
+
+   scale = _cpu(cyc2ns, cpu);
+
+   rdtscll(tsc_now);
+   ns_now = __cycles_2_ns(tsc_now);
+
+   prev_scale = *scale;
+   if (cpu_khz)
+   *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+   /*
+* Start smoothly with the new frequency:
+*/
+   sched_clock_idle_wakeup_event(0);
+   local_irq_restore(flags);
 }
 
 /*
@@ -239,7 +258,9 @@ time_cpufreq_notifier(struct notifier_block *nb, unsigned 
long val, void *data)
ref_freq, freq->new);
if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
tsc_khz = cpu_khz;
-   set_cyc2ns_scale(cpu_khz);
+   preempt_disable();
+   set_cyc2ns_scale(cpu_khz, smp_processor_id());
+   preempt_enable();
/*
 * TSC based sched_clock turns
 * to junk w/ cpufreq
@@ -367,6 +388,8 @@ static inline void check_geode_tsc_reliable(void) { }
 
 void __init tsc_init(void)
 {
+   int cpu;
+
if (!cpu_has_tsc || tsc_disable)
goto out_no_tsc;
 
@@ -380,7 +403,15 @@ void __init tsc_init(void)
(unsigned long)cpu_khz / 1000,
(unsigned long)cpu_khz % 1000);
 
-   set_cyc2ns_scale(cpu_khz);
+   /*
+* Secondary CPUs do not run through tsc_init(), so set up
+* all the scale factors for all CPUs, assuming the same
+* 

Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-10 Thread David Dillow
Ingo, Thomas added as I think this is related to
sched.c:__update_rq_clock()'s checking for forward time warps.

On Wed, 2008-01-09 at 17:48 -0500, David Dillow wrote:
> While trying to gain some insight into a disk issue, I found that
> blktrace/blkparse was giving me bogus traces -- I was seeing requests
> complete before they were even dispatched or queued even! I had thought
> that maybe this was an issue with SMP on the box, but when running with
> 'maxcpus=1', it told me that my 53 second test run only took 3.5
> seconds.
> 
> I started tracking this down, and upon looking at cpu_clock(), and found
> that it uses sched_clock(), which is based on jiffies. At this point I
> had an ahah! moment and remembered that I had NO_HZ enabled.

[I did figure out that the sched_clock() jiffies implementation in
sched.c is a fallback.]

A few pieces of info I forgot in the original message -- this is on an
quad-processor, dual core Opteron box, running 2.6.24-rc7 x86_64. I'm
currently booting it with maxcpus=1, though at this point it is just a
hold-over from my initial bug hunting.

I can provide a full .config/dmesg if needed, but off the top:
CONFIG_NO_HZ=y
CONFIG_HZ=1000
CONFIG_CPU_FREQ is not set
CONFIG_CPU_IDLE makes no difference

hpet is the current clocksource

When booting with "nohz=off", rq->clock and rq->prev_clock_raw
from /proc/sched_debug track nicely with ktime_get() ("now is at ...
msecs"). rq->clock_overflows is non-zero, but increases relatively
slowly -- 650 for ~355 seconds.

With "nohz=on", rq->prev_clock_raw still tracks nicely with ktime_get(),
but rq->clock is moving very slowly and rq->clock_overflows is
incrementing fairly rapidly -- 53718 for ~357 seconds.

Either way gives a rq->max_delta of 0.999844 -- ms, I presume.

rq->clock_overflows is only incremented in sched.c:__update_rq_clock(),
and only when delta pushes clock more than TICK_NSEC past the current
tick_timestamp. I'm assuming this happens when the CPU's been idle for a
bit, with everything waiting for IO (4 to 5ms in the ticked blktrace),
so there's been no updates of rq->clock from sched_clock().

At the moment, I'm not sure how to track this farther, or how to fix it
properly. Any advice would be appreciated.

Thanks!
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-10 Thread David Dillow
Ingo, Thomas added as I think this is related to
sched.c:__update_rq_clock()'s checking for forward time warps.

On Wed, 2008-01-09 at 17:48 -0500, David Dillow wrote:
 While trying to gain some insight into a disk issue, I found that
 blktrace/blkparse was giving me bogus traces -- I was seeing requests
 complete before they were even dispatched or queued even! I had thought
 that maybe this was an issue with SMP on the box, but when running with
 'maxcpus=1', it told me that my 53 second test run only took 3.5
 seconds.
 
 I started tracking this down, and upon looking at cpu_clock(), and found
 that it uses sched_clock(), which is based on jiffies. At this point I
 had an ahah! moment and remembered that I had NO_HZ enabled.

[I did figure out that the sched_clock() jiffies implementation in
sched.c is a fallback.]

A few pieces of info I forgot in the original message -- this is on an
quad-processor, dual core Opteron box, running 2.6.24-rc7 x86_64. I'm
currently booting it with maxcpus=1, though at this point it is just a
hold-over from my initial bug hunting.

I can provide a full .config/dmesg if needed, but off the top:
CONFIG_NO_HZ=y
CONFIG_HZ=1000
CONFIG_CPU_FREQ is not set
CONFIG_CPU_IDLE makes no difference

hpet is the current clocksource

When booting with nohz=off, rq-clock and rq-prev_clock_raw
from /proc/sched_debug track nicely with ktime_get() (now is at ...
msecs). rq-clock_overflows is non-zero, but increases relatively
slowly -- 650 for ~355 seconds.

With nohz=on, rq-prev_clock_raw still tracks nicely with ktime_get(),
but rq-clock is moving very slowly and rq-clock_overflows is
incrementing fairly rapidly -- 53718 for ~357 seconds.

Either way gives a rq-max_delta of 0.999844 -- ms, I presume.

rq-clock_overflows is only incremented in sched.c:__update_rq_clock(),
and only when delta pushes clock more than TICK_NSEC past the current
tick_timestamp. I'm assuming this happens when the CPU's been idle for a
bit, with everything waiting for IO (4 to 5ms in the ticked blktrace),
so there's been no updates of rq-clock from sched_clock().

At the moment, I'm not sure how to track this farther, or how to fix it
properly. Any advice would be appreciated.

Thanks!
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-10 Thread Guillaume Chazarain
David Dillow [EMAIL PROTECTED] wrote:

 At the moment, I'm not sure how to track this farther, or how to fix it
 properly. Any advice would be appreciated.

Just out of curiosity, could you try the appended cumulative patch and
report .clock_warps, .clock_overflows and .clock_underflows as you did.

Thanks.

commit 20fa02359d971bdb820d238184fabd42d8018e4f
Author: Guillaume Chazarain [EMAIL PROTECTED]
Date:   Thu Jan 10 23:36:43 2008 +0100

sched: monitor clock underflows in /proc/sched_debug

We monitor clock overflows, let's also monitor clock underflows.

Signed-off-by: Guillaume Chazarain [EMAIL PROTECTED]

diff --git a/kernel/sched.c b/kernel/sched.c
index 37cf07a..cab9756 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -317,7 +317,7 @@ struct rq {
u64 clock, prev_clock_raw;
s64 clock_max_delta;
 
-   unsigned int clock_warps, clock_overflows;
+   unsigned int clock_warps, clock_overflows, clock_underflows;
u64 idle_clock;
unsigned int clock_deep_idle_events;
u64 tick_timestamp;
@@ -3485,8 +3485,10 @@ void scheduler_tick(void)
/*
 * Let rq-clock advance by at least TICK_NSEC:
 */
-   if (unlikely(rq-clock  next_tick))
+   if (unlikely(rq-clock  next_tick)) {
rq-clock = next_tick;
+   rq-clock_underflows++;
+   }
rq-tick_timestamp = rq-clock;
update_cpu_load(rq);
if (curr != rq-idle) /* FIXME: needed? */
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 80fbbfc..9e5de09 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -179,6 +179,7 @@ static void print_cpu(struct seq_file *m, int cpu)
PN(prev_clock_raw);
P(clock_warps);
P(clock_overflows);
+   P(clock_underflows);
P(clock_deep_idle_events);
PN(clock_max_delta);
P(cpu_load[0]);

commit c146421cae64bb626714dc951fa39b55d2f819c1
Author: Guillaume Chazarain [EMAIL PROTECTED]
Date:   Wed Jan 2 14:10:17 2008 +0100

commit 60c6397ce4e8c9fd7feaeaef4167ace71c3949c8

x86: scale cyc_2_nsec according to CPU frequency

scale the sched_clock() cyc_2_nsec scaling factor according to
CPU frequency changes.

[ [EMAIL PROTECTED]: simplified it and fixed it for SMP. ]

Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
Signed-off-by: Thomas Gleixner [EMAIL PROTECTED]

diff --git a/arch/x86/kernel/tsc_32.c b/arch/x86/kernel/tsc_32.c
index 9ebc0da..00bb4c1 100644
--- a/arch/x86/kernel/tsc_32.c
+++ b/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include linux/jiffies.h
 #include linux/init.h
 #include linux/dmi.h
+#include linux/percpu.h
 
 #include asm/delay.h
 #include asm/tsc.h
@@ -80,13 +81,31 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *
  * [EMAIL PROTECTED] math is hard, lets go shopping!
  */
-unsigned long cyc2ns_scale __read_mostly;
 
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-   cyc2ns_scale = (100  CYC2NS_SCALE_FACTOR)/cpu_khz;
+   unsigned long flags, prev_scale, *scale;
+   unsigned long long tsc_now, ns_now;
+
+   local_irq_save(flags);
+   sched_clock_idle_sleep_event();
+
+   scale = per_cpu(cyc2ns, cpu);
+
+   rdtscll(tsc_now);
+   ns_now = __cycles_2_ns(tsc_now);
+
+   prev_scale = *scale;
+   if (cpu_khz)
+   *scale = (NSEC_PER_MSEC  CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+   /*
+* Start smoothly with the new frequency:
+*/
+   sched_clock_idle_wakeup_event(0);
+   local_irq_restore(flags);
 }
 
 /*
@@ -239,7 +258,9 @@ time_cpufreq_notifier(struct notifier_block *nb, unsigned 
long val, void *data)
ref_freq, freq-new);
if (!(freq-flags  CPUFREQ_CONST_LOOPS)) {
tsc_khz = cpu_khz;
-   set_cyc2ns_scale(cpu_khz);
+   preempt_disable();
+   set_cyc2ns_scale(cpu_khz, smp_processor_id());
+   preempt_enable();
/*
 * TSC based sched_clock turns
 * to junk w/ cpufreq
@@ -367,6 +388,8 @@ static inline void check_geode_tsc_reliable(void) { }
 
 void __init tsc_init(void)
 {
+   int cpu;
+
if (!cpu_has_tsc || tsc_disable)
goto out_no_tsc;
 
@@ -380,7 +403,15 @@ void __init tsc_init(void)
(unsigned long)cpu_khz / 1000,
(unsigned long)cpu_khz % 1000);
 
-   set_cyc2ns_scale(cpu_khz);
+   /*
+* Secondary CPUs do not run through tsc_init(), so set up
+* all the scale factors for 

Re: CONFIG_NO_HZ breaks blktrace timestamps

2008-01-10 Thread David Dillow

On Thu, 2008-01-10 at 23:44 +0100, Guillaume Chazarain wrote:
 David Dillow [EMAIL PROTECTED] wrote:
 
  At the moment, I'm not sure how to track this farther, or how to fix it
  properly. Any advice would be appreciated.
 
 Just out of curiosity, could you try the appended cumulative patch and
 report .clock_warps, .clock_overflows and .clock_underflows as you did.

With those patches, CONFIG_NO_HZ works just fine.

Patched kernel, nohz=off:
now at 214257.820809 msecs
  .clock : 214212.727559
  .idle_clock: 0.00
  .prev_clock_raw: 244569.402345
  .clock_warps   : 0
  .clock_overflows   : 577
  .clock_underflows  : 213887
  .clock_deep_idle_events: 4
  .clock_max_delta   : 0.999830

Patched kernel, nohz=on:
now at 248931.524381 msecs
  .clock : 248745.808465
  .idle_clock: 0.00
  .prev_clock_raw: 270911.098507
  .clock_warps   : 0
  .clock_overflows   : 69
  .clock_underflows  : 784
  .clock_deep_idle_events: 4
  .clock_max_delta   : 100.639397

Running my disk test, blktrace is getting the proper timestamps now with
CONFIG_NO_HZ.

Thanks!
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/