Re: Use of absolute timeouts for oneshot timers

2007-03-11 Thread Thomas Gleixner
On Sat, 2007-03-10 at 16:42 -0800, Jeremy Fitzhardinge wrote:
> Thomas Gleixner wrote:
> > It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute
> > time, which is read back from the clocksource, even if we use a relative
> > value for real hardware clock event devices to program the next event.
> > We calculate the delta between the absolute event and now. So we never
> > get an accumulating error.
> >
> > What problem are you observing ?
> 
> Actually, two things.  There was the unexpected pauses during boot,
> which is trivially fixable by not using the Xen periodic timer, and
> using the single-shot fallback.
> 
> But I'm making the more general observation that if you use an absolute
> rather than relative time to set the single-shot timeout, then you have
> to deal with a long-term cumulative drift between the kernel's monotonic
> time and the hypervisor's monotonic time.  This can happen even if your
> clocksource is derived directly from the hypervisor monotonic time,
> because running ntp will warp the kernel's time, and so it will drift
> with respect to the hypervisor clock.  You can only avoid this by 1) not
> allowing adjtime, or 2) making those same adjtime warps to the
> hypervisor time.  Neither of these is a good general solution.

Sigh, yes. Using a relative time for the next event is probably the
least ugly solution

tglx



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Use of absolute timeouts for oneshot timers

2007-03-11 Thread Thomas Gleixner
On Sat, 2007-03-10 at 16:42 -0800, Jeremy Fitzhardinge wrote:
 Thomas Gleixner wrote:
  It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute
  time, which is read back from the clocksource, even if we use a relative
  value for real hardware clock event devices to program the next event.
  We calculate the delta between the absolute event and now. So we never
  get an accumulating error.
 
  What problem are you observing ?
 
 Actually, two things.  There was the unexpected pauses during boot,
 which is trivially fixable by not using the Xen periodic timer, and
 using the single-shot fallback.
 
 But I'm making the more general observation that if you use an absolute
 rather than relative time to set the single-shot timeout, then you have
 to deal with a long-term cumulative drift between the kernel's monotonic
 time and the hypervisor's monotonic time.  This can happen even if your
 clocksource is derived directly from the hypervisor monotonic time,
 because running ntp will warp the kernel's time, and so it will drift
 with respect to the hypervisor clock.  You can only avoid this by 1) not
 allowing adjtime, or 2) making those same adjtime warps to the
 hypervisor time.  Neither of these is a good general solution.

Sigh, yes. Using a relative time for the next event is probably the
least ugly solution

tglx



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Use of absolute timeouts for oneshot timers

2007-03-10 Thread Jeremy Fitzhardinge
Thomas Gleixner wrote:
> It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute
> time, which is read back from the clocksource, even if we use a relative
> value for real hardware clock event devices to program the next event.
> We calculate the delta between the absolute event and now. So we never
> get an accumulating error.
>
> What problem are you observing ?

Actually, two things.  There was the unexpected pauses during boot,
which is trivially fixable by not using the Xen periodic timer, and
using the single-shot fallback.

But I'm making the more general observation that if you use an absolute
rather than relative time to set the single-shot timeout, then you have
to deal with a long-term cumulative drift between the kernel's monotonic
time and the hypervisor's monotonic time.  This can happen even if your
clocksource is derived directly from the hypervisor monotonic time,
because running ntp will warp the kernel's time, and so it will drift
with respect to the hypervisor clock.  You can only avoid this by 1) not
allowing adjtime, or 2) making those same adjtime warps to the
hypervisor time.  Neither of these is a good general solution.

Therefore, the only useful way to set a single-shot timer is by using
relative rather than absolute time, and making sure the delta not too
large.  The guest and hypervisor may (and in general, will) have
drifting clocks, but the error will never be too large to deal with.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Use of absolute timeouts for oneshot timers

2007-03-10 Thread Jeremy Fitzhardinge
Thomas Gleixner wrote:
> The clocksource is not used until the clocksource is installed. Also the
> periodic mode during boot, when the clock event device supports periodic
> mode, is not reading the time. It relies on the clock event device
> getting it straight.

Yes.  This could be one source of error, where I compute  the offset
hypervisor_time - ktime_get(), but ktime_get() may drift with respect to
hypervisor time while using a periodic jiffies timebase.

> Once we switch to NO_HZ or HIGHRES the clock event device is directly
> coupled to the clock event source.
>   
OK.  Erm, but not in the sense that you always choose the xen/hpet/lapic
clocksource+clockevent together; there's no direct linkage between the
two kinds of device.  But there's the coupling where the clocksource is
always used to directly measure the clockevent's behaviour.

> Once we switched over to the clocksource, everything should be in
> perfect sync.
>   

Assuming that the clocksource and the clockevent device have
close-enough timebases.

>> Or perhaps this is a property of the whole clock subsystem: that
>> clockevents must be paired with clocksources.  But its not obvious to me
>> that this enforced, or even acknowledged.
>> 
>
> It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute
> time, which is read back from the clocksource, even if we use a relative
> value for real hardware clock event devices to program the next event.
> We calculate the delta between the absolute event and now. So we never
> get an accumulating error.
>   

Right, but if the clocksource and the clockevent devices have a relative
drift, then using the clocksource to compute that we need a 500ns delay,
but the clockevent device ends delivering the oneshot event 750ns (or
250ns) later, then things are going to be locally upset, even if the
next time the clockevent oneshot is programmed it will take the
overshoot into account.  (Of course, you'd hope the drift would never
really be that bad, and 2^32 ns only gives you ~4s window to screw up).

> What problem are you observing ?
>   

Unexpected pauses during boot.  I think the real problem is that Xen
periodic timer events are not delivered unless the vcpu is actually
running (ie, they're specifically intended for timeslicing rather than
general periodic events).  Perhaps the real fix in this case is to just
remove the periodic feature flag.


J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Use of absolute timeouts for oneshot timers

2007-03-10 Thread Thomas Gleixner
On Sat, 2007-03-10 at 14:52 -0800, Jeremy Fitzhardinge wrote:
> When booting under Xen, you'll get this if you're using both the xen
> clocksource and clockevent drivers.  However, it seems that during boot
> on a NO_HZ HIGHRES_TIMERS system, the kernel does not use the Xen
> clocksource until it switches to highres timer mode.  This means that
> during boot the kernel's monotonic clock is drifting with respect to the
> hypervisor, and all timeouts are unreliable.

The clocksource is not used until the clocksource is installed. Also the
periodic mode during boot, when the clock event device supports periodic
mode, is not reading the time. It relies on the clock event device
getting it straight. That's not a big deal during boot and on a kernel
with NO_HZ=n and HIGHRES=n the periodic tick only updates jiffies. If
the only clocksource is jiffies, then we have to live with it and we do
not switch to NO_HZ/HIGHRES as we would lose track of time.

Once we switch to NO_HZ or HIGHRES the clock event device is directly
coupled to the clock event source.

> Initially I was just computing the kernel-hypervisor offset at boot
> time, but then I changed it to recompute it every time the timer mode
> changes.  However, this didn't really help, and I was still getting
> unpredictable timeouts during boot.  I've changed it to just compute the
> hypervisor absolute time directly using the delta each time the oneshot
> timer is set, which will definitely be reliable (if the kernel and
> hypervisor have drifting timebases then the meaning of Xns delta will be
> different, but at least thats a local error rather than a long-term
> cumulative error).

We do not really care up to the point, where the high resolution
clocksource (e.g. TSC, PM-Timer or HPET on real hardware) becomes
active. Early boot is fragile and we switch over to high res clocksource
and highres/nohz when things have stabilized. 

> My analysis might be wrong here (I suspect the Xen periodic timer may
> have unexpected behaviour), but the overall conclusion still stands:
> using an absolute timeout only works if the kernel and hypervisor have
> non-drifting timebases.  I think its too fragile for a clockevent
> implementation to assume that a particular clocksource is in use to get
> reliable results.

Once we switched over to the clocksource, everything should be in
perfect sync.

> Or perhaps this is a property of the whole clock subsystem: that
> clockevents must be paired with clocksources.  But its not obvious to me
> that this enforced, or even acknowledged.

It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute
time, which is read back from the clocksource, even if we use a relative
value for real hardware clock event devices to program the next event.
We calculate the delta between the absolute event and now. So we never
get an accumulating error.

What problem are you observing ?

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Use of absolute timeouts for oneshot timers

2007-03-10 Thread Jeremy Fitzhardinge
I've been thinking a bit more about how useful an absolute timeout is
for a oneshot timer in a virtual environment.

In principle, absolute times are generally preferable.  A relative
timeout means "timeout in X ns from now", but the meaning of "now" is
ambiguous, particularly if the vcpu can be preempted at any time, which
means the determination of "now" can be arbitrarily deferred.

However, an absolute time is only meaningful if the kernel and
hypervisor are operating off the same timebase (ie, no drift).  In
general, the kernel's monotonic timer is going to start from 0ns when
the virtual machine is booted, and the hypervisor's is going to start at
0ns when the hypervisor is booted.  If they're operating off the same
timebase, then in principle you can work out a constant offset between
the two, and use that for converting a kernel absolute time into a
hypervisor absolute time.

When booting under Xen, you'll get this if you're using both the xen
clocksource and clockevent drivers.  However, it seems that during boot
on a NO_HZ HIGHRES_TIMERS system, the kernel does not use the Xen
clocksource until it switches to highres timer mode.  This means that
during boot the kernel's monotonic clock is drifting with respect to the
hypervisor, and all timeouts are unreliable.

Initially I was just computing the kernel-hypervisor offset at boot
time, but then I changed it to recompute it every time the timer mode
changes.  However, this didn't really help, and I was still getting
unpredictable timeouts during boot.  I've changed it to just compute the
hypervisor absolute time directly using the delta each time the oneshot
timer is set, which will definitely be reliable (if the kernel and
hypervisor have drifting timebases then the meaning of Xns delta will be
different, but at least thats a local error rather than a long-term
cumulative error).

My analysis might be wrong here (I suspect the Xen periodic timer may
have unexpected behaviour), but the overall conclusion still stands:
using an absolute timeout only works if the kernel and hypervisor have
non-drifting timebases.  I think its too fragile for a clockevent
implementation to assume that a particular clocksource is in use to get
reliable results.

Or perhaps this is a property of the whole clock subsystem: that
clockevents must be paired with clocksources.  But its not obvious to me
that this enforced, or even acknowledged.

(Of course, if the drift can be characterized, then you can compensate
for it, but this seems too complex to be the right answer.  And drift
compensation is numerically much simpler for small 32-bit deltas
compared to 64-bit absolute times.)

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Use of absolute timeouts for oneshot timers

2007-03-10 Thread Jeremy Fitzhardinge
I've been thinking a bit more about how useful an absolute timeout is
for a oneshot timer in a virtual environment.

In principle, absolute times are generally preferable.  A relative
timeout means timeout in X ns from now, but the meaning of now is
ambiguous, particularly if the vcpu can be preempted at any time, which
means the determination of now can be arbitrarily deferred.

However, an absolute time is only meaningful if the kernel and
hypervisor are operating off the same timebase (ie, no drift).  In
general, the kernel's monotonic timer is going to start from 0ns when
the virtual machine is booted, and the hypervisor's is going to start at
0ns when the hypervisor is booted.  If they're operating off the same
timebase, then in principle you can work out a constant offset between
the two, and use that for converting a kernel absolute time into a
hypervisor absolute time.

When booting under Xen, you'll get this if you're using both the xen
clocksource and clockevent drivers.  However, it seems that during boot
on a NO_HZ HIGHRES_TIMERS system, the kernel does not use the Xen
clocksource until it switches to highres timer mode.  This means that
during boot the kernel's monotonic clock is drifting with respect to the
hypervisor, and all timeouts are unreliable.

Initially I was just computing the kernel-hypervisor offset at boot
time, but then I changed it to recompute it every time the timer mode
changes.  However, this didn't really help, and I was still getting
unpredictable timeouts during boot.  I've changed it to just compute the
hypervisor absolute time directly using the delta each time the oneshot
timer is set, which will definitely be reliable (if the kernel and
hypervisor have drifting timebases then the meaning of Xns delta will be
different, but at least thats a local error rather than a long-term
cumulative error).

My analysis might be wrong here (I suspect the Xen periodic timer may
have unexpected behaviour), but the overall conclusion still stands:
using an absolute timeout only works if the kernel and hypervisor have
non-drifting timebases.  I think its too fragile for a clockevent
implementation to assume that a particular clocksource is in use to get
reliable results.

Or perhaps this is a property of the whole clock subsystem: that
clockevents must be paired with clocksources.  But its not obvious to me
that this enforced, or even acknowledged.

(Of course, if the drift can be characterized, then you can compensate
for it, but this seems too complex to be the right answer.  And drift
compensation is numerically much simpler for small 32-bit deltas
compared to 64-bit absolute times.)

J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Use of absolute timeouts for oneshot timers

2007-03-10 Thread Thomas Gleixner
On Sat, 2007-03-10 at 14:52 -0800, Jeremy Fitzhardinge wrote:
 When booting under Xen, you'll get this if you're using both the xen
 clocksource and clockevent drivers.  However, it seems that during boot
 on a NO_HZ HIGHRES_TIMERS system, the kernel does not use the Xen
 clocksource until it switches to highres timer mode.  This means that
 during boot the kernel's monotonic clock is drifting with respect to the
 hypervisor, and all timeouts are unreliable.

The clocksource is not used until the clocksource is installed. Also the
periodic mode during boot, when the clock event device supports periodic
mode, is not reading the time. It relies on the clock event device
getting it straight. That's not a big deal during boot and on a kernel
with NO_HZ=n and HIGHRES=n the periodic tick only updates jiffies. If
the only clocksource is jiffies, then we have to live with it and we do
not switch to NO_HZ/HIGHRES as we would lose track of time.

Once we switch to NO_HZ or HIGHRES the clock event device is directly
coupled to the clock event source.

 Initially I was just computing the kernel-hypervisor offset at boot
 time, but then I changed it to recompute it every time the timer mode
 changes.  However, this didn't really help, and I was still getting
 unpredictable timeouts during boot.  I've changed it to just compute the
 hypervisor absolute time directly using the delta each time the oneshot
 timer is set, which will definitely be reliable (if the kernel and
 hypervisor have drifting timebases then the meaning of Xns delta will be
 different, but at least thats a local error rather than a long-term
 cumulative error).

We do not really care up to the point, where the high resolution
clocksource (e.g. TSC, PM-Timer or HPET on real hardware) becomes
active. Early boot is fragile and we switch over to high res clocksource
and highres/nohz when things have stabilized. 

 My analysis might be wrong here (I suspect the Xen periodic timer may
 have unexpected behaviour), but the overall conclusion still stands:
 using an absolute timeout only works if the kernel and hypervisor have
 non-drifting timebases.  I think its too fragile for a clockevent
 implementation to assume that a particular clocksource is in use to get
 reliable results.

Once we switched over to the clocksource, everything should be in
perfect sync.

 Or perhaps this is a property of the whole clock subsystem: that
 clockevents must be paired with clocksources.  But its not obvious to me
 that this enforced, or even acknowledged.

It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute
time, which is read back from the clocksource, even if we use a relative
value for real hardware clock event devices to program the next event.
We calculate the delta between the absolute event and now. So we never
get an accumulating error.

What problem are you observing ?

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Use of absolute timeouts for oneshot timers

2007-03-10 Thread Jeremy Fitzhardinge
Thomas Gleixner wrote:
 The clocksource is not used until the clocksource is installed. Also the
 periodic mode during boot, when the clock event device supports periodic
 mode, is not reading the time. It relies on the clock event device
 getting it straight.

Yes.  This could be one source of error, where I compute  the offset
hypervisor_time - ktime_get(), but ktime_get() may drift with respect to
hypervisor time while using a periodic jiffies timebase.

 Once we switch to NO_HZ or HIGHRES the clock event device is directly
 coupled to the clock event source.
   
OK.  Erm, but not in the sense that you always choose the xen/hpet/lapic
clocksource+clockevent together; there's no direct linkage between the
two kinds of device.  But there's the coupling where the clocksource is
always used to directly measure the clockevent's behaviour.

 Once we switched over to the clocksource, everything should be in
 perfect sync.
   

Assuming that the clocksource and the clockevent device have
close-enough timebases.

 Or perhaps this is a property of the whole clock subsystem: that
 clockevents must be paired with clocksources.  But its not obvious to me
 that this enforced, or even acknowledged.
 

 It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute
 time, which is read back from the clocksource, even if we use a relative
 value for real hardware clock event devices to program the next event.
 We calculate the delta between the absolute event and now. So we never
 get an accumulating error.
   

Right, but if the clocksource and the clockevent devices have a relative
drift, then using the clocksource to compute that we need a 500ns delay,
but the clockevent device ends delivering the oneshot event 750ns (or
250ns) later, then things are going to be locally upset, even if the
next time the clockevent oneshot is programmed it will take the
overshoot into account.  (Of course, you'd hope the drift would never
really be that bad, and 2^32 ns only gives you ~4s window to screw up).

 What problem are you observing ?
   

Unexpected pauses during boot.  I think the real problem is that Xen
periodic timer events are not delivered unless the vcpu is actually
running (ie, they're specifically intended for timeslicing rather than
general periodic events).  Perhaps the real fix in this case is to just
remove the periodic feature flag.


J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Use of absolute timeouts for oneshot timers

2007-03-10 Thread Jeremy Fitzhardinge
Thomas Gleixner wrote:
 It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute
 time, which is read back from the clocksource, even if we use a relative
 value for real hardware clock event devices to program the next event.
 We calculate the delta between the absolute event and now. So we never
 get an accumulating error.

 What problem are you observing ?

Actually, two things.  There was the unexpected pauses during boot,
which is trivially fixable by not using the Xen periodic timer, and
using the single-shot fallback.

But I'm making the more general observation that if you use an absolute
rather than relative time to set the single-shot timeout, then you have
to deal with a long-term cumulative drift between the kernel's monotonic
time and the hypervisor's monotonic time.  This can happen even if your
clocksource is derived directly from the hypervisor monotonic time,
because running ntp will warp the kernel's time, and so it will drift
with respect to the hypervisor clock.  You can only avoid this by 1) not
allowing adjtime, or 2) making those same adjtime warps to the
hypervisor time.  Neither of these is a good general solution.

Therefore, the only useful way to set a single-shot timer is by using
relative rather than absolute time, and making sure the delta not too
large.  The guest and hypervisor may (and in general, will) have
drifting clocks, but the error will never be too large to deal with.

J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/