Re: [collectd] long sleeps when using collectd and ntpd

2013-09-24 Thread Florian Forster
Hello Lode,

On Tue, Sep 24, 2013 at 10:53:02AM +0200, danta wrote:
> Whenever ntpd sets the clock in the past (because of a clock drift),
> collectd sleeps until it is back at it's 'normal' time.

> When trying to debug the problem, we found that the do_loop function
> in collectd.c determines the amount of time to sleep based on times
> abtained from the "gettimeofday" function. Was this a design choice?
> Wouldn't it be better to use a monotonic clock?

it would be possible to use a monotonic clock _on Linux_, but according
to POSIX (i.e. to be portable) one must use a real time clock. Also, I
don't think this is causing the behavior; the problem is a bit more
complicated.

Each callback, after it returns, is put into a heap which is sorted by
the absolute, real time when it should be called next. I.e. the loop in
src/collectd.c is still waking up periodically, but next to no work is
done inside this loop, especially not the reads.

If you were to "fix" this, you'd for example end up with the metric for
12:00:00 being followed by the metric for 11:45:10, i.e. almost
15 minutes into the "past". This would result in collectd refusing this
measurement because it is "too old". This would continue until the time
has progressed enough.

Last, but not least, metrics _have_ to use the wall clock time.
Otherwise you won't be able to correlate between metrics and alerts /
behavior you observe.

Hope this helps, best regards,
—octo
-- 
collectd – The system statistics collection daemon
Website: http://collectd.org
Google+: http://collectd.org/+
GitHub:  https://github.com/collectd
Twitter: http://twitter.com/collectd


signature.asc
Description: Digital signature
___
collectd mailing list
collectd@verplant.org
http://mailman.verplant.org/listinfo/collectd


Re: [collectd] long sleeps when using collectd and ntpd

2013-09-24 Thread Thomas Harold

On 9/24/2013 4:53 AM, danta wrote:

We have some problems using collectd and ntpd.

Whenever ntpd sets the clock in the past (because of a clock drift),
collectd sleeps until it is back at it's 'normal' time.
To give an example:
- Suppose it's 10:00am,
- ntpd sees a large clock drift and set the clock to 09:45am
- collectd will sleep till 10:00am



https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Virtualization/chap-Virtualization-KVM_guest_timing_management.html

http://kb.vmware.com/kb/1006427

How big is the time step?  You may want to look at the -g and -x options 
that are being passed to the ntpd daemon.  Under RHEL, these are set in 
/etc/sysconfig/ntpd.


http://www.eecis.udel.edu/~mills/ntp/html/ntpd.html

And make sure that there is no periodic process running "ntpdate" 
somewhere in the background.  You may want to check through the logfiles 
to figure out whether it is ntpd that is stepping the clock or some 
other process.



___
collectd mailing list
collectd@verplant.org
http://mailman.verplant.org/listinfo/collectd


Re: [collectd] long sleeps when using collectd and ntpd

2013-09-24 Thread Yves Mettier

Hello,

I have already noticed this problem too.
Forget about ntpd. Some may try to understand what ntpd has to do with 
it and maybe suggest solutions for ntpd. The problem is the time on the 
machine.


How to reproduce :
0/ disable ntpd (so you can see that there is nothing to do with ntpd)
1/ Set the date&time in the future on your server (it's "t0", set it to 
t1, for example t1="t0 + 2 hours")
2/ Wait 2 or 3 minutes that Collectd collects and send data to the main 
Collectd server (and break your rrd files)
3/ Set the date&time back to the correct value (t0 + the time elapsed 
since step 1)
4/ no data will be written inside the rrd files until the real date&time 
reaches t1 (in our example, we loose 2 hours of data).


The problem is inside the rrd files.
You can run "rrdtool last .rrd" and see the timestamp of 
the last value written.
As far as I know rrd files and rrdtool, you cannot write data with 
timestamps before that value.
So if you set a bad date&time in the future, it will continue to work. 
But when you set a date&time in the past (or like in the example, back 
from the future), it's broken.


Bigger problem when the date&time goes crazy and set a date years later. 
When this happens, you can consider your rrd file as corrupted.



Well, here is my version of the description of the problem.
If anybody has an idea on how to fix, I'm interested too.

Regards,
Yves


Le 2013-09-24 10:53, danta a écrit :

We have some problems using collectd and ntpd.

Whenever ntpd sets the clock in the past (because of a clock drift),
collectd sleeps until it is back at it's 'normal' time.
To give an example:
- Suppose it's 10:00am,
- ntpd sees a large clock drift and set the clock to 09:45am
- collectd will sleep till 10:00am

When trying to debug the problem, we found that the do_loop function
in collectd.c determines the amount of time to sleep based on times
abtained from the "gettimeofday" function. Was this a design choice?
Wouldn't it be better to use a monotonic clock?

Greetz,
Lode

___
collectd mailing list
collectd@verplant.org
http://mailman.verplant.org/listinfo/collectd


--
- Homepage   - http://ymettier.free.fr -
- GPG key- http://ymettier.free.fr/gpg.txt -
- C en action- http://ymettier.free.fr/livres/C_en_action_ed2.html -
- Guide Survie C - http://www.pearson.fr/livre/?GCOI=27440100673730-

___
collectd mailing list
collectd@verplant.org
http://mailman.verplant.org/listinfo/collectd


Re: [collectd] long sleeps when using collectd and ntpd

2013-09-24 Thread Dave Cottlehuber
On 24. September 2013 at 12:12:15, danta (merte...@axsguard.net) wrote:
>
>Hi Dave,
>
>We see the problem on virtual machines, where the host machine has it's
>clock set to the local time. I realize this is a wrong setup but
>unfortunately we can't change the settings on the host machine. Also the
>host clock seems rather unstable, leading to large drifts if the ntpd
>server can't be reached for large periods of time.
>
>The long sleep problem also occurs if you correct your date to a date in
>the past.
>
>For the moment we patched the collectd code to exit if it detects a
>large time drift (a watchdog will then restart it), but I think this
>isn't a clean solution.
>
>Best regards,
>Lode

Makes sense, OTOH I would imagine this would cause problems with lots of other 
things too.

>From twitter, serendipitiously:

http://infiniteundo.com/post/25326999628/falsehoods-programmers-believe-about-time

A+
Dave



___
collectd mailing list
collectd@verplant.org
http://mailman.verplant.org/listinfo/collectd


Re: [collectd] long sleeps when using collectd and ntpd

2013-09-24 Thread danta

Hi Dave,

We see the problem on virtual machines, where the host machine has it's 
clock set to the local time. I realize this is a wrong setup but 
unfortunately we can't change the settings on the host machine. Also the 
host clock seems rather unstable, leading to large drifts if the ntpd 
server can't be reached for large periods of time.


The long sleep problem also occurs if you correct your date to a date in 
the past.


For the moment we patched the collectd code to exit if it detects a 
large time drift (a watchdog will then restart it), but I think this 
isn't a clean solution.


Best regards,
Lode

On 09/24/2013 11:27 AM, Dave Cottlehuber wrote:

On 24. September 2013 at 10:56:07, danta (merte...@axsguard.net) wrote:

We have some problems using collectd and ntpd.

Whenever ntpd sets the clock in the past (because of a clock drift),
collectd sleeps until it is back at it's 'normal' time.
To give an example:
- Suppose it's 10:00am,
- ntpd sees a large clock drift and set the clock to 09:45am
- collectd will sleep till 10:00am

When trying to debug the problem, we found that the do_loop function in
collectd.c determines the amount of time to sleep based on times
abtained from the "gettimeofday" function. Was this a design choice?
Wouldn't it be better to use a monotonic clock?

Greetz,
Lode

Hi Lode

I can't answer for collectd but this doesn't sound like the correct usage for 
ntpd.

Is the large drift an example, & the timescale is a few seconds delay?

Your ntpd should use a driftfile which over time will keep things in line most 
of the time, even if your ntpd servers are temporarily unavailable.

Are these VMs that are being hibernated or similar? If so, use host clock sync 
and not ntpd. Obviously the hosts will use ntpd though!

A+
Dave





___
collectd mailing list
collectd@verplant.org
http://mailman.verplant.org/listinfo/collectd


Re: [collectd] long sleeps when using collectd and ntpd

2013-09-24 Thread Dave Cottlehuber
On 24. September 2013 at 10:56:07, danta (merte...@axsguard.net) wrote:
>
>We have some problems using collectd and ntpd.
>
>Whenever ntpd sets the clock in the past (because of a clock drift),
>collectd sleeps until it is back at it's 'normal' time.
>To give an example:
>- Suppose it's 10:00am,
>- ntpd sees a large clock drift and set the clock to 09:45am
>- collectd will sleep till 10:00am
>
>When trying to debug the problem, we found that the do_loop function in
>collectd.c determines the amount of time to sleep based on times
>abtained from the "gettimeofday" function. Was this a design choice?
>Wouldn't it be better to use a monotonic clock?
>
>Greetz,
>Lode

Hi Lode

I can't answer for collectd but this doesn't sound like the correct usage for 
ntpd.

Is the large drift an example, & the timescale is a few seconds delay? 

Your ntpd should use a driftfile which over time will keep things in line most 
of the time, even if your ntpd servers are temporarily unavailable.

Are these VMs that are being hibernated or similar? If so, use host clock sync 
and not ntpd. Obviously the hosts will use ntpd though!

A+
Dave



___
collectd mailing list
collectd@verplant.org
http://mailman.verplant.org/listinfo/collectd


[collectd] long sleeps when using collectd and ntpd

2013-09-24 Thread danta

We have some problems using collectd and ntpd.

Whenever ntpd sets the clock in the past (because of a clock drift), 
collectd sleeps until it is back at it's 'normal' time.

To give an example:
- Suppose it's 10:00am,
- ntpd sees a large clock drift and set the clock to 09:45am
- collectd will sleep till 10:00am

When trying to debug the problem, we found that the do_loop function in 
collectd.c determines the amount of time to sleep based on times 
abtained from the "gettimeofday" function. Was this a design choice?  
Wouldn't it be better to use a monotonic clock?


Greetz,
Lode

___
collectd mailing list
collectd@verplant.org
http://mailman.verplant.org/listinfo/collectd