[ntp:questions] NTP.log interpretation

2014-04-18 Thread GregL
I'm trying to determine what this section of an ntp.log is telling me.

This is from a default xntpd instance on AIX 5.3.

ntp.conf has two severs listed, 34 and 97, with the '34' server preferred.

The first couple of lines are the last in a *long* list of hourly logged
offset messages.

It appears to me that *something* caused the system to sync to the
non-preferred server.  My guess is a lack of response from the '34' server?

Q#1)  Is that a safe assumption?  How many times, and how long would it
wait to get a response from '34'?

It turns out that due to a configuration issue, the '34' server was
actually out of time sync, and the second, non preferred server,  '97', was
actually the correct time. They were seconds off.   ('34' ended up not
being configured to an outside time server pool... only to sync with
'97'... but clearly there was a problem with that 'sync' happening.).


After this first reset, then the system went into this cycle of resetting
back and forth.  I'm trying to understand the logic here.  Is it telling me
that the was that the response from the preferred server was sporadic
enough that it would regularly flip to sync with secondary server...thus
the flip flopping back and forth with time.   This continued for hours
until the configuration issue with the two time servers were fixed, so they
were both "in sync" with the correct time... then back to regular old
boring offset log entries.

Q#2)  does "sychonisation lost" mean a lack of response from the preferred
server, or does it mean a dramatic difference in time, such that it needs
to be reset (as another log message indicates).

Thanks for any help in learning how to read this.  I have no indications of
any network issues, but given NTP uses UDP, it could be hard to track.   It
seems suspicious that once the two time servers were in sync...the issues
went away.   I.e.. meaning a *real* network issue did not exist... perhaps
just a small hiccup that caused the flip/flopping to occur and continue
until fixed.

Right now, my plan is to add the "-x" option to the xntpd startup;
hopefully that would avoid setting the clock backwards.   Additionally,
actions are taken to make sure the two time servers never get that far out
of sync without throwing out some alerts.

Any advice/counsel concerning this scenario would be greatly appreciated.
 I would like to feel a better understanding of what the log is telling
me which is why I am here ;-)

Thanks!



10 Apr 00:57:22 xntpd[245888]: offset -0.000638 freq 1.908 poll 6

10 Apr 01:57:22 xntpd[245888]: offset -0.000978 freq 2.094 poll 6

*10 Apr 02:19:40 xntpd[245888]: synchronized to 172.16.32.34, stratum=3

*10 Apr 02:21:05 xntpd[245888]: synchronisation lost

*10 Apr 02:21:48 xntpd[245888]: synchronized to 172.16.56.97, stratum=1

*10 Apr 02:27:48 xntpd[245888]: synchronized to 172.16.32.34, stratum=2

10 Apr 02:35:18 xntpd[245888]: time reset (step) -2.580356 s

10 Apr 02:35:18 xntpd[245888]: synchronized to 172.16.56.97, stratum=3

10 Apr 02:35:18 xntpd[245888]: synchronisation lost

10 Apr 02:35:18 xntpd[245888]: system event 'event_clock_reset' (0x05)
status 'sync_alarm, sync_unspec, 15 events, event_peer/strat_chg' (0xc0f4)

10 Apr 02:35:18 xntpd[245888]: system event 'event_sync_chg' (0x03) status
'sync_alarm, sync_unspec, 15 events, event_clock_reset' (0xc0f5)

10 Apr 02:35:18 xntpd[245888]: system event 'event_peer/strat_chg' (0x04)
status 'sync_alarm, sync_unspec, 15 events, event_sync_chg' (0xc0f3)

10 Apr 02:35:50 xntpd[245888]: peer 172.16.56.97 event 'event_reach' (0x84)
status 'reach, conf, 15 events, event_reach' (0x90f4)

10 Apr 02:36:22 xntpd[245888]: peer 172.16.32.34 event 'event_reach' (0x84)
status 'reach, conf, 15 events, event_reach' (0x90f4)

10 Apr 02:40:06 xntpd[245888]: synchronized to 172.16.56.97, stratum=3

10 Apr 02:40:09 xntpd[245888]: time reset (step) 2.569861 s

10 Apr 02:40:09 xntpd[245888]: synchronisation lost

10 Apr 02:40:09 xntpd[245888]: system event 'event_clock_reset' (0x05)
status 'sync_alarm, sync_unspec, 15 events, event_peer/strat_chg' (0xc0f4)

10 Apr 02:40:41 xntpd[245888]: peer 172.16.32.34 event 'event_reach' (0x84)
status 'reach, conf, 15 events, event_reach' (0x90f4)

10 Apr 02:41:13 xntpd[245888]: peer 172.16.56.97 event 'event_reach' (0x84)
status 'reach, conf, 15 events, event_reach' (0x90f4)

10 Apr 02:44:57 xntpd[245888]: system event 'event_peer/strat_chg' (0x04)
status 'sync_alarm, sync_ntp, 15 events, event_clock_reset' (0xc6f5)

10 Apr 02:44:57 xntpd[245888]: synchronized to 172.16.32.34, stratum=2

10 Apr 02:44:54 xntpd[245888]: time reset (step) -2.877219 s

10 Apr 02:44:54 xntpd[245888]: synchronisation lost

10 Apr 02:44:54 xntpd[245888]: system event 'event_clock_reset' (0x05)
status 'sync_alarm, sync_unspec, 15 events, event_peer/strat_chg' (0xc0f4)

10 Apr 02:45:26 xntpd[245888]: peer 172.16.56.97 event 'event_reach' (0x84)
status 'reach, conf, 15 events, event_reach' (0x90f4)

10 Apr 02:45:58 xntpd[245888]: peer 17

Re: [ntp:questions] NTP.log interpretation

2014-04-18 Thread mike cook

A classic example of the adage " A man with two clocks doesn't know what the 
time is" . So neither can NTP.
It will hop between the two until the two agree. This is a bad configuration.

Le 18 avr. 2014 à 05:53, GregL a écrit :

< snip>

> Right now, my plan is to add the "-x" option to the xntpd startup;
> hopefully that would avoid setting the clock backwards.   Additionally,
> actions are taken to make sure the two time servers never get that far out
> of sync without throwing out some alerts.
> 

  What you should do is to add more servers to the config.

___
questions mailing list
questions@lists.ntp.org
http://lists.ntp.org/listinfo/questions


Re: [ntp:questions] NTP.log interpretation

2014-04-18 Thread GregL
>
>
> A classic example of the adage " A man with two clocks doesn't know what
> the time is" . So neither can NTP.
> It will hop between the two until the two agree. This is a bad
> configuration.
>
>
That is certainly the way it feels! ;-)


>
>   What you should do is to add more servers to the config.
> 


What about the idea of going to only one entry, but that entry is served by
a DNS load balancer to choose one of two internal time servers to check.
 Each of those, is configured to point at a pool of time servers (4 each).
___
questions mailing list
questions@lists.ntp.org
http://lists.ntp.org/listinfo/questions


Re: [ntp:questions] NTP.log interpretation

2014-04-18 Thread Miroslav Lichvar
On Fri, Apr 18, 2014 at 09:01:09AM -0500, GregL wrote:
> >   What you should do is to add more servers to the config.
> 
> What about the idea of going to only one entry, but that entry is served by
> a DNS load balancer to choose one of two internal time servers to check.
>  Each of those, is configured to point at a pool of time servers (4 each).

Well, that will prevent the client from detecting it's getting wrong
time. Is that what you want?

>From the log it seems that at least one server is completely wrong,
the offset between the two servers is around 3 seconds! I'd suggest to
fix that first.

-- 
Miroslav Lichvar
___
questions mailing list
questions@lists.ntp.org
http://lists.ntp.org/listinfo/questions


Re: [ntp:questions] NTP.log interpretation

2014-04-18 Thread GregL
> On Fri, Apr 18, 2014 at 09:01:09AM -0500, GregL wrote:
> > >   What you should do is to add more servers to the config.
> >
> > What about the idea of going to only one entry, but that entry is served
> by
> > a DNS load balancer to choose one of two internal time servers to check.
> >  Each of those, is configured to point at a pool of time servers (4
> each).
>
> Well, that will prevent the client from detecting it's getting wrong
> time. Is that what you want?
>
>
I'm wrestling with that very question.  With 100+ systems, we have a far
greater problem if some systems are *off* and others are not.

>From the log it seems that at least one server is completely wrong,
> the offset between the two servers is around 3 seconds! I'd suggest to
> fix that first.
>
>
Yes, clearly the root of the most recent problem was a faulty configuration
that allowed our internal time servers to grow to nearly 50 seconds apart
at some pointand that wreaked havoc in many many areas.

That is fixed, and our two internal time servers *should* be correct.

Now, I'm just planning on making changes to the ntp.conf, like adding the
"-x" parameter.  I'm hoping that that will prevent huge time resets
backwards in time...should that ever be even possible again.

But, was the "sychronization lost" message *because* ntp saw the time
difference so great on peer servers...and chose one to synch to...resulting
in the time reset message?
___
questions mailing list
questions@lists.ntp.org
http://lists.ntp.org/listinfo/questions


Re: [ntp:questions] NTP.log interpretation

2014-04-18 Thread Miroslav Lichvar
On Fri, Apr 18, 2014 at 10:38:10AM -0500, GregL wrote:
> But, was the "sychronization lost" message *because* ntp saw the time
> difference so great on peer servers...and chose one to synch to...resulting
> in the time reset message?

It seems so. Not sure how close this is to the version you are
running, but in xntp3-5.93e (dated 1998) it seems the system peer is
unselected (and the message logged) on every clock step.

-- 
Miroslav Lichvar
___
questions mailing list
questions@lists.ntp.org
http://lists.ntp.org/listinfo/questions


Re: [ntp:questions] NTP.log interpretation

2014-04-18 Thread William Unruh
On 2014-04-18, GregL  wrote:
>> On Fri, Apr 18, 2014 at 09:01:09AM -0500, GregL wrote:
>> > >   What you should do is to add more servers to the config.
>> >
>> > What about the idea of going to only one entry, but that entry is served
>> by
>> > a DNS load balancer to choose one of two internal time servers to check.
>> >  Each of those, is configured to point at a pool of time servers (4
>> each).
>>
>> Well, that will prevent the client from detecting it's getting wrong
>> time. Is that what you want?
>>
>>
> I'm wrestling with that very question.  With 100+ systems, we have a far
> greater problem if some systems are *off* and others are not.
>
> From the log it seems that at least one server is completely wrong,
>> the offset between the two servers is around 3 seconds! I'd suggest to
>> fix that first.
>>
>>
> Yes, clearly the root of the most recent problem was a faulty configuration
> that allowed our internal time servers to grow to nearly 50 seconds apart
> at some pointand that wreaked havoc in many many areas.

What was causing that. Clearly one, or both, are not getting their time
from proper servers themselves. In you post there seemed to be a hint
that one of your servers was getting its time from the other. That is
bad idea. It is no better than having just one server. 

>
> That is fixed, and our two internal time servers *should* be correct.

>
> Now, I'm just planning on making changes to the ntp.conf, like adding the
> "-x" parameter.  I'm hoping that that will prevent huge time resets
> backwards in time...should that ever be even possible again.

ntpd will reset the time if it is off by more than 128 ms. Those higly
non-linear jumps are one of the "features" of ntpd. If you do not want
them, run for example chrony. It will smoothly change the time. It will
however also at times slew the time much faster than 500PPM to get the
time back on track. 
>
> But, was the "sychronization lost" message *because* ntp saw the time
> difference so great on peer servers...and chose one to synch to...resulting
> in the time reset message?

And since there are only two, it had no idea which one to choose so it
chose randomly. 

___
questions mailing list
questions@lists.ntp.org
http://lists.ntp.org/listinfo/questions


Re: [ntp:questions] NTP.log interpretation

2014-04-18 Thread GregL
> > Yes, clearly the root of the most recent problem was a faulty
> configuration
> > that allowed our internal time servers to grow to nearly 50 seconds apart
> > at some pointand that wreaked havoc in many many areas.
>
> What was causing that. Clearly one, or both, are not getting their time
> from proper servers themselves. In you post there seemed to be a hint
> that one of your servers was getting its time from the other. That is
> bad idea. It is no better than having just one server.
>
>
Yes.  From what I understand, one of the servers that serves as a time
server as was rebuilt in January and the ntpd configuration was not put
back on.  It was an oversight.  Because of other services that run there,
that server *should* have kept in sync with the other server, but that sync
didn't appear to happen either.

Clearly a bad situation.  That is corrected now, with both internal time
servers independently configured to go to a external pool of NTP servers.
That is more of the "correct the problem" fix;  as a matter of looking at
the big picture, we are just trying to determine any other changes we
should make.   Building more dedicated time servers that aren't rebooted
weekly is one thing I will lobby for ;-)

I'm certainly learning more ;-)


--Greg
___
questions mailing list
questions@lists.ntp.org
http://lists.ntp.org/listinfo/questions


Re: [ntp:questions] NTP.log interpretation

2014-04-18 Thread Steve Kostecke
On 2014-04-18, William Unruh  wrote:

> On 2014-04-18, GregL  wrote:
>
>> Now, I'm just planning on making changes to the ntp.conf, like adding
>> the "-x" parameter. I'm hoping that that will prevent huge time
>> resets backwards in time...should that ever be even possible again.
>
> ntpd will reset the time if it is off by more than 128 ms.

The default step threshold is 128ms. This threshold is user
configurable.

As for the '-x' option. Using it could lead to having a clock so far off
from the correct time that ntpd will never be able to correct the offset
via slewing. 

> Those higly non-linear jumps are one of the "features" of ntpd. If you
> do not want them, run for example chrony. It will smoothly change the
> time. It will however also at times slew the time much faster than
> 500PPM to get the time back on track.

500PPM per day is 43 seconds per day. One could argue that a clock which
requires more than 43 seconds per day of correction is fundamentally
broken and requires repair rather than calibration.

-- 
Steve Kostecke 
NTP Public Services Project - http://support.ntp.org/

___
questions mailing list
questions@lists.ntp.org
http://lists.ntp.org/listinfo/questions


Re: [ntp:questions] NTP.log interpretation

2014-04-18 Thread William Unruh
On 2014-04-18, Steve Kostecke  wrote:
> On 2014-04-18, William Unruh  wrote:
>
>> On 2014-04-18, GregL  wrote:
>>
>>> Now, I'm just planning on making changes to the ntp.conf, like adding
>>> the "-x" parameter. I'm hoping that that will prevent huge time
>>> resets backwards in time...should that ever be even possible again.
>>
>> ntpd will reset the time if it is off by more than 128 ms.
>
> The default step threshold is 128ms. This threshold is user
> configurable.
>
> As for the '-x' option. Using it could lead to having a clock so far off
> from the correct time that ntpd will never be able to correct the offset
> via slewing. 
>
>> Those higly non-linear jumps are one of the "features" of ntpd. If you
>> do not want them, run for example chrony. It will smoothly change the
>> time. It will however also at times slew the time much faster than
>> 500PPM to get the time back on track.
>
> 500PPM per day is 43 seconds per day. One could argue that a clock which
> requires more than 43 seconds per day of correction is fundamentally
> broken and requires repair rather than calibration.

If the rate error were off by that much, that would be true. However, if
the clock is off by an hour say, and you do not want it ever jump
backwards then 43 sec per day would take 100 days to correct that offset
error (assuming that none of that 43 sec per day were  taken up by
rate error). At the max linux slew rate of 10 PPM, it would take
about 10 hours to correct. Yes, your rate might be out by 10% but it may
be that never jumping is worth that to you. 

Also, some clocks are just out by over 500PPM. That could be the case
for Linux with its clock calibration routine for a while (very rare but
possible). Since almost
none of us are capable of rewriting the kernel, "fixing the problem" was
not an option. 
(On another bootup, the rate error could be very different.)


>

___
questions mailing list
questions@lists.ntp.org
http://lists.ntp.org/listinfo/questions


Re: [ntp:questions] NTP.log interpretation

2014-04-18 Thread Jochen Bern
On 18.04.2014 20:45, questions-requ...@lists.ntp.org digested:
> From: GregL 
> 
> > > What about the idea of going to only one entry, but that entry is
> > > served by a DNS load balancer to choose one of two internal time
> > > servers to check.
> >
> > Well, that will [...]
> 
> I'm wrestling with that very question.  With 100+ systems, we have a far
> greater problem if some systems are *off* and others are not.

Am I missing something, or will the setup described above (and assuming
that the two servers disagree again) *force* your clients to do what you
just called "the far greater problem"? Namely, being randomly split
50/50 between the two servers, not even *knowing* of the other one?

(FWIW, ntpd does the DNS resolution *once* when loading its config and
works with the one IP obtained from then on, plans of implementing
automatic rotation/selection of "pool" servers in future versions
notwithstanding. And having potentially disagreeing NTP servers put
behind a V*IP* load balancer is discouraged as well.)

Regards,
J. Bern
-- 
*NEU* - NEC IT-Infrastruktur-Produkte im :
Server--Storage--Virtualisierung--Management SW--Passion for Performance
Jochen Bern, Systemingenieur --- LINworks GmbH 
Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt
PGP (1024D/4096g) FP = D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C27
Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202
Unternehmenssitz Weiterstadt, Geschäftsführer Metin Dogan, Oliver Michel
___
questions mailing list
questions@lists.ntp.org
http://lists.ntp.org/listinfo/questions


Re: [ntp:questions] NTP.log interpretation

2014-04-18 Thread William Unruh
On 2014-04-18, GregL  wrote:
>> > Yes, clearly the root of the most recent problem was a faulty
>> configuration
>> > that allowed our internal time servers to grow to nearly 50 seconds apart
>> > at some pointand that wreaked havoc in many many areas.
>>
>> What was causing that. Clearly one, or both, are not getting their time
>> from proper servers themselves. In you post there seemed to be a hint
>> that one of your servers was getting its time from the other. That is
>> bad idea. It is no better than having just one server.
>>
>>
> Yes.  From what I understand, one of the servers that serves as a time
> server as was rebuilt in January and the ntpd configuration was not put
> back on.  It was an oversight.  Because of other services that run there,
> that server *should* have kept in sync with the other server, but that sync
> didn't appear to happen either.

Having two servers, one of which gets its time from the other is pretty
useless. It is equivalent at the best of times to having only one
server, and at the worst to hvaing none (as you discovered).

You should always try to make sure that your sources of time really are
independent. That is a problem with the pool, you can get two or three
servers all of whom get their time from the same stratum 1 server. 

If you can do it, a better solution would be to have say one server with
a gps PPS clock source, and the other(s) from the outside ntp pool. 

>
> Clearly a bad situation.  That is corrected now, with both internal time
> servers independently configured to go to a external pool of NTP servers.
> That is more of the "correct the problem" fix;  as a matter of looking at
> the big picture, we are just trying to determine any other changes we
> should make.   Building more dedicated time servers that aren't rebooted
> weekly is one thing I will lobby for ;-)
>
> I'm certainly learning more ;-)
>
>
> --Greg

___
questions mailing list
questions@lists.ntp.org
http://lists.ntp.org/listinfo/questions


Re: [ntp:questions] NTP.log interpretation

2014-04-18 Thread GregL
On Fri, Apr 18, 2014 at 3:15 PM, Jochen Bern wrote:

> Am I missing something, or will the setup described above (and assuming
> that the two servers disagree again) *force* your clients to do what you
> just called "the far greater problem"? Namely, being randomly split
> 50/50 between the two servers, not even *knowing* of the other one?
>
>
I think that is part of the reason I'm sanity checking.  I think if the
servers stay in sync... it's probably a non issue.  But that is the issue..
I've seen the havoc when one out of two servers is bad... and with the load
balancer, there's no guarantee I'm any better off...

(FWIW, ntpd does the DNS resolution *once* when loading its config and
> works with the one IP obtained from then on, plans of implementing
> automatic rotation/selection of "pool" servers in future versions
> notwithstanding. And having potentially disagreeing NTP servers put
> behind a V*IP* load balancer is discouraged as well.)
>

**especially** considering that statement!   Hmmmif that's the way ntpd
works, then I think the load balancer is worse than useless for ntp
clients... it could be disastrous, correct?

Thanks again for the feedback/advice... I think that re-examining the
configuration is even more important now.

I like the idea of one time server from a pool and the other from a gps
based source.

-greg
___
questions mailing list
questions@lists.ntp.org
http://lists.ntp.org/listinfo/questions


Re: [ntp:questions] NTP.log interpretation

2014-04-18 Thread Jason Rabel
Greg,

As others have suggested, any client running NTP should point to *at least* 3 
time sources (usually ~5 is preferred)... The reason
being if one server goes wacko, but the other two agree, then the client knows 
to X out the bad one and keep the two others. With
only two you are essentially just flipping a coin...

I do not know where you are located, but if you are serving time to 100+ 
clients, you should probably consider the "pool" servers as
backup sources and look more into finding local public stratum 1 & 2 servers:

http://support.ntp.org/bin/view/Servers/StratumOneTimeServers

http://support.ntp.org/bin/view/Servers/StratumTwoTimeServers

NTP uses very very very little bandwidth, it's one small UDP packet (less than 
128 bytes) that (assuming default configuration)
works its way up to once every 17 minute... There's no reason to be stingy with 
selecting a handful of external internet time
servers (unless company policy prohibits it).

If your company has the funds and you have the ability to mount a GPS antenna, 
then going with a commercial GPS based NTP server
might be the way to go. You can choose from various oscillator options so that 
they will flywheel if they lose GPS lock but still
keep decent time for long hold-over periods. Likewise those same companies also 
offer CDMA based time servers if you have no
sky-view access.

If you want the DIY route, any old PC running Linux or FreeBSD that has a 
serial port + a GPS module that will output a PPS will
yield you far better results than you could sync with over a network.

Finally, it would also be worthwhile to have a layer of your time servers 
"peer" with each other. For instance, I have several
Stratum-1 servers that get time via GPS. Then I have three Stratum-2 servers 
that use the "server" line for the S1 servers, but in
addition they use the "peer" line with each other S2 server. When you combine 
that with "orphan" mode if all my S1 servers went
down, the S2's would work with each other to figure out their best guess at the 
right time. Finally all my clients point to the S2
servers... Just because it's only my local LAN, I do not have any external NTP 
servers listed, but if I did then those would end up
being used as fallback sources for the S2 servers.

My S2 servers are also not dedicated time servers, but they are servers that 
don't go down, rebooted, or even tinkered with often.
For instance one is a NAS that is the primary network storage for all clients 
running. Another is a database server. 

A dedicated NTP server doesn't have to be a huge powerful machine. Many 
commercial products if you open them up you would be
surprised to see 486-class PC104 SBCs The extra cost comes in their 
proprietary hardware that usually will discipline a TCXO,
OCXO, or Rubidium oscillator to GPS or CDMA (giving the flywheel ability)...

I have built probably half a dozen GPS based Stratum-1 NTP servers using 
Soekris SBCs and Motorola Oncore GPS receivers, all off
ebay... Maybe spending $50 total in hardware and an hour or less mounting 
everything in the little chassis and soldering wires. The
end result is a nice time server that consumes maybe 5-10 watts... I also 
purchased a handful of old commercial time servers that
also pop up on eBay from time to time at decent "hobbyist" prices... But to be 
honest they provide no better time than my homemade
ones, and most are running outdated OSes and NTP distros that I would not trust 
in a commercial environment because of the potential
for exploiting (which is probably why they ended up on eBay). Not to mention 
most are using old GPS receivers from the 90's (some
aren't even timing receivers).




___
questions mailing list
questions@lists.ntp.org
http://lists.ntp.org/listinfo/questions