I am doing post-mortem analysis on an NTP related problem in which one host 
running ntp-4.1.2 gets in a state where it seems to be making large step 
corrections to its local clock.

When I look at the NTP stats file, I can see that something was terribly wrong 
with one or more of the NTP servers this host was using.  Sometime around 18 
August, the clocks of NTP servers 192.168.0.1 and 192.168.0.2 began to 
gradually diverge reaching a difference of over 800 seconds by 8 September.  
Compounding this problem, the peerstats also shows one of the NTP servers 
periodically (period of ~900s) being detected as unreachable over the whole 
duration.  The other NTP server had a few sporadic incidences of being 
unreachable.

I have captured all of the ntp configuration and the stats files.  Also, I 
prepared a graph (http://dingo.dogpad.net/ntpProblem/reachableScatter.png) 
showing the offset of each peer as a function of time.  All the stats and 
config (and the graph) can be found at http://dingo.dogpad.net/ntpProblem.

I am a little bit interested in understanding what could have happened with the 
NTP servers on 18 August.  I know that on 8 September, someone changed the 
configuration of one of the NTP servers (Note: the servers are probably not 
ntp.org's implementation), which apparently fixed the problem.

I am more interested, however, how the my node handled this problem.  Before I 
started digging into the problem, I was under the impression that ntp.org's 
ntpd never stepped the clock, but only slewed it to correct it.  Now I see this 
is not the default behavior, bu I can achieve this using tinker step 0.  
However, I read a thread on this newsgroup from Feb 2005 in which David Mills 
suggested this could produce large offsets and other unpredictable errors.

How can I avoid the large clock stepping in this scenario?  Is it related to the 
"prefer" keyword used for 192.168.0.1?
Can I safely use "tinker step 0" along with "kernel disable" to prevent step 
corrections altogether?
Can anyone tell me what they think happened to cause the two NTP servers to 
diverge so quickly?


---
Joe Harvell

_______________________________________________
questions mailing list
[email protected]
https://lists.ntp.isc.org/mailman/listinfo/questions

Reply via email to