Richard B. Gilbert wrote:
Joe Harvell wrote:

David L. Mills wrote:
<snip>
<snip>
I concede that only having 2 NTP servers for our host made this problem more likely to occur. But considering the mayhem caused by jerking the clock back and forth every 15 minues for 22 days, I think it is worth investigating whether to eliminate stepping altogether.


Why didn't anyone notice the problem for 22 days? If, indeed, it caused mayhem, why was it allowed to continue for so long?

I see your point.  I don't know for sure if it really caused problems.  I 
suspect I will begin to see a large number of bug reports coming out of this 
test lab once they start filtering back to the design team.  But it is quite 
possible there weren't any big problems or they went unnoticed.  It really 
depends on the type of testing they were performing.  This application is a 
call processing application, implementing call signaling protocols, and a host 
of other proprietary protocols for OAM (Operations, Administration, 
Maintenance) of the software itself.  The big problems I would expect to have 
occurred fall into two categories:  1) problems stemming from protocol timers 
expiring both early and late; and 2) accounting records for the calls 
themselves showing inaccurate (including negative) duration.  The software that 
did notice the problem was the software responsible for journaling application 
state from one process to another, as part of a 1+1 fault tolerance system.  Th
is software was measuring round-trip latencies between it and its mate by 
bouncing a measurement from its own clock off of its mate and then re-sampling 
its own clock to see the RTT.  These RTT measurements only takes place during 
failure recovery scenarios, which is what was being tested at the time.

Since our customers are telecommunications service providers, I expect they 
would notice negative durations for their billing records.  I am trying to 
prevent this from ever occurring.  However, based on the response I've received 
from Dr. Mills in this thread, it seems like the daemon feedback loop is 
unstable as a result of OS developers implementing variable slew rates into 
adjtime.  So it looks like if we continue with NTP, the better choice is to use 
the kernel time discipline for stable time.  We will have to engineer the 
network so that multiple failures would be required to necessitate the stepping 
in the first place.

I wonder if it would be good to add a description to the NTP FAQ about this?  
The key points to include I think should be why the kernel time discipline is 
disabled when the step threshold is changed, and also some indication that the 
daemon feedback loop is broken to begin with.  I am not the first person to go 
down this path.

Thanks again for your responses.

---
Joe Harvell

_______________________________________________
questions mailing list
[email protected]
https://lists.ntp.isc.org/mailman/listinfo/questions

Reply via email to