Richard B. Gilbert wrote:
Joe Harvell wrote:
David L. Mills wrote:
<snip>
<snip>
I concede that only having 2 NTP servers for our host made this
problem more likely to occur. But considering the mayhem caused by
jerking the clock back and forth every 15 minues for 22 days, I think
it is worth investigating whether to eliminate stepping altogether.
Why didn't anyone notice the problem for 22 days? If, indeed, it caused
mayhem, why was it allowed to continue for so long?
I see your point. I don't know for sure if it really caused problems. I
suspect I will begin to see a large number of bug reports coming out of this
test lab once they start filtering back to the design team. But it is quite
possible there weren't any big problems or they went unnoticed. It really
depends on the type of testing they were performing. This application is a
call processing application, implementing call signaling protocols, and a host
of other proprietary protocols for OAM (Operations, Administration,
Maintenance) of the software itself. The big problems I would expect to have
occurred fall into two categories: 1) problems stemming from protocol timers
expiring both early and late; and 2) accounting records for the calls
themselves showing inaccurate (including negative) duration. The software that
did notice the problem was the software responsible for journaling application
state from one process to another, as part of a 1+1 fault tolerance system. Th
is software was measuring round-trip latencies between it and its mate by
bouncing a measurement from its own clock off of its mate and then re-sampling
its own clock to see the RTT. These RTT measurements only takes place during
failure recovery scenarios, which is what was being tested at the time.
Since our customers are telecommunications service providers, I expect they
would notice negative durations for their billing records. I am trying to
prevent this from ever occurring. However, based on the response I've received
from Dr. Mills in this thread, it seems like the daemon feedback loop is
unstable as a result of OS developers implementing variable slew rates into
adjtime. So it looks like if we continue with NTP, the better choice is to use
the kernel time discipline for stable time. We will have to engineer the
network so that multiple failures would be required to necessitate the stepping
in the first place.
I wonder if it would be good to add a description to the NTP FAQ about this?
The key points to include I think should be why the kernel time discipline is
disabled when the step threshold is changed, and also some indication that the
daemon feedback loop is broken to begin with. I am not the first person to go
down this path.
Thanks again for your responses.
---
Joe Harvell
_______________________________________________
questions mailing list
[email protected]
https://lists.ntp.isc.org/mailman/listinfo/questions