Hi,

Thank you all for your replies, sorry I have not got back with comments 
before, got involved in other things at work etc.

Having got my hands on another 8 units from production I set them all up in 
a rack.

Of the 8 units 4 of them locked up when first synchronised to ntp, with the 
drift file being stuck at -500.

These 4 units where then left for 48 hours and, they where still in the same 
condition that was: -

-- Drift file read between -495 and -500, there seemed to be a small changes 
over time but mostly -500

-- Using ntpq -p to monitor the lock, you could see the time initially being 
within 15ms of the server, then over a few minutes make its way up to an 
offset of 500ms, when a step change was made and the hole process started 
all over again.

-- The good units where locked to the same time server and where within +/- 
15ms or better, the server is local but locked to PPS ntp servers over the 
internet ... not perfect as the delay and jitter changes through the day as 
internet usage changes. Have seen then locked to within 2ms when used on a 
customer's network with a local GPS controlled server.

After 48 hours of being in this stuck condition I logged in and stopped the 
ntp process, cleared the drift file back to 0, used ntpdate to set the 
system time to that on the server and re-started the ntp process.

(The drift file will always start at zero on these systems as they operate 
from read only flash, the /var directory is constructed in /dev/shm at boot)

After 3 hours or so I checked them, and all 8 of the systems where locked to 
the server, with drift files ranging from about -20 to -75, so all the 
processor clocks don't see to fare off.

Now from this I can only conclude its a problem with the control algorithm 
in ntp, that under some start conditions gets stuck applying the maximum 
compensation. If it were a problem with the hardware I would have expected 
it to still be there after stopping and re-starting the ntp process. If a 
transient problem at power up with the hardware why still stuck after 48 
hours.

Looking at the differences between the systems more I found one of them that 
had initially failed to lock had a completely dead CMOS battery. Apparently 
production had received a batch of 100 dead batteries, and this was one that 
slipped through the net.

Looking at all the others turned up one other dead battery, but that was 
form one that locked Ok.

All of the Real time clocks in the units where set to some time in December 
2008, having never been set, the BIOS was written around then so it seems 
that's the time they start with then the battery is first fitted.

This is in part to do with the way the units are used with a read only file 
system, they are designed to just be turned off rather than shut down, so 
the system time never gets written to the RTC, as I understand this is 
normally done during shutdown. (In normal use these units will be on all the 
time)

I have since tried to reproduce the lockup with these units, returning there 
flash drive to the production image and starting all over again, with the 
RTC again set to different times in the past and future. Non of these 
efforts has so far reproduced the right conditions, they always seem to lock 
with no problems at all.

I have seen one other unit recently that locked up, that was a unit being 
used by one of the software development engineers, who had been on holiday 
for two weeks and when he returned and powered his unit up, it failed to 
lock. Unfortunately I did not get to examine it before he had shut it down 
and re-booted so don't know the status of the RTC when it was booted, the 
sundown had set it to the server time.

So may be some merit in testing a few more systems that have been off for a 
while and see what happens?

Think my next step is to get the software, or may be just a cron script to

a) Set the RTC to the system time once locked to NTP

b) If the drift file is -500, stop ntp, set time to the server time, restart 
ntp with zeroed drift file.

PS systems are running Linux version 2.6.22.18-0.2-default (ge...@buildhost) 
(gcc version 4.2.1 (SUSE Linux))

Dave




_______________________________________________
questions mailing list
[email protected]
https://lists.ntp.org/mailman/listinfo/questions

Reply via email to