"David Hawkins" <[email protected]> writes:

>Hi,

>Thank you all for your replies, sorry I have not got back with comments 
>before, got involved in other things at work etc.

>Having got my hands on another 8 units from production I set them all up in 
>a rack.

>Of the 8 units 4 of them locked up when first synchronised to ntp, with the 
>drift file being stuck at -500.

>These 4 units where then left for 48 hours and, they where still in the same 
>condition that was: -

>-- Drift file read between -495 and -500, there seemed to be a small changes 
>over time but mostly -500

>-- Using ntpq -p to monitor the lock, you could see the time initially being 
>within 15ms of the server, then over a few minutes make its way up to an 
>offset of 500ms, when a step change was made and the hole process started 
>all over again.

I do not know what a "few minutes" means -- 10 min, 1000 min. If it is 10 min 
this
would indicate a drift rate of about 1000PPM which ntp certainly cannot fix.


>-- The good units where locked to the same time server and where within +/- 
>15ms or better, the server is local but locked to PPS ntp servers over the 
>internet ... not perfect as the delay and jitter changes through the day as 
>internet usage changes. Have seen then locked to within 2ms when used on a 
>customer's network with a local GPS controlled server.

>After 48 hours of being in this stuck condition I logged in and stopped the 
>ntp process, cleared the drift file back to 0, used ntpdate to set the 
>system time to that on the server and re-started the ntp process.

>(The drift file will always start at zero on these systems as they operate 
>from read only flash, the /var directory is constructed in /dev/shm at boot)

>After 3 hours or so I checked them, and all 8 of the systems where locked to 
>the server, with drift files ranging from about -20 to -75, so all the 
>processor clocks don't see to fare off.

>Now from this I can only conclude its a problem with the control algorithm 
>in ntp, that under some start conditions gets stuck applying the maximum 
>compensation. If it were a problem with the hardware I would have expected 
>it to still be there after stopping and re-starting the ntp process. If a 
>transient problem at power up with the hardware why still stuck after 48 
>hours.

>Looking at the differences between the systems more I found one of them that 
>had initially failed to lock had a completely dead CMOS battery. Apparently 
>production had received a batch of 100 dead batteries, and this was one that 
>slipped through the net.

>Looking at all the others turned up one other dead battery, but that was 
>form one that locked Ok.

>All of the Real time clocks in the units where set to some time in December 
>2008, having never been set, the BIOS was written around then so it seems 
>that's the time they start with then the battery is first fitted.

>This is in part to do with the way the units are used with a read only file 
>system, they are designed to just be turned off rather than shut down, so 
>the system time never gets written to the RTC, as I understand this is 
>normally done during shutdown. (In normal use these units will be on all the 
>time)

>I have since tried to reproduce the lockup with these units, returning there 
>flash drive to the production image and starting all over again, with the 
>RTC again set to different times in the past and future. Non of these 
>efforts has so far reproduced the right conditions, they always seem to lock 
>with no problems at all.

>I have seen one other unit recently that locked up, that was a unit being 
>used by one of the software development engineers, who had been on holiday 
>for two weeks and when he returned and powered his unit up, it failed to 
>lock. Unfortunately I did not get to examine it before he had shut it down 
>and re-booted so don't know the status of the RTC when it was booted, the 
>sundown had set it to the server time.

>So may be some merit in testing a few more systems that have been off for a 
>while and see what happens?

>Think my next step is to get the software, or may be just a cron script to

>a) Set the RTC to the system time once locked to NTP

>b) If the drift file is -500, stop ntp, set time to the server time, restart 
>ntp with zeroed drift file.

>PS systems are running Linux version 2.6.22.18-0.2-default (ge...@buildhost) 
>(gcc version 4.2.1 (SUSE Linux))

Linux has had some problem with its calibration of the system clock's time and 
it
is possible that this resulted in a clock which was running way out of spec (ie
say near 500PPM out). This would of course make it impossible for ntp to sync 
the
clock. You could try to manually set the rate ( adjtimex with the --tick
adjustment) to compensate for the anomalous drift rate. 



>Dave




_______________________________________________
questions mailing list
[email protected]
https://lists.ntp.org/mailman/listinfo/questions

Reply via email to