When running recent versions of ntpd and with a recent Solaris patch, I see 
large delay and jitter in ntpq based output, and in the stats files. I am 
running in a high performance configuration where I'm trying to minimize offset.

I think it is a Solaris problem in asynchronous i/o processing, where select() 
is erroneously returning zero return values under certain conditions.

My ‘high performance’ configuration is running ntpd bound to processor 0 on an 
SMP, and at highest priority. My poll rate is 16 sec. And I’m running ntpq 
locally so ntpq is communicating with ntpd over the loopback interface. Most of 
these conditions are required to create the problem. So running time share, or 
running ntpq remotely, or not binding to processor 0 and then there is no 
problem. I haven’t tried other poll rates but its not likely that matters.

I did a lot of runtime tracing on ntpd using the built-in tracing printouts as 
well as printouts I added myself. I see that after ntpd services the ntpq 
requests over the loopback interface, and then services its next timer alarm 
interrupt, it sends a time synch packet out to its server as expected. (The 
server is on the same LAN in the same room, not on the internet.)

A SIGPOLL interrupt happens soon after as expected, as the return packet comes 
in. What then catches my eye is the select() call in the signal handler returns 
with return value and file descriptors set to zero. These should be nonzero. 
This is why I think there is a problem in Solaris. So with these zero return 
values from select(), the signal handler does not read out the packet from 
socket, and returns. The packet stays in the socket and gets stale.

Next there are more timer expiration events, and being at a poll rate of 16, 
not too much later I see another outbound packet get sent to the server in the 
room. Then I see the SIGPOLL interrupt and the interrupt handler run. The 
handler reads two packets out of the socket, the stale one and then the new one 
that came in behind it. My trace shows a printout of 'flash header 002', and 
'2' means 'bogus' so I think ntpd has detected something is bad about this 
stale packet that was stuck in the socket for a long time. However, I think 
this stale packet is what may be driving up the delay and jitter values I see 
in ntpq and the stats files. In rawstats, I see the t4 timestamp being really 
late for what I think is this event.

The signal handler makes the select() call with a timeout of 0, so the call 
returns immediately. As an experiement, I tried putting in non-zero timeouts, 
first 100 microsec and then 1 microsec. With non-zero timeouts, select() 
returns look good, non-zero return values.  I looked at a handful of events, 
and left the timeout at 1 microsec. My colleague let it run for a while and he 
reported back it was better but eventually he started to see problems. Anyway, 
this is a hint about what may be happening in Solaris.

Some things I tried fiddling with in Solaris that did not matter whether they 
were enabled or disabled were TCP Fusion and hi_res tick. 

I passed this to a couple Oracle reps about a week ago and haven’t heard 
anything back. I’ll try to post this to an Oracle Solaris support group, and 
try to submit an official trouble report to Oracle on it as well.

Some important details on the runtime machine and versions:

Hardware: Oracle SPARC T3 server, SMP multiprocessor, 32 Gig RAM
Operating System: Solaris 10, 8/11
Solaris patch where we started to see a problem: 144500-19

ntpd versions: I saw the problem and did my tracing on v4.2.6. Then I went back 
to version 4.2.5 which was recently delivered with Solaris and see the problem 
there as well. My colleague went back even further and he reports that 4.2.2 
has the problem, but then 4.2.0 and back does not.

I’ll let you know if I hear back from Oracle. If you have ideas please let me 
know. We are not stuck. We can run older versions, or not bind to processor 0 
to avoid the problem and still stay within the spec we are trying to meet.

Thanks,

Fran Horan
JHU/APL


_______________________________________________
questions mailing list
questions@lists.ntp.org
http://lists.ntp.org/listinfo/questions

Reply via email to