On 2012-06-08, Rick Jones <rick.jon...@hp.com> wrote: > unruh <un...@invalid.ca> wrote: >> I am sure that it started when we switched from 100Mb technology to >> Gb technology, yes. Other places to look for the problem would be >> appreciated. > > I would suggest then trying disabling of the interrupt coalescing via > ethtool on the 1GbE NIC of your server and a few select clients and > see what that does. If things start to look cleaner then you know it > is an implementation-specific detail of one or more GbE NICs.
It looks to me that interrupt coalescing is not enables according to ethtools. It seems that it is the receipt of the packets is the problem. Ie, if I plot the round trip time vs the offset, it is strongly correlated so that the longer the roundtrip, the more the offset indicates that the local clock is behind time. (by 1/2 the roundtripi excess) Ie, it is a one way delay, and the effect is much worse for the Gigabit than for the 100 (ie, the variation in round trip is about 4 times as large for Gigabit than for the 100.) > > If it is possible to connect a client "back-to-back" to your server at > the same time (via a second port) - still with interrupt coalescing > disabled at both ends that would be an excellent addition. That will > help evaluate the switch. > > I trust there were no OS changes when going from 100BT to GbE? Though > even if not, there is still the prospect of the drivers for the 100BT > cards not doing what linux calls "napi" and the drivers for the GbE > cards doing it, which may introduce some timing changes. What is napi? > >> So yes, I think it is the Gb technology that is causing trouble. > > I split what may seem a hair between Gb technology being the IEEE > specification and Gb implementation being what specific NIC vendors > do. So, to me, interrupt coalescing is implementation not technology. For me, I do not care what which it is, it is all Gb. Note that on one of the clients, there are two separate clusters of roundtrip delays, one from .15 to about .4ms, and the other from about 1.3 to 1.6 ms. The slope within each cluster is as above but the slope between the clusters is the opposite. Ie, within the cluster, the client to server is being delayed, while the clusters are due to a huge delay in the server to client. (if I have the signs right) In http://www.theory.physics.ubc.ca/scatter/scatter.html I have the scatter plots (offset vs return time) for two clients to two different servers. One of the servers is a Gb server, while the other is a 100Mb server. Both servers are disciplined by a GPS PPS device. The offset fluctuations on both servers is about 4 us, so none of the offset fluctuations come from the server clocks themselves. > _______________________________________________ questions mailing list questions@lists.ntp.org http://lists.ntp.org/listinfo/questions