On 1 Apr, 2014, at 12:50 , David Laight <da...@l8s.co.uk> wrote: > On Fri, Mar 28, 2014 at 06:16:23PM -0400, Dennis Ferguson wrote: >> I would like to rework the clock support in the kernel a bit to correct >> some deficiencies which exist now, and to provide new functionality. The >> issues I would like to try to address include: > > A few comments, I've deleted the body so they aren't hidden!
Thanks very much for looking at it. I know that reading about clocks is, for most people, a good way to put oneself to sleep at night. > One problem I do see is knowing which counter to trust most. > You are trying to cross synchronise values and it might be that > the clock with the best long term accuracy is a very slow one > with a lot of jitter (NTP over dialup anyone?). > Whereas the fastest clock is likely to have the least jitter, but > may not have the long term stability. This is true but when considering the quality of non-special-purpose computer clock hardware running on its own, either on the CPU board or on an ethernet card, what you'll effectively end up trying to determine by this is whether the clock is just crappy, or is crappier than that. The stability of cheap, uncompensated free-running crystals is always poor, you shouldn't trust any of these these unless you have no choice, and life is too short to worry about trying to measure degrees of crappiness. Since all the clocks in your system are likely to be crappy if left running free the "best" clock in the system will always be the one which is making the most accurate measurements of the most accurate external time source you have available and steering itself to that. The only important "quality" of a clock is how well it is measuring its time source and how good that time source is. The measurement clocks are only useful if you have an application which is interested in taking and processing those measurements, and if that application is not broken it will certainly come to some opinion about which of those clocks is the best one based on those measurements. That will be the clock the time comes from, the polling is the mechanism to get it to the others. The kernel itself will see the polling and see adjustments being made to clocks but it will be the application which knows why that is being done and which way the time is moving. If there are no external time sources, however, you'll probably just live with whatever your chosen system clock does and not worry about the measurement clocks. > There are places where you are only interested in the difference > between timestamps - rather than needing them converting to absolute > times. I'm not quite sure how to read that, but I'll guess. I over-simplified the description of what is being maintained a bit. I'm fond of, and the system call interface I like makes use of, the two timescales the kernel maintains now, i.e. time = uptime + boottime; where `time' has an UTC-aligned epoch, `uptime's epoch is around the time the machine was booted, and boottime is a mostly-constant value which expresses uptime's epoch in terms of time's epoch. uptime is maintained to advance at the same rate as time but to be phase continuous, which means that uptime will advance at as close to the rate of the SI second as we can determine it (since it advances at the same rate as time, which advances at the rate of UTC, which advances at the rate of the SI second) but is unaffected by step changes made to time make to bring it into phase alignment to UTC (boottime changes instead). uptime hence tracks UTC's frequency but not its phase. If you want to measure the interval between timestamps, then, I think you would take your timestamps in terms of uptime and then compute interval = uptime[1] - uptime[0]; which should reliably give you system's best estimate of the elapsed number of SI seconds between the times the two stamps were acquired. I like to record event timestamps in terms of uptime as well since it makes it unambiguous when the events occurred even if someone calls clock_settime() in between. Also, the tuple describing a conversion from a tickcount_t tc to a systime_t, which I over-simplified, actually maintains the pair of timescales by maintaining two `c' values, so that time = (tc << s) * r + c_time; uptime = (tc << s) * r + c_uptime; and boottime = c_time - c_uptime; So if "absolute time" means UTC, in the form of UTC-aligned `time', then I agree. You can't reliably compute time intervals from two UTC timestamps since, almost unavoidably, some day the system's estimate of UTC will be wrong and will require a step change to fix, and you'll compute a bogus time interval if your timestamps straddle that. On the other hand, if avoiding "needing them converted to absolute times" means hanging on to the raw tickstamp/tickcount for an extended period then I don't see the point. The conversion isn't very expensive, and a pair of uptime timestamps taken from the system clock will reliably allow you to compute intervals in SI seconds (or the system's best estimate of SI seconds), which is probably what you'd like to know. > This may mean that you can (effectively) count the ticks on all your > clocks since 'boot' and then scale the frequency of each to give the > same 'time since boot' - even though that will slightly change the > relationship between old timestamps taken on different clocks. > Possibly you do need a small offset for each clock to avoid > discrepencies in the 'current time' when you recalculate the clocks > frequency. The rate of advance which clock synchronization software sets the clock to is actually a prediction of clock performance in the immediate future based on measurements of as little of the clock's recent past behaviour as the software thinks might be useful. The problem with crappy clocks is that the number changes a lot, and while longer averages help if you have a constant signal in zero-mean noise they make things worse if the signal itself is a moving target, which is what the clock's frequency is like. A current frequency hence won't tell you very much about the very distant past (or distant future). Note, however, that the frequency changes are being made explicitly to make the ratio of "uptime seconds"/"SI seconds" as close to unity as possible at all times, and the longer the interval the closer to unity it will generally be. If you keep uptime timestamps you can compute intervals in SI seconds with considerable precision; uptime itself is our best estimate of the number of SI seconds since boot. Also, I would expect most applications to exclusively take their timestamps from the system clock (the point of making measurements is to make the system clock as accurate as possible) and, while the hardware source of the system clock might (rarely) be changed, it will make this change in a way which keeps uptime as continuous as it can even if the raw tickstamps look very different. > If the 128bit divides are being done to generate corrected frequences, > it might be that you can use the error term to adjust the current value > - and remove the need for the divide at all (after the initial setup). The user interface expresses rate changes as a sysrate_t, which lets the new value of the `r' rate constant to be computed with a 128 bit multiply. The current code I have uses the divide in the kernel in 3 spots: - It needs the 128 divide once to compute the nominal value of `r' from each clock's counter frequency, which is done when a clock is initialized. - It needs it to compute the tickcount_t time of a change that needs to be scheduled for a future moment, like the end of a slew a la adjtime() or a leap second. - The system call interface doesn't promise to do exactly the adjustment you tell it to, but does promise to tell you exactly what it ended up doing. For rate changes it currently does a divide to figure out the rate it is actually setting in sysrate_t form to return to the caller since the different precision of the 'r' rate constant can change what you asked for by a couple of low order bits (i.e. 10^-18). Clearly the last could go away if I get over being anal about precision, though I'm not sure it has to. The worst case machine I've looked at so far is arm, which has no hardware divide instructions at all and relatively slow multiplies, yet my 1 GHz Cortex A8 can do the 128 bit divide with shifts, adds and multiplies in about 250 ns, mostly. That's the same as maybe two or three cache misses. The only other things I've measured were on a 2.4 GHz amd64 machine which did the divide in about 22 ns with the 128 bit divide instruction it has, or 28 ns with the C function that every other machine uses, and the same machine running i386 code which I remember as being below 40 ns. And the benefit that having a fine rate-of-advance adjustment pays for is that it should allow the clock to be maintained as accurately as it can be with a minimal rate of adjustment, so ideally the divides won't need to be done very often. > One thought I've sometimes had is that, instead of trying to synchronise > the TSC counters in an SMP system, move them as far from each other > as possible! > Then, when you read the TSC, you can tell from the value which cpu > it must have come from! I need to get a machine with more than one CPU socket at some point. My current approach has been to bail and use some other clock at the first sign of trouble... Dennis Ferguson