While I was reading the docs I came across these parameters:
maxsamples [samples]
The maxsamples directive sets the default maximum number of samples that
chronyd should keep for each source. This setting can be overridden for
individual sources in the server and refclock directives. The default value
is 0, which disables the configurable limit. The useful range is 4 to 64.
As a special case, setting maxsamples to 1 disables frequency tracking
in order to make the sources immediately selectable with only one sample.
This can be useful when chronyd is started with the -q or -Q option.
minsamples [samples]
The minsamples directive sets the default minimum number of samples that
chronyd should keep for each source. This setting can be overridden for
individual sources in the server and refclock directives. The default value
is 6. The useful range is 4 to 64.
Forcing chronyd to keep more samples than it would normally keep reduces
noise in the estimated frequency and offset, but slows down the response to
changes in the frequency and offset of the clock. The offsets in the
tracking and sourcestats reports (and the tracking.log and statistics.log
files) may be smaller than the actual offsets.
Maybe I am way off here, but the descriptions suggest that these retained
samples are interpolated using a linear or other form, and then the
interpolated info is used by chrony. Is that correct?
The offset data is obviously noisy. In addition I have observed on my own
machines that there can be occasional outliers that are on the order of 10x
larger than usual. So the data also has outliers.
A linear regression is not the best way to process this kind of data.
Instead a robust analysis method is best. There is a simple and effective
one for obtaining the "best fit" slope of a dataset called a Thiel-Sen
estimator. There is a great Wikipedia entry for it if you are not familiar
with the technique (not sure if links are allowed so I did not include it).
In a nutshell, the slope for all pairs of points in the dataset is computed
and the median value is selected as the estimate of the slope. It is
straightforward to use this to obtain an good estimate of the true offset
for any time within the time interval of the dataset, and to make a
prediction into the future. Because it can reject outliers and fits noisy
data well, it seems like it would be a perfect candidate for a more robust
offset estimator in chrony.
Normally this is termed an order N^2 difficult problem, because the slope
must be calculated for all pairs in the dataset. But to implement this in
chrony it seems to me you only need to compute N pairs as each new offset is
obtained. This is because the previous pairwise slope values will not
change, and it is only the pairwise slope between the single new offset
value and the existing, retained values that needs to be calculated. So the
overhead would not be large, especially since the number of data points is
less than e.g. 64.
Would it be worth looking into implementing this estimation method in chrony
for predicting the current and future offsets?
-Charlie