On Mon, Mar 14, 2011 at 12:34 PM, Luc Maisonobe <luc.maison...@free.fr>wrote:
> Le 14/03/2011 15:33, Benson Margulies a écrit : > > Please excuse the following ignorant question. > > > > I want to maintain summary statistics of a rate. At each 'event', I > > know the number of characters and the time it took to process them, > > and I want to maintain summary statistics for the rate of > > chars/second. I imagine that I'm missing something basic, but I don't > > see how to do this. > > You should define some windows width, either in terms of a time span > (all events in the last n seconds) or in terms of number of events (last > n events). > > In [math], we do not provide (yet) anything for maintaining such a data > structure, you'll have to maintain the events in this slot by yourself, > with something similar to a FIFO. > The problem for Benson's case is that the FIFO is a global structure. Computing the global moving average is a bit trickier. > Another thing we discussed some months ago (but did not implement yet) > is a way to compute an approximation of percentiles in a flow of data > without storing them. There is an interesting algorithm for it that was > developed for the needs of telecommunication companies, I think it may > be of interest to you. This would provide results like : currently 95% > of the characters are processed in n milliseconds. would you be > interested in us implementing this feature ? > I would recommend stealing it instead from the OnlineSummarizer in Mahout which computes rank statistics as well as average and standard deviation in a completely on-line way. The only rank statistics computed with now are the quartiles, but this can easily be changed to compute 1, 5, 95 and 99%-iles. This is done in a completely on-line way (O(1) memory and time per sample) and is roughly as accurate for computing quantiles of a distribution as sorting all the samples and observing the empirical quantiles. The current version simplifies the published algorithm slightly. As a result, some extremely skewed distributions can have quartiles that are in error. For instance, if I remember accurately, the unit gamma distribution with shape = 0.001 has a minimum of 0 and a median of 10^-50 while the Mahout code only gives an estimate of the median of 10^-20. Relative to the distance between the min to the median, this is an unfortunately large error. Relative to the distance from the mean to the min or relative to the inter-quartile distance, this error is trivial. I know of no practical use cases for such an extreme distributions except as very conservative scale preserving priors in some Bayesian computations. Even in these cases, estimation of the quantiles is not useful since only the posterior is of interest and it invariably is less extreme if any data is considered.