On 3 January 2014 22:58, Rainer Jung <rainer.j...@kippdata.de> wrote: > On 03.01.2014 13:57, bugzi...@apache.org wrote: >> https://issues.apache.org/bugzilla/show_bug.cgi?id=55932 >> >> --- Comment #6 from Sebb <s...@apache.org> --- >> I have been having a look at the implementation. >> >> I don't really see that it needs Commons Math; we aleady have StatCalculator >> which handles percentiles and more. >> >> Likewise, does it really need Commons Pool? >> It seems wrong to have to have 2 separate pools of SocketOutputStream >> instances. >> How many of these would there be? >> >> Also, DescriptiveStatistics is not thread-safe (nor is StatCalculator). >> >> If we do implement something like this, I think the data processing needs >> either to be carefully synchronised, or the raw data should be sent to a >> separate singleton background thread. > > FWIW: I always get a bit nervous when percentiles are calculated. > Percentiles are expensive to calculate if one needs exact results with > given percentage numbers (50%, 99%, 99.9% etc.). In that case one needs > to keep all values as an ordered list to calculate the percentiles. For > a long running test that would be expensive in terms of memory but also > in terms of CPU (sorting). There's no way of exactly merging percentiles > from interim statistical data. > > Sometimes approximations are enough. By approximation I don't mean > estimated data, but percentages which are not exactly the ones you are > keen for. E.g. you would get a 48% value instead of a 50% value, or a > 99.02% value instead of a 99% value. > > Suppose you would know (configure) that only very few samples will take > longer than 1000ms, then one could create fixed bins for e.g. 10ms, > 15ms, 20ms, 25ms, 30ms, 40ms, 50ms, 75ms, 100ms, 150ms, 200ms, 250ms, > 300ms, 400ms, 500ms, 750ms and 1000ms. Now whenever a sample finishes > you count the sample in the bin it belongs to and do not save the data > (of course you can still log it). At any time you can now look at the > not need to keep all sample values around and sort them, but one does > also not get equidistant percentiles (10%, 11%, 12%, ...).
StatCalculator already takes a similar approach, counting values rather than storing them. We already use it for the GUI listeners. There are other approaches; Commons Math DescriptiveStatistics uses an array of doubles (with a sliding window). And there is also the following: http://search-lucene.com/jd/mahout/math/org/apache/mahout/math/stats/OnlineSummarizer.html However, AFAICT it does not support arbitrary percentiles, only quartiles.