Ouch.

You didn't mention accuracy.  I will assume a standard sort of 2-3%
accuracy or better and let you correct me if necessary.

I could meet all but one or two of those requirements several different
ways.

For instance, very high or low quantiles can be met with stacked min-sets
or max-sets.  The idea is that you keep the highest k values and the
highest k 10x downsampled data and so on.  This is pretty good for down to
the 90+%-ile (or up to the 10th %-ile).  This structure merges without loss
of accuracy.

For well-defined quantiles like 25-50-75, then the Mahout OnlineSummarizer
is excellent.  You can choose your arbitrary quantile ahead of time and you
can sometimes merge (but perverse data can kill you).

And then the QDigest.  It is, by definition, as big as a QDigest, but is
mergeable and allows any quantile. Also cool, is the fact that you can pick
the quantile late in the process.

Maybe the answer is to make the QDigest structure smaller.  How well is the
streamlib implementation cranked down?  Is it really tight?




On Wed, Aug 7, 2013 at 3:04 PM, Otis Gospodnetic <otis_gospodne...@yahoo.com
> wrote:

> Hi Ted,
>
> I need percentiles.  Ideally not pre-defined ones, because one person may
> want e.g. 70th pctile, while somebody else might want 75th pctile for the
> same metric.
>
> Deal breakers:
> High memory footprint. ("high" means "higher than QDigest from stream-lib"
> for us.... and we could test and compare with QDigest relatively easily
> with live data)
> Algos that create data structures that cannot be merged
> Loss of accuracy that is not predictably small or configurable
>
> Thank you,
> Otis
> ----
>
> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> http://sematext.com/spm
>
>
>
>
> >________________________________
> > From: Ted Dunning <ted.dunn...@gmail.com>
> >To: "user@mahout.apache.org" <user@mahout.apache.org>; Otis Gospodnetic <
> otis_gospodne...@yahoo.com>
> >Sent: Wednesday, August 7, 2013 11:48 PM
> >Subject: Re: Is OnlineSummarizer mergeable?
> >
> >
> >
> >Otis,
> >
> >
> >What statistics do you need?
> >
> >
> >What guarantees?
> >
> >
> >
> >
> >
> >On Wed, Aug 7, 2013 at 1:26 PM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
> >
> >Hi Ted,
> >>
> >>I'm actually trying to find an alternative to QDigest (the stream-lib
> impl specifically) because even though it seems good, we have to deal with
> crazy volumes of data in SPM (performance monitoring service, see
> signature)... I'm hoping we can find something that has both a lower memory
> footprint than QDigest AND that is mergeable a la QDigest.  Utopia?
> >>
> >>Thanks,
> >>Otis
> >>----
> >>Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> http://sematext.com/spm
> >>
> >>
> >>
> >>
> >>>________________________________
> >>> From: Ted Dunning <ted.dunn...@gmail.com>
> >>>To: "user@mahout.apache.org" <user@mahout.apache.org>
> >>>Sent: Wednesday, August 7, 2013 4:51 PM
> >>>Subject: Re: Is OnlineSummarizer mergeable?
> >>>
> >>>
> >>>It isn't as mergeable as I would like.  If you have randomized record
> >>>selection, it should be possible, but perverse ordering can cause
> serious
> >>>errors.
> >>>
> >>>It would be better to use something like a Q-digest.
> >>>
> >>>http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
> >>>
> >>>
> >>>
> >>>
> >>>On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic <
> otis.gospodne...@gmail.com
> >>>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Is OnlineSummarizer algo "mergeable"?
> >>>>
> >>>> Say that we compute a percentile for some metric for time 12:00-12:01
> >>>> and store that somewhere, then we compute it for 1201-12:02 and store
> >>>> that separately, and so on.
> >>>>
> >>>> Can we then later merge these computed and previously stored
> >>>> percentile "instances" and get an accurate value?
> >>>>
> >>>> Thanks,
> >>>> Otis
> >>>> --
> >>>> Performance Monitoring -- http://sematext.com/spm
> >>>> Solr & ElasticSearch Support -- http://sematext.com/
> >>>>
> >>>
> >>>
> >>>
> >
> >
> >

Reply via email to