Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Reynold Xin Mon, 06 Apr 2015 01:01:12 -0700

I think those are great to have. I would put them in the DataFrame API
though, since this is applying to structured data. Many of the advanced
functions on the PairRDDFunctions should really go into the DataFrame API
now we have it.


One thing that would be great to understand is what state-of-the-art
alternatives are out there. I did a quick google scholar search using the
keyword "approximate quantile" and found some older papers. Just the first
few I found:

http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
 by Bruce Lindsay, IBM

http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf





On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <[email protected]> wrote:

> Hi!
>
> I'd like to get community's opinion on implementing a generic quantile
> approximation algorithm for Spark that is O(n) and requires limited memory.
> I would find it useful and I haven't found any existing implementation. The
> plan was basically to wrap t-digest <https://github.com/tdunning/t-digest>,
> implement the serialization/deserialization boilerplate and provide
>
> def cdf(x: Double): Double
> def quantile(q: Double): Double
>
>
> on RDD[Double] and RDD[(K, Double)].
>
> Let me know what you think. Any other ideas/suggestions also welcome!
>
> Best,
> Grega
> --
> [image: Inline image 1]*Grega Kešpret*
> Senior Software Engineer, Analytics
>
> Skype: gregakespret
> celtra.com <http://www.celtra.com/> | @celtramobile
> <http://www.twitter.com/celtramobile>
>
>

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Reply via email to