Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Grega Kešpret Wed, 10 Jun 2015 14:54:45 -0700

I have some time to work on it now. What's a good way to continue the
discussions before coding it?


This e-mail list, JIRA or something else?

On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin <r...@databricks.com> wrote:

> I think those are great to have. I would put them in the DataFrame API
> though, since this is applying to structured data. Many of the advanced
> functions on the PairRDDFunctions should really go into the DataFrame API
> now we have it.
>
> One thing that would be great to understand is what state-of-the-art
> alternatives are out there. I did a quick google scholar search using the
> keyword "approximate quantile" and found some older papers. Just the
> first few I found:
>
> http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs
>
>
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
>  by Bruce Lindsay, IBM
>
> http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf
>
>
>
>
>
> On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <gr...@celtra.com> wrote:
>
>> Hi!
>>
>> I'd like to get community's opinion on implementing a generic quantile
>> approximation algorithm for Spark that is O(n) and requires limited memory.
>> I would find it useful and I haven't found any existing implementation. The
>> plan was basically to wrap t-digest
>> <https://github.com/tdunning/t-digest>, implement the
>> serialization/deserialization boilerplate and provide
>>
>> def cdf(x: Double): Double
>> def quantile(q: Double): Double
>>
>>
>> on RDD[Double] and RDD[(K, Double)].
>>
>> Let me know what you think. Any other ideas/suggestions also welcome!
>>
>> Best,
>> Grega
>> --
>> [image: Inline image 1]*Grega Kešpret*
>> Senior Software Engineer, Analytics
>>
>> Skype: gregakespret
>> celtra.com <http://www.celtra.com/> | @celtramobile
>> <http://www.twitter.com/celtramobile>
>>
>>
>

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Reply via email to