I think those are great to have. I would put them in the DataFrame API though, since this is applying to structured data. Many of the advanced functions on the PairRDDFunctions should really go into the DataFrame API now we have it.
One thing that would be great to understand is what state-of-the-art alternatives are out there. I did a quick google scholar search using the keyword "approximate quantile" and found some older papers. Just the first few I found: http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf by bell labs http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf by Bruce Lindsay, IBM http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <gr...@celtra.com> wrote: > Hi! > > I'd like to get community's opinion on implementing a generic quantile > approximation algorithm for Spark that is O(n) and requires limited memory. > I would find it useful and I haven't found any existing implementation. The > plan was basically to wrap t-digest <https://github.com/tdunning/t-digest>, > implement the serialization/deserialization boilerplate and provide > > def cdf(x: Double): Double > def quantile(q: Double): Double > > > on RDD[Double] and RDD[(K, Double)]. > > Let me know what you think. Any other ideas/suggestions also welcome! > > Best, > Grega > -- > [image: Inline image 1]*Grega Kešpret* > Senior Software Engineer, Analytics > > Skype: gregakespret > celtra.com <http://www.celtra.com/> | @celtramobile > <http://www.twitter.com/celtramobile> > >