I have some time to work on it now. What's a good way to continue the discussions before coding it?
This e-mail list, JIRA or something else? On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin <r...@databricks.com> wrote: > I think those are great to have. I would put them in the DataFrame API > though, since this is applying to structured data. Many of the advanced > functions on the PairRDDFunctions should really go into the DataFrame API > now we have it. > > One thing that would be great to understand is what state-of-the-art > alternatives are out there. I did a quick google scholar search using the > keyword "approximate quantile" and found some older papers. Just the > first few I found: > > http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf by bell labs > > > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf > by Bruce Lindsay, IBM > > http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf > > > > > > On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <gr...@celtra.com> wrote: > >> Hi! >> >> I'd like to get community's opinion on implementing a generic quantile >> approximation algorithm for Spark that is O(n) and requires limited memory. >> I would find it useful and I haven't found any existing implementation. The >> plan was basically to wrap t-digest >> <https://github.com/tdunning/t-digest>, implement the >> serialization/deserialization boilerplate and provide >> >> def cdf(x: Double): Double >> def quantile(q: Double): Double >> >> >> on RDD[Double] and RDD[(K, Double)]. >> >> Let me know what you think. Any other ideas/suggestions also welcome! >> >> Best, >> Grega >> -- >> [image: Inline image 1]*Grega Kešpret* >> Senior Software Engineer, Analytics >> >> Skype: gregakespret >> celtra.com <http://www.celtra.com/> | @celtramobile >> <http://www.twitter.com/celtramobile> >> >> >