RE: Percentile example

2015-02-17 Thread SiMaYunRui
Thanks Imran for very detailed explanations and options. I think for now T-Digest is what I want. From: iras...@cloudera.com Date: Tue, 17 Feb 2015 08:39:48 -0600 Subject: Re: Percentile example To: myl...@hotmail.com CC: user@spark.apache.org (trying to repost to the list w/out URLs

RE: Percentile example

2015-02-17 Thread SiMaYunRui
Thanks Kohler, that's very interesting approach. I never used Spark SQL and not sure whether my cluster was configured well for it. But will definitely have a try.  From: c.koh...@elsevier.com To: myl...@hotmail.com; user@spark.apache.org Subject: Re: Percentile example Date: Tue, 17 Feb 2015

Re: Percentile example

2015-02-17 Thread Imran Rashid
(trying to repost to the list w/out URLs -- rejected as spam earlier) Hi, Using take() is not a good idea, as you have noted it will pull a lot of data down to the driver so its not scalable. Here are some more scalable alternatives: 1. Approximate solutions 1a. Sample the data. Just sample

Re: Percentile example

2015-02-17 Thread Kohler, Curt E (ELS-STL)
The best approach I've found to calculate Percentiles in Spark is to leverage SparkSQL. If you use the Hive Query Language support, you can use the UDAFs for percentiles (as of Spark 1.2) Something like this (Note: syntax not guaranteed to run but should give you the gist of what you need