Thanks Imran for very detailed explanations and options. I think for now
T-Digest is what I want.
From: iras...@cloudera.com
Date: Tue, 17 Feb 2015 08:39:48 -0600
Subject: Re: Percentile example
To: myl...@hotmail.com
CC: user@spark.apache.org
(trying to repost to the list w/out URLs
Thanks Kohler, that's very interesting approach. I never used Spark SQL and not
sure whether my cluster was configured well for it. But will definitely have a
try.
From: c.koh...@elsevier.com
To: myl...@hotmail.com; user@spark.apache.org
Subject: Re: Percentile example
Date: Tue, 17 Feb 2015
(trying to repost to the list w/out URLs -- rejected as spam earlier)
Hi,
Using take() is not a good idea, as you have noted it will pull a lot of
data down to the driver so its not scalable. Here are some more scalable
alternatives:
1. Approximate solutions
1a. Sample the data. Just sample
The best approach I've found to calculate Percentiles in Spark is to leverage
SparkSQL. If you use the Hive Query Language support, you can use the UDAFs
for percentiles (as of Spark 1.2)
Something like this (Note: syntax not guaranteed to run but should give you the
gist of what you need