Thanks Imran for very detailed explanations and options. I think for now
T-Digest is what I want.
From: iras...@cloudera.com
Date: Tue, 17 Feb 2015 08:39:48 -0600
Subject: Re: Percentile example
To: myl...@hotmail.com
CC: user@spark.apache.org
(trying to repost to the list w/out URLs
Thanks Kohler, that's very interesting approach. I never used Spark SQL and not
sure whether my cluster was configured well for it. But will definitely have a
try.
From: c.koh...@elsevier.com
To: myl...@hotmail.com; user@spark.apache.org
Subject: Re: Percentile example
Date: Tue, 17 Feb 2015
(trying to repost to the list w/out URLs -- rejected as spam earlier)
Hi,
Using take() is not a good idea, as you have noted it will pull a lot of
data down to the driver so its not scalable. Here are some more scalable
alternatives:
1. Approximate solutions
1a. Sample the data. Just sample
@spark.apache.org
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Percentile example
hello,
I am a newbie to spark and trying to figure out how to get percentile against a
big data set. Actually, I googled this topic but not find any very useful code
example and explanation. Seems
hello,
I am a newbie to spark and trying to figure out how to get percentile against a
big data set. Actually, I googled this topic but not find any very useful code
example and explanation. Seems that I can use transformer SortBykey to get my
data set in order, but not pretty sure how can I