The best approach I've found to calculate Percentiles in Spark is to leverage 
SparkSQL.  If you use the Hive Query Language support, you can use the UDAFs 
for percentiles (as of Spark 1.2)

Something like this (Note: syntax not guaranteed to run but should give you the 
gist of what you need to do):


JavaSparkContext sc = new JavaSparkContext(sparkConf);

JavaHiveContext hsc = new JavaHiveContext(sc);

//Get your Data into a SchemaRDD and register the Table


// Query it

String hql =  "SELECT FIELD1, FIELD2, percentile(FIELD3, 0.05) AS ptile5 from 
TABLE-NAME GROUP BY FIELD1, FIELD2;"

JavaSchemaRDD result = hsc.hql(hql);

List<Row> grp = result.collect();


for (int z = 2; z  < row.length(); z++) {

  // Do something with the results

}

Curt


From: SiMaYunRui <myl...@hotmail.com<mailto:myl...@hotmail.com>>
Date: Sunday, February 15, 2015 at 10:37 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Percentile example

hello,

I am a newbie to spark and trying to figure out how to get percentile against a 
big data set. Actually, I googled this topic but not find any very useful code 
example and explanation. Seems that I can use transformer SortBykey to get my 
data set in order, but not pretty sure how can I get value of , for example, 
percentile 66.

Should I use take() to pick up the value of percentile 66? I don't believe any 
machine can load my data set in memory. I believe there must be more efficient 
approaches.

Can anyone shed some light on this problem?

Regards

Reply via email to