The best approach I've found to calculate Percentiles in Spark is to leverage SparkSQL. If you use the Hive Query Language support, you can use the UDAFs for percentiles (as of Spark 1.2)
Something like this (Note: syntax not guaranteed to run but should give you the gist of what you need to do): JavaSparkContext sc = new JavaSparkContext(sparkConf); JavaHiveContext hsc = new JavaHiveContext(sc); //Get your Data into a SchemaRDD and register the Table // Query it String hql = "SELECT FIELD1, FIELD2, percentile(FIELD3, 0.05) AS ptile5 from TABLE-NAME GROUP BY FIELD1, FIELD2;" JavaSchemaRDD result = hsc.hql(hql); List<Row> grp = result.collect(); for (int z = 2; z < row.length(); z++) { // Do something with the results } Curt From: SiMaYunRui <myl...@hotmail.com<mailto:myl...@hotmail.com>> Date: Sunday, February 15, 2015 at 10:37 AM To: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Percentile example hello, I am a newbie to spark and trying to figure out how to get percentile against a big data set. Actually, I googled this topic but not find any very useful code example and explanation. Seems that I can use transformer SortBykey to get my data set in order, but not pretty sure how can I get value of , for example, percentile 66. Should I use take() to pick up the value of percentile 66? I don't believe any machine can load my data set in memory. I believe there must be more efficient approaches. Can anyone shed some light on this problem? Regards