[ https://issues.apache.org/jira/browse/SPARK-25125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-25125. ---------------------------------- Resolution: Incomplete > Spark SQL percentile_approx takes longer than Hive version for large datasets > ----------------------------------------------------------------------------- > > Key: SPARK-25125 > URL: https://issues.apache.org/jira/browse/SPARK-25125 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.1 > Reporter: Mir Ali > Priority: Major > Labels: bulk-closed > > The percentile_approx function in Spark SQL takes much longer than the > previous Hive implementation for large data sets (7B rows grouped into 200k > buckets, percentile is on each bucket). Tested with Spark 2.3.1 vs Spark > 2.1.0. > The below code finishes in around 24 minutes on spark 2.1.0, on spark 2.3.1, > this does not finish at all in more than 2 hours. Also tried this with > different accuracy values 5000,1000,500, the timing does get better with > smaller datasets with the new version, but the speed difference is > insignificant > > Infrastructure used: > AWS EMR -> Spark 2.1.0 > vs > AWS EMR -> Spark 2.3.1 > > spark-shell --conf spark.driver.memory=12g --conf spark.executor.memory=10g > --conf spark.sql.shuffle.partitions=2000 --conf > spark.default.parallelism=2000 --num-executors=75 --executor-cores=2 > {code:java} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val df=spark.range(7000000000L).withColumn("some_grouping_id", > round(rand()*200000L).cast(LongType)) > df.createOrReplaceTempView("tab") > val percentile_query = """ select some_grouping_id, percentile_approx(id, > array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """ > spark.sql(percentile_query).collect() > {code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org