[jira] [Comment Edited] (SPARK-25125) Spark SQL percentile_approx takes longer than Hive version for large datasets

Marco Gaido (JIRA) Thu, 16 Aug 2018 06:08:18 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16582383#comment-16582383
 ]


Marco Gaido edited comment on SPARK-25125 at 8/16/18 1:07 PM:
--------------------------------------------------------------

I think his may be a duplicate of SPARK-24013. [~myali] may you please try and 
check whether current master still have the issue?
If


was (Author: mgaido):
I think his may be a duplicate of SPARK-25125. [~myali] may you please try and 
check whether current master still have the issue?
If

> Spark SQL percentile_approx takes longer than Hive version for large datasets
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-25125
>                 URL: https://issues.apache.org/jira/browse/SPARK-25125
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Mir Ali
>            Priority: Major
>
> The percentile_approx function in Spark SQL takes much longer than the 
> previous Hive implementation for large data sets (7B rows grouped into 200k 
> buckets, percentile is on each bucket). Tested with Spark 2.3.1 vs Spark 
> 2.1.0.
> The below code finishes in around 24 minutes on spark 2.1.0, on spark 2.3.1, 
> this does not finish at all in more than 2 hours. Also tried this with 
> different accuracy values 5000,1000,500, the timing does get better with 
> smaller datasets with the new version, but the speed difference is 
> insignificant
>  
> Infrastructure used:
> AWS EMR -> Spark 2.1.0
> vs
> AWS EMR  -> Spark 2.3.1
>  
> spark-shell --conf spark.driver.memory=12g --conf spark.executor.memory=10g 
> --conf spark.sql.shuffle.partitions=2000 --conf 
> spark.default.parallelism=2000 --num-executors=75 --executor-cores=2
> {code:java}
> import org.apache.spark.sql.functions._ 
> import org.apache.spark.sql.types._ 
> val df=spark.range(7000000000L).withColumn("some_grouping_id", 
> round(rand()*200000L).cast(LongType)) 
> df.createOrReplaceTempView("tab")   
> val percentile_query = """ select some_grouping_id, percentile_approx(id, 
> array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """ 
> spark.sql(percentile_query).collect()
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25125) Spark SQL percentile_approx takes longer than Hive version for large datasets

Reply via email to