[ 
https://issues.apache.org/jira/browse/SPARK-25125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mir Ali updated SPARK-25125:
----------------------------
    Description: 
The percentile_approx function in Spark SQL takes much longer than the previous 
Hive implementation for large data sets (7B rows grouped into 200k buckets, 
percentile is on each bucket). Tested with Spark 2.3.1 vs Spark 2.2.0.

The below code finishes in around 24 minutes on spark 2.2.0, on spark 2.3.1, 
this does not finish at all in more than 2 hours.

 

Infrastructure used:

AWS EMR 5.10.0 -> Spark 2.2.0

vs

AWS EMR 5.16.0 -> Spark 2.3.1

 

 
{code:java}
import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.types._ 

val df=spark.range(7000000000L).withColumn("some_grouping_id", 
round(rand()*200000L).cast(LongType)) 
df.createOrReplaceTempView("tab")   
val percentile_query = """ select some_grouping_id, percentile_approx(id, 
array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """ 

spark.sql(percentile_query).collect()
{code}
 

 

 

  was:
The percentile_approx function in Spark SQL takes much longer than the previous 
Hive implementation for large data sets (7B rows grouped into 200k buckets, 
percentile is on each bucket). Tested with Spark 2.3.1 vs Spark 2.2.0.

The below code finishes in around 24 minutes on spark 2.2.0, on spark 2.3.1, 
this does not finish at all in more than 2 hours.

 

Infrastructure used:

AWS EMR 5.10.0 -> Spark 2.2.0

vs

AWS EMR 5.16.0 -> Spark 2.3.1

 

 
{code:java}
import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.types._ 

val df=spark.range(7000000000L).withColumn("some_grouping_id", 
round(rand()*195000L).cast(LongType)) df.createOrReplaceTempView("tab")   
val percentile_query = """ select some_grouping_id, percentile_approx(id, 
array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """ 

spark.sql(percentile_query).collect()
{code}
 

 

 


> Spark SQL percentile_approx takes longer than Hive version for large datasets
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-25125
>                 URL: https://issues.apache.org/jira/browse/SPARK-25125
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Mir Ali
>            Priority: Major
>
> The percentile_approx function in Spark SQL takes much longer than the 
> previous Hive implementation for large data sets (7B rows grouped into 200k 
> buckets, percentile is on each bucket). Tested with Spark 2.3.1 vs Spark 
> 2.2.0.
> The below code finishes in around 24 minutes on spark 2.2.0, on spark 2.3.1, 
> this does not finish at all in more than 2 hours.
>  
> Infrastructure used:
> AWS EMR 5.10.0 -> Spark 2.2.0
> vs
> AWS EMR 5.16.0 -> Spark 2.3.1
>  
>  
> {code:java}
> import org.apache.spark.sql.functions._ 
> import org.apache.spark.sql.types._ 
> val df=spark.range(7000000000L).withColumn("some_grouping_id", 
> round(rand()*200000L).cast(LongType)) 
> df.createOrReplaceTempView("tab")   
> val percentile_query = """ select some_grouping_id, percentile_approx(id, 
> array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """ 
> spark.sql(percentile_query).collect()
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to