[ 
https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15160256#comment-15160256
 ] 

Xiao Li commented on SPARK-13307:
---------------------------------

First, I am not sure if usage of broadcastjoin makes sense in this query, 
especially when your table size is huge. 

Second, are your queries written in SQL? or DataFrame APIs? Spark SQL does not 
provide broadcast hint for SQL users. If using DataFrame API, you can do 
something like 
{code}
df1.join(broadcast(df2), $"key" === $"key2", "leftsemi")
{code}

Third, I think performance regression is expected for avoiding OOM. 
SortMergeJoin is consuming less memory than ShuffleHashJoin. Thus, it might 
make more sense to choose SortMergeJoin as a default join type. 

> TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
> ---------------------------------------------------------
>
>                 Key: SPARK-13307
>                 URL: https://issues.apache.org/jira/browse/SPARK-13307
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>            Reporter: JESSE CHEN
>
> Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average 
> about 9% faster. There are a few degraded, and one that is definitely not 
> within error margin is query 66.
> Query 66 in 1.4.1: 699 seconds
> Query 66 in 1.6.0: 918 seconds
> 30% worse.
> Collected the physical plans from both versions - drastic difference maybe 
> partially from using Tungsten in 1.6, but anything else at play here?
> Please see plans here:
> https://ibm.box.com/spark-sql-q66-debug-160plan
> https://ibm.box.com/spark-sql-q66-debug-141plan



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to