[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401413#comment-15401413
 ] 

Sital Kedia commented on SPARK-16827:
-------------------------------------

That is not the case. There is no broadcast join involved, its using shuffle 
join in both 1.6 and 2.0.

Plan in 1.6 -

Project [target_id#136L AS userid#115L]
+- SortMergeJoin [source_id#135L], [id#140L]
   :- Sort [source_id#135L ASC], false, 0
   :  +- TungstenExchange hashpartitioning(source_id#135L,800), None
   :     +- ConvertToUnsafe
   :        +- HiveTableScan [target_id#136L,source_id#135L], MetastoreRelation 
foo, table1, Some(a), [(ds#134 = 2016-07-15)]
   +- Sort [id#140L ASC], false, 0
      +- TungstenExchange hashpartitioning(id#140L,800), None
         +- ConvertToUnsafe
            +- HiveTableScan [id#140L], MetastoreRelation foo, table2, Some(b), 
[(ds#139 = 2016-07-15)]


Plan in 2.0  - 

*Project [target_id#151L AS userid#111L]
+- *SortMergeJoin [source_id#150L], [id#155L], Inner
   :- *Sort [source_id#150L ASC], false, 0
   :  +- Exchange hashpartitioning(source_id#150L, 800)
   :     +- *Filter isnotnull(source_id#150L)
   :        +- HiveTableScan [source_id#150L, target_id#151L], 
MetastoreRelation foo, table1, a, [isnotnull(ds#149), (ds#149 = 2016-07-15)]
   +- *Sort [id#155L ASC], false, 0
      +- Exchange hashpartitioning(id#155L, 800)
         +- *Filter isnotnull(id#155L)
            +- HiveTableScan [id#155L], MetastoreRelation foo, table2, b, 
[isnotnull(ds#154), (ds#154 = 2016-07-15)]

> Query with Join produces excessive amount of shuffle data
> ---------------------------------------------------------
>
>                 Key: SPARK-16827
>                 URL: https://issues.apache.org/jira/browse/SPARK-16827
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 2.0.0
>            Reporter: Sital Kedia
>              Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>      FROM  table1 a
>      JOIN table2 b
>       ON    a.ds = '2016-07-15'
>       AND  b.ds = '2016-07-15'
>       AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to