[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401413#comment-15401413 ]
Sital Kedia edited comment on SPARK-16827 at 8/1/16 12:52 AM: -------------------------------------------------------------- That is not the case. There is no broadcast join involved, its using shuffle join in both 1.6 and 2.0. Plan in 1.6 - {code} Project [target_id#136L AS userid#115L] +- SortMergeJoin [source_id#135L], [id#140L] :- Sort [source_id#135L ASC], false, 0 : +- TungstenExchange hashpartitioning(source_id#135L,800), None : +- ConvertToUnsafe : +- HiveTableScan [target_id#136L,source_id#135L], MetastoreRelation foo, table1, Some(a), [(ds#134 = 2016-07-15)] +- Sort [id#140L ASC], false, 0 +- TungstenExchange hashpartitioning(id#140L,800), None +- ConvertToUnsafe +- HiveTableScan [id#140L], MetastoreRelation foo, table2, Some(b), [(ds#139 = 2016-07-15)] {code} Plan in 2.0 - {code} *Project [target_id#151L AS userid#111L] +- *SortMergeJoin [source_id#150L], [id#155L], Inner :- *Sort [source_id#150L ASC], false, 0 : +- Exchange hashpartitioning(source_id#150L, 800) : +- *Filter isnotnull(source_id#150L) : +- HiveTableScan [source_id#150L, target_id#151L], MetastoreRelation foo, table1, a, [isnotnull(ds#149), (ds#149 = 2016-07-15)] +- *Sort [id#155L ASC], false, 0 +- Exchange hashpartitioning(id#155L, 800) +- *Filter isnotnull(id#155L) +- HiveTableScan [id#155L], MetastoreRelation foo, table2, b, [isnotnull(ds#154), (ds#154 = 2016-07-15)] {code} was (Author: sitalke...@gmail.com): That is not the case. There is no broadcast join involved, its using shuffle join in both 1.6 and 2.0. Plan in 1.6 - Project [target_id#136L AS userid#115L] +- SortMergeJoin [source_id#135L], [id#140L] :- Sort [source_id#135L ASC], false, 0 : +- TungstenExchange hashpartitioning(source_id#135L,800), None : +- ConvertToUnsafe : +- HiveTableScan [target_id#136L,source_id#135L], MetastoreRelation foo, table1, Some(a), [(ds#134 = 2016-07-15)] +- Sort [id#140L ASC], false, 0 +- TungstenExchange hashpartitioning(id#140L,800), None +- ConvertToUnsafe +- HiveTableScan [id#140L], MetastoreRelation foo, table2, Some(b), [(ds#139 = 2016-07-15)] Plan in 2.0 - *Project [target_id#151L AS userid#111L] +- *SortMergeJoin [source_id#150L], [id#155L], Inner :- *Sort [source_id#150L ASC], false, 0 : +- Exchange hashpartitioning(source_id#150L, 800) : +- *Filter isnotnull(source_id#150L) : +- HiveTableScan [source_id#150L, target_id#151L], MetastoreRelation foo, table1, a, [isnotnull(ds#149), (ds#149 = 2016-07-15)] +- *Sort [id#155L ASC], false, 0 +- Exchange hashpartitioning(id#155L, 800) +- *Filter isnotnull(id#155L) +- HiveTableScan [id#155L], MetastoreRelation foo, table2, b, [isnotnull(ds#154), (ds#154 = 2016-07-15)] > Query with Join produces excessive amount of shuffle data > --------------------------------------------------------- > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core > Affects Versions: 2.0.0 > Reporter: Sital Kedia > Labels: performance > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ON a.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. > PS - Even if the intermediate shuffle data size is huge, the job still > produces accurate output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org