[ 
https://issues.apache.org/jira/browse/SPARK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389751#comment-15389751
 ] 

Khaled Hammouda commented on SPARK-2183:
----------------------------------------

[~etang] For us caching the dataframe before performing the join actually 
prevented the data from being loaded twice. However, it still would show as 
being loaded twice in the DAG, but the loading stage would be skipped the 
second time. Are you sure it's actually loading twice and not just what the DAG 
is telling you?

> Avoid loading/shuffling data twice in self-join query
> -----------------------------------------------------
>
>                 Key: SPARK-2183
>                 URL: https://issues.apache.org/jira/browse/SPARK-2183
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Minor
>
> {code}
> scala> hql("select * from src a join src b on (a.key=b.key)")
> res2: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[3] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#3:0,value#4:1,key#5:2,value#6:3]
>  HashJoin [key#3], [key#5], BuildRight
>   Exchange (HashPartitioning [key#3:0], 200)
>    HiveTableScan [key#3,value#4], (MetastoreRelation default, src, Some(a)), 
> None
>   Exchange (HashPartitioning [key#5:0], 200)
>    HiveTableScan [key#5,value#6], (MetastoreRelation default, src, Some(b)), 
> None
> {code}
> The optimal execution strategy for the above example is to load data only 
> once and repartition once. 
> If we want to hyper optimize it, we can also have a self join operator that 
> builds the hashmap and then simply traverses the hashmap ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to