[jira] [Commented] (SPARK-2183) Avoid loading/shuffling data twice in self-join query

Emma Tang (JIRA) Sat, 23 Jul 2016 00:02:03 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390586#comment-15390586
 ]


Emma Tang commented on SPARK-2183:
----------------------------------

[~khussein] You're right, it does not load from source (s3) the second time. 
However, it is still trying to load the same table twice into memory at the 
same time. My job is structured like the following:

tiny_table2 = big_table JOIN tiny_table;
result = tiny_table2 JOIN big_table;

Spark is loading the big_table into memory twice at the same time to prepare 
for the joins. It tries to hold 2 big_tables in memory at the same time. If it 
joins with the tiny_table first, the size would be dramatically reduced, then 
joining on the big_table would be efficient. My big table is around 20x the 
size of the tiny table. 

Have you managed to avoid this?



> Avoid loading/shuffling data twice in self-join query
> -----------------------------------------------------
>
>                 Key: SPARK-2183
>                 URL: https://issues.apache.org/jira/browse/SPARK-2183
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Minor
>
> {code}
> scala> hql("select * from src a join src b on (a.key=b.key)")
> res2: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[3] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#3:0,value#4:1,key#5:2,value#6:3]
>  HashJoin [key#3], [key#5], BuildRight
>   Exchange (HashPartitioning [key#3:0], 200)
>    HiveTableScan [key#3,value#4], (MetastoreRelation default, src, Some(a)), 
> None
>   Exchange (HashPartitioning [key#5:0], 200)
>    HiveTableScan [key#5,value#6], (MetastoreRelation default, src, Some(b)), 
> None
> {code}
> The optimal execution strategy for the above example is to load data only 
> once and repartition once. 
> If we want to hyper optimize it, we can also have a self join operator that 
> builds the hashmap and then simply traverses the hashmap ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2183) Avoid loading/shuffling data twice in self-join query

Reply via email to