[jira] [Comment Edited] (SPARK-2183) Avoid loading/shuffling data twice in self-join query

Bartosz Owczarek (JIRA) Wed, 02 Mar 2016 09:34:13 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176032#comment-15176032
 ]


Bartosz Owczarek edited comment on SPARK-2183 at 3/2/16 5:32 PM:
-----------------------------------------------------------------

I can confim that it exists in spark 1.5.2 :( We also encountered the same 
issue on azure where 1.5.2 is available. i would need to test that on 1.6 in 
standalone mode. 

Looks like query optimizer (catalyst) does not have optimization to detect 
self-join and always assumes that data on the left (up stream) & right (down 
stream) side comes from different sources.

If you now run the whole query on a huge data sets then your query simply takes 
~2 times more time just on the I/O which might be 95% of the whole query time 
execution.

Based on that I wouldn't say it's minor ;)


was (Author: bartosz.owcza...@gmail.com):
I can confim that it exists in spark 1.5.2 :( We also encountered the same 
issue on azure where 1.5.2 is available. i would need to test that on 1.6 in 
standalone mode. 

Looks like query optimizer (catalyst) does not have optimization to detect 
self-join and always assumes that data on the left (up stream) & right (down 
stream) side comes from different sources.

If you now run the whole query on a huge data sets then your query simply takes 
~2 times more time just on the I/O which might be 95% of the whole query time 
execution.

> Avoid loading/shuffling data twice in self-join query
> -----------------------------------------------------
>
>                 Key: SPARK-2183
>                 URL: https://issues.apache.org/jira/browse/SPARK-2183
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Minor
>
> {code}
> scala> hql("select * from src a join src b on (a.key=b.key)")
> res2: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[3] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#3:0,value#4:1,key#5:2,value#6:3]
>  HashJoin [key#3], [key#5], BuildRight
>   Exchange (HashPartitioning [key#3:0], 200)
>    HiveTableScan [key#3,value#4], (MetastoreRelation default, src, Some(a)), 
> None
>   Exchange (HashPartitioning [key#5:0], 200)
>    HiveTableScan [key#5,value#6], (MetastoreRelation default, src, Some(b)), 
> None
> {code}
> The optimal execution strategy for the above example is to load data only 
> once and repartition once. 
> If we want to hyper optimize it, we can also have a self join operator that 
> builds the hashmap and then simply traverses the hashmap ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2183) Avoid loading/shuffling data twice in self-join query

Reply via email to