[ https://issues.apache.org/jira/browse/SPARK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176032#comment-15176032 ]
Bartosz Owczarek edited comment on SPARK-2183 at 3/2/16 5:32 PM: ----------------------------------------------------------------- I can confim that it exists in spark 1.5.2 :( We also encountered the same issue on azure where 1.5.2 is available. i would need to test that on 1.6 in standalone mode. Looks like query optimizer (catalyst) does not have optimization to detect self-join and always assumes that data on the left (up stream) & right (down stream) side comes from different sources. If you now run the whole query on a huge data sets then your query simply takes ~2 times more time just on the I/O which might be 95% of the whole query time execution. Based on that I wouldn't say it's minor ;) was (Author: bartosz.owcza...@gmail.com): I can confim that it exists in spark 1.5.2 :( We also encountered the same issue on azure where 1.5.2 is available. i would need to test that on 1.6 in standalone mode. Looks like query optimizer (catalyst) does not have optimization to detect self-join and always assumes that data on the left (up stream) & right (down stream) side comes from different sources. If you now run the whole query on a huge data sets then your query simply takes ~2 times more time just on the I/O which might be 95% of the whole query time execution. > Avoid loading/shuffling data twice in self-join query > ----------------------------------------------------- > > Key: SPARK-2183 > URL: https://issues.apache.org/jira/browse/SPARK-2183 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Reynold Xin > Priority: Minor > > {code} > scala> hql("select * from src a join src b on (a.key=b.key)") > res2: org.apache.spark.sql.SchemaRDD = > SchemaRDD[3] at RDD at SchemaRDD.scala:100 > == Query Plan == > Project [key#3:0,value#4:1,key#5:2,value#6:3] > HashJoin [key#3], [key#5], BuildRight > Exchange (HashPartitioning [key#3:0], 200) > HiveTableScan [key#3,value#4], (MetastoreRelation default, src, Some(a)), > None > Exchange (HashPartitioning [key#5:0], 200) > HiveTableScan [key#5,value#6], (MetastoreRelation default, src, Some(b)), > None > {code} > The optimal execution strategy for the above example is to load data only > once and repartition once. > If we want to hyper optimize it, we can also have a self join operator that > builds the hashmap and then simply traverses the hashmap ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org