[ 
https://issues.apache.org/jira/browse/SPARK-11471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836547#comment-15836547
 ] 

Andrew Ash commented on SPARK-11471:
------------------------------------

[~yhuai] I'm interested in helping make progress on this -- it's causing pretty 
significant slowness on a SQL query I have.  Have you had any more thoughts on 
this since you first filed?

For my slow query, running the same SQL on another system (Teradata) completes 
in a few minutes whereas in Spark SQL it eventually fails after 20h+ of 
computation.  By breaking up the query into subqueries (forcing joins to happen 
in a different order) we can get the Spark SQL execution down to about 15min.

The query joins 8 tables together with a number of where clauses, and the join 
tree ends up doing a large number of seemingly "extra" shuffles.  At the bottom 
of the join tree, tables A and B are both shuffled to the same distribution and 
then joined.  Then when table C comes in (the next table) it's shuffled to a 
new distribution and then the result of (A+B) is reshuffled to match the new 
table C distribution.  Potentially instead table C would be shuffled to match 
the distribution of (A+B) so that (A+B) doesn't have to be reshuffled.

The general pattern here seems to be that N-Way joins require N + N-1 = 2N-1 
shuffles.  Potentially some of those shuffles could be eliminated with more 
intelligent join ordering and/or distribution selection.

What do you think?

> Improve the way that we plan shuffled join
> ------------------------------------------
>
>                 Key: SPARK-11471
>                 URL: https://issues.apache.org/jira/browse/SPARK-11471
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Yin Huai
>
> Right now, when adaptive query execution is enabled, in most of cases, we 
> will shuffle input tables for every join. However, once we finish our work of 
> https://issues.apache.org/jira/browse/SPARK-10665, we will be able to have a 
> global on the input datasets of a stage. Then, we should be able to add 
> exchange coordinators after we get the entire physical plan (after the phase 
> that we add Exchanges).
> I will try to fill in more information later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to