[ https://issues.apache.org/jira/browse/SPARK-11471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836547#comment-15836547 ]
Andrew Ash commented on SPARK-11471: ------------------------------------ [~yhuai] I'm interested in helping make progress on this -- it's causing pretty significant slowness on a SQL query I have. Have you had any more thoughts on this since you first filed? For my slow query, running the same SQL on another system (Teradata) completes in a few minutes whereas in Spark SQL it eventually fails after 20h+ of computation. By breaking up the query into subqueries (forcing joins to happen in a different order) we can get the Spark SQL execution down to about 15min. The query joins 8 tables together with a number of where clauses, and the join tree ends up doing a large number of seemingly "extra" shuffles. At the bottom of the join tree, tables A and B are both shuffled to the same distribution and then joined. Then when table C comes in (the next table) it's shuffled to a new distribution and then the result of (A+B) is reshuffled to match the new table C distribution. Potentially instead table C would be shuffled to match the distribution of (A+B) so that (A+B) doesn't have to be reshuffled. The general pattern here seems to be that N-Way joins require N + N-1 = 2N-1 shuffles. Potentially some of those shuffles could be eliminated with more intelligent join ordering and/or distribution selection. What do you think? > Improve the way that we plan shuffled join > ------------------------------------------ > > Key: SPARK-11471 > URL: https://issues.apache.org/jira/browse/SPARK-11471 > Project: Spark > Issue Type: Sub-task > Components: SQL > Reporter: Yin Huai > > Right now, when adaptive query execution is enabled, in most of cases, we > will shuffle input tables for every join. However, once we finish our work of > https://issues.apache.org/jira/browse/SPARK-10665, we will be able to have a > global on the input datasets of a stage. Then, we should be able to add > exchange coordinators after we get the entire physical plan (after the phase > that we add Exchanges). > I will try to fill in more information later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org