[ https://issues.apache.org/jira/browse/SPARK-30528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan resolved SPARK-30528. --------------------------------- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27551 [https://github.com/apache/spark/pull/27551] > Potential performance regression with DPP subquery duplication > -------------------------------------------------------------- > > Key: SPARK-30528 > URL: https://issues.apache.org/jira/browse/SPARK-30528 > Project: Spark > Issue Type: Bug > Components: Optimizer > Affects Versions: 3.0.0 > Reporter: Mayur Bhosale > Assignee: Wei Xue > Priority: Major > Labels: performance > Fix For: 3.0.0 > > Attachments: cases.png, dup_subquery.png, plan.png > > > In DPP, heuristics to decide if DPP is going to benefit relies on the sizes > of the tables in the right subtree of the join. This might not be a correct > estimate especially when the detailed column level stats are not available. > {code:java} > // the pruning overhead is the total size in bytes of all scan relations > val overhead = > otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat > filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat > {code} > Also, DPP executes the entire right side of the join as a subquery because of > which multiple scans happen for the tables in the right subtree of the join. > This can cause issues when join is non-Broadcast Hash Join (BHJ) and reuse of > the subquery result does not happen. Also, I couldn’t figure out, why do the > results from the subquery get re-used only for BHJ? > > Consider a query, > {code:java} > SELECT * > FROM store_sales_partitioned > JOIN (SELECT * > FROM store_returns_partitioned, > date_dim > WHERE sr_returned_date_sk = d_date_sk) ret_date > ON ss_sold_date_sk = d_date_sk > WHERE d_fy_quarter_seq > 0 > {code} > DPP will kick-in for both the join. (Please check the image plan.png attached > below for the plan) > Some of the observations - > * Based on heuristics, DPP would go ahead with pruning if the cost of > scanning the tables in the right sub-tree of the join is less than the > benefit due to pruning. This is due to the reason that multiple scans will be > needed for an SMJ. But heuristics simply checks if the benefits offset the > cost of multiple scans and do not take into consideration other operations > like Join, etc in the right subtree which can be quite expensive. This issue > will be particularly prominent when detailed column level stats are not > available. In the example above, a decision that pruningHasBenefit was made > on the basis of sizes of the tables store_returns_partitioned and date_dim > but did not take into consideration the join between them before the join > happens with the store_sales_partitioned table. > * Multiple scans are needed when the join is SMJ as the reuse of the > exchanges does not happen. This is because Aggregate gets added on top of the > right subtree to be executed as a subquery in order to prune only required > columns. Here, scanning all the columns as the right subtree of the join > would, and reusing the same exchange might be more helpful as it avoids > duplicate scans. > This was just a representative example, but in-general for cases such as in > the image cases.png below, DPP can cause performance issues. > > Also, for the cases when there are multiple DPP compatible join conditions in > the same join, the entire right subtree of the join would be executed as a > subquery that many times. Consider an example, > {code:java} > SELECT * > FROM partitionedtable > JOIN nonpartitionedtable > ON partcol1 = col1 > AND partcol2 = col2 > WHERE nonpartitionedtable.id > 0 > {code} > Here the right subtree of the join (scan of table nonpartitionedtable) would > be executed twice as a subquery, once each for the every join condition. > These two subqueries should be aggregated and executed only once as they are > almost the same apart from the columns that they prune. Check the image > dup_subquery.png attached below for the details. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org