[jira] [Resolved] (SPARK-30528) Potential performance regression with DPP subquery duplication

Wenchen Fan (Jira) Thu, 13 Feb 2020 03:45:11 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-30528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wenchen Fan resolved SPARK-30528.
---------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 27551
[https://github.com/apache/spark/pull/27551]

> Potential performance regression with DPP subquery duplication
> --------------------------------------------------------------
>
>                 Key: SPARK-30528
>                 URL: https://issues.apache.org/jira/browse/SPARK-30528
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer
>    Affects Versions: 3.0.0
>            Reporter: Mayur Bhosale
>            Assignee: Wei Xue
>            Priority: Major
>              Labels: performance
>             Fix For: 3.0.0
>
>         Attachments: cases.png, dup_subquery.png, plan.png
>
>
> In DPP, heuristics to decide if DPP is going to benefit relies on the sizes 
> of the tables in the right subtree of the join. This might not be a correct 
> estimate especially when the detailed column level stats are not available.
> {code:java}
>     // the pruning overhead is the total size in bytes of all scan relations
>     val overhead = 
> otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
>     filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
> {code}
> Also, DPP executes the entire right side of the join as a subquery because of 
> which multiple scans happen for the tables in the right subtree of the join. 
> This can cause issues when join is non-Broadcast Hash Join (BHJ) and reuse of 
> the subquery result does not happen. Also, I couldn’t figure out, why do the 
> results from the subquery get re-used only for BHJ?
>  
> Consider a query,
> {code:java}
> SELECT * 
> FROM   store_sales_partitioned 
>        JOIN (SELECT * 
>              FROM   store_returns_partitioned, 
>                     date_dim 
>              WHERE  sr_returned_date_sk = d_date_sk) ret_date 
>          ON ss_sold_date_sk = d_date_sk 
> WHERE  d_fy_quarter_seq > 0 
> {code}
> DPP will kick-in for both the join. (Please check the image plan.png attached 
> below for the plan)
> Some of the observations -
>  * Based on heuristics, DPP would go ahead with pruning if the cost of 
> scanning the tables in the right sub-tree of the join is less than the 
> benefit due to pruning. This is due to the reason that multiple scans will be 
> needed for an SMJ. But heuristics simply checks if the benefits offset the 
> cost of multiple scans and do not take into consideration other operations 
> like Join, etc in the right subtree which can be quite expensive. This issue 
> will be particularly prominent when detailed column level stats are not 
> available. In the example above, a decision that pruningHasBenefit was made 
> on the basis of sizes of the tables store_returns_partitioned and date_dim 
> but did not take into consideration the join between them before the join 
> happens with the store_sales_partitioned table.
>  * Multiple scans are needed when the join is SMJ as the reuse of the 
> exchanges does not happen. This is because Aggregate gets added on top of the 
> right subtree to be executed as a subquery in order to prune only required 
> columns. Here, scanning all the columns as the right subtree of the join 
> would, and reusing the same exchange might be more helpful as it avoids 
> duplicate scans.
> This was just a representative example, but in-general for cases such as in 
> the image cases.png below, DPP can cause performance issues.
>  
> Also, for the cases when there are multiple DPP compatible join conditions in 
> the same join, the entire right subtree of the join would be executed as a 
> subquery that many times. Consider an example,
> {code:java}
> SELECT * 
> FROM   partitionedtable 
>        JOIN nonpartitionedtable 
>          ON partcol1 = col1 
>             AND partcol2 = col2 
> WHERE  nonpartitionedtable.id > 0 
> {code}
> Here the right subtree of the join (scan of table nonpartitionedtable) would 
> be executed twice as a subquery, once each for the every join condition. 
> These two subqueries should be aggregated and executed only once as they are 
> almost the same apart from the columns that they prune. Check the image 
> dup_subquery.png attached below for the details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30528) Potential performance regression with DPP subquery duplication

Reply via email to