[ 
https://issues.apache.org/jira/browse/SPARK-30528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayur Bhosale updated SPARK-30528:
----------------------------------
    Description: 
In DPP, heuristics to decide if DPP is going to benefit relies on the sizes of 
the tables in the right subtree of the join. This might not be a correct 
estimate especially when the detailed column level stats are not available.
{code:java}
    // the pruning overhead is the total size in bytes of all scan relations
    val overhead = 
otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
    filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
{code}
Also, DPP executes the entire right side of the join as a subquery because of 
which multiple scans happen for the tables in the right subtree of the join. 
This can cause issues when join is non-Broadcast Hash Join (BHJ) and reuse of 
the subquery result does not happen. Also, I couldn’t figure out, why do the 
results from the subquery get re-used only for BHJ?

 

Consider a query,
{code:java}
SELECT * 
FROM   store_sales_partitioned 
       JOIN (SELECT * 
             FROM   store_returns_partitioned, 
                    date_dim 
             WHERE  sr_returned_date_sk = d_date_sk) ret_date 
         ON ss_sold_date_sk = d_date_sk 
WHERE  d_fy_quarter_seq > 0 
{code}
DPP will kick-in for both the join. (Please check the image attached below for 
the plan)

Some of the observations -
 * Based on heuristics, DPP would go ahead with pruning if the cost of scanning 
the tables in the right sub-tree of the join is less than the benefit due to 
pruning. This is due to the reason that multiple scans will be needed for an 
SMJ. But heuristics simply checks if the benefits offset the cost of multiple 
scans and do not take into consideration other operations like Join, etc in the 
right subtree which can be quite expensive. This issue will be particularly 
prominent when detailed column level stats are not available. In the example 
above, a decision that pruningHasBenefit was made on the basis of sizes of the 
tables store_returns_partitioned and date_dim but did not take into 
consideration the join between them before the join happens with the 
store_sales_partitioned table.

 * Multiple scans are needed when the join is SMJ as the reuse of the exchanges 
does not happen. This is because Aggregate gets added on top of the right 
subtree to be executed as a subquery in order to prune only required columns. 
Here, scanning all the columns as the right subtree of the join would, and 
reusing the same exchange might be more helpful as it avoids duplicate scans.

This was just a representative example, but in-general for cases such as in the 
image below, DPP can cause performance issues.

 

Also, for the cases when there are multiple DPP compatible join conditions in 
the same join, the entire right subtree would be executed as a subquery that 
many times. Consider an example,

  was:
In DPP, heuristics to decide if DPP is going to benefit relies on the sizes of 
the tables in the right subtree of the join. This might not be a correct 
estimate especially when the detailed column level stats are not available.
{code:java}
    // the pruning overhead is the total size in bytes of all scan relations
    val overhead = 
otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
    filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
{code}
Also, DPP executes the entire right side of the join as a subquery because of 
which multiple scans happen for the tables in the right subtree of the join. 
This can cause issues when join is non-Broadcast Hash Join (BHJ) and reuse of 
the subquery result does not happen. Also, I couldn’t figure out, why do the 
results from the subquery get re-used only for BHJ?

 

Consider a query,
{code:java}
SELECT * 
FROM   store_sales_partitioned 
       JOIN (SELECT * 
             FROM   store_returns_partitioned, 
                    date_dim 
             WHERE  sr_returned_date_sk = d_date_sk) ret_date 
         ON ss_sold_date_sk = d_date_sk 
WHERE  d_fy_quarter_seq > 0 
{code}
DPP will kick-in for both the join. (Please check the image attached below for 
the plan)

Some of the observations -
 * Based on heuristics, DPP would go ahead with pruning if the cost of scanning 
the tables in the right sub-tree of the join is less than the benefit due to 
pruning. This is due to the reason that multiple scans will be needed for an 
SMJ. But heuristics simply checks if the benefits offset the cost of multiple 
scans and do not take into consideration other operations like Join, etc in the 
right subtree which can be quite expensive. This issue will be particularly 
prominent when detailed column level stats are not available. In the example 
above, a decision that pruningHasBenefit was made on the basis of sizes of the 
tables store_returns_partitioned and date_dim but did not take into 
consideration the join between them before the join happens with the 
store_sales_partitioned table.

 * Multiple scans are needed when the join is SMJ as the reuse of the exchanges 
does not happen. This is because Aggregate gets added on top of the right 
subtree to be executed as a subquery in order to prune only required columns. 
Here, scanning all the columns as the right subtree of the join would, and 
reusing the same exchange might be more helpful as it avoids duplicate scans.

This was just a representative example, but in-general for cases such as in the 
image below, DPP can cause performance issues.

 


> DPP issues
> ----------
>
>                 Key: SPARK-30528
>                 URL: https://issues.apache.org/jira/browse/SPARK-30528
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer
>    Affects Versions: 3.0.0
>            Reporter: Mayur Bhosale
>            Priority: Major
>              Labels: performance
>         Attachments: cases.png, plan.png
>
>
> In DPP, heuristics to decide if DPP is going to benefit relies on the sizes 
> of the tables in the right subtree of the join. This might not be a correct 
> estimate especially when the detailed column level stats are not available.
> {code:java}
>     // the pruning overhead is the total size in bytes of all scan relations
>     val overhead = 
> otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
>     filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
> {code}
> Also, DPP executes the entire right side of the join as a subquery because of 
> which multiple scans happen for the tables in the right subtree of the join. 
> This can cause issues when join is non-Broadcast Hash Join (BHJ) and reuse of 
> the subquery result does not happen. Also, I couldn’t figure out, why do the 
> results from the subquery get re-used only for BHJ?
>  
> Consider a query,
> {code:java}
> SELECT * 
> FROM   store_sales_partitioned 
>        JOIN (SELECT * 
>              FROM   store_returns_partitioned, 
>                     date_dim 
>              WHERE  sr_returned_date_sk = d_date_sk) ret_date 
>          ON ss_sold_date_sk = d_date_sk 
> WHERE  d_fy_quarter_seq > 0 
> {code}
> DPP will kick-in for both the join. (Please check the image attached below 
> for the plan)
> Some of the observations -
>  * Based on heuristics, DPP would go ahead with pruning if the cost of 
> scanning the tables in the right sub-tree of the join is less than the 
> benefit due to pruning. This is due to the reason that multiple scans will be 
> needed for an SMJ. But heuristics simply checks if the benefits offset the 
> cost of multiple scans and do not take into consideration other operations 
> like Join, etc in the right subtree which can be quite expensive. This issue 
> will be particularly prominent when detailed column level stats are not 
> available. In the example above, a decision that pruningHasBenefit was made 
> on the basis of sizes of the tables store_returns_partitioned and date_dim 
> but did not take into consideration the join between them before the join 
> happens with the store_sales_partitioned table.
>  * Multiple scans are needed when the join is SMJ as the reuse of the 
> exchanges does not happen. This is because Aggregate gets added on top of the 
> right subtree to be executed as a subquery in order to prune only required 
> columns. Here, scanning all the columns as the right subtree of the join 
> would, and reusing the same exchange might be more helpful as it avoids 
> duplicate scans.
> This was just a representative example, but in-general for cases such as in 
> the image below, DPP can cause performance issues.
>  
> Also, for the cases when there are multiple DPP compatible join conditions in 
> the same join, the entire right subtree would be executed as a subquery that 
> many times. Consider an example,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to