[ 
https://issues.apache.org/jira/browse/SPARK-23839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628080#comment-16628080
 ] 

Xiao Li commented on SPARK-23839:
---------------------------------

To implement CBO in planner, we need a major change in our planner. The 
stats-based JoinReorder rule is just the current workaround before we doing the 
actual cost-based optimizer. 

> consider bucket join in cost-based JoinReorder rule
> ---------------------------------------------------
>
>                 Key: SPARK-23839
>                 URL: https://issues.apache.org/jira/browse/SPARK-23839
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Xiaoju Wu
>            Priority: Minor
>
> Since spark 2.2, the cost-based JoinReorder rule is implemented and in Spark 
> 2.3 released, it is improved with histogram. While it doesn't take the cost 
> of the different join implementations. For example:
> TableA JOIN TableB JOIN TableC
> TableA  will output 10,000 rows after filter and projection. 
> TableB  will output 10,000 rows after filter and projection. 
> TableC  will output 8,000 rows after filter and projection. 
> The current JoinReorder rule will possibly optimize the plan to join TableC 
> with TableA firstly and then TableB. But if the TableA and TableB are bucket 
> tables and can be applied with BucketJoin, it could be a different story. 
>  
> Also, to support bucket join of more than 2 tables when table bucket number 
> is multiple of another (SPARK-17570), whether bucket join can take effect 
> depends on the result of JoinReorder. For example of "A join B join C" which 
> has bucket number like 8, 4, 12, JoinReorder rule should optimize the order 
> to "A join B join C“ to make the bucket join take effect instead of "C join A 
> join B". 
>  
> Based on current CBO JoinReorder, there are possibly 2 part to be changed:
>  # CostBasedJoinReorder rule is applied in optimizer phase while we do Join 
> selection in planner phase and bucket join optimization in EnsureRequirements 
> which is in preparation phase. Both are after optimizer. 
>  # Current statistics and join cost formula are based data selectivity and 
> cardinality, we need to add statistics for present the join method cost like 
> shuffle, sort, hash and etc. Also we need to add the statistics into the 
> formula to estimate the join cost. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to