[ https://issues.apache.org/jira/browse/SPARK-23839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422026#comment-16422026 ]
Takeshi Yamamuro commented on SPARK-23839: ------------------------------------------ Is it not enough to add a simple optimizer rule for joining bucketing tables first and then reordering the other tables? I think the proposed idea needs big changes (and this is a design issue) in the optimizer, the planner, and the preparation phase. So, I feel the simpler, the better. > consider bucket join in cost-based JoinReorder rule > --------------------------------------------------- > > Key: SPARK-23839 > URL: https://issues.apache.org/jira/browse/SPARK-23839 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.0 > Reporter: Xiaoju Wu > Priority: Minor > > Since spark 2.2, the cost-based JoinReorder rule is implemented and in Spark > 2.3 released, it is improved with histogram. While it doesn't take the cost > of the different join implementations. For example: > TableA JOIN TableB JOIN TableC > TableA will output 10,000 rows after filter and projection. > TableB will output 10,000 rows after filter and projection. > TableC will output 8,000 rows after filter and projection. > The current JoinReorder rule will possibly optimize the plan to join TableC > with TableA firstly and then TableB. But if the TableA and TableB are bucket > tables and can be applied with BucketJoin, it could be a different story. > > Also, to support bucket join of more than 2 tables when table bucket number > is multiple of another (SPARK-17570), whether bucket join can take effect > depends on the result of JoinReorder. For example of "a join b join c" which > has bucket number like 8, 12, 4, JoinReorder rule should optimize the order > to "c join a join b“ to make the bucket join take effect. > > Based on current CBO JoinReorder, there are possibly 2 part to be changed: > # CostBasedJoinReorder rule is applied in optimizer phase while we do Join > selection in planner phase and bucket join optimization in EnsureRequirements > which is in preparation phase. Both are after optimizer. > # Current statistics and join cost formula are based data selectivity and > cardinality, we need to add statistics for present the join method cost like > shuffle, sort, hash and etc. Also we need to add the statistics into the > formula to estimate the join cost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org