GitHub user wzhfy opened a pull request: https://github.com/apache/spark/pull/17138
[SPARK-17080] [SQL] join reorder ## What changes were proposed in this pull request? Reorder the joins using a dynamic programming algorithm (Selinger paper): First we put all items (basic joined nodes) into level 1, then we build all two-way joins at level 2 from plans at level 1 (single items), then build all 3-way joins from plans at previous levels (two-way joins and single items), then 4-way joins ... etc, until we build all n-way joins and pick the best plan among them. When building m-way joins, we only keep the best plan (with the lowest cost) for the same set of m items. E.g., for 3-way joins, we keep only the best plan for items {A, B, C} among plans (A J B) J C, (A J C) J B and (B J C) J A. Thus, the plans maintained for each level when reordering four items A, B, C, D are as follows: ``` level 1: p({A}), p({B}), p({C}), p({D}) level 2: p({A, B}), p({A, C}), p({A, D}), p({B, C}), p({B, D}), p({C, D}) level 3: p({A, B, C}), p({A, B, D}), p({A, C, D}), p({B, C, D}) level 4: p({A, B, C, D}) ``` where p({A, B, C, D}) is the final output plan. For cost evaluation, since physical costs for operators are not available currently, we use cardinalities and sizes to compute costs. ## How was this patch tested? add test cases You can merge this pull request into a Git repository by running: $ git pull https://github.com/wzhfy/spark joinReorder Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17138.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17138 ---- commit 4682da4e20327bcf78f979061b9e4366dda25363 Author: wangzhenhua <wangzhen...@huawei.com> Date: 2017-03-01T08:45:13Z join reorder commit f8b19a81a6a5451150afa618488307c057bde861 Author: wangzhenhua <wangzhen...@huawei.com> Date: 2017-03-02T14:17:43Z add test cases ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org