2010YOUY01 opened a new issue, #17432: URL: https://github.com/apache/datafusion/issues/17432
### Is your feature request related to a problem or challenge? DataFusion has several join implementations like Nested Loop Join, Hash Join, etc. For a given join SQL query, there might be multiple applicable physical join types, and the DF planner/optimizer will pick a optimal one to execute. DF currently offers limited configurability. Allowing users to prefer or disable specific join types would improve the user experience, especially as more specialized join operators are on to the roadmap. (thanks to @jonathanc-n ) ### Example ``` > select * from t1 join t2 on t1.v1 = t2.v1; -- DF choose HJ by default > set datafusion.optimizer.prefer_nested_loop_join = true; > select * from t1 join t2 on t1.v1 = t2.v1; -- Now it's using NLJ ``` ### Describe the solution you'd like ### Reference solution from DuckDB and PostgreSQL pg is using configurations like `enable_hashjoin`, `enable_nestloop`. You can turn certain types on and off, I guess the planner will pick the optimal one from only enabled join types. reference: https://www.postgresql.org/docs/7.1/runtime-config.html DuckDB only has several configurations to guide the optimizer's decision, but it doesn't provide configurations to turn off certain join types. examples: `prefer_range_joins`, `merge_join_threshold` reference: https://duckdb.org/docs/stable/configuration/overview.html ### Idea to implement DataFusion now has one configuration option to toggle HJ and SMJ, when they're both available ``` datafusion.optimizer.prefer_hash_join | true | When set to true, the physical plan optimizer will prefer HashJoin over SortMergeJoin. HashJoin can work more efficiently than SortMergeJoin but consumes more memory ``` To make the configuration easier for all join types, two sets of options can be added - `enable_join_type`: If `true`, this join type will be considered by the optimizer, otherwise it will be always disabled. If there are no enabled join types to finish the query, an error will be return. - `prefer_join_type`: If a certain join type is both enabled, and possible to use for a certain join query, the preferred join type will be prioritized. If there are multiple preferred join types, the optimizer will pick the best one. #### Example Available options (the implementation shouldn't use acronym, it's for simplicity here): - `enable_nlj` - `enable_smj` - `enable_hj` - ...and other available join types - `prefer_nlj` - `prefer_smj` - `prefer_hj` - ...and other available join types Query: `select * from t1 join t2 on t1.v1 = t2.v1;` ---- #### Config 1 -- Enable all, no preference ``` enable_nlj = true enable_smj = true enable_hj = true prefer_nlj = false prefer_smj = false prefer_hj = false ``` (all enabled; no preference; optimizer uses default heuristic) -> HJ --- #### Config 2 -- Force NLJ ``` enable_nlj = true enable_smj = false enable_hj = false prefer_nlj = false prefer_smj = false prefer_hj = false ``` (only NLJ enabled) -> NLJ --- #### Config 3 -- Prefer SMJ, others allowed ``` enable_nlj = true enable_smj = true enable_hj = true prefer_nlj = false prefer_smj = true prefer_hj = false ``` (SMJ preferred if feasible; else fallback by cost) -> SMJ --- #### Config 4 -- Prefer SMJ and HJ equally ``` enable_nlj = true enable_smj = true enable_hj = true prefer_nlj = false prefer_smj = true prefer_hj = true ``` (SMJ & HJ both preferred; the optimizer by default do HJ > SMJ) -> HJ --- #### Config 5 -- Prefer disabled type (ignored) ``` enable_nlj = false enable_smj = true enable_hj = true prefer_nlj = true prefer_smj = false prefer_hj = false ``` (NLJ preference ignored since disabled; choose among enabled) -> HJ --- #### Config 6 -- No join type enabled ``` enable_nlj = false enable_smj = false enable_hj = false ``` (none enabled; cannot plan) -> ERROR ---- ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
