2010YOUY01 opened a new issue, #17432:
URL: https://github.com/apache/datafusion/issues/17432

   ### Is your feature request related to a problem or challenge?
   
   DataFusion has several join implementations like Nested Loop Join, Hash 
Join, etc. For a given join SQL query, there might be multiple applicable 
physical join types, and the DF planner/optimizer will pick a optimal one to 
execute.
   
   DF currently offers limited configurability. Allowing users to prefer or 
disable specific join types would improve the user experience, especially as 
more specialized join operators are on to the roadmap. (thanks to @jonathanc-n )
   
   ### Example
   ```
   > select * from t1 join t2 on t1.v1 = t2.v1;
   
   -- DF choose HJ by default
   
   > set datafusion.optimizer.prefer_nested_loop_join = true;
   > select * from t1 join t2 on t1.v1 = t2.v1;
   
   -- Now it's using NLJ
   ```
   
   ### Describe the solution you'd like
   
   ### Reference solution from DuckDB and PostgreSQL
   pg is using configurations like `enable_hashjoin`, `enable_nestloop`. You 
can turn certain types on and off, I guess the planner will pick the optimal 
one from only enabled join types.
   reference: https://www.postgresql.org/docs/7.1/runtime-config.html
   
   DuckDB only has several configurations to guide the optimizer's decision, 
but it doesn't provide configurations to turn off certain join types. examples: 
`prefer_range_joins`, `merge_join_threshold`
   reference: https://duckdb.org/docs/stable/configuration/overview.html
   
   ### Idea to implement
   
   DataFusion now has one configuration option to toggle HJ and SMJ, when 
they're both available
   ```
   datafusion.optimizer.prefer_hash_join | true | When set to true, the 
physical plan optimizer will prefer HashJoin over SortMergeJoin. HashJoin can 
work more efficiently than SortMergeJoin but consumes more memory
   ```
   
   To make the configuration easier for all join types, two sets of options can 
be added
   - `enable_join_type`: If `true`, this join type will be considered by the 
optimizer, otherwise it will be always disabled. If there are no enabled join 
types to finish the query, an error will be return.
   - `prefer_join_type`: If a certain join type is both enabled, and possible 
to use for a certain join query, the preferred join type will be prioritized. 
If there are multiple preferred join types, the optimizer will pick the best 
one.
   
   #### Example
   Available options (the implementation shouldn't use acronym, it's for 
simplicity here):
   - `enable_nlj`
   - `enable_smj`
   - `enable_hj`
   - ...and other available join types
   
   - `prefer_nlj`
   - `prefer_smj`
   - `prefer_hj`
   - ...and other available join types
   
   Query: `select * from t1 join t2 on t1.v1 = t2.v1;`
   
   ----
   #### Config 1 -- Enable all, no preference
   ```
   enable_nlj = true
   enable_smj = true
   enable_hj  = true
   
   prefer_nlj = false
   prefer_smj = false
   prefer_hj  = false
   ```
   (all enabled; no preference; optimizer uses default heuristic) -> HJ
   
   ---
   
   #### Config 2 -- Force NLJ
   ```
   enable_nlj = true
   enable_smj = false
   enable_hj  = false
   
   prefer_nlj = false
   prefer_smj = false
   prefer_hj  = false
   ```
   (only NLJ enabled) -> NLJ
   
   ---
   
   #### Config 3 -- Prefer SMJ, others allowed
   ```
   enable_nlj = true
   enable_smj = true
   enable_hj  = true
   
   prefer_nlj = false
   prefer_smj = true
   prefer_hj  = false
   ```
   (SMJ preferred if feasible; else fallback by cost) -> SMJ
   
   ---
   
   #### Config 4 -- Prefer SMJ and HJ equally
   ```
   enable_nlj = true
   enable_smj = true
   enable_hj  = true
   
   prefer_nlj = false
   prefer_smj = true
   prefer_hj  = true
   ```
   (SMJ & HJ both preferred; the optimizer by default do HJ > SMJ) -> HJ
   
   ---
   
   #### Config 5 -- Prefer disabled type (ignored)
   ```
   enable_nlj = false
   enable_smj = true
   enable_hj  = true
   
   prefer_nlj = true
   prefer_smj = false
   prefer_hj  = false
   ```
   (NLJ preference ignored since disabled; choose among enabled) -> HJ
   
   ---
   
   #### Config 6 -- No join type enabled
   ```
   enable_nlj = false
   enable_smj = false
   enable_hj  = false
   ```
   (none enabled; cannot plan) -> ERROR
   
   ----
   
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to