alamb opened a new issue, #9056:
URL: https://github.com/apache/arrow-datafusion/issues/9056

   ### Is your feature request related to a problem or challenge?
   
   
   TLDR is that it would be nice to more easily use DataFusion as an execution 
engine for engines like Spark (e.g. the comet project 
https://github.com/apache/arrow-datafusion-comet/pull/1), where operators 
directly take general expressions as join keys. 
   
   As @viirya said on https://github.com/apache/arrow-datafusion/pull/8991:
   
   
   Currently the join keys of join operators like `SortMergeJoin` are 
restricted to be `Column`. But it is commonly we use expressions (e.g., `l_col 
+ 1 = r_col + 2`) other than simply columns as join keys. From the query plan, 
DataFusion seems to add additional `Project` under join operator which projects 
the expressions into columns. So the above join operators take join keys as 
columns.
   
   However, in other query engines, e.g., Spark, its query plan doesn't have 
the additional projection but its join operators directly take general 
expressions as join keys. (note that by adding additional projection before 
join in Spark it means more data to be shuffled/sorted which can be bad for 
performance)
   
   That means if we cannot delegate such join operators to DataFusion physical 
join operators which require join keys must be columns.
   
   This patch tries to relax this join keys constraint of physical join 
operators. So we can construct DataFusion physical join operator using general 
expressions as join keys.
   
   This patch doesn't change how DataFusion plans the join operators. I.e., 
DataFusion still plans a join operation using non-column join keys into 
projection + join operator with columns. (We probably can remove this 
additional projection later if it also adds additional cost to DataFusion. 
Currently I'm not sure if/how DataFusion plans partitioning for the join 
operators.)
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   We could potentially still require joins to take only columns, and require 
other parts of the system to insert `ProjectionExec`s  
   
   In fact it seems like It seems like substrait represents equi joins as 
columns (not expressions):  
https://substrait.io/relations/physical_relations/#hash-equijoin-properties
   
   It might be possible for engines like `comet` to insert `ProjectionExec` in 
the appropriate places in the plan using an optimizer pass
   
   For example,
   ```
   HashJoinExec(exprs=(l_col + 1, r_col + 2))
   ```
   
   Could be rewritten to 
   ```
   HashJoinExec(exprs=(x, y))
     ProjectionExec(exprs=[l_col + 1 as "x", r_col + 2 as "y"])
   ```
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to