adragomir opened a new issue, #10978:
URL: https://github.com/apache/datafusion/issues/10978

   ### Describe the bug
   
   We ran into problems with projections inside HashJoin. 
   
   Each schema in the join (left / right) has:
   
   * a single struct column 
   * and the join column (reference to a get_field inside the first column)
   
   The projection is `[0, 2]` - the struct column from left, and the struct 
column from right
   
   The join column is not specified in the output. When trying to optimize the 
join and reverse the order, the projection is swapped as `[2, 0]`, however 
there is no column with index 2 in the output, as the output contains only the 
2 structs
   
   ### To Reproduce
   
   * Create two schemas with a single struct column `(key, value)`
   * Join on the `key`
   * request the two `value` fields
   
   ### Expected behavior
   
   The hash join optimization works, even when swapping the join order (and 
wrapping in a ProjectionExec)
   
   
   ### Additional context
   
   Reading the [comment for 
HashJoinExec::projection](https://github.com/apache/datafusion/blob/ac161bba336d098eab46f666af4664de7e8cd29f/datafusion/physical-plan/src/joins/hash_join.rs#L318)
 it says `The projection indices of the columns in the output schema of join`, 
however
   
   * inside the `try_new` it seems to be [checked against the join 
schema](https://github.com/apache/datafusion/blob/ac161bba336d098eab46f666af4664de7e8cd29f/datafusion/physical-plan/src/joins/hash_join.rs#L363)
   * and inside the `with_projection` it seems to be [checked against the 
output 
schema](https://github.com/apache/datafusion/blob/ac161bba336d098eab46f666af4664de7e8cd29f/datafusion/physical-plan/src/joins/hash_join.rs#L453)
   * It also seems to be treated as relative to the join schema [inside the 
`swap_join_projection` 
function](https://github.com/apache/datafusion/blob/ac161bba336d098eab46f666af4664de7e8cd29f/datafusion/core/src/physical_optimizer/join_selection.rs#L140)
 - as it uses the left and right schemas
   
   I tried taking a stab at it, but it's unclear what the meaning of what is 
passed in projections is. 
   For now, I am fixing it surgically when swapping the order - I am rewriting 
the projections to be relative to the output schema when [wrapping the join 
with a 
`ProjectionExec`](https://github.com/apache/datafusion/blob/ac161bba336d098eab46f666af4664de7e8cd29f/datafusion/core/src/physical_optimizer/join_selection.rs#L196)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to