yjshen opened a new issue, #2509:
URL: https://github.com/apache/arrow-datafusion/issues/2509

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   It would be necessary to support filters in the join operator, instead of a 
join operator followed by a filter, the necessity comes from two folds:
   
   1. semi-join that columns from one table would be removed after the join, 
which would make a filter that references both sides impossible. 
   2. filter out records earlier to reduce the need for constructing more 
records(batches).
   
   **Describe the solution you'd like**
   1. `pub type FilterOn = (Vec<Column>, Vec<Column>, Arc<dyn PhysicalExpr>)`; 
and `Option<FilterOn>` for JoinExec.
   2. generate record batch using arrays in the filter expr, but rebind the 
expr to point to the right columns of the newly generated batch.
   
   **Describe alternatives you've considered**
   1. `pub type FilterOn = Vec<(Column, Column, datafusion_expr::Operator)>;`
   2. Normalize each filter into two sides of a binary op.  like: `t1.a + t2.b 
> 100` to `t1.a > 100 - t2.b`. evaluates `a` , `100-b` separately as two 
columns and apply binary expr calculation logic.
   
   But the approach would be quite limited since it greatly limits the 
expressions that could be used in a join filter.
   
   **Additional context**
   
   Consider Part of TPC-DS query-95's SparkSQL plan as an example:
   ```
   +- SortMergeJoin [ws_order_number#251], [ws_order_number#285], Inner, NOT 
(ws_warehouse_sk#249 = ws_warehouse_sk#283)
      :- Sort [ws_order_number#251 ASC NULLS FIRST], false, 0
      :  +- CustomShuffleReader coalesced
      :     +- ShuffleQueryStage 4
      :        +- ReusedExchange [ws_warehouse_sk#249, ws_order_number#251], 
ArrowShuffleExchange hashpartitioning(ws_order_number#125, 200), true, [id=#226]
      +- Sort [ws_order_number#285 ASC NULLS FIRST], false, 0
         +- CustomShuffleReader coalesced
            +- ShuffleQueryStage 5
               +- ReusedExchange [ws_warehouse_sk#283, ws_order_number#285], 
ArrowShuffleExchange hashpartitioning(ws_order_number#125, 200), true, [id=#226]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to