yjshen opened a new issue, #2509: URL: https://github.com/apache/arrow-datafusion/issues/2509
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** It would be necessary to support filters in the join operator, instead of a join operator followed by a filter, the necessity comes from two folds: 1. semi-join that columns from one table would be removed after the join, which would make a filter that references both sides impossible. 2. filter out records earlier to reduce the need for constructing more records(batches). **Describe the solution you'd like** 1. `pub type FilterOn = (Vec<Column>, Vec<Column>, Arc<dyn PhysicalExpr>)`; and `Option<FilterOn>` for JoinExec. 2. generate record batch using arrays in the filter expr, but rebind the expr to point to the right columns of the newly generated batch. **Describe alternatives you've considered** 1. `pub type FilterOn = Vec<(Column, Column, datafusion_expr::Operator)>;` 2. Normalize each filter into two sides of a binary op. like: `t1.a + t2.b > 100` to `t1.a > 100 - t2.b`. evaluates `a` , `100-b` separately as two columns and apply binary expr calculation logic. But the approach would be quite limited since it greatly limits the expressions that could be used in a join filter. **Additional context** Consider Part of TPC-DS query-95's SparkSQL plan as an example: ``` +- SortMergeJoin [ws_order_number#251], [ws_order_number#285], Inner, NOT (ws_warehouse_sk#249 = ws_warehouse_sk#283) :- Sort [ws_order_number#251 ASC NULLS FIRST], false, 0 : +- CustomShuffleReader coalesced : +- ShuffleQueryStage 4 : +- ReusedExchange [ws_warehouse_sk#249, ws_order_number#251], ArrowShuffleExchange hashpartitioning(ws_order_number#125, 200), true, [id=#226] +- Sort [ws_order_number#285 ASC NULLS FIRST], false, 0 +- CustomShuffleReader coalesced +- ShuffleQueryStage 5 +- ReusedExchange [ws_warehouse_sk#283, ws_order_number#285], ArrowShuffleExchange hashpartitioning(ws_order_number#125, 200), true, [id=#226] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org