UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2173634499
########## datafusion/physical-plan/src/joins/utils.rs: ########## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter: &JoinFilter, build_side: JoinSide, + max_intermediate_size: Option<usize>, ) -> Result<(UInt64Array, UInt32Array)> { if build_indices.is_empty() && probe_indices.is_empty() { return Ok((build_indices, probe_indices)); }; - let intermediate_batch = build_batch_from_indices( - filter.schema(), - build_input_buffer, - probe_batch, - &build_indices, - &probe_indices, - filter.column_indices(), - build_side, - )?; - let filter_result = filter - .expression() - .evaluate(&intermediate_batch)? - .into_array(intermediate_batch.num_rows())?; + let filter_result = if let Some(max_size) = max_intermediate_size { Review Comment: >Can we enforce it before filtering, while calculating build/probe_indices args for this function (in NestedLoopJoinExec::build_join_indices)? I'll do it in next pr. https://github.com/apache/datafusion/issues/16364#issuecomment-2975520489 > Why batch_size enforcement should take place during filtering 1. Although the "Process the Cartesian Product Incrementally" step is designed to limit the input size for `apply_join_filter_to_indices`, the size of a single batch can still be very large (up to `left_table.now_rows() * N`). When the left table itself is large, this can lead to the creation of a large `record_batch`. 2. Benchmarks indicate that executing joins is faster with this enforcement in place. https://github.com/apache/datafusion/pull/16443#issuecomment-2993893069 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org