UBarney commented on code in PR #16443:
URL: https://github.com/apache/datafusion/pull/16443#discussion_r2173634499


##########
datafusion/physical-plan/src/joins/utils.rs:
##########
@@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices(
     probe_indices: UInt32Array,
     filter: &JoinFilter,
     build_side: JoinSide,
+    max_intermediate_size: Option<usize>,
 ) -> Result<(UInt64Array, UInt32Array)> {
     if build_indices.is_empty() && probe_indices.is_empty() {
         return Ok((build_indices, probe_indices));
     };
 
-    let intermediate_batch = build_batch_from_indices(
-        filter.schema(),
-        build_input_buffer,
-        probe_batch,
-        &build_indices,
-        &probe_indices,
-        filter.column_indices(),
-        build_side,
-    )?;
-    let filter_result = filter
-        .expression()
-        .evaluate(&intermediate_batch)?
-        .into_array(intermediate_batch.num_rows())?;
+    let filter_result = if let Some(max_size) = max_intermediate_size {

Review Comment:
   >Can we enforce it before filtering, while calculating build/probe_indices 
args for this function (in NestedLoopJoinExec::build_join_indices)?
   
   I'll do it in next pr. 
https://github.com/apache/datafusion/issues/16364#issuecomment-2975520489
   
   > Why batch_size enforcement should take place during filtering
   
   1. Although the "Process the Cartesian Product Incrementally" step is 
designed to limit the input size for `apply_join_filter_to_indices`, the size 
of a single batch can still be very large (up to `left_table.now_rows() * N`). 
When the left table itself is large, this can lead to the creation of a large 
`record_batch`.
   2. Benchmarks indicate that executing joins is faster with this enforcement 
in place. 
https://github.com/apache/datafusion/pull/16443#issuecomment-2993893069



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to