kosiew commented on code in PR #22979:
URL: https://github.com/apache/datafusion/pull/22979#discussion_r3466404016


##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -302,41 +302,50 @@ impl PhysicalExpr for BinaryExpr {
                 let rhs = self.right.evaluate(batch)?;
                 return Ok(rhs);
             }
-            ShortCircuitStrategy::PreSelection(selection) => {
-                // The function `evaluate_selection` was not called for 
filtering and calculation,
-                // as it takes into account cases where the selection contains 
null values.
-                let batch = filter_record_batch(batch, selection)?;
-                let right_ret = self.right.evaluate(&batch)?;
+            ShortCircuitStrategy::PreSelection { mask, fill_value } => {
+                // `mask` selects the rows whose result depends on the RHS; the
+                // unselected rows are all `fill_value` (see 
`ShortCircuitStrategy`).
+                //
+                // Use `filter_record_batch` directly because 
`evaluate_selection`
+                // scatters the RHS back to the original batch length.
+                let selection_batch = filter_record_batch(batch, &mask)?;
+                let right_ret = self.right.evaluate(&selection_batch)?;
 
                 match &right_ret {
                     ColumnarValue::Array(array) => {
-                        // When the array on the right is all true or all 
false, skip the scatter process
                         let boolean_array = array.as_boolean();
-                        if boolean_array.null_count() == 0 && 
!boolean_array.has_false() {
-                            return Ok(lhs);
-                        } else if boolean_array.null_count() == 0
-                            && !boolean_array.has_true()
-                        {
-                            // If the right-hand array is returned at this 
point,the lengths will be inconsistent;
-                            // returning a scalar can avoid this issue
-                            return 
Ok(ColumnarValue::Scalar(ScalarValue::Boolean(
-                                Some(false),
-                            )));
+                        // If the RHS is uniform on the selected rows, the 
whole
+                        // expression collapses and no scatter is needed.
+                        if boolean_array.null_count() == 0 {
+                            let rhs_value = if !boolean_array.has_false() {

Review Comment:
   Small readability thought: the uniform RHS detection could be a bit flatter 
by encoding the two valid uniform cases directly.
   
   ```rust
   let rhs_value = match (boolean_array.has_true(), boolean_array.has_false()) {
       (true, false) => Some(true),
       (false, true) => Some(false),
       _ => None,
   };
   ```
   
   I think this makes the mixed case easier to spot at a glance and avoids the 
nested `if` / `else if`.



##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -961,62 +985,54 @@ fn check_short_circuit<'a>(
     ShortCircuitStrategy::None
 }
 
-/// Creates a new boolean array based on the evaluation of the right 
expression,
-/// but only for positions where the left_result is true.
+/// Collapses a pre-selected expression whose RHS is uniformly `rhs_value` 
across
+/// every selected row, avoiding a scatter:
+/// - when it equals `fill_value`, every row is `fill_value` (a scalar);
+/// - otherwise the selected rows already equal the RHS, which matches the LHS
+///   there, and the unselected rows are the LHS value too, so the result is 
`lhs`.
+fn uniform_pre_selection_result(
+    rhs_value: bool,
+    fill_value: bool,
+    lhs: ColumnarValue,
+) -> ColumnarValue {
+    if rhs_value == fill_value {
+        ColumnarValue::Scalar(ScalarValue::Boolean(Some(fill_value)))
+    } else {
+        lhs
+    }
+}
+
+/// Creates a boolean array by scattering compact RHS results into the 
positions
+/// selected by `mask`.
 ///
-/// This function is used for short-circuit evaluation optimization of logical 
AND operations:
-/// - When left_result has few true values, we only evaluate the right 
expression for those positions
-/// - Values are copied from right_array where left_result is true
-/// - All other positions are filled with false values
+/// This function is used for short-circuit evaluation optimization of logical 
AND/OR operations:
+/// - Only selected rows are evaluated on the RHS
+/// - Values are copied from `right_result` where `mask` is true
+/// - All other positions are filled with `fill_value` (`false` for AND, 
`true` for OR)
 ///
 /// # Parameters
-/// - `left_result` Boolean array with selection mask (typically from left 
side of AND)
+/// - `mask` Boolean array with the rows whose result depends on the RHS
 /// - `right_result` Result of evaluating right side of expression (only for 
selected positions)
+/// - `fill_value` The value for the unselected positions (`false` for AND, 
`true` for OR)
 ///
 /// # Returns
-/// A combined ColumnarValue with values from right_result where left_result 
is true
-///
-/// # Example
-///  Initial Data: { 1, 2, 3, 4, 5 }
-///  Left Evaluation
-///     (Condition: Equal to 2 or 3)
-///          ↓
-///  Filtered Data: {2, 3}
-///    Left Bitmap: { 0, 1, 1, 0, 0 }
-///          ↓
-///   Right Evaluation
-///     (Condition: Even numbers)
-///          ↓
-///  Right Data: { 2 }
-///    Right Bitmap: { 1, 0 }
-///          ↓
-///   Combine Results
-///  Final Bitmap: { 0, 1, 0, 0, 0 }
-///
-/// # Note
-/// Perhaps it would be better to modify `left_result` directly without 
creating a copy?
-/// In practice, `left_result` should have only one owner, so making changes 
should be safe.
-/// However, this is difficult to achieve under the immutable constraints of 
[`Arc`] and [`BooleanArray`].
+/// A combined `ColumnarValue` with the same length as `mask`.
 fn pre_selection_scatter(
-    left_result: &BooleanArray,
+    mask: &BooleanArray,
     right_result: Option<&BooleanArray>,
+    fill_value: bool,
 ) -> Result<ColumnarValue> {
-    let result_len = left_result.len();
+    let result_len = mask.len();
 
     let mut result_array_builder = BooleanArray::builder(result_len);
 
-    // keep track of current position we have in right boolean array
     let mut right_array_pos = 0;

Review Comment:
   `pre_selection_scatter` now has two very similar `SlicesIterator` loops, 
with the main difference being how selected rows are filled. It might be worth 
folding this into one loop and choosing the selected-row behavior inside it.
   
   That would keep the invariant in one place: gaps get `fill_value`, while 
selected rows get RHS values or nulls. It may also help avoid future drift 
between the AND and OR paths.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to