(datafusion) branch main updated: Apply pre-selection and computation skipping to short-circuit optimization (#15694)

alamb Thu, 17 Apr 2025 10:06:25 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion.git



The following commit(s) were added to refs/heads/main by this push:
     new 4818966fa0 Apply pre-selection and computation skipping to 
short-circuit optimization (#15694)
4818966fa0 is described below

commit 4818966fa0950a6a471d1ea9e4b8c5808878207e
Author: LB7666 <[email protected]>
AuthorDate: Fri Apr 18 01:06:13 2025 +0800

    Apply pre-selection and computation skipping to short-circuit optimization 
(#15694)
    
    * Enhance short-circuit evaluation for binary expressions
    
    - Delay evaluation of the right-hand side (RHS) unless necessary.
    - Optimize short-circuiting for `Operator::And` and `Operator::Or` by 
checking LHS alone first.
    - Introduce `get_short_circuit_result` function to determine short-circuit 
conditions based on LHS and RHS.
    - Update tests to cover various short-circuit scenarios for both `AND` and 
`OR` operations.
    
    * refactor: rename test_check_short_circuit to 
test_get_short_circuit_result and update assertions
    
    - Renamed the test function for clarity.
    - Updated assertions to use get_short_circuit_result instead of 
check_short_circuit.
    - Added additional test cases for AND and OR operations with expected 
results.
    
    * fix: enhance short-circuit evaluation logic in get_short_circuit_result 
function for null
    
    - Updated AND and OR short-circuit conditions to only trigger when all 
values are either false or true, respectively, and there are no nulls in the 
array.
    - Adjusted test case to reflect the change in expected output.
    
    * feat: add debug logging for binary expression evaluation and 
short-circuit checks
    
    * fix: improve short-circuit evaluation logic in BinaryExpr to ensure RHS 
is only evaluated when necessary
    
    * fix: restrict short-circuit evaluation to logical operators in 
get_short_circuit_result function
    
    * add more println!("==> ");
    
    * fix: remove duplicate data type checks for left and right operands in 
BinaryExpr evaluation
    
    * feat: add debug prints for dictionary values and keys in binary 
expression tests
    
    * Tests pass
    
    * fix: remove redundant short-circuit evaluation check in BinaryExpr and 
enhance documentation for get_short_circuit_result
    
    * refactor: remove unnecessary debug prints and streamline short-circuit 
evaluation in BinaryExpr
    
    * test: enhance short-circuit evaluation tests for nullable and scalar 
values in BinaryExpr
    
    * add benchmark
    
    * refactor: improve short-circuit logic in BinaryExpr for logical operators
    
    - Renamed `arg` to `lhs` for clarity in the `get_short_circuit_result` 
function.
    - Updated handling of Boolean data types to return `None` for null values.
    - Simplified short-circuit checks for AND/OR operations by consolidating 
logic.
    - Enhanced readability and maintainability of the code by restructuring 
match statements.
    
    * refactor: enhance short-circuit evaluation strategy in BinaryExpr to 
optimize logical operations
    
    * Revert "refactor: enhance short-circuit evaluation strategy in BinaryExpr 
to optimize logical operations"
    
    This reverts commit a62df471e10ea4e53c7438c768ed273d1d344d22.
    
    * bench: add benchmark for OR operation with all false values in 
short-circuit evaluation
    
    * refactor: add ShortCircuitStrategy enum to optimize short-circuit 
evaluation in BinaryExpr
    
    - Replaced the lazy evaluation of the right-hand side (RHS) with immediate 
evaluation based on short-circuiting logic.
    - Introduced a new function `check_short_circuit` to determine if 
short-circuiting can be applied for logical operators.
    - Updated the logic to return early for `Operator::And` and `Operator::Or` 
based on the evaluation of the left-hand side (LHS) and the conditions of the 
RHS.
    - Improved clarity and efficiency of the short-circuit evaluation process 
by eliminating unnecessary evaluations.
    
    * refactor: simplify short-circuit evaluation logic in check_short_circuit 
function
    
    * datafusion_expr::lit as expr_lit
    
    * refactor: optimize short-circuit evaluation in check_short_circuit 
function
    - Simplified logic for AND/OR operations by prioritizing false/true counts 
to enhance performance.
    - Updated documentation to reflect changes in array handling techniques.
    
    * refactor: add count_boolean_values helper function and optimize 
check_short_circuit logic
    
    - Introduced a new helper function `count_boolean_values` to count true and 
false values in a BooleanArray, improving readability and performance.
    - Updated `check_short_circuit` to utilize the new helper function for 
counting, reducing redundant operations and enhancing clarity in the evaluation 
logic for AND/OR operations.
    - Adjusted comments for better understanding of the short-circuiting 
conditions based on the new counting mechanism.
    
    * Revert "refactor: add count_boolean_values helper function and optimize 
check_short_circuit logic"
    
    This reverts commit e2b9f777222759a530bd539ab0453c100b2699b4.
    
    * optimise evaluate
    
    * optimise evaluate 2
    
    * refactor  op:AND, lhs all false op:OR, lhs all true to be faster
    
    * fix clippy warning
    
    * refactor: optimize short-circuit evaluation logic in check_short_circuit 
function
    
    * fix clippy warning
    
    * add pre selection
    
    * add some comments
    
    * [WIP] fix pre-selection result
    
    * fix: Error in calculating the ratio
    
    * fix: Correct typo in pre-selection threshold constant and improve 
pre-selection scatter function documentation
    
    * fix doctest error
    
    * fix cargo doc
    
    * fix cargo doc
    
    * test: Add unit tests for pre_selection_scatter function
    
    ---------
    
    Co-authored-by: Siew Kam Onn <[email protected]>
---
 datafusion/physical-expr/benches/binary_op.rs      |  17 +-
 datafusion/physical-expr/src/expressions/binary.rs | 509 ++++++++++++++++++---
 2 files changed, 458 insertions(+), 68 deletions(-)

diff --git a/datafusion/physical-expr/benches/binary_op.rs 
b/datafusion/physical-expr/benches/binary_op.rs
index 59a602df05..216d8a520e 100644
--- a/datafusion/physical-expr/benches/binary_op.rs
+++ b/datafusion/physical-expr/benches/binary_op.rs
@@ -126,14 +126,25 @@ fn generate_boolean_cases<const TEST_ALL_FALSE: bool>(
         ));
     }
 
+    // Scenario 7: Test all true or all false in AND/OR
+    // This situation won't cause a short circuit, but it can skip the bool 
calculation
+    if TEST_ALL_FALSE {
+        let all_true = vec![true; len];
+        cases.push(("all_true_in_and".to_string(), 
BooleanArray::from(all_true)));
+    } else {
+        let all_false = vec![false; len];
+        cases.push(("all_false_in_or".to_string(), 
BooleanArray::from(all_false)));
+    }
+
     cases
 }
 
 /// Benchmarks AND/OR operator short-circuiting by evaluating complex regex 
conditions.
 ///
-/// Creates 6 test scenarios per operator:
+/// Creates 7 test scenarios per operator:
 /// 1. All values enable short-circuit (all_true/all_false)
 /// 2. 2-6 Single true/false value at different positions to measure early exit
+/// 3. Test all true or all false in AND/OR
 ///
 /// You can run this benchmark with:
 /// ```sh
@@ -203,7 +214,7 @@ fn benchmark_binary_op_in_short_circuit(c: &mut Criterion) {
 
     // Each scenario when the test operator is `and`
     {
-        for (name, batch) in batches_and {
+        for (name, batch) in batches_and.into_iter() {
             c.bench_function(&format!("short_circuit/and/{}", name), |b| {
                 b.iter(|| expr_and.evaluate(black_box(&batch)).unwrap())
             });
@@ -211,7 +222,7 @@ fn benchmark_binary_op_in_short_circuit(c: &mut Criterion) {
     }
     // Each scenario when the test operator is `or`
     {
-        for (name, batch) in batches_or {
+        for (name, batch) in batches_or.into_iter() {
             c.bench_function(&format!("short_circuit/or/{}", name), |b| {
                 b.iter(|| expr_or.evaluate(black_box(&batch)).unwrap())
             });
diff --git a/datafusion/physical-expr/src/expressions/binary.rs 
b/datafusion/physical-expr/src/expressions/binary.rs
index 84374f4a29..6c68d11e2c 100644
--- a/datafusion/physical-expr/src/expressions/binary.rs
+++ b/datafusion/physical-expr/src/expressions/binary.rs
@@ -29,7 +29,9 @@ use arrow::compute::kernels::boolean::{and_kleene, not, 
or_kleene};
 use arrow::compute::kernels::cmp::*;
 use arrow::compute::kernels::comparison::{regexp_is_match, 
regexp_is_match_scalar};
 use arrow::compute::kernels::concat_elements::concat_elements_utf8;
-use arrow::compute::{cast, ilike, like, nilike, nlike};
+use arrow::compute::{
+    cast, filter_record_batch, ilike, like, nilike, nlike, SlicesIterator,
+};
 use arrow::datatypes::*;
 use arrow::error::ArrowError;
 use datafusion_common::cast::as_boolean_array;
@@ -358,11 +360,24 @@ impl PhysicalExpr for BinaryExpr {
     fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
         use arrow::compute::kernels::numeric::*;
 
+        // Evaluate left-hand side expression.
         let lhs = self.left.evaluate(batch)?;
 
-        // Optimize for short-circuiting `Operator::And` or `Operator::Or` 
operations and return early.
-        if check_short_circuit(&lhs, &self.op) {
-            return Ok(lhs);
+        // Check if we can apply short-circuit evaluation.
+        match check_short_circuit(&lhs, &self.op) {
+            ShortCircuitStrategy::None => {}
+            ShortCircuitStrategy::ReturnLeft => return Ok(lhs),
+            ShortCircuitStrategy::ReturnRight => {
+                let rhs = self.right.evaluate(batch)?;
+                return Ok(rhs);
+            }
+            ShortCircuitStrategy::PreSelection(selection) => {
+                // The function `evaluate_selection` was not called for 
filtering and calculation,
+                // as it takes into account cases where the selection contains 
null values.
+                let batch = filter_record_batch(batch, selection)?;
+                let right_ret = self.right.evaluate(&batch)?;
+                return pre_selection_scatter(selection, right_ret);
+            }
         }
 
         let rhs = self.right.evaluate(batch)?;
@@ -405,23 +420,19 @@ impl PhysicalExpr for BinaryExpr {
 
         let result_type = self.data_type(input_schema)?;
 
-        // Attempt to use special kernels if one input is scalar and the other 
is an array
-        let scalar_result = match (&lhs, &rhs) {
-            (ColumnarValue::Array(array), ColumnarValue::Scalar(scalar)) => {
-                // if left is array and right is literal(not NULL) - use 
scalar operations
-                if scalar.is_null() {
-                    None
-                } else {
-                    self.evaluate_array_scalar(array, scalar.clone())?.map(|r| 
{
-                        r.and_then(|a| to_result_type_array(&self.op, a, 
&result_type))
-                    })
+        // If the left-hand side is an array and the right-hand side is a 
non-null scalar, try the optimized kernel.
+        if let (ColumnarValue::Array(array), ColumnarValue::Scalar(ref 
scalar)) =
+            (&lhs, &rhs)
+        {
+            if !scalar.is_null() {
+                if let Some(result_array) =
+                    self.evaluate_array_scalar(array, scalar.clone())?
+                {
+                    let final_array = result_array
+                        .and_then(|a| to_result_type_array(&self.op, a, 
&result_type));
+                    return final_array.map(ColumnarValue::Array);
                 }
             }
-            (_, _) => None, // default to array implementation
-        };
-
-        if let Some(result) = scalar_result {
-            return result.map(ColumnarValue::Array);
         }
 
         // if both arrays or both literals - extract arrays and continue 
execution
@@ -811,58 +822,199 @@ impl BinaryExpr {
     }
 }
 
+enum ShortCircuitStrategy<'a> {
+    None,
+    ReturnLeft,
+    ReturnRight,
+    PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTION_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///    - if LHS is all false => short-circuit → return LHS
+///    - if LHS is all true  => short-circuit → return RHS
+///    - if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTION_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///    - if LHS is all true  => short-circuit → return LHS
+///    - if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///    `bool_or`/`bool_and` maybe a better choice too，for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-    let data_type = arg.data_type();
-    match (data_type, op) {
-        (DataType::Boolean, Operator::And) => {
-            match arg {
-                ColumnarValue::Array(array) => {
-                    if let Ok(array) = as_boolean_array(&array) {
-                        return array.false_count() == array.len();
-                    }
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+    lhs: &'a ColumnarValue,
+    op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+    // Quick reject for non-logical operators,and quick judgment when op is and
+    let is_and = match op {
+        Operator::And => true,
+        Operator::Or => false,
+        _ => return ShortCircuitStrategy::None,
+    };
+
+    // Non-boolean types can't be short-circuited
+    if lhs.data_type() != DataType::Boolean {
+        return ShortCircuitStrategy::None;
+    }
+
+    match lhs {
+        ColumnarValue::Array(array) => {
+            // Fast path for arrays - try to downcast to boolean array
+            if let Ok(bool_array) = as_boolean_array(array) {
+                // Arrays with nulls can't be short-circuited
+                if bool_array.null_count() > 0 {
+                    return ShortCircuitStrategy::None;
                 }
-                ColumnarValue::Scalar(scalar) => {
-                    if let ScalarValue::Boolean(Some(value)) = scalar {
-                        return !value;
+
+                let len = bool_array.len();
+                if len == 0 {
+                    return ShortCircuitStrategy::None;
+                }
+
+                let true_count = bool_array.values().count_set_bits();
+                if is_and {
+                    // For AND, prioritize checking for all-false (short 
circuit case)
+                    // Uses optimized false_count() method provided by Arrow
+
+                    // Short circuit if all values are false
+                    if true_count == 0 {
+                        return ShortCircuitStrategy::ReturnLeft;
+                    }
+
+                    // If no false values, then all must be true
+                    if true_count == len {
+                        return ShortCircuitStrategy::ReturnRight;
+                    }
+
+                    // determine if we can pre-selection
+                    if true_count as f32 / len as f32 <= 
PRE_SELECTION_THRESHOLD {
+                        return ShortCircuitStrategy::PreSelection(bool_array);
+                    }
+                } else {
+                    // For OR, prioritize checking for all-true (short circuit 
case)
+                    // Uses optimized true_count() method provided by Arrow
+
+                    // Short circuit if all values are true
+                    if true_count == len {
+                        return ShortCircuitStrategy::ReturnLeft;
+                    }
+
+                    // If no true values, then all must be false
+                    if true_count == 0 {
+                        return ShortCircuitStrategy::ReturnRight;
                     }
                 }
             }
-            false
         }
-        (DataType::Boolean, Operator::Or) => {
-            match arg {
-                ColumnarValue::Array(array) => {
-                    if let Ok(array) = as_boolean_array(&array) {
-                        return array.true_count() == array.len();
-                    }
-                }
-                ColumnarValue::Scalar(scalar) => {
-                    if let ScalarValue::Boolean(Some(value)) = scalar {
-                        return *value;
-                    }
+        ColumnarValue::Scalar(scalar) => {
+            // Fast path for scalar values
+            if let ScalarValue::Boolean(Some(is_true)) = scalar {
+                // Return Left for:
+                // - AND with false value
+                // - OR with true value
+                if (is_and && !is_true) || (!is_and && *is_true) {
+                    return ShortCircuitStrategy::ReturnLeft;
+                } else {
+                    return ShortCircuitStrategy::ReturnRight;
                 }
             }
-            false
         }
-        _ => false,
     }
+
+    // If we can't short-circuit, indicate that normal evaluation should 
continue
+    ShortCircuitStrategy::None
+}
+
+/// Creates a new boolean array based on the evaluation of the right 
expression,
+/// but only for positions where the left_result is true.
+///
+/// This function is used for short-circuit evaluation optimization of logical 
AND operations:
+/// - When left_result has few true values, we only evaluate the right 
expression for those positions
+/// - Values are copied from right_array where left_result is true
+/// - All other positions are filled with false values
+///
+/// # Parameters
+/// - `left_result` Boolean array with selection mask (typically from left 
side of AND)
+/// - `right_result` Result of evaluating right side of expression (only for 
selected positions)
+///
+/// # Returns
+/// A combined ColumnarValue with values from right_result where left_result 
is true
+///
+/// # Example
+///  Initial Data: { 1, 2, 3, 4, 5 }
+///  Left Evaluation
+///     (Condition: Equal to 2 or 3)
+///          ↓
+///  Filtered Data: {2, 3}
+///    Left Bitmap: { 0, 1, 1, 0, 0 }
+///          ↓
+///   Right Evaluation
+///     (Condition: Even numbers)
+///          ↓
+///  Right Data: { 2 }
+///    Right Bitmap: { 1, 0 }
+///          ↓
+///   Combine Results
+///  Final Bitmap: { 0, 1, 0, 0, 0 }
+///
+/// # Note
+/// Perhaps it would be better to modify `left_result` directly without 
creating a copy?
+/// In practice, `left_result` should have only one owner, so making changes 
should be safe.
+/// However, this is difficult to achieve under the immutable constraints of 
[`Arc`] and [`BooleanArray`].
+fn pre_selection_scatter(
+    left_result: &BooleanArray,
+    right_result: ColumnarValue,
+) -> Result<ColumnarValue> {
+    let right_boolean_array = match &right_result {
+        ColumnarValue::Array(array) => array.as_boolean(),
+        ColumnarValue::Scalar(_) => return Ok(right_result),
+    };
+
+    let result_len = left_result.len();
+
+    let mut result_array_builder = BooleanArray::builder(result_len);
+
+    // keep track of current position we have in right boolean array
+    let mut right_array_pos = 0;
+
+    // keep track of how much is filled
+    let mut last_end = 0;
+    SlicesIterator::new(left_result).for_each(|(start, end)| {
+        // the gap needs to be filled with false
+        if start > last_end {
+            result_array_builder.append_n(start - last_end, false);
+        }
+
+        // copy values from right array for this slice
+        let len = end - start;
+        right_boolean_array
+            .slice(right_array_pos, len)
+            .iter()
+            .for_each(|v| result_array_builder.append_option(v));
+
+        right_array_pos += len;
+        last_end = end;
+    });
+
+    // Fill any remaining positions with false
+    if last_end < result_len {
+        result_array_builder.append_n(result_len - last_end, false);
+    }
+    let boolean_result = result_array_builder.finish();
+
+    Ok(ColumnarValue::Array(Arc::new(boolean_result)))
 }
 
 fn concat_elements(left: Arc<dyn Array>, right: Arc<dyn Array>) -> 
Result<ArrayRef> {
@@ -919,10 +1071,14 @@ pub fn similar_to(
 mod tests {
     use super::*;
     use crate::expressions::{col, lit, try_cast, Column, Literal};
+    use datafusion_expr::lit as expr_lit;
 
     use datafusion_common::plan_datafusion_err;
     use datafusion_physical_expr_common::physical_expr::fmt_sql;
 
+    use crate::planner::logical2physical;
+    use arrow::array::BooleanArray;
+    use datafusion_expr::col as logical_col;
     /// Performs a binary operation, applying any type coercion necessary
     fn binary_op(
         left: Arc<dyn PhysicalExpr>,
@@ -4895,9 +5051,7 @@ mod tests {
 
     #[test]
     fn test_check_short_circuit() {
-        use crate::planner::logical2physical;
-        use datafusion_expr::col as logical_col;
-        use datafusion_expr::lit;
+        // Test with non-nullable arrays
         let schema = Arc::new(Schema::new(vec![
             Field::new("a", DataType::Int32, false),
             Field::new("b", DataType::Int32, false),
@@ -4911,20 +5065,245 @@ mod tests {
         .unwrap();
 
         // op: AND left: all false
-        let left_expr = logical2physical(&logical_col("a").eq(lit(2)), 
&schema);
+        let left_expr = logical2physical(&logical_col("a").eq(expr_lit(2)), 
&schema);
         let left_value = left_expr.evaluate(&batch).unwrap();
-        assert!(check_short_circuit(&left_value, &Operator::And));
+        assert!(matches!(
+            check_short_circuit(&left_value, &Operator::And),
+            ShortCircuitStrategy::ReturnLeft
+        ));
+
         // op: AND left: not all false
-        let left_expr = logical2physical(&logical_col("a").eq(lit(3)), 
&schema);
+        let left_expr = logical2physical(&logical_col("a").eq(expr_lit(3)), 
&schema);
         let left_value = left_expr.evaluate(&batch).unwrap();
-        assert!(!check_short_circuit(&left_value, &Operator::And));
+        let ColumnarValue::Array(array) = &left_value else {
+            panic!("Expected ColumnarValue::Array");
+        };
+        let ShortCircuitStrategy::PreSelection(value) =
+            check_short_circuit(&left_value, &Operator::And)
+        else {
+            panic!("Expected ShortCircuitStrategy::PreSelection");
+        };
+        let expected_boolean_arr: Vec<_> =
+            as_boolean_array(array).unwrap().iter().collect();
+        let boolean_arr: Vec<_> = value.iter().collect();
+        assert_eq!(expected_boolean_arr, boolean_arr);
+
         // op: OR left: all true
-        let left_expr = logical2physical(&logical_col("a").gt(lit(0)), 
&schema);
+        let left_expr = logical2physical(&logical_col("a").gt(expr_lit(0)), 
&schema);
         let left_value = left_expr.evaluate(&batch).unwrap();
-        assert!(check_short_circuit(&left_value, &Operator::Or));
+        assert!(matches!(
+            check_short_circuit(&left_value, &Operator::Or),
+            ShortCircuitStrategy::ReturnLeft
+        ));
+
         // op: OR left: not all true
-        let left_expr = logical2physical(&logical_col("a").gt(lit(2)), 
&schema);
+        let left_expr: Arc<dyn PhysicalExpr> =
+            logical2physical(&logical_col("a").gt(expr_lit(2)), &schema);
         let left_value = left_expr.evaluate(&batch).unwrap();
-        assert!(!check_short_circuit(&left_value, &Operator::Or));
+        assert!(matches!(
+            check_short_circuit(&left_value, &Operator::Or),
+            ShortCircuitStrategy::None
+        ));
+
+        // Test with nullable arrays and null values
+        let schema_nullable = Arc::new(Schema::new(vec![
+            Field::new("c", DataType::Boolean, true),
+            Field::new("d", DataType::Boolean, true),
+        ]));
+
+        // Create arrays with null values
+        let c_array = Arc::new(BooleanArray::from(vec![
+            Some(true),
+            Some(false),
+            None,
+            Some(true),
+            None,
+        ])) as ArrayRef;
+        let d_array = Arc::new(BooleanArray::from(vec![
+            Some(false),
+            Some(true),
+            Some(false),
+            None,
+            Some(true),
+        ])) as ArrayRef;
+
+        let batch_nullable = RecordBatch::try_new(
+            Arc::clone(&schema_nullable),
+            vec![Arc::clone(&c_array), Arc::clone(&d_array)],
+        )
+        .unwrap();
+
+        // Case: Mixed values with nulls - shouldn't short-circuit for AND
+        let mixed_nulls = logical2physical(&logical_col("c"), 
&schema_nullable);
+        let mixed_nulls_value = mixed_nulls.evaluate(&batch_nullable).unwrap();
+        assert!(matches!(
+            check_short_circuit(&mixed_nulls_value, &Operator::And),
+            ShortCircuitStrategy::None
+        ));
+
+        // Case: Mixed values with nulls - shouldn't short-circuit for OR
+        assert!(matches!(
+            check_short_circuit(&mixed_nulls_value, &Operator::Or),
+            ShortCircuitStrategy::None
+        ));
+
+        // Test with all nulls
+        let all_nulls = Arc::new(BooleanArray::from(vec![None, None, None])) 
as ArrayRef;
+        let null_batch = RecordBatch::try_new(
+            Arc::new(Schema::new(vec![Field::new("e", DataType::Boolean, 
true)])),
+            vec![all_nulls],
+        )
+        .unwrap();
+
+        let null_expr = logical2physical(&logical_col("e"), 
&null_batch.schema());
+        let null_value = null_expr.evaluate(&null_batch).unwrap();
+
+        // All nulls shouldn't short-circuit for AND or OR
+        assert!(matches!(
+            check_short_circuit(&null_value, &Operator::And),
+            ShortCircuitStrategy::None
+        ));
+        assert!(matches!(
+            check_short_circuit(&null_value, &Operator::Or),
+            ShortCircuitStrategy::None
+        ));
+
+        // Test with scalar values
+        // Scalar true
+        let scalar_true = 
ColumnarValue::Scalar(ScalarValue::Boolean(Some(true)));
+        assert!(matches!(
+            check_short_circuit(&scalar_true, &Operator::Or),
+            ShortCircuitStrategy::ReturnLeft
+        )); // Should short-circuit OR
+        assert!(matches!(
+            check_short_circuit(&scalar_true, &Operator::And),
+            ShortCircuitStrategy::ReturnRight
+        )); // Should return the RHS for AND
+
+        // Scalar false
+        let scalar_false = 
ColumnarValue::Scalar(ScalarValue::Boolean(Some(false)));
+        assert!(matches!(
+            check_short_circuit(&scalar_false, &Operator::And),
+            ShortCircuitStrategy::ReturnLeft
+        )); // Should short-circuit AND
+        assert!(matches!(
+            check_short_circuit(&scalar_false, &Operator::Or),
+            ShortCircuitStrategy::ReturnRight
+        )); // Should return the RHS for OR
+
+        // Scalar null
+        let scalar_null = ColumnarValue::Scalar(ScalarValue::Boolean(None));
+        assert!(matches!(
+            check_short_circuit(&scalar_null, &Operator::And),
+            ShortCircuitStrategy::None
+        ));
+        assert!(matches!(
+            check_short_circuit(&scalar_null, &Operator::Or),
+            ShortCircuitStrategy::None
+        ));
+    }
+
+    /// Test for [pre_selection_scatter]
+    /// Since [check_short_circuit] ensures that the left side does not 
contain null and is neither all_true nor all_false, as well as not being empty,
+    /// the following tests have been designed:
+    /// 1. Test sparse left with interleaved true/false
+    /// 2. Test multiple consecutive true blocks
+    /// 3. Test multiple consecutive true blocks
+    /// 4. Test single true at first position
+    /// 5. Test single true at last position
+    /// 6. Test nulls in right array
+    /// 7. Test scalar right handling
+    #[test]
+    fn test_pre_selection_scatter() {
+        fn create_bool_array(bools: Vec<bool>) -> BooleanArray {
+            BooleanArray::from(bools.into_iter().map(Some).collect::<Vec<_>>())
+        }
+        // Test sparse left with interleaved true/false
+        {
+            // Left: [T, F, T, F, T]
+            // Right: [F, T, F] (values for 3 true positions)
+            let left = create_bool_array(vec![true, false, true, false, true]);
+            let right = ColumnarValue::Array(Arc::new(create_bool_array(vec![
+                false, true, false,
+            ])));
+
+            let result = pre_selection_scatter(&left, right).unwrap();
+            let result_arr = result.into_array(left.len()).unwrap();
+
+            let expected = create_bool_array(vec![false, false, true, false, 
false]);
+            assert_eq!(&expected, result_arr.as_boolean());
+        }
+        // Test multiple consecutive true blocks
+        {
+            // Left: [F, T, T, F, T, T, T]
+            // Right: [T, F, F, T, F]
+            let left =
+                create_bool_array(vec![false, true, true, false, true, true, 
true]);
+            let right = ColumnarValue::Array(Arc::new(create_bool_array(vec![
+                true, false, false, true, false,
+            ])));
+
+            let result = pre_selection_scatter(&left, right).unwrap();
+            let result_arr = result.into_array(left.len()).unwrap();
+
+            let expected =
+                create_bool_array(vec![false, true, false, false, false, true, 
false]);
+            assert_eq!(&expected, result_arr.as_boolean());
+        }
+        // Test single true at first position
+        {
+            // Left: [T, F, F]
+            // Right: [F]
+            let left = create_bool_array(vec![true, false, false]);
+            let right = 
ColumnarValue::Array(Arc::new(create_bool_array(vec![false])));
+
+            let result = pre_selection_scatter(&left, right).unwrap();
+            let result_arr = result.into_array(left.len()).unwrap();
+
+            let expected = create_bool_array(vec![false, false, false]);
+            assert_eq!(&expected, result_arr.as_boolean());
+        }
+        // Test single true at last position
+        {
+            // Left: [F, F, T]
+            // Right: [F]
+            let left = create_bool_array(vec![false, false, true]);
+            let right = 
ColumnarValue::Array(Arc::new(create_bool_array(vec![false])));
+
+            let result = pre_selection_scatter(&left, right).unwrap();
+            let result_arr = result.into_array(left.len()).unwrap();
+
+            let expected = create_bool_array(vec![false, false, false]);
+            assert_eq!(&expected, result_arr.as_boolean());
+        }
+        // Test nulls in right array
+        {
+            // Left: [F, T, F, T]
+            // Right: [None, Some(false)] (with null at first position)
+            let left = create_bool_array(vec![false, true, false, true]);
+            let right_arr = BooleanArray::from(vec![None, Some(false)]);
+            let right = ColumnarValue::Array(Arc::new(right_arr));
+
+            let result = pre_selection_scatter(&left, right).unwrap();
+            let result_arr = result.into_array(left.len()).unwrap();
+
+            let expected = BooleanArray::from(vec![
+                Some(false),
+                None, // null from right
+                Some(false),
+                Some(false),
+            ]);
+            assert_eq!(&expected, result_arr.as_boolean());
+        }
+        // Test scalar right handling
+        {
+            // Left: [T, F, T]
+            // Right: Scalar true
+            let left = create_bool_array(vec![true, false, true]);
+            let right = 
ColumnarValue::Scalar(ScalarValue::Boolean(Some(true)));
+
+            let result = pre_selection_scatter(&left, right).unwrap();
+            assert!(matches!(result, ColumnarValue::Scalar(_)));
+        }
     }
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion) branch main updated: Apply pre-selection and computation skipping to short-circuit optimization (#15694)

Reply via email to