Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-07-01 Thread via GitHub


acking-you commented on PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#issuecomment-3026216633

   > Maybe we can reuse the table from @acking-you above: [#15694 
(comment)](https://github.com/apache/datafusion/pull/15694#issuecomment-2800087698)
   
   By the way, https://github.com/apache/datafusion/pull/15462 can use Q6 from 
this table: https://github.com/apache/datafusion/pull/15462#issue-2954143137


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-07-01 Thread via GitHub


alamb commented on PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#issuecomment-3025430123

   > Is there a benchmark showcasing this change (and #15462) against previous 
behaviour? I'd like to showcase this a bit in the [DF 49 blog 
post](https://github.com/apache/datafusion/issues/16347).
   
   Maybe we can reuse the table from @acking-you  above: 
https://github.com/apache/datafusion/pull/15694#issuecomment-2800087698


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-07-01 Thread via GitHub


Omega359 commented on PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#issuecomment-3024965984

   Is there a benchmark showcasing this change (and #15462) against previous 
behaviour? I'd like to showcase this a bit in the [DF 49 blog 
post](https://github.com/apache/datafusion/issues/16347).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-17 Thread via GitHub


alamb commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2049350694


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -4911,20 +5065,245 @@ mod tests {
 .unwrap();
 
 // op: AND left: all false
-let left_expr = logical2physical(&logical_col("a").eq(lit(2)), 
&schema);
+let left_expr = logical2physical(&logical_col("a").eq(expr_lit(2)), 
&schema);
 let left_value = left_expr.evaluate(&batch).unwrap();
-assert!(check_short_circuit(&left_value, &Operator::And));
+assert!(matches!(
+check_short_circuit(&left_value, &Operator::And),
+ShortCircuitStrategy::ReturnLeft
+));
+
 // op: AND left: not all false
-let left_expr = logical2physical(&logical_col("a").eq(lit(3)), 
&schema);
+let left_expr = logical2physical(&logical_col("a").eq(expr_lit(3)), 
&schema);
 let left_value = left_expr.evaluate(&batch).unwrap();
-assert!(!check_short_circuit(&left_value, &Operator::And));
+let ColumnarValue::Array(array) = &left_value else {
+panic!("Expected ColumnarValue::Array");
+};
+let ShortCircuitStrategy::PreSelection(value) =
+check_short_circuit(&left_value, &Operator::And)
+else {
+panic!("Expected ShortCircuitStrategy::PreSelection");
+};
+let expected_boolean_arr: Vec<_> =
+as_boolean_array(array).unwrap().iter().collect();
+let boolean_arr: Vec<_> = value.iter().collect();
+assert_eq!(expected_boolean_arr, boolean_arr);
+
 // op: OR left: all true
-let left_expr = logical2physical(&logical_col("a").gt(lit(0)), 
&schema);
+let left_expr = logical2physical(&logical_col("a").gt(expr_lit(0)), 
&schema);
 let left_value = left_expr.evaluate(&batch).unwrap();
-assert!(check_short_circuit(&left_value, &Operator::Or));
+assert!(matches!(
+check_short_circuit(&left_value, &Operator::Or),
+ShortCircuitStrategy::ReturnLeft
+));
+
 // op: OR left: not all true
-let left_expr = logical2physical(&logical_col("a").gt(lit(2)), 
&schema);
+let left_expr: Arc =
+logical2physical(&logical_col("a").gt(expr_lit(2)), &schema);
 let left_value = left_expr.evaluate(&batch).unwrap();
-assert!(!check_short_circuit(&left_value, &Operator::Or));
+assert!(matches!(
+check_short_circuit(&left_value, &Operator::Or),
+ShortCircuitStrategy::None
+));
+
+// Test with nullable arrays and null values
+let schema_nullable = Arc::new(Schema::new(vec![
+Field::new("c", DataType::Boolean, true),
+Field::new("d", DataType::Boolean, true),
+]));
+
+// Create arrays with null values
+let c_array = Arc::new(BooleanArray::from(vec![
+Some(true),
+Some(false),
+None,
+Some(true),
+None,
+])) as ArrayRef;
+let d_array = Arc::new(BooleanArray::from(vec![
+Some(false),
+Some(true),
+Some(false),
+None,
+Some(true),
+])) as ArrayRef;
+
+let batch_nullable = RecordBatch::try_new(
+Arc::clone(&schema_nullable),
+vec![Arc::clone(&c_array), Arc::clone(&d_array)],
+)
+.unwrap();
+
+// Case: Mixed values with nulls - shouldn't short-circuit for AND
+let mixed_nulls = logical2physical(&logical_col("c"), 
&schema_nullable);
+let mixed_nulls_value = mixed_nulls.evaluate(&batch_nullable).unwrap();
+assert!(matches!(
+check_short_circuit(&mixed_nulls_value, &Operator::And),
+ShortCircuitStrategy::None
+));
+
+// Case: Mixed values with nulls - shouldn't short-circuit for OR
+assert!(matches!(
+check_short_circuit(&mixed_nulls_value, &Operator::Or),
+ShortCircuitStrategy::None
+));
+
+// Test with all nulls
+let all_nulls = Arc::new(BooleanArray::from(vec![None, None, None])) 
as ArrayRef;
+let null_batch = RecordBatch::try_new(
+Arc::new(Schema::new(vec![Field::new("e", DataType::Boolean, 
true)])),
+vec![all_nulls],
+)
+.unwrap();
+
+let null_expr = logical2physical(&logical_col("e"), 
&null_batch.schema());
+let null_value = null_expr.evaluate(&null_batch).unwrap();
+
+// All nulls shouldn't short-circuit for AND or OR
+assert!(matches!(
+check_short_circuit(&null_value, &Operator::And),
+ShortCircuitStrategy::None
+));
+assert!(matches!(
+check_short_circuit(&null_value, &Operator::Or),
+ShortCircuitStrategy::None
+));
+
+// Test with scalar values
+//

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-17 Thread via GitHub


alamb merged PR #15694:
URL: https://github.com/apache/datafusion/pull/15694


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-16 Thread via GitHub


acking-you commented on PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2810163625

   > The only thing I think is needed in this PR is a few more tests for the 
`pre_selection_scatter` function and then it will be ready to go
   
   done
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-16 Thread via GitHub


acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2047346420


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,199 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTION_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTION_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <= 
PRE_SELECTION_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true 

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub


kosiew commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2045951368


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <= 
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (short 

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub


acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2045899326


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,199 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTION_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTION_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <= 
PRE_SELECTION_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true 

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub


alamb commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2045649141


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,199 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.

Review Comment:
   this seems reasonable to me



##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,199 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTION_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTION_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+ 

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub


acking-you commented on PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2807123847

   > 🤖: Benchmark completed
   > 
   > Details
   > 
   > 
   > 
   > ```
   > Comparing HEAD and short-and-optmize
   > 
   > Benchmark clickbench_extended.json
   > 
   > ┏━━┳┳━━━┳━━━┓
   > ┃ Query┃   HEAD ┃ short-and-optmize ┃Change ┃
   > ┡━━╇╇━━━╇━━━┩
   > │ QQuery 0 │  1918.00ms │ 1941.59ms │ no change │
   > │ QQuery 1 │   711.53ms │  702.35ms │ no change │
   > │ QQuery 2 │  1429.36ms │ 1446.13ms │ no change │
   > │ QQuery 3 │   704.72ms │  727.05ms │ no change │
   > │ QQuery 4 │  1516.84ms │ 1514.23ms │ no change │
   > │ QQuery 5 │ 17398.39ms │17271.95ms │ no change │
   > │ QQuery 6 │  2026.72ms │ 2124.80ms │ no change │
   > └──┴┴───┴───┘
   > ┏━━┳┓
   > ┃ Benchmark Summary┃┃
   > ┡━━╇┩
   > │ Total Time (HEAD)│ 25705.56ms │
   > │ Total Time (short-and-optmize)   │ 25728.10ms │
   > │ Average Time (HEAD)  │  3672.22ms │
   > │ Average Time (short-and-optmize) │  3675.44ms │
   > │ Queries Faster   │  0 │
   > │ Queries Slower   │  0 │
   > │ Queries with No Change   │  7 │
   > └──┴┘
   > 
   > Benchmark clickbench_partitioned.json
   > 
   > ┏━━┳┳━━━┳━━━┓
   > ┃ Query┃   HEAD ┃ short-and-optmize ┃Change ┃
   > ┡━━╇╇━━━╇━━━┩
   > │ QQuery 0 │ 2.74ms │2.25ms │ +1.21x faster │
   > │ QQuery 1 │36.47ms │   36.63ms │ no change │
   > │ QQuery 2 │90.59ms │   88.44ms │ no change │
   > │ QQuery 3 │   101.66ms │   95.74ms │ +1.06x faster │
   > │ QQuery 4 │   802.73ms │  840.27ms │ no change │
   > │ QQuery 5 │   891.98ms │  914.36ms │ no change │
   > │ QQuery 6 │ 2.10ms │2.14ms │ no change │
   > │ QQuery 7 │41.09ms │   42.92ms │ no change │
   > │ QQuery 8 │   948.80ms │  961.56ms │ no change │
   > │ QQuery 9 │  1269.90ms │ 1292.89ms │ no change │
   > │ QQuery 10│   264.01ms │  274.69ms │ no change │
   > │ QQuery 11│   301.65ms │  316.08ms │ no change │
   > │ QQuery 12│   943.62ms │  966.90ms │ no change │
   > │ QQuery 13│  1405.86ms │ 1426.41ms │ no change │
   > │ QQuery 14│   900.03ms │  898.57ms │ no change │
   > │ QQuery 15│  1079.97ms │ 1069.93ms │ no change │
   > │ QQuery 16│  1778.50ms │ 1782.02ms │ no change │
   > │ QQuery 17│  1665.53ms │ 1666.25ms │ no change │
   > │ QQuery 18│  3160.53ms │ 3186.64ms │ no change │
   > │ QQuery 19│86.25ms │   92.80ms │  1.08x slower │
   > │ QQuery 20│  1170.29ms │ 1183.35ms │ no change │
   > │ QQuery 21│  1355.56ms │ 1374.25ms │ no change │
   > │ QQuery 22│  2374.52ms │ 2322.80ms │ no change │
   > │ QQuery 23│  8507.41ms │ 8569.51ms │ no change │
   > │ QQuery 24│   470.15ms │  480.90ms │ no change │
   > │ QQuery 25│   394.74ms │  408.10ms │ no change │
   > │ QQuery 26│   539.09ms │  561.82ms │ no change │
   > │ QQuery 27│  1706.49ms │ 1747.01ms │ no change │
   > │ QQuery 28│ 12891.10ms │12837.09ms │ no change │
   > │ QQuery 29│   536.68ms │  522.79ms │ no change │
   > │ QQuery 30│   826.10ms │  849.08ms │ no change │
   > │ QQuery 31│   879.52ms │  918.11ms │ no change │
   > │ QQuery 32│  2784.99ms │ 2694.41ms │ no change │
   > │ QQuery 33│  3398.18ms │ 3423.14ms │ no change │
   > │ QQuery 34│  3435.31ms │ 3521.83ms │ no change │
   > │ QQuery 35│  1320.36ms │ 1303.37ms │ no change │
   > │ QQuery 36│   126.83ms │  132.86ms │ no change │
   > │ QQuery 37│57.78ms │   59.40ms │ no change │
   > │ QQuery 38│   126.83ms │  125.35ms │ no change │
   > │ QQuery 39│   201.53ms │  208.57ms │ no change │
   > │ QQuery 40│51.48ms │   47.72ms │ +1.08x faster │
   > │ QQuery 41│47.15ms │   46.73ms │ no change │
   > │ QQuery 42│40.77ms │  

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub


alamb commented on PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2807039992

   🤖: Benchmark completed
   
   Details
   
   
   
   ```
   Comparing HEAD and short-and-optmize
   
   Benchmark clickbench_extended.json
   
   ┏━━┳┳━━━┳━━━┓
   ┃ Query┃   HEAD ┃ short-and-optmize ┃Change ┃
   ┡━━╇╇━━━╇━━━┩
   │ QQuery 0 │  1918.00ms │ 1941.59ms │ no change │
   │ QQuery 1 │   711.53ms │  702.35ms │ no change │
   │ QQuery 2 │  1429.36ms │ 1446.13ms │ no change │
   │ QQuery 3 │   704.72ms │  727.05ms │ no change │
   │ QQuery 4 │  1516.84ms │ 1514.23ms │ no change │
   │ QQuery 5 │ 17398.39ms │17271.95ms │ no change │
   │ QQuery 6 │  2026.72ms │ 2124.80ms │ no change │
   └──┴┴───┴───┘
   ┏━━┳┓
   ┃ Benchmark Summary┃┃
   ┡━━╇┩
   │ Total Time (HEAD)│ 25705.56ms │
   │ Total Time (short-and-optmize)   │ 25728.10ms │
   │ Average Time (HEAD)  │  3672.22ms │
   │ Average Time (short-and-optmize) │  3675.44ms │
   │ Queries Faster   │  0 │
   │ Queries Slower   │  0 │
   │ Queries with No Change   │  7 │
   └──┴┘
   
   Benchmark clickbench_partitioned.json
   
   ┏━━┳┳━━━┳━━━┓
   ┃ Query┃   HEAD ┃ short-and-optmize ┃Change ┃
   ┡━━╇╇━━━╇━━━┩
   │ QQuery 0 │ 2.74ms │2.25ms │ +1.21x faster │
   │ QQuery 1 │36.47ms │   36.63ms │ no change │
   │ QQuery 2 │90.59ms │   88.44ms │ no change │
   │ QQuery 3 │   101.66ms │   95.74ms │ +1.06x faster │
   │ QQuery 4 │   802.73ms │  840.27ms │ no change │
   │ QQuery 5 │   891.98ms │  914.36ms │ no change │
   │ QQuery 6 │ 2.10ms │2.14ms │ no change │
   │ QQuery 7 │41.09ms │   42.92ms │ no change │
   │ QQuery 8 │   948.80ms │  961.56ms │ no change │
   │ QQuery 9 │  1269.90ms │ 1292.89ms │ no change │
   │ QQuery 10│   264.01ms │  274.69ms │ no change │
   │ QQuery 11│   301.65ms │  316.08ms │ no change │
   │ QQuery 12│   943.62ms │  966.90ms │ no change │
   │ QQuery 13│  1405.86ms │ 1426.41ms │ no change │
   │ QQuery 14│   900.03ms │  898.57ms │ no change │
   │ QQuery 15│  1079.97ms │ 1069.93ms │ no change │
   │ QQuery 16│  1778.50ms │ 1782.02ms │ no change │
   │ QQuery 17│  1665.53ms │ 1666.25ms │ no change │
   │ QQuery 18│  3160.53ms │ 3186.64ms │ no change │
   │ QQuery 19│86.25ms │   92.80ms │  1.08x slower │
   │ QQuery 20│  1170.29ms │ 1183.35ms │ no change │
   │ QQuery 21│  1355.56ms │ 1374.25ms │ no change │
   │ QQuery 22│  2374.52ms │ 2322.80ms │ no change │
   │ QQuery 23│  8507.41ms │ 8569.51ms │ no change │
   │ QQuery 24│   470.15ms │  480.90ms │ no change │
   │ QQuery 25│   394.74ms │  408.10ms │ no change │
   │ QQuery 26│   539.09ms │  561.82ms │ no change │
   │ QQuery 27│  1706.49ms │ 1747.01ms │ no change │
   │ QQuery 28│ 12891.10ms │12837.09ms │ no change │
   │ QQuery 29│   536.68ms │  522.79ms │ no change │
   │ QQuery 30│   826.10ms │  849.08ms │ no change │
   │ QQuery 31│   879.52ms │  918.11ms │ no change │
   │ QQuery 32│  2784.99ms │ 2694.41ms │ no change │
   │ QQuery 33│  3398.18ms │ 3423.14ms │ no change │
   │ QQuery 34│  3435.31ms │ 3521.83ms │ no change │
   │ QQuery 35│  1320.36ms │ 1303.37ms │ no change │
   │ QQuery 36│   126.83ms │  132.86ms │ no change │
   │ QQuery 37│57.78ms │   59.40ms │ no change │
   │ QQuery 38│   126.83ms │  125.35ms │ no change │
   │ QQuery 39│   201.53ms │  208.57ms │ no change │
   │ QQuery 40│51.48ms │   47.72ms │ +1.08x faster │
   │ QQuery 41│47.15ms │   46.73ms │ no change │
   │ QQuery 42│40.77ms │   39.91ms │ no change │
   └──┴┴───┴───┘
   ┏━━┳┓
   ┃ Benchmark Summa

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub


Dandandan commented on PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2800863670

   Very cool.
   It would be nice to run some e2e benchmarks (TPC-H, clickbench) with this to 
see the impact here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub


acking-you commented on PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2801015314

   > | However, one point needs to be confirmed: 
[filter_record_batch](https://docs.rs/arrow-select/54.2.1/src/arrow_select/filter.rs.html#202-205)
 will retain rows that are null.
   > 
   > `filter_record_batch` removes rows that are null.
   
   If that's the case, we may need to do some hacky things to extend it to 
nulls.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub


alamb commented on PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2806923704

   🤖 `./gh_compare_branch.sh` [Benchmark 
Script](https://github.com/alamb/datafusion-benchmarking) Running
   Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct  4 22:16:29 UTC 2024 
x86_64 x86_64 x86_64 GNU/Linux
   Comparing short-and-optmize (d104e3bba2aeb139c94f98b3a7b837d30d63e65c) to 
0b01fdf7f02f9097c319204058576f420b9790b4 
[diff](https://github.com/apache/datafusion/compare/0b01fdf7f02f9097c319204058576f420b9790b4..d104e3bba2aeb139c94f98b3a7b837d30d63e65c)
   Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub


acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2043975638


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <= 
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sh

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-14 Thread via GitHub


Dandandan commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2042898541


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <= 
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sho

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-14 Thread via GitHub


acking-you commented on PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2801131412

   > Very cool. It would be nice to run some e2e benchmarks (TPC-H, clickbench) 
with this to see the impact here.
   
   I tried running clickbench, and there wasn't a significant improvement. The 
optimization is still relatively selective about the cost of expression 
evaluation. For example, if the computation on the right involves only simple 
integer comparisons or short string comparisons, skipping a large amount of 
computation may not yield much benefit.
   
   ## New ideas
   One potential area for significant improvement that I thought of is: 
heuristically replacing the computation process of 
[`filter_and_project`](https://github.com/apache/datafusion/blob/61e8a5d0d65859427c5882644e1cc97c4def0bde/datafusion/physical-plan/src/filter.rs#L514-L545).
   
   The current state of this function is:
   1. Pass in the batch to compute the predicate and obtain a boolean array.
   2. Perform the actual filtering process on the batch based on the boolean 
array and return the result.
   
   ## Pros
   Since the filtering process involves copying to produce a new batch, 
calculating a boolean array based on the predicate only requires a single copy. 
If filtering is performed during the predicate computation, multiple copies 
will occur.
   
   ## Cons
   If the predicate is an AND expression, after the left side is computed, it 
may already have filtered out most of the batch and computed a new batch. 
However, the final batch generation is still based on the previous batch and 
the boolean array, which introduces unnecessary copy overhead.
   
   
   ## Conclusion
   Considering both aspects, I believe we can strike a balance. For example, if 
it is discovered during the AND computation process that most of the batch can 
be filtered out, then early filtering can be performed as the final result.
   However, this optimization scenario is very limited: it can only be applied 
when the predicate is entirely an AND computation, such as `a AND b AND c ...`. 
Once an OR appears in the middle, we cannot perform early filtering to produce 
the result.
   
   The changes required to implement this optimization might be significant, 
and it would only apply to predicates in the form of `a AND b AND c...`. It 
might not be worth the effort.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-14 Thread via GitHub


Dandandan commented on PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2800856291

   | However, one point needs to be confirmed: 
[filter_record_batch](https://docs.rs/arrow-select/54.2.1/src/arrow_select/filter.rs.html#202-205)
 will retain rows that are null.
   
   `filter_record_batch` removes rows that are null.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-14 Thread via GitHub


acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041910661


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <= 
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sh

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-14 Thread via GitHub


acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041670623


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <= 
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sh

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-14 Thread via GitHub


Dandandan commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041741105


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <= 
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sho

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-14 Thread via GitHub


Dandandan commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041741105


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <= 
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sho

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-14 Thread via GitHub


acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041635786


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <= 
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sh

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-14 Thread via GitHub


acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041635786


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <= 
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sh

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-14 Thread via GitHub


acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041635786


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <= 
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sh

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-13 Thread via GitHub


kosiew commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041391176


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <= 
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (short 

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-13 Thread via GitHub


kosiew commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041391176


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <= 
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (short 

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-13 Thread via GitHub


kosiew commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041328132


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;

Review Comment:
   ```suggestion
   const PRE_SELECTION_THRESHOLD: f32 = 0.2;
   ```



##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
 }
 }
 
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit 
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is 
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
 /// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation 
based on the left-hand side (lhs) result.
 ///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes 
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be 
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be 
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true  => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <= 
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true  => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
 /// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
 /// * `op` - The logical operator (`AND` or `OR`)
 ///
 /// # Implementation Notes
 /// 1. Only works with Boolean-typed arguments (other types automatically 
return `false`)
 /// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from 
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed 
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
 }
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short 
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+r

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-13 Thread via GitHub


acking-you commented on PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2800087698

   The relevant bug fixes have been completed, and corresponding performance 
tests have been conducted. The results show that pre-selection has achieved 
significant gains! @Dandandan  @alamb 
   
   Compare the current optimization with the main branch using `cargo bench 
--bench binary_op`. The results are as follows, where fluctuations within ±5% 
are considered as no changes.
   
   ## Performance Comparison of the AND Logic Group
   
   | Test Case | main  | short-and-optimize | Ratio   | 
Change |
   | - | - | -- | --- | 
-- |
   | all_false | 62.623 ns | 65.923 ns  | 0.95x   | no 
changes |
   | one_true_first| 448.69 µs | 195.60 µs  | ​**2.29x**​ | ↑ ✅ 
   |
   | one_true_last | 452.00 µs | 171.91 µs  | ​**2.63x**​ | ↑ ✅ 
   |
   | one_true_middle   | 453.12 µs | 173.94 µs  | ​**2.60x**​ | ↑ ✅ 
   |
   | one_true_middle_left  | 453.06 µs | 165.70 µs  | ​**2.73x**​ | ↑ ✅ 
   |
   | one_true_middle_right | 459.61 µs | 171.53 µs  | ​**2.68x**​ | ↑ ✅ 
   |
   | all_true_in_and   | 450.03 µs | 445.76 µs  | 1.01x   | no 
changes |
   
   ## Performance Comparison of the OR Logic Group
   
   | Test Case  | main  | short-and-optimize | Ratio | Change   
  |
   | -- | - | -- | - | 
-- |
   | all_true   | 61.162 ns | 64.430 ns  | 0.95x | no 
changes |
   | one_false_first| 448.51 µs | 439.92 µs  | 1.02x | no 
changes |
   | one_false_last | 447.38 µs | 453.64 µs  | 0.99x | no 
changes |
   | one_false_middle   | 457.79 µs | 447.15 µs  | 1.02x | no 
changes |
   | one_false_middle_left  | 452.78 µs | 447.75 µs  | 1.01x | no 
changes |
   | one_false_middle_right | 451.21 µs | 444.23 µs  | 1.02x | no 
changes |
   | all_false_in_or| 449.90 µs | 442.36 µs  | 1.02x | no 
changes |
   
   
   ## Possible next step(extend to nulls)
   ### Short-circuit optimization cannot be extended to nulls
   The current short-circuit optimization is only applicable to cases without 
null values. However, based on the calculation principles of "and" and "or", if 
the left-hand side (lhs) evaluates to null, then the final result can only be 
determined by continuing to calculate the right-hand side (rhs). Therefore, 
optimization for this scenario is not feasible. Below is an example of a 
calculation where lhs is null:
   ```sql
   ❯ select null and true;
   ++
   | NULL AND Boolean(true) |
   ++
   ||
   ++
   1 row in set. Query took 0.000 seconds.
   
   ❯ select null and false;
   +-+
   | NULL AND Boolean(false) |
   +-+
   | false   |
   +-+
   1 row in set. Query took 0.000 seconds.
   
   ❯ select null or false;
   ++
   | NULL OR Boolean(false) |
   ++
   ||
   ++
   1 row in set. Query took 0.000 seconds.
   
   ❯ select null or true;
   +---+
   | NULL OR Boolean(true) |
   +---+
   | true  |
   +---+
   1 row in set. Query took 0.000 seconds.
   ```
   
   ### Pre-selection can be extended to include nulls
   
   As I explained earlier: 
https://github.com/apache/datafusion/pull/15694#issuecomment-2799010340, 
pre-selection can actually be extended to cover cases involving null values. 
However, one point needs to be confirmed: 
[filter_record_batch](https://docs.rs/arrow-select/54.2.1/src/arrow_select/filter.rs.html#202-205)
 will retain rows that are null.
   
   > 4. Combine the left-hand and right-hand boolean arrays to produce the 
correct boolean array (modify the positions in the left-hand array marked as 
true based on the values from the right-hand array).
   
   Afterward, we only need to modify the fourth step of the pre-selection 
process mentioned earlier to complete the extension that supports nulls.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-12 Thread via GitHub


acking-you commented on PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2799010340

   The error in `cargo test` is caused by an incorrect calculation of the 
pre-selection. The correct steps for calculating the pre-selection are as 
follows:
   
   1. Compute the boolean array on the left-hand side.
   2. Filter to obtain a new record batch based on the left-hand boolean array.
   3. Compute the boolean array on the right-hand side using the new record 
batch (note that this boolean array will have incorrect positions since it 
corresponds to the new record batch).
   4. Combine the left-hand and right-hand boolean arrays to produce the 
correct boolean array (modify the positions in the left-hand array marked as 
`true` based on the values from the right-hand array).
   
   To illustrate this, I’ve drawn a diagram (it’s quite rough since I used a 
mouse, so I hope you don’t mind 😂):
   
   
![image](https://github.com/user-attachments/assets/83197db1-fed1-47da-90e4-07acb4d7de69)
   
   
   The current code only handles up to the second step, which is why the error 
occurs. If you use `evaluate_selection`, the issue persists because its 
internal call to `scatter` (which corresponds to completing the fourth step) 
fills in the missing parts of the right-hand boolean array with null values, 
resulting in an incorrect outcome.
   
   I’m currently working on implementing the fourth step to fix the issue, but 
it may take some time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]