Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-3026216633 > Maybe we can reuse the table from @acking-you above: [#15694 (comment)](https://github.com/apache/datafusion/pull/15694#issuecomment-2800087698) By the way, https://github.com/apache/datafusion/pull/15462 can use Q6 from this table: https://github.com/apache/datafusion/pull/15462#issue-2954143137 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
alamb commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-3025430123 > Is there a benchmark showcasing this change (and #15462) against previous behaviour? I'd like to showcase this a bit in the [DF 49 blog post](https://github.com/apache/datafusion/issues/16347). Maybe we can reuse the table from @acking-you above: https://github.com/apache/datafusion/pull/15694#issuecomment-2800087698 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
Omega359 commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-3024965984 Is there a benchmark showcasing this change (and #15462) against previous behaviour? I'd like to showcase this a bit in the [DF 49 blog post](https://github.com/apache/datafusion/issues/16347). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
alamb commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2049350694
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -4911,20 +5065,245 @@ mod tests {
.unwrap();
// op: AND left: all false
-let left_expr = logical2physical(&logical_col("a").eq(lit(2)),
&schema);
+let left_expr = logical2physical(&logical_col("a").eq(expr_lit(2)),
&schema);
let left_value = left_expr.evaluate(&batch).unwrap();
-assert!(check_short_circuit(&left_value, &Operator::And));
+assert!(matches!(
+check_short_circuit(&left_value, &Operator::And),
+ShortCircuitStrategy::ReturnLeft
+));
+
// op: AND left: not all false
-let left_expr = logical2physical(&logical_col("a").eq(lit(3)),
&schema);
+let left_expr = logical2physical(&logical_col("a").eq(expr_lit(3)),
&schema);
let left_value = left_expr.evaluate(&batch).unwrap();
-assert!(!check_short_circuit(&left_value, &Operator::And));
+let ColumnarValue::Array(array) = &left_value else {
+panic!("Expected ColumnarValue::Array");
+};
+let ShortCircuitStrategy::PreSelection(value) =
+check_short_circuit(&left_value, &Operator::And)
+else {
+panic!("Expected ShortCircuitStrategy::PreSelection");
+};
+let expected_boolean_arr: Vec<_> =
+as_boolean_array(array).unwrap().iter().collect();
+let boolean_arr: Vec<_> = value.iter().collect();
+assert_eq!(expected_boolean_arr, boolean_arr);
+
// op: OR left: all true
-let left_expr = logical2physical(&logical_col("a").gt(lit(0)),
&schema);
+let left_expr = logical2physical(&logical_col("a").gt(expr_lit(0)),
&schema);
let left_value = left_expr.evaluate(&batch).unwrap();
-assert!(check_short_circuit(&left_value, &Operator::Or));
+assert!(matches!(
+check_short_circuit(&left_value, &Operator::Or),
+ShortCircuitStrategy::ReturnLeft
+));
+
// op: OR left: not all true
-let left_expr = logical2physical(&logical_col("a").gt(lit(2)),
&schema);
+let left_expr: Arc =
+logical2physical(&logical_col("a").gt(expr_lit(2)), &schema);
let left_value = left_expr.evaluate(&batch).unwrap();
-assert!(!check_short_circuit(&left_value, &Operator::Or));
+assert!(matches!(
+check_short_circuit(&left_value, &Operator::Or),
+ShortCircuitStrategy::None
+));
+
+// Test with nullable arrays and null values
+let schema_nullable = Arc::new(Schema::new(vec![
+Field::new("c", DataType::Boolean, true),
+Field::new("d", DataType::Boolean, true),
+]));
+
+// Create arrays with null values
+let c_array = Arc::new(BooleanArray::from(vec![
+Some(true),
+Some(false),
+None,
+Some(true),
+None,
+])) as ArrayRef;
+let d_array = Arc::new(BooleanArray::from(vec![
+Some(false),
+Some(true),
+Some(false),
+None,
+Some(true),
+])) as ArrayRef;
+
+let batch_nullable = RecordBatch::try_new(
+Arc::clone(&schema_nullable),
+vec![Arc::clone(&c_array), Arc::clone(&d_array)],
+)
+.unwrap();
+
+// Case: Mixed values with nulls - shouldn't short-circuit for AND
+let mixed_nulls = logical2physical(&logical_col("c"),
&schema_nullable);
+let mixed_nulls_value = mixed_nulls.evaluate(&batch_nullable).unwrap();
+assert!(matches!(
+check_short_circuit(&mixed_nulls_value, &Operator::And),
+ShortCircuitStrategy::None
+));
+
+// Case: Mixed values with nulls - shouldn't short-circuit for OR
+assert!(matches!(
+check_short_circuit(&mixed_nulls_value, &Operator::Or),
+ShortCircuitStrategy::None
+));
+
+// Test with all nulls
+let all_nulls = Arc::new(BooleanArray::from(vec![None, None, None]))
as ArrayRef;
+let null_batch = RecordBatch::try_new(
+Arc::new(Schema::new(vec![Field::new("e", DataType::Boolean,
true)])),
+vec![all_nulls],
+)
+.unwrap();
+
+let null_expr = logical2physical(&logical_col("e"),
&null_batch.schema());
+let null_value = null_expr.evaluate(&null_batch).unwrap();
+
+// All nulls shouldn't short-circuit for AND or OR
+assert!(matches!(
+check_short_circuit(&null_value, &Operator::And),
+ShortCircuitStrategy::None
+));
+assert!(matches!(
+check_short_circuit(&null_value, &Operator::Or),
+ShortCircuitStrategy::None
+));
+
+// Test with scalar values
+//
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
alamb merged PR #15694: URL: https://github.com/apache/datafusion/pull/15694 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2810163625 > The only thing I think is needed in this PR is a few more tests for the `pre_selection_scatter` function and then it will be ready to go done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2047346420
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,199 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTION_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTION_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <=
PRE_SELECTION_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
kosiew commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2045951368
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <=
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (short
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2045899326
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,199 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTION_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTION_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <=
PRE_SELECTION_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
alamb commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2045649141
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,199 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
Review Comment:
this seems reasonable to me
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,199 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTION_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTION_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2807123847 > 🤖: Benchmark completed > > Details > > > > ``` > Comparing HEAD and short-and-optmize > > Benchmark clickbench_extended.json > > ┏━━┳┳━━━┳━━━┓ > ┃ Query┃ HEAD ┃ short-and-optmize ┃Change ┃ > ┡━━╇╇━━━╇━━━┩ > │ QQuery 0 │ 1918.00ms │ 1941.59ms │ no change │ > │ QQuery 1 │ 711.53ms │ 702.35ms │ no change │ > │ QQuery 2 │ 1429.36ms │ 1446.13ms │ no change │ > │ QQuery 3 │ 704.72ms │ 727.05ms │ no change │ > │ QQuery 4 │ 1516.84ms │ 1514.23ms │ no change │ > │ QQuery 5 │ 17398.39ms │17271.95ms │ no change │ > │ QQuery 6 │ 2026.72ms │ 2124.80ms │ no change │ > └──┴┴───┴───┘ > ┏━━┳┓ > ┃ Benchmark Summary┃┃ > ┡━━╇┩ > │ Total Time (HEAD)│ 25705.56ms │ > │ Total Time (short-and-optmize) │ 25728.10ms │ > │ Average Time (HEAD) │ 3672.22ms │ > │ Average Time (short-and-optmize) │ 3675.44ms │ > │ Queries Faster │ 0 │ > │ Queries Slower │ 0 │ > │ Queries with No Change │ 7 │ > └──┴┘ > > Benchmark clickbench_partitioned.json > > ┏━━┳┳━━━┳━━━┓ > ┃ Query┃ HEAD ┃ short-and-optmize ┃Change ┃ > ┡━━╇╇━━━╇━━━┩ > │ QQuery 0 │ 2.74ms │2.25ms │ +1.21x faster │ > │ QQuery 1 │36.47ms │ 36.63ms │ no change │ > │ QQuery 2 │90.59ms │ 88.44ms │ no change │ > │ QQuery 3 │ 101.66ms │ 95.74ms │ +1.06x faster │ > │ QQuery 4 │ 802.73ms │ 840.27ms │ no change │ > │ QQuery 5 │ 891.98ms │ 914.36ms │ no change │ > │ QQuery 6 │ 2.10ms │2.14ms │ no change │ > │ QQuery 7 │41.09ms │ 42.92ms │ no change │ > │ QQuery 8 │ 948.80ms │ 961.56ms │ no change │ > │ QQuery 9 │ 1269.90ms │ 1292.89ms │ no change │ > │ QQuery 10│ 264.01ms │ 274.69ms │ no change │ > │ QQuery 11│ 301.65ms │ 316.08ms │ no change │ > │ QQuery 12│ 943.62ms │ 966.90ms │ no change │ > │ QQuery 13│ 1405.86ms │ 1426.41ms │ no change │ > │ QQuery 14│ 900.03ms │ 898.57ms │ no change │ > │ QQuery 15│ 1079.97ms │ 1069.93ms │ no change │ > │ QQuery 16│ 1778.50ms │ 1782.02ms │ no change │ > │ QQuery 17│ 1665.53ms │ 1666.25ms │ no change │ > │ QQuery 18│ 3160.53ms │ 3186.64ms │ no change │ > │ QQuery 19│86.25ms │ 92.80ms │ 1.08x slower │ > │ QQuery 20│ 1170.29ms │ 1183.35ms │ no change │ > │ QQuery 21│ 1355.56ms │ 1374.25ms │ no change │ > │ QQuery 22│ 2374.52ms │ 2322.80ms │ no change │ > │ QQuery 23│ 8507.41ms │ 8569.51ms │ no change │ > │ QQuery 24│ 470.15ms │ 480.90ms │ no change │ > │ QQuery 25│ 394.74ms │ 408.10ms │ no change │ > │ QQuery 26│ 539.09ms │ 561.82ms │ no change │ > │ QQuery 27│ 1706.49ms │ 1747.01ms │ no change │ > │ QQuery 28│ 12891.10ms │12837.09ms │ no change │ > │ QQuery 29│ 536.68ms │ 522.79ms │ no change │ > │ QQuery 30│ 826.10ms │ 849.08ms │ no change │ > │ QQuery 31│ 879.52ms │ 918.11ms │ no change │ > │ QQuery 32│ 2784.99ms │ 2694.41ms │ no change │ > │ QQuery 33│ 3398.18ms │ 3423.14ms │ no change │ > │ QQuery 34│ 3435.31ms │ 3521.83ms │ no change │ > │ QQuery 35│ 1320.36ms │ 1303.37ms │ no change │ > │ QQuery 36│ 126.83ms │ 132.86ms │ no change │ > │ QQuery 37│57.78ms │ 59.40ms │ no change │ > │ QQuery 38│ 126.83ms │ 125.35ms │ no change │ > │ QQuery 39│ 201.53ms │ 208.57ms │ no change │ > │ QQuery 40│51.48ms │ 47.72ms │ +1.08x faster │ > │ QQuery 41│47.15ms │ 46.73ms │ no change │ > │ QQuery 42│40.77ms │
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
alamb commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2807039992 🤖: Benchmark completed Details ``` Comparing HEAD and short-and-optmize Benchmark clickbench_extended.json ┏━━┳┳━━━┳━━━┓ ┃ Query┃ HEAD ┃ short-and-optmize ┃Change ┃ ┡━━╇╇━━━╇━━━┩ │ QQuery 0 │ 1918.00ms │ 1941.59ms │ no change │ │ QQuery 1 │ 711.53ms │ 702.35ms │ no change │ │ QQuery 2 │ 1429.36ms │ 1446.13ms │ no change │ │ QQuery 3 │ 704.72ms │ 727.05ms │ no change │ │ QQuery 4 │ 1516.84ms │ 1514.23ms │ no change │ │ QQuery 5 │ 17398.39ms │17271.95ms │ no change │ │ QQuery 6 │ 2026.72ms │ 2124.80ms │ no change │ └──┴┴───┴───┘ ┏━━┳┓ ┃ Benchmark Summary┃┃ ┡━━╇┩ │ Total Time (HEAD)│ 25705.56ms │ │ Total Time (short-and-optmize) │ 25728.10ms │ │ Average Time (HEAD) │ 3672.22ms │ │ Average Time (short-and-optmize) │ 3675.44ms │ │ Queries Faster │ 0 │ │ Queries Slower │ 0 │ │ Queries with No Change │ 7 │ └──┴┘ Benchmark clickbench_partitioned.json ┏━━┳┳━━━┳━━━┓ ┃ Query┃ HEAD ┃ short-and-optmize ┃Change ┃ ┡━━╇╇━━━╇━━━┩ │ QQuery 0 │ 2.74ms │2.25ms │ +1.21x faster │ │ QQuery 1 │36.47ms │ 36.63ms │ no change │ │ QQuery 2 │90.59ms │ 88.44ms │ no change │ │ QQuery 3 │ 101.66ms │ 95.74ms │ +1.06x faster │ │ QQuery 4 │ 802.73ms │ 840.27ms │ no change │ │ QQuery 5 │ 891.98ms │ 914.36ms │ no change │ │ QQuery 6 │ 2.10ms │2.14ms │ no change │ │ QQuery 7 │41.09ms │ 42.92ms │ no change │ │ QQuery 8 │ 948.80ms │ 961.56ms │ no change │ │ QQuery 9 │ 1269.90ms │ 1292.89ms │ no change │ │ QQuery 10│ 264.01ms │ 274.69ms │ no change │ │ QQuery 11│ 301.65ms │ 316.08ms │ no change │ │ QQuery 12│ 943.62ms │ 966.90ms │ no change │ │ QQuery 13│ 1405.86ms │ 1426.41ms │ no change │ │ QQuery 14│ 900.03ms │ 898.57ms │ no change │ │ QQuery 15│ 1079.97ms │ 1069.93ms │ no change │ │ QQuery 16│ 1778.50ms │ 1782.02ms │ no change │ │ QQuery 17│ 1665.53ms │ 1666.25ms │ no change │ │ QQuery 18│ 3160.53ms │ 3186.64ms │ no change │ │ QQuery 19│86.25ms │ 92.80ms │ 1.08x slower │ │ QQuery 20│ 1170.29ms │ 1183.35ms │ no change │ │ QQuery 21│ 1355.56ms │ 1374.25ms │ no change │ │ QQuery 22│ 2374.52ms │ 2322.80ms │ no change │ │ QQuery 23│ 8507.41ms │ 8569.51ms │ no change │ │ QQuery 24│ 470.15ms │ 480.90ms │ no change │ │ QQuery 25│ 394.74ms │ 408.10ms │ no change │ │ QQuery 26│ 539.09ms │ 561.82ms │ no change │ │ QQuery 27│ 1706.49ms │ 1747.01ms │ no change │ │ QQuery 28│ 12891.10ms │12837.09ms │ no change │ │ QQuery 29│ 536.68ms │ 522.79ms │ no change │ │ QQuery 30│ 826.10ms │ 849.08ms │ no change │ │ QQuery 31│ 879.52ms │ 918.11ms │ no change │ │ QQuery 32│ 2784.99ms │ 2694.41ms │ no change │ │ QQuery 33│ 3398.18ms │ 3423.14ms │ no change │ │ QQuery 34│ 3435.31ms │ 3521.83ms │ no change │ │ QQuery 35│ 1320.36ms │ 1303.37ms │ no change │ │ QQuery 36│ 126.83ms │ 132.86ms │ no change │ │ QQuery 37│57.78ms │ 59.40ms │ no change │ │ QQuery 38│ 126.83ms │ 125.35ms │ no change │ │ QQuery 39│ 201.53ms │ 208.57ms │ no change │ │ QQuery 40│51.48ms │ 47.72ms │ +1.08x faster │ │ QQuery 41│47.15ms │ 46.73ms │ no change │ │ QQuery 42│40.77ms │ 39.91ms │ no change │ └──┴┴───┴───┘ ┏━━┳┓ ┃ Benchmark Summa
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
Dandandan commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2800863670 Very cool. It would be nice to run some e2e benchmarks (TPC-H, clickbench) with this to see the impact here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2801015314 > | However, one point needs to be confirmed: [filter_record_batch](https://docs.rs/arrow-select/54.2.1/src/arrow_select/filter.rs.html#202-205) will retain rows that are null. > > `filter_record_batch` removes rows that are null. If that's the case, we may need to do some hacky things to extend it to nulls. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
alamb commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2806923704 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking) Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Comparing short-and-optmize (d104e3bba2aeb139c94f98b3a7b837d30d63e65c) to 0b01fdf7f02f9097c319204058576f420b9790b4 [diff](https://github.com/apache/datafusion/compare/0b01fdf7f02f9097c319204058576f420b9790b4..d104e3bba2aeb139c94f98b3a7b837d30d63e65c) Benchmarks: tpch_mem clickbench_partitioned clickbench_extended Results will be posted here when complete -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2043975638
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <=
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sh
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
Dandandan commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2042898541
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <=
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sho
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2801131412 > Very cool. It would be nice to run some e2e benchmarks (TPC-H, clickbench) with this to see the impact here. I tried running clickbench, and there wasn't a significant improvement. The optimization is still relatively selective about the cost of expression evaluation. For example, if the computation on the right involves only simple integer comparisons or short string comparisons, skipping a large amount of computation may not yield much benefit. ## New ideas One potential area for significant improvement that I thought of is: heuristically replacing the computation process of [`filter_and_project`](https://github.com/apache/datafusion/blob/61e8a5d0d65859427c5882644e1cc97c4def0bde/datafusion/physical-plan/src/filter.rs#L514-L545). The current state of this function is: 1. Pass in the batch to compute the predicate and obtain a boolean array. 2. Perform the actual filtering process on the batch based on the boolean array and return the result. ## Pros Since the filtering process involves copying to produce a new batch, calculating a boolean array based on the predicate only requires a single copy. If filtering is performed during the predicate computation, multiple copies will occur. ## Cons If the predicate is an AND expression, after the left side is computed, it may already have filtered out most of the batch and computed a new batch. However, the final batch generation is still based on the previous batch and the boolean array, which introduces unnecessary copy overhead. ## Conclusion Considering both aspects, I believe we can strike a balance. For example, if it is discovered during the AND computation process that most of the batch can be filtered out, then early filtering can be performed as the final result. However, this optimization scenario is very limited: it can only be applied when the predicate is entirely an AND computation, such as `a AND b AND c ...`. Once an OR appears in the middle, we cannot perform early filtering to produce the result. The changes required to implement this optimization might be significant, and it would only apply to predicates in the form of `a AND b AND c...`. It might not be worth the effort. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
Dandandan commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2800856291 | However, one point needs to be confirmed: [filter_record_batch](https://docs.rs/arrow-select/54.2.1/src/arrow_select/filter.rs.html#202-205) will retain rows that are null. `filter_record_batch` removes rows that are null. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041910661
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <=
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sh
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041670623
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <=
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sh
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
Dandandan commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041741105
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <=
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sho
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
Dandandan commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041741105
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <=
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sho
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041635786
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <=
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sh
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041635786
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <=
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sh
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041635786
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <=
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (sh
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
kosiew commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041391176
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <=
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (short
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
kosiew commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041391176
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+return ShortCircuitStrategy::ReturnLeft;
+}
+
+// If no false values, then all must be true
+if true_count == len {
+return ShortCircuitStrategy::ReturnRight;
+}
+
+// determine if we can pre-selection
+if true_count as f32 / len as f32 <=
PRE_SELECTIO_THRESHOLD {
+return ShortCircuitStrategy::PreSelection(bool_array);
+}
+} else {
+// For OR, prioritize checking for all-true (short
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
kosiew commented on code in PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041328132
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
Review Comment:
```suggestion
const PRE_SELECTION_THRESHOLD: f32 = 0.2;
```
##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -811,58 +822,164 @@ impl BinaryExpr {
}
}
+enum ShortCircuitStrategy<'a> {
+None,
+ReturnLeft,
+ReturnRight,
+PreSelection(&'a BooleanArray),
+}
+
+/// Based on the results calculated from the left side of the short-circuit
operation,
+/// if the proportion of `true` is less than 0.2 and the current operation is
an `and`,
+/// the `RecordBatch` will be filtered in advance.
+const PRE_SELECTIO_THRESHOLD: f32 = 0.2;
+
/// Checks if a logical operator (`AND`/`OR`) can short-circuit evaluation
based on the left-hand side (lhs) result.
///
-/// Short-circuiting occurs when evaluating the right-hand side (rhs) becomes
unnecessary:
-/// - For `AND`: if ALL values in `lhs` are `false`, the expression must be
`false` regardless of rhs.
-/// - For `OR`: if ALL values in `lhs` are `true`, the expression must be
`true` regardless of rhs.
-///
-/// Returns `true` if short-circuiting is possible, `false` otherwise.
-///
+/// Short-circuiting occurs under these circumstances:
+/// - For `AND`:
+///- if LHS is all false => short-circuit → return LHS
+///- if LHS is all true => short-circuit → return RHS
+///- if LHS is mixed and true_count/sum_count <=
[`PRE_SELECTIO_THRESHOLD`] -> pre-selection
+/// - For `OR`:
+///- if LHS is all true => short-circuit → return LHS
+///- if LHS is all false => short-circuit → return RHS
/// # Arguments
-/// * `arg` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
+/// * `lhs` - The left-hand side (lhs) columnar value (array or scalar)
/// * `op` - The logical operator (`AND` or `OR`)
///
/// # Implementation Notes
/// 1. Only works with Boolean-typed arguments (other types automatically
return `false`)
/// 2. Handles both scalar values and array values
-/// 3. For arrays, uses optimized `true_count()`/`false_count()` methods from
arrow-rs.
-///`bool_or`/`bool_and` maybe a better choice too,for detailed
discussion,see:[link](https://github.com/apache/datafusion/pull/15462#discussion_r2020558418)
-fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
-let data_type = arg.data_type();
-match (data_type, op) {
-(DataType::Boolean, Operator::And) => {
-match arg {
-ColumnarValue::Array(array) => {
-if let Ok(array) = as_boolean_array(&array) {
-return array.false_count() == array.len();
-}
+/// 3. For arrays, uses optimized bit counting techniques for boolean arrays
+fn check_short_circuit<'a>(
+lhs: &'a ColumnarValue,
+op: &Operator,
+) -> ShortCircuitStrategy<'a> {
+// Quick reject for non-logical operators,and quick judgment when op is and
+let is_and = match op {
+Operator::And => true,
+Operator::Or => false,
+_ => return ShortCircuitStrategy::None,
+};
+
+// Non-boolean types can't be short-circuited
+if lhs.data_type() != DataType::Boolean {
+return ShortCircuitStrategy::None;
+}
+
+match lhs {
+ColumnarValue::Array(array) => {
+// Fast path for arrays - try to downcast to boolean array
+if let Ok(bool_array) = as_boolean_array(array) {
+// Arrays with nulls can't be short-circuited
+if bool_array.null_count() > 0 {
+return ShortCircuitStrategy::None;
}
-ColumnarValue::Scalar(scalar) => {
-if let ScalarValue::Boolean(Some(value)) = scalar {
-return !value;
+
+let len = bool_array.len();
+if len == 0 {
+return ShortCircuitStrategy::None;
+}
+
+let true_count = bool_array.values().count_set_bits();
+if is_and {
+// For AND, prioritize checking for all-false (short
circuit case)
+// Uses optimized false_count() method provided by Arrow
+
+// Short circuit if all values are false
+if true_count == 0 {
+r
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2800087698 The relevant bug fixes have been completed, and corresponding performance tests have been conducted. The results show that pre-selection has achieved significant gains! @Dandandan @alamb Compare the current optimization with the main branch using `cargo bench --bench binary_op`. The results are as follows, where fluctuations within ±5% are considered as no changes. ## Performance Comparison of the AND Logic Group | Test Case | main | short-and-optimize | Ratio | Change | | - | - | -- | --- | -- | | all_false | 62.623 ns | 65.923 ns | 0.95x | no changes | | one_true_first| 448.69 µs | 195.60 µs | **2.29x** | ↑ ✅ | | one_true_last | 452.00 µs | 171.91 µs | **2.63x** | ↑ ✅ | | one_true_middle | 453.12 µs | 173.94 µs | **2.60x** | ↑ ✅ | | one_true_middle_left | 453.06 µs | 165.70 µs | **2.73x** | ↑ ✅ | | one_true_middle_right | 459.61 µs | 171.53 µs | **2.68x** | ↑ ✅ | | all_true_in_and | 450.03 µs | 445.76 µs | 1.01x | no changes | ## Performance Comparison of the OR Logic Group | Test Case | main | short-and-optimize | Ratio | Change | | -- | - | -- | - | -- | | all_true | 61.162 ns | 64.430 ns | 0.95x | no changes | | one_false_first| 448.51 µs | 439.92 µs | 1.02x | no changes | | one_false_last | 447.38 µs | 453.64 µs | 0.99x | no changes | | one_false_middle | 457.79 µs | 447.15 µs | 1.02x | no changes | | one_false_middle_left | 452.78 µs | 447.75 µs | 1.01x | no changes | | one_false_middle_right | 451.21 µs | 444.23 µs | 1.02x | no changes | | all_false_in_or| 449.90 µs | 442.36 µs | 1.02x | no changes | ## Possible next step(extend to nulls) ### Short-circuit optimization cannot be extended to nulls The current short-circuit optimization is only applicable to cases without null values. However, based on the calculation principles of "and" and "or", if the left-hand side (lhs) evaluates to null, then the final result can only be determined by continuing to calculate the right-hand side (rhs). Therefore, optimization for this scenario is not feasible. Below is an example of a calculation where lhs is null: ```sql ❯ select null and true; ++ | NULL AND Boolean(true) | ++ || ++ 1 row in set. Query took 0.000 seconds. ❯ select null and false; +-+ | NULL AND Boolean(false) | +-+ | false | +-+ 1 row in set. Query took 0.000 seconds. ❯ select null or false; ++ | NULL OR Boolean(false) | ++ || ++ 1 row in set. Query took 0.000 seconds. ❯ select null or true; +---+ | NULL OR Boolean(true) | +---+ | true | +---+ 1 row in set. Query took 0.000 seconds. ``` ### Pre-selection can be extended to include nulls As I explained earlier: https://github.com/apache/datafusion/pull/15694#issuecomment-2799010340, pre-selection can actually be extended to cover cases involving null values. However, one point needs to be confirmed: [filter_record_batch](https://docs.rs/arrow-select/54.2.1/src/arrow_select/filter.rs.html#202-205) will retain rows that are null. > 4. Combine the left-hand and right-hand boolean arrays to produce the correct boolean array (modify the positions in the left-hand array marked as true based on the values from the right-hand array). Afterward, we only need to modify the fourth step of the pre-selection process mentioned earlier to complete the extension that supports nulls. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2799010340 The error in `cargo test` is caused by an incorrect calculation of the pre-selection. The correct steps for calculating the pre-selection are as follows: 1. Compute the boolean array on the left-hand side. 2. Filter to obtain a new record batch based on the left-hand boolean array. 3. Compute the boolean array on the right-hand side using the new record batch (note that this boolean array will have incorrect positions since it corresponds to the new record batch). 4. Combine the left-hand and right-hand boolean arrays to produce the correct boolean array (modify the positions in the left-hand array marked as `true` based on the values from the right-hand array). To illustrate this, I’ve drawn a diagram (it’s quite rough since I used a mouse, so I hope you don’t mind 😂):  The current code only handles up to the second step, which is why the error occurs. If you use `evaluate_selection`, the issue persists because its internal call to `scatter` (which corresponds to completing the fourth step) fills in the missing parts of the right-hand boolean array with null values, resulting in an incorrect outcome. I’m currently working on implementing the fourth step to fix the issue, but it may take some time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
