Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]
acking-you commented on issue #15631: URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2798871646 > > [@acking-you](https://github.com/acking-you) the code needs to be extended to support nulls (you can take a look at the true_count implementation in arrow-rs to do this efficiently). > > I have an idea for an early exit losing performance, and I'm trying it out. If it works, I'll post the code @Dandandan You're absolutely right. At this point, we should no longer focus on optimizing the early return for the `true_count` calculation. Instead, we should consider extending support to new optimization scenarios, such as handling `nulls` and other cases. Here’s why: ## My Attempt I experimented with optimizing the `true_count` calculation using SIMD instructions, performing 256-bit comparisons and computations while incorporating early returns. This significantly reduced the overhead of checking boolean arrays in non-short-circuit scenarios (with a performance boost of up to 6x in extreme cases). However, in short-circuit scenarios like `all false` or `all true`, it still performed about 20% slower than the existing `false_count/true_count` implementation. When I applied this optimization to the execution of `BinaryExpr` and benchmarked the entire `BinaryExpr`, I observed a performance regression. The reasons are as follows: 1. The computation of `true_count` for boolean arrays is only on the order of 60ns, and even with early exit optimizations, it remains at the 10ns level (in extreme cases). 2. When `BinaryExpr` does not trigger short-circuiting, its execution time is on the order of 400us (depending on the complexity of the expression). Optimizing the `true_count` calculation for boolean arrays becomes negligible in this context. 3. When short-circuiting occurs, the time complexity of `BinaryExpr` is on par with the computation of the boolean array. In this case, calling `true_count` remains the optimal solution. ## Other Optimization Scenarios - Extending short-circuit optimizations to handle `nulls`. - The early execution of selection that I previously mentioned: [https://github.com/apache/datafusion/issues/15631#issuecomment-2788923437](https://github.com/apache/datafusion/issues/15631#issuecomment-2788923437). I realized that my earlier conclusions about the benefits of early selection execution were incorrect. After testing with the latest code changes (on a machine with an AMD 9950x CPU), the results are as follows: all scenarios that benefited from this optimization saw a reduction in execution time from `450us` to `170us`. This is a significant improvement, and it did not slow down any other cases! ```sh short_circuit/and/one_true_first time: [188.36 µs 188.96 µs 189.55 µs] change: [-58.870% -58.652% -58.420%] (p = 0.00 < 0.05) Performance has improved. short_circuit/and/one_true_last time: [170.10 µs 171.04 µs 171.97 µs] change: [-62.788% -62.520% -62.258%] (p = 0.00 < 0.05) Performance has improved. short_circuit/and/one_true_middle time: [165.38 µs 165.97 µs 166.57 µs] change: [-64.176% -63.968% -63.784%] (p = 0.00 < 0.05) Performance has improved. short_circuit/and/one_true_middle_left time: [165.02 µs 165.48 µs 165.96 µs] change: [-63.671% -63.458% -63.204%] (p = 0.00 < 0.05) Performance has improved. short_circuit/and/one_true_middle_right time: [169.09 µs 169.61 µs 170.13 µs] change: [-63.298% -63.089% -62.868%] (p = 0.00 < 0.05) Performance has improved. short_circuit/and/one_true_middle_right #2 time: [166.43 µs 167.12 µs 167.85 µs] change: [-64.092% -63.916% -63.751%] (p = 0.00 < 0.05) Performance has improved. ``` I further optimized the work done by @kosiew. Relevant PR: https://github.com/apache/datafusion/pull/15694 (currently, there are still some issues with `cargo test`, and I'm working on fixing them). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]
acking-you commented on issue #15631: URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2798742956 > @acking-you the code needs to be extended to support nulls (you can take a look at the true_count implementation in arrow-rs to do this efficiently). I have an idea for an early exit losing performance, and I'm trying it out. If it works, I'll post the code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]
kosiew commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2798494451
hi @Dandandan
I am getting failed tests with
```rust
#[test]
fn test_all_one() -> Result<()> {
// Helper function to run tests and report failures
let run_test = |array: &BooleanArray, name: &str, expected: bool| ->
Result<()> {
let result = all_one(array);
if result != expected {
println!(
"Test case '{}' failed: expected {}, got {}",
name, expected, result
);
println!("Array contents: {:?}", array);
}
assert_eq!(result, expected, "all_one failed for test case
'{}'", name);
Ok(())
};
// Basic cases - uniform arrays
let all_one_array = BooleanArray::from(vec![true, true, true, true]);
run_test(&all_one_array, "all true array", true)?;
let all_zero_array = BooleanArray::from(vec![false, false, false,
false]);
run_test(&all_zero_array, "all false array", false)?;
// Mixed values
let mixed_array = BooleanArray::from(vec![true, true, false, true]);
run_test(&mixed_array, "mixed array with one false", false)?;
// Edge cases
let empty_array = BooleanArray::from(vec![] as Vec);
run_test(&empty_array, "empty array", true)?;
let single_true = BooleanArray::from(vec![true]);
run_test(&single_true, "single true", true)?;
let single_false = BooleanArray::from(vec![false]);
run_test(&single_false, "single false", false)?;
// Arrays with nulls
let array_with_nulls: BooleanArray =
vec![Some(true), None, Some(true)].into_iter().collect();
run_test(&array_with_nulls, "nulls with true", true)?;
let nulls_with_false: BooleanArray =
vec![Some(true), Some(false), None].into_iter().collect();
run_test(&nulls_with_false, "nulls with false", false)?;
// Large arrays
let large_all_one = BooleanArray::from(vec![true; 128]);
run_test(&large_all_one, "large all true (128)", true)?;
// Test with a single false at different positions
let mut values = vec![true; 128];
values[63] = false; // Last bit in first chunk
let large_with_boundary_false = BooleanArray::from(values);
run_test(
&large_with_boundary_false,
"large with false at bit 63",
false,
)?;
let mut values = vec![true; 128];
values[64] = false; // First bit in second chunk
let large_with_second_chunk_false = BooleanArray::from(values);
run_test(
&large_with_second_chunk_false,
"large with false at bit 64",
false,
)?;
// Specific sizes that might trigger edge cases
let exact_chunk_size = BooleanArray::from(vec![true; 64]);
run_test(&exact_chunk_size, "exact chunk size (64)", true)?;
let just_under_chunk = BooleanArray::from(vec![true; 63]);
run_test(&just_under_chunk, "just under chunk size (63)", true)?;
let just_over_chunk = BooleanArray::from(vec![true; 65]);
run_test(&just_over_chunk, "just over chunk size (65)", true)?;
Ok(())
}
#[test]
fn test_all_zero() -> Result<()> {
// Helper function to run tests and report failures
let run_test = |array: &BooleanArray, name: &str, expected: bool| ->
Result<()> {
let result = all_zero(array);
if result != expected {
println!(
"Test case '{}' failed: expected {}, got {}",
name, expected, result
);
println!("Array contents: {:?}", array);
}
assert_eq!(result, expected, "all_zero failed for test case
'{}'", name);
Ok(())
};
// Basic cases - uniform arrays
let all_zero_array = BooleanArray::from(vec![false, false, false,
false]);
run_test(&all_zero_array, "all false array", true)?;
let all_one_array = BooleanArray::from(vec![true, true, true, true]);
run_test(&all_one_array, "all true array", false)?;
// Mixed values
let mixed_array = BooleanArray::from(vec![false, false, true,
false]);
run_test(&mixed_array, "mixed array with one true", false)?;
// Edge cases
let empty_array = BooleanArray::from(vec![] as Vec);
run_test(&empty_array, "empty array", true)?;
let sin
Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]
Dandandan commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2796844672
Btw as a simple concept, I tested this yesterday to reduce execution time of
short circuiting all false / all true cases by -25% compared to `true_count` /
`false_count`:
```rust
fn all_zero(array: &BooleanArray) -> bool {
// TODO: nulls
// match array.nulls() {
// Some(nulls) => {
// }
// None => {}
// }
// platform specific
let bit_chunks = array.values().bit_chunks();
let mut all_zero = true;
for i in bit_chunks {
all_zero &= i == 0;
}
all_zero
}
fn all_one(array: &BooleanArray) -> bool {
// platform specific
let bit_chunks = array.values().bit_chunks();
let mut all_one = true;
for i in bit_chunks {
all_one &= i == 0x;
}
all_one
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]
alamb commented on issue #15631: URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2792363126 `ShortCircuitStrategy` is a pretty neat idea In my opinion, as long as the code is easy to understand, makes realistic benchmarks faster, and doesn't regress existing performance it is good thing to try My biggest concern with this type of optimization is that it will cause some queries to go faster and some to go slower -- this tradeoff is not great in my mind as then we have to judge if the tradeoff is worth it. Given how many people use DataFusion I think it likely that not all users have the same opinion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]
Dandandan commented on issue #15631: URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2792442938 > I don't know if you think it's a good idea? @alamb @Dandandan I think it is a pretty good idea given that evaluation is so important. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]
acking-you commented on issue #15631: URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2788923437 I have an idea that might improve the effectiveness of short-circuit optimization, and it seems necessary to use `false_count` for evaluation counting. The current issue with DataFusion's execution of `BinaryExpr`: After computing the `left` side, the result is not immediately used to filter the batch to reduce the input size of the `right` batch. Example: ``` Current: batch:[1,2,3,4] -> execute left -> bool array: [true,false,true,false] batch:[1,2,3,4] -> execute right -> bool array: [true,true,false,false] Might be better: batch:[1,2,3,4] -> execute left -> bool array: [true,false,true,false] -> batch:[1,3] batch:[1,3] -> execute right -> bool array: [true,false] -> batch:[1] ``` I tried implementing this process using [evaluate_selection](https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_selection), but the performance regressed in many cases because its internal implementation requires copying to create a new `RecordBatch`. However, perhaps we could heuristically decide whether to pre-filter the `RecordBatch` based on `false_count`, for example, when `false_count / array_len > 0.8`. By the way, I recently looked into ClickHouse's execution logic for `BinaryOp`. It immediately uses the result of each expression to filter and then proceeds to execute the next expression. Similarly, it involves copying, but it accelerates the checking and copying process using SIMD instructions. I also noticed that arrow-rs, which DataFusion uses, has a very efficient approach for this process: [IterationStrategy](https://github.com/apache/arrow-rs/blob/9322547590ab32efeff8c0486e4a3a2cb5887a26/arrow-select/src/filter.rs#L296-L318). I don't know if you think it's a good idea? @alamb @Dandandan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]
Dandandan commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2786711757
Would be good to compare it with a boolean version of this as well, like
this, to see if it vectorizes better:
```
pub fn all_zero(&self) -> bool {
// platform specific
const LANES: usize = 8;
let mut chunks = self.chunks.chunks_exact(LANES);
let mut all_false = true;
chunks.borrow_mut().for_each(|chunk| {
let chunk: [u64; LANES] = chunk.try_into().unwrap();
for i in 0..LANES {
all_false &= chunk[i] == 0;
}
});
let remainder = chunks.remainder();
for chunk in remainder {
all_false &= *chunk == 0;
}
all_false
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]
alamb commented on issue #15631: URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2786424000 > sum += chunk[i].count_ones() as usize; Maybe simply manually unrolling the loop to check 1024 bits at a time would let llvm make the best code Something like checking windows of 8 `u64s` at a time against 0x0 -- and exiting early if there is a non zero 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]
Dandandan commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2786341929
Interesting!
I think we probably can take some inspiration from arrow-rs aggregate code,
e.g. doing something like (?):
```rust
/// Counts the number of ones
pub fn count_ones(&self) -> usize {
// platform specific
const LANES: usize = 8;
let mut chunks = self.chunks.chunks_exact(LANES);
let mut sum: usize = 0;
chunks.borrow_mut().for_each(|chunk| {
let chunk: [u64; LANES] = chunk.try_into().unwrap();
for i in 0..LANES {
sum += chunk[i].count_ones() as usize;
}
});
let remainder = chunks.remainder();
for chunk in remainder {
sum += chunk.count_ones() as usize;
}
sum
}
```
(Didn't test this to be faster).
We could also define a faster version for this use case that doesn't count
but returns the same as max / min boolean.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]
acking-you commented on issue #15631: URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2786166445 This might require manual SIMD for optimization, but that would increase the porting difficulty([As duckdb says](https://duckdb.org/faq.html#does-duckdb-use-simd)). However, perhaps an alternative approach could be tried to make it easier for the compiler to optimize. If feasible, it also seems capable of improving the performance of related calls in the arrow-rs library. ## Some exploration In ClickHouse's filter implementation, there is a classic manual SIMD implementation approach: [code](https://github.com/ClickHouse/ClickHouse/blob/master/src/Columns/ColumnsCommon.cpp#L237-L275) The function involves loading multiple boolean values at once using SIMD instructions to increase the loop step. The best-case scenarios are: - The filter does not match, skipping to the next iteration. - The filter fully matches, copying multiple rows at once. For other cases, the performance degrades to a handling method similar to when SIMD is not used (the additional overhead being the preparation of SIMD variables). If this approach is applied to check whether a bit is 1 or 0, it should incur almost no overhead (only requiring a comparison with `0` or ``). At the same time, could DataFusion's filter process also be optimized using this method? Alternatively, could we find another form of vectorization that does not involve manual unrolling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
