Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

2025-04-12 Thread via GitHub


acking-you commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2798871646

   > > [@acking-you](https://github.com/acking-you) the code needs to be 
extended to support nulls (you can take a look at the true_count implementation 
in arrow-rs to do this efficiently).
   > 
   > I have an idea for an early exit losing performance, and I'm trying it 
out. If it works, I'll post the code
   
   @Dandandan You're absolutely right. At this point, we should no longer focus 
on optimizing the early return for the `true_count` calculation. Instead, we 
should consider extending support to new optimization scenarios, such as 
handling `nulls` and other cases. Here’s why:
   
   ## My Attempt
   I experimented with optimizing the `true_count` calculation using SIMD 
instructions, performing 256-bit comparisons and computations while 
incorporating early returns. This significantly reduced the overhead of 
checking boolean arrays in non-short-circuit scenarios (with a performance 
boost of up to 6x in extreme cases). However, in short-circuit scenarios like 
`all false` or `all true`, it still performed about 20% slower than the 
existing `false_count/true_count` implementation.
   
   When I applied this optimization to the execution of `BinaryExpr` and 
benchmarked the entire `BinaryExpr`, I observed a performance regression. The 
reasons are as follows:
   1. The computation of `true_count` for boolean arrays is only on the order 
of 60ns, and even with early exit optimizations, it remains at the 10ns level 
(in extreme cases).
   2. When `BinaryExpr` does not trigger short-circuiting, its execution time 
is on the order of 400us (depending on the complexity of the expression). 
Optimizing the `true_count` calculation for boolean arrays becomes negligible 
in this context.
   3. When short-circuiting occurs, the time complexity of `BinaryExpr` is on 
par with the computation of the boolean array. In this case, calling 
`true_count` remains the optimal solution.
   
   ## Other Optimization Scenarios
   - Extending short-circuit optimizations to handle `nulls`.
   - The early execution of selection that I previously mentioned: 
[https://github.com/apache/datafusion/issues/15631#issuecomment-2788923437](https://github.com/apache/datafusion/issues/15631#issuecomment-2788923437).
   
   I realized that my earlier conclusions about the benefits of early selection 
execution were incorrect. After testing with the latest code changes (on a 
machine with an AMD 9950x CPU), the results are as follows: all scenarios that 
benefited from this optimization saw a reduction in execution time from `450us` 
to `170us`. This is a significant improvement, and it did not slow down any 
other cases!
   
   ```sh
   short_circuit/and/one_true_first
   time:   [188.36 µs 188.96 µs 189.55 µs]
   change: [-58.870% -58.652% -58.420%] (p = 0.00 < 
0.05)
   Performance has improved.
   
   short_circuit/and/one_true_last
   time:   [170.10 µs 171.04 µs 171.97 µs]
   change: [-62.788% -62.520% -62.258%] (p = 0.00 < 
0.05)
   Performance has improved.
   
   short_circuit/and/one_true_middle
   time:   [165.38 µs 165.97 µs 166.57 µs]
   change: [-64.176% -63.968% -63.784%] (p = 0.00 < 
0.05)
   Performance has improved.
   
   short_circuit/and/one_true_middle_left
   time:   [165.02 µs 165.48 µs 165.96 µs]
   change: [-63.671% -63.458% -63.204%] (p = 0.00 < 
0.05)
   Performance has improved.
   
   short_circuit/and/one_true_middle_right
   time:   [169.09 µs 169.61 µs 170.13 µs]
   change: [-63.298% -63.089% -62.868%] (p = 0.00 < 
0.05)
   Performance has improved.
   
   short_circuit/and/one_true_middle_right #2
   time:   [166.43 µs 167.12 µs 167.85 µs]
   change: [-64.092% -63.916% -63.751%] (p = 0.00 < 
0.05)
   Performance has improved.
   ```
   
   I further optimized the work done by @kosiew. Relevant PR: 
https://github.com/apache/datafusion/pull/15694 (currently, there are still 
some issues with `cargo test`, and I'm working on fixing them).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

2025-04-12 Thread via GitHub


acking-you commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2798742956

   > @acking-you the code needs to be extended to support nulls (you can take a 
look at the true_count implementation in arrow-rs to do this efficiently).
   
   I have an idea for an early exit losing performance, and I'm trying it out. 
If it works, I'll post the code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

2025-04-11 Thread via GitHub


kosiew commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2798494451

   hi @Dandandan 
   
   I am getting failed tests with 
   
   ```rust
   
   #[test]
   fn test_all_one() -> Result<()> {
   // Helper function to run tests and report failures
   let run_test = |array: &BooleanArray, name: &str, expected: bool| -> 
Result<()> {
   let result = all_one(array);
   if result != expected {
   println!(
   "Test case '{}' failed: expected {}, got {}",
   name, expected, result
   );
   println!("Array contents: {:?}", array);
   }
   assert_eq!(result, expected, "all_one failed for test case 
'{}'", name);
   Ok(())
   };
   
   // Basic cases - uniform arrays
   let all_one_array = BooleanArray::from(vec![true, true, true, true]);
   run_test(&all_one_array, "all true array", true)?;
   
   let all_zero_array = BooleanArray::from(vec![false, false, false, 
false]);
   run_test(&all_zero_array, "all false array", false)?;
   
   // Mixed values
   let mixed_array = BooleanArray::from(vec![true, true, false, true]);
   run_test(&mixed_array, "mixed array with one false", false)?;
   
   // Edge cases
   let empty_array = BooleanArray::from(vec![] as Vec);
   run_test(&empty_array, "empty array", true)?;
   
   let single_true = BooleanArray::from(vec![true]);
   run_test(&single_true, "single true", true)?;
   
   let single_false = BooleanArray::from(vec![false]);
   run_test(&single_false, "single false", false)?;
   
   // Arrays with nulls
   let array_with_nulls: BooleanArray =
   vec![Some(true), None, Some(true)].into_iter().collect();
   run_test(&array_with_nulls, "nulls with true", true)?;
   
   let nulls_with_false: BooleanArray =
   vec![Some(true), Some(false), None].into_iter().collect();
   run_test(&nulls_with_false, "nulls with false", false)?;
   
   // Large arrays
   let large_all_one = BooleanArray::from(vec![true; 128]);
   run_test(&large_all_one, "large all true (128)", true)?;
   
   // Test with a single false at different positions
   let mut values = vec![true; 128];
   values[63] = false; // Last bit in first chunk
   let large_with_boundary_false = BooleanArray::from(values);
   run_test(
   &large_with_boundary_false,
   "large with false at bit 63",
   false,
   )?;
   
   let mut values = vec![true; 128];
   values[64] = false; // First bit in second chunk
   let large_with_second_chunk_false = BooleanArray::from(values);
   run_test(
   &large_with_second_chunk_false,
   "large with false at bit 64",
   false,
   )?;
   
   // Specific sizes that might trigger edge cases
   let exact_chunk_size = BooleanArray::from(vec![true; 64]);
   run_test(&exact_chunk_size, "exact chunk size (64)", true)?;
   
   let just_under_chunk = BooleanArray::from(vec![true; 63]);
   run_test(&just_under_chunk, "just under chunk size (63)", true)?;
   
   let just_over_chunk = BooleanArray::from(vec![true; 65]);
   run_test(&just_over_chunk, "just over chunk size (65)", true)?;
   
   Ok(())
   }
   
   #[test]
   fn test_all_zero() -> Result<()> {
   // Helper function to run tests and report failures
   let run_test = |array: &BooleanArray, name: &str, expected: bool| -> 
Result<()> {
   let result = all_zero(array);
   if result != expected {
   println!(
   "Test case '{}' failed: expected {}, got {}",
   name, expected, result
   );
   println!("Array contents: {:?}", array);
   }
   assert_eq!(result, expected, "all_zero failed for test case 
'{}'", name);
   Ok(())
   };
   
   // Basic cases - uniform arrays
   let all_zero_array = BooleanArray::from(vec![false, false, false, 
false]);
   run_test(&all_zero_array, "all false array", true)?;
   
   let all_one_array = BooleanArray::from(vec![true, true, true, true]);
   run_test(&all_one_array, "all true array", false)?;
   
   // Mixed values
   let mixed_array = BooleanArray::from(vec![false, false, true, 
false]);
   run_test(&mixed_array, "mixed array with one true", false)?;
   
   // Edge cases
   let empty_array = BooleanArray::from(vec![] as Vec);
   run_test(&empty_array, "empty array", true)?;
   
   let sin

Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

2025-04-11 Thread via GitHub


Dandandan commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2796844672

   Btw as a simple concept, I tested this yesterday to reduce execution time of 
short circuiting all false / all true cases by -25% compared to `true_count` / 
`false_count`:
   
   ```rust
   fn all_zero(array: &BooleanArray) -> bool {
   // TODO: nulls
   // match array.nulls() {
   // Some(nulls) => {
   // }
   // None => {}
   // }
   // platform specific
   let bit_chunks = array.values().bit_chunks();
   let mut all_zero = true;
   for i in bit_chunks {
   all_zero &= i == 0;
   }
   
   all_zero
   }
   
   fn all_one(array: &BooleanArray) -> bool {
   // platform specific
   let bit_chunks = array.values().bit_chunks();
   let mut all_one = true;
   for i in bit_chunks {
   all_one &= i == 0x;
   }
   
   all_one
   }
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

2025-04-10 Thread via GitHub


alamb commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2792363126

   `ShortCircuitStrategy` is a pretty neat idea
   
In my opinion, as long as the code is easy to understand, makes realistic 
benchmarks faster, and doesn't regress existing performance it is good thing to 
try
   
   My biggest concern with this type of optimization is that it will cause some 
queries to go faster and some to go slower -- this tradeoff is not great in my 
mind as then we have to judge if the tradeoff is worth it. Given how many 
people use DataFusion I think it likely that not all users have the same opinion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

2025-04-10 Thread via GitHub


Dandandan commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2792442938

   > I don't know if you think it's a good idea? @alamb @Dandandan
   
   I think it is a pretty good idea given that evaluation is so important. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

2025-04-09 Thread via GitHub


acking-you commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2788923437

   I have an idea that might improve the effectiveness of short-circuit 
optimization, and it seems necessary to use `false_count` for evaluation 
counting.
   
   The current issue with DataFusion's execution of `BinaryExpr`:
   After computing the `left` side, the result is not immediately used to 
filter the batch to reduce the input size of the `right` batch.
   Example:
   ```
   Current:
   batch:[1,2,3,4] -> execute left -> bool array: [true,false,true,false]
   batch:[1,2,3,4] -> execute right -> bool array: [true,true,false,false]
   
   Might be better:
   batch:[1,2,3,4] -> execute left -> bool array: [true,false,true,false] -> 
batch:[1,3]
   batch:[1,3] -> execute right -> bool array: [true,false] -> batch:[1]
   ```
   I tried implementing this process using 
[evaluate_selection](https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_selection),
 but the performance regressed in many cases because its internal 
implementation requires copying to create a new `RecordBatch`.
   
   However, perhaps we could heuristically decide whether to pre-filter the 
`RecordBatch` based on `false_count`, for example, when `false_count / 
array_len > 0.8`.
   
   By the way, I recently looked into ClickHouse's execution logic for 
`BinaryOp`. It immediately uses the result of each expression to filter and 
then proceeds to execute the next expression. Similarly, it involves copying, 
but it accelerates the checking and copying process using SIMD instructions. I 
also noticed that arrow-rs, which DataFusion uses, has a very efficient 
approach for this process: 
[IterationStrategy](https://github.com/apache/arrow-rs/blob/9322547590ab32efeff8c0486e4a3a2cb5887a26/arrow-select/src/filter.rs#L296-L318).
   
   I don't know if you think it's a good idea? @alamb @Dandandan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

2025-04-08 Thread via GitHub


Dandandan commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2786711757

   Would be good to compare it with a boolean version of this as well, like 
this, to see if it vectorizes better:
   ```
   pub fn all_zero(&self) -> bool {
   // platform specific
   const LANES: usize = 8;
   let mut chunks = self.chunks.chunks_exact(LANES);
   let mut all_false = true;
   chunks.borrow_mut().for_each(|chunk| {
   let chunk: [u64; LANES] = chunk.try_into().unwrap();
   for i in 0..LANES {
   all_false &= chunk[i] == 0;
   }
   });
   
   let remainder = chunks.remainder();
   for chunk in remainder {
   all_false &= *chunk == 0;
   }
   
   all_false
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

2025-04-08 Thread via GitHub


alamb commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2786424000

   > sum += chunk[i].count_ones() as usize;
   
   Maybe simply manually unrolling the loop to check 1024 bits at a time would 
let llvm make the best code
   Something like checking windows of 8 `u64s` at a time against 0x0 -- and 
exiting early if there is a non zero
   
   🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

2025-04-08 Thread via GitHub


Dandandan commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2786341929

   Interesting!
   
   I think we probably can take some inspiration from arrow-rs aggregate code, 
e.g. doing something like (?):
   
   ```rust
   /// Counts the number of ones
   pub fn count_ones(&self) -> usize {
   // platform specific
   const LANES: usize = 8;
   let mut chunks = self.chunks.chunks_exact(LANES);
   let mut sum: usize = 0;
   chunks.borrow_mut().for_each(|chunk| {
   let chunk: [u64; LANES] = chunk.try_into().unwrap();
   for i in 0..LANES {
   sum += chunk[i].count_ones() as usize;
   }
   });
   
   let remainder = chunks.remainder();
   for chunk in remainder {
   sum += chunk.count_ones() as usize;
   }
   
   sum
   }
   ```
   
   (Didn't test this to be faster).
   We could also define a faster version for this use case that doesn't count 
but returns the same as max / min boolean.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

2025-04-08 Thread via GitHub


acking-you commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2786166445

   This might require manual SIMD for optimization, but that would increase the 
porting difficulty([As duckdb 
says](https://duckdb.org/faq.html#does-duckdb-use-simd)). However, perhaps an 
alternative approach could be tried to make it easier for the compiler to 
optimize. If feasible, it also seems capable of improving the performance of 
related calls in the arrow-rs library.
   
   ## Some exploration
   In ClickHouse's filter implementation, there is a classic manual SIMD 
implementation approach: 
[code](https://github.com/ClickHouse/ClickHouse/blob/master/src/Columns/ColumnsCommon.cpp#L237-L275)
   
   The function involves loading multiple boolean values at once using SIMD 
instructions to increase the loop step.  
   The best-case scenarios are:  
   - The filter does not match, skipping to the next iteration.  
   - The filter fully matches, copying multiple rows at once.  
   
   For other cases, the performance degrades to a handling method similar to 
when SIMD is not used (the additional overhead being the preparation of SIMD 
variables).  
   
   If this approach is applied to check whether a bit is 1 or 0, it should 
incur almost no overhead (only requiring a comparison with `0` or ``).  
   
   At the same time, could DataFusion's filter process also be optimized using 
this method?  
   
   Alternatively, could we find another form of vectorization that does not 
involve manual unrolling?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]