Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-10 Thread via GitHub


acking-you commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2031035548


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   > I might be missing some obvious points, but
   > 
   > 1. applying the same check for rhs is redundant? (does it approximately 
requires the same computation if we continue the execution as is)
   > 2. why aren't we also covering the cases when **lhs is all true and op is 
AND** / **lhs is all false and op is OR**
   
   I've optimized the comments, so it may look a bit clearer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-10 Thread via GitHub


acking-you commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2783526836

   > The only thing I think is needed for this PR is to either
   > 
   > 1. remove the bechmark
   > 2. update the benchmark ti actually test `BinaryExpr` evaluation time
   
   done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-08 Thread via GitHub


berkaysynnada commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2034007815


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   > > Doesn't this brings a clear gain by short-circuiting?
   > 
   > I understand these now. Sorry, I didn't notice it just now and thought it 
was only a positional difference.
   > 
   > This is a great idea. I think we can open a new issue and add this 
optimization.
   
   what about these?
   
   > [xxx] AND [false,false,false...false] => return rhs, which is 
[false,false,false...false]
   > [xxx] OR [true, true, true... true] => return rhs, which is [true, true, 
true...true]



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-08 Thread via GitHub


berkaysynnada commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2034007019


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   > I don't think you can rewrite expressions like this without changing the 
short-circuit semantics (if lHS is false then don't run the RHS in `a AND b`)
   
   We don't need a rewrite. evaluate() needs to return a ColumnarValue, so we 
can just return the short-circuited versions in my examples



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-08 Thread via GitHub


alamb commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2787055441

   the extended tests appear to be failing after this PR:
   - https://github.com/apache/datafusion/issues/15641


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-08 Thread via GitHub


alamb commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2033602903


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   I don't think you can rewrite expressions like this without changing the 
short-circuit semantics (if lHS is false then don't run the RHS in `a AND b`) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-08 Thread via GitHub


acking-you commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2033205911


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   > Doesn't this brings a clear gain by short-circuiting?
   
   I understand these now. Sorry, I didn't notice it just now and thought it 
was only a positional difference.
   
   This is a great idea. I think we can open a new issue and add this 
optimization.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-08 Thread via GitHub


berkaysynnada commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2033185234


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   > but wouldn't it be better to do it this way: the optimizer rewrites 
simpler expressions to the left as much as possible? 
   Is there a planned work to implement this? If it is so, trying first lhs 
makes sense. But even if that's the case, checking the rhs values as well, 
before doing the all binary computation, is a preferable option? 
   
   Let's imagine this case also (I adapt my second example such that homogenous 
side is lhs) 
   ```
   [true, true, true... true] AND [xxx] => return rhs, which is [xxx]
   [false, false, false... false] OR [xxx] => return rhs, which is [xxx]
   ```
   Doesn't this brings a clear gain by short-circuiting?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-08 Thread via GitHub


acking-you commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2033160515


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   Okay, I understand now. However, the current execution order of BinaryExpr 
is fixed to evaluate the left side first and then the right side. The situation 
you mentioned I had also considered before, but wouldn't it be better to do it 
this way: the optimizer rewrites simpler expressions to the left as much as 
possible?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-08 Thread via GitHub


berkaysynnada commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2033101460


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   I think I couldn't be clear in my last comment. 
   
   This PR short-circuits this case:
   ```
   [false,false,false...false] AND [xxx] => return lhs, which is 
[false,false,false...false]
   [true, true, true... true] OR [xxx] => return lhs, which is [true, true, 
true...true]
   ```
   I have 2 further ideas to discuss:
   1) 
   ```
   [xxx] AND [false,false,false...false] => return rhs, which is 
[false,false,false...false]
   [xxx] OR [true, true, true... true] => return rhs, which is [true, true, 
true...true]
   ```
   Why don't we check this case as well?
   
   2) 
   ```
   [xxx] AND [true, true, true... true] => return lhs, which is [xxx]
   [xxx] OR [false, false, false... false] => return lhs, which is [xxx]
   ```
   isn't this case optimizable as well? Are we handling those cases in another 
place?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-08 Thread via GitHub


alamb commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2032964400


##
datafusion/physical-expr/benches/binary_op.rs:
##
@@ -0,0 +1,373 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::{
+array::BooleanArray,
+compute::{bool_and, bool_or},
+datatypes::{DataType, Field, Schema},
+};
+use arrow::{array::StringArray, record_batch::RecordBatch};
+use criterion::{black_box, criterion_group, criterion_main, Criterion};
+use datafusion_expr::{and, binary_expr, col, lit, or, Operator};
+use datafusion_physical_expr::{
+expressions::{BinaryExpr, Column},
+planner::logical2physical,
+PhysicalExpr,
+};
+use std::sync::{Arc, LazyLock};
+
+/// Generates BooleanArrays with different true/false distributions for 
benchmarking.
+///
+/// Returns a vector of tuples containing scenario name and corresponding 
BooleanArray.
+///
+/// # Arguments
+/// - `TEST_ALL_FALSE` - Used to generate what kind of test data
+/// - `len` - Length of the BooleanArray to generate
+fn generate_boolean_cases(
+len: usize,
+) -> Vec<(String, BooleanArray)> {
+let mut cases = Vec::with_capacity(6);
+
+// Scenario 1: All elements false or all elements true
+if TEST_ALL_FALSE {
+let all_false = BooleanArray::from(vec![false; len]);
+cases.push(("all_false".to_string(), all_false));
+} else {
+let all_true = BooleanArray::from(vec![true; len]);
+cases.push(("all_true".to_string(), all_true));
+}
+
+// Scenario 2: Single true at first position or single false at first 
position
+if TEST_ALL_FALSE {
+let mut first_true = vec![false; len];
+first_true[0] = true;
+cases.push(("one_true_first".to_string(), 
BooleanArray::from(first_true)));
+} else {
+let mut first_false = vec![true; len];
+first_false[0] = false;
+cases.push((
+"one_false_first".to_string(),
+BooleanArray::from(first_false),
+));
+}
+
+// Scenario 3: Single true at last position or single false at last 
position
+if TEST_ALL_FALSE {
+let mut last_true = vec![false; len];
+last_true[len - 1] = true;
+cases.push(("one_true_last".to_string(), 
BooleanArray::from(last_true)));
+} else {
+let mut last_false = vec![true; len];
+last_false[len - 1] = false;
+cases.push(("one_false_last".to_string(), 
BooleanArray::from(last_false)));
+}
+
+// Scenario 4: Single true at exact middle or single false at exact middle
+let mid = len / 2;
+if TEST_ALL_FALSE {
+let mut mid_true = vec![false; len];
+mid_true[mid] = true;
+cases.push(("one_true_middle".to_string(), 
BooleanArray::from(mid_true)));
+} else {
+let mut mid_false = vec![true; len];
+mid_false[mid] = false;
+cases.push((
+"one_false_middle".to_string(),
+BooleanArray::from(mid_false),
+));
+}
+
+// Scenario 5: Single true at 25% position or single false at 25% position
+let mid_left = len / 4;
+if TEST_ALL_FALSE {
+let mut mid_left_true = vec![false; len];
+mid_left_true[mid_left] = true;
+cases.push((
+"one_true_middle_left".to_string(),
+BooleanArray::from(mid_left_true),
+));
+} else {
+let mut mid_left_false = vec![true; len];
+mid_left_false[mid_left] = false;
+cases.push((
+"one_false_middle_left".to_string(),
+BooleanArray::from(mid_left_false),
+));
+}
+
+// Scenario 6: Single true at 75% position or single false at 75% position
+let mid_right = (3 * len) / 4;
+if TEST_ALL_FALSE {
+let mut mid_right_true = vec![false; len];
+mid_right_true[mid_right] = true;
+cases.push((
+"one_true_middle_right".to_string(),
+BooleanArray::from(mid_right_true),
+));
+} else {
+let mut mid_right_false = vec![true; len];
+mid_right_false[mid_right] = false;
+cases.push((
+"one_false_middle_right".to_string(),
+BooleanArray::from(mid_r

Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-08 Thread via GitHub


alamb commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2786029908

   Filed https://github.com/apache/datafusion/issues/15631


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-08 Thread via GitHub


alamb merged PR #15462:
URL: https://github.com/apache/datafusion/pull/15462


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-08 Thread via GitHub


alamb commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2786018330

   Thanks again for the wonderful work @acking-you  --
   I ran the benchmarks locally and indeed they show a non trivial amount of 
time being spent in count_ones. 
   
   ![Screenshot 2025-04-08 at 6 40 33 
AM](https://github.com/user-attachments/assets/13d85f00-0fad-49bc-9f95-ce136830a212)
   
   I also confirmed that without the short circuit optimization the benchmarks 
get much much slower
   
   ```
   Gnuplot not found, using plotters backend
   bench_all_false_and time:   [503.28 µs 505.29 µs 507.36 µs]
   change: [+1711392% +1718962% +1727169%] (p = 0.00 < 
0.05)
   Performance has regressed.
   Found 7 outliers among 100 measurements (7.00%)
 5 (5.00%) high mild
 2 (2.00%) high severe
   
   bench_all_true_or   time:   [504.57 µs 506.27 µs 508.10 µs]
   change: [+1704732% +1713204% +1721665%] (p = 0.00 < 
0.05)
   Performance has regressed.
   Found 7 outliers among 100 measurements (7.00%)
 6 (6.00%) high mild
   ```
   
   Let's merge this PR in and I will file a follow on ticket to track 
potentially improving performance even more
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-07 Thread via GitHub


acking-you commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2784190060

   > I think the likely part of it getting slower is short-circuiting whenever 
a `true` is observed (having a branch for each item to check).
   > 
   > For this case it might be interesting to compare it with 
`array.values().bit_chunks().iter_padded().fold(true, |acc, i: u64| i != 0 && 
acc)` (e.g. fully evaluate the entire array) and see if it is any faster than 
the `true_count` implementation.
   > 
   > For `max_boolean` it might be interesting to see if we can make it faster 
by doing both, early exiting and generating more efficient code. Maybe it could 
evaluate it per `n` values instead of per-value to benefit from faster code / 
branch prediction while still exiting early enough.
   
   I’m certain that the semantics of the `if` statement, which preemptively 
exits based on a condition, caused the compiler to abandon SIMD optimization. 
To test this, I conducted the following experiment:  
   I wrote two simple functions to simulate the two scenarios and then examined 
the optimized assembly code generated by the compiler to compare the 
differences.  
   You can view the specific assembly code generated from my simulated 
experiment here: 
[https://godbolt.org/z/63bEvYP4d](https://godbolt.org/z/63bEvYP4d).
   
   ## Records
   Rust code as follows:
   ```rust
   /// Scene 1: Counting all set bits (the number of 1s)
   #[no_mangle]
   #[inline(never)]
   pub fn count_ones(ba: &[u8]) -> usize {
   ba.iter().map(|x| x.count_ones() as usize).sum()
   }
   
   /// Scene 2: Check if a bit is set
   #[no_mangle]
   #[inline(never)]
   pub fn find_any(ba: &[u8]) -> bool {
   ba.iter().any(|&x| x != 0)
   }
   ```
   Assembly code as follows:
   ```asm
   count_ones:
   testrsi, rsi
   je  .LBB0_1
   cmp rsi, 4
   jae .LBB0_5
   xor ecx, ecx
   xor eax, eax
   jmp .LBB0_8
   .LBB0_1:
   xor eax, eax
   ret
   .LBB0_5:
   mov rcx, rsi
   and rcx, -4
   pxorxmm0, xmm0
   xor eax, eax
   movdqa  xmm2, xmmword ptr [rip + .LCPI0_0]
   movdqa  xmm3, xmmword ptr [rip + .LCPI0_1]
   movdqa  xmm5, xmmword ptr [rip + .LCPI0_2]
   pxorxmm4, xmm4
   pxorxmm1, xmm1
   .LBB0_6:
   movzx   edx, word ptr [rdi + rax]
   movdxmm7, edx
   movzx   edx, word ptr [rdi + rax + 2]
   movdxmm6, edx
   punpcklbw   xmm7, xmm0
   punpcklwd   xmm7, xmm0
   punpckldq   xmm7, xmm0
   movdqa  xmm8, xmm7
   psrlw   xmm8, 1
   pandxmm8, xmm2
   psubb   xmm7, xmm8
   movdqa  xmm8, xmm7
   pandxmm8, xmm3
   psrlw   xmm7, 2
   pandxmm7, xmm3
   paddb   xmm7, xmm8
   movdqa  xmm8, xmm7
   psrlw   xmm8, 4
   paddb   xmm8, xmm7
   pandxmm8, xmm5
   psadbw  xmm8, xmm0
   paddq   xmm4, xmm8
   punpcklbw   xmm6, xmm0
   punpcklwd   xmm6, xmm0
   punpckldq   xmm6, xmm0
   movdqa  xmm7, xmm6
   psrlw   xmm7, 1
   pandxmm7, xmm2
   psubb   xmm6, xmm7
   movdqa  xmm7, xmm6
   pandxmm7, xmm3
   psrlw   xmm6, 2
   pandxmm6, xmm3
   paddb   xmm6, xmm7
   movdqa  xmm7, xmm6
   psrlw   xmm7, 4
   paddb   xmm7, xmm6
   pandxmm7, xmm5
   psadbw  xmm7, xmm0
   paddq   xmm1, xmm7
   add rax, 4
   cmp rcx, rax
   jne .LBB0_6
   paddq   xmm1, xmm4
   pshufd  xmm0, xmm1, 238
   paddq   xmm0, xmm1
   movqrax, xmm0
   cmp rcx, rsi
   je  .LBB0_2
   .LBB0_8:
   movzx   edx, byte ptr [rdi + rcx]
   imuledx, edx, 134480385
   shr edx, 3
   and edx, 286331153
   imuledx, edx, 286331153
   shr edx, 28
   add rax, rdx
   inc rcx
   cmp rsi, rcx
   jne .LBB0_8
   .LBB0_2:
   ret
   
   find_any:
   xor ecx, ecx
   .LBB1_1:
   mov rax, rcx
   cmp rsi, rcx
   je  .LBB1_3
   lea rcx, [rax + 1]
   cmp byte ptr [rdi + rax], 0
   je  .LBB1_1
   .LBB1_3:
   cmp rsi, rax
   setne   al
   ret
   ```
   
   ## Conclusion
   
   1. **Vectorization Optimization Differences**:
  - The assembly code for `count_ones` utilizes SIMD instructions (such as 
`movdqa`, `psrlw`, and `paddb`), achieving automatic vectorization. The 
compiler packs four bytes (32 bits) of data into an XMM register for parallel 
processing, efficiently calculating the number of set bits 

Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-07 Thread via GitHub


alamb commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2783230355

   > > Security audit failure is not related to this PR
   > 
   > How can this be resolved?
   
   - it is tracked by https://github.com/apache/datafusion/issues/15571
   
   There are ideas on there of how to resolve the isseu
   
   To be clear I don't think this is blocking this PR from merging
   
   The only thing I think is needed for this PR is to either
   1. remove the bechmark 
   2. update the benchmark ti actually test `BinaryExpr` evaluation time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-07 Thread via GitHub


Dandandan commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2783121877

   > > The related results are as follows. I’m not sure what you think about 
these results — should we continue using true_count/false_count, or look for a 
better solution? @alamb @Dandandan
   > 
   > Thank you for your diligence @acking-you
   > 
   > I think we should merge this PR in (with the count bits and the 
benchmarks) and file a follow on ticket to potentially improve the performance 
in some other way.
   > 
   > I don't fully understand your performance results given that the two 
functions seem very similar -- maybe it has to do with option hanlding messing 
auto vectorization or something
   > 
   > 
https://docs.rs/arrow-array/54.3.1/src/arrow_array/array/boolean_array.rs.html#160
 https://docs.rs/arrow-arith/54.3.1/src/arrow_arith/aggregate.rs.html#751
   
   I think the likely part of it getting slower is short-circuiting whenever a 
`true` is observed.
   
   For this case it might be interesting to compare it with 
`array.values().bit_chunks().iter_padded().fold(true, |acc, i: u64| i != 0 && 
acc)` and see if it is any faster than the `true_count` implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-07 Thread via GitHub


acking-you commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2031069949


##
datafusion/physical-expr/benches/boolean_op.rs:
##
@@ -0,0 +1,187 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::{
+array::BooleanArray,
+compute::{bool_and, bool_or},
+};
+use criterion::{black_box, criterion_group, criterion_main, Criterion};
+use std::sync::{Arc, LazyLock};
+
+/// Generates BooleanArrays with different true/false distributions for 
benchmarking.
+///
+/// Returns a vector of tuples containing scenario name and corresponding 
BooleanArray.
+///
+/// # Arguments
+/// - `TEST_ALL_FALSE` - Used to generate what kind of test data
+/// - `len` - Length of the BooleanArray to generate
+fn generate_boolean_cases(
+len: usize,
+) -> Vec<(String, BooleanArray)> {
+let mut cases = Vec::with_capacity(6);
+
+// Scenario 1: All elements false or all elements true
+if TEST_ALL_FALSE {
+let all_false = BooleanArray::from(vec![false; len]);
+cases.push(("all_false".to_string(), all_false));
+} else {
+let all_true = BooleanArray::from(vec![true; len]);
+cases.push(("all_true".to_string(), all_true));
+}
+
+// Scenario 2: Single true at first position or single false at first 
position
+if TEST_ALL_FALSE {
+let mut first_true = vec![false; len];
+first_true[0] = true;
+cases.push(("one_true_first".to_string(), 
BooleanArray::from(first_true)));
+} else {
+let mut first_false = vec![true; len];
+first_false[0] = false;
+cases.push((
+"one_false_first".to_string(),
+BooleanArray::from(first_false),
+));
+}
+
+// Scenario 3: Single true at last position or single false at last 
position
+if TEST_ALL_FALSE {
+let mut last_true = vec![false; len];
+last_true[len - 1] = true;
+cases.push(("one_true_last".to_string(), 
BooleanArray::from(last_true)));
+} else {
+let mut last_false = vec![true; len];
+last_false[len - 1] = false;
+cases.push(("one_false_last".to_string(), 
BooleanArray::from(last_false)));
+}
+
+// Scenario 4: Single true at exact middle or single false at exact middle
+let mid = len / 2;
+if TEST_ALL_FALSE {
+let mut mid_true = vec![false; len];
+mid_true[mid] = true;
+cases.push(("one_true_middle".to_string(), 
BooleanArray::from(mid_true)));
+} else {
+let mut mid_false = vec![true; len];
+mid_false[mid] = false;
+cases.push((
+"one_false_middle".to_string(),
+BooleanArray::from(mid_false),
+));
+}
+
+// Scenario 5: Single true at 25% position or single false at 25% position
+let mid_left = len / 4;
+if TEST_ALL_FALSE {
+let mut mid_left_true = vec![false; len];
+mid_left_true[mid_left] = true;
+cases.push((
+"one_true_middle_left".to_string(),
+BooleanArray::from(mid_left_true),
+));
+} else {
+let mut mid_left_false = vec![true; len];
+mid_left_false[mid_left] = false;
+cases.push((
+"one_false_middle_left".to_string(),
+BooleanArray::from(mid_left_false),
+));
+}
+
+// Scenario 6: Single true at 75% position or single false at 75% position
+let mid_right = (3 * len) / 4;
+if TEST_ALL_FALSE {
+let mut mid_right_true = vec![false; len];
+mid_right_true[mid_right] = true;
+cases.push((
+"one_true_middle_right".to_string(),
+BooleanArray::from(mid_right_true),
+));
+} else {
+let mut mid_right_false = vec![true; len];
+mid_right_false[mid_right] = false;
+cases.push((
+"one_false_middle_right".to_string(),
+BooleanArray::from(mid_right_false),
+));
+}
+
+cases
+}
+
+fn benchmark_boolean_ops(c: &mut Criterion) {

Review Comment:
   > in other words, make micro benchmarks for expressions like A AND B AND C 
AND D similar to what you added to the clickbench extended suite that shows the 
performance of ex

Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-07 Thread via GitHub


acking-you commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2783033518

   > I don't fully understand your performance results given that the two 
functions seem very similar -- maybe it has to do with option hanlding messing 
auto vectorization or something
   
   Yes, I was also quite surprised. The only difference in the implementation 
of these two functions might be whether or not `find` was used.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-07 Thread via GitHub


alamb commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2031057246


##
datafusion/physical-expr/benches/boolean_op.rs:
##
@@ -0,0 +1,187 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::{
+array::BooleanArray,
+compute::{bool_and, bool_or},
+};
+use criterion::{black_box, criterion_group, criterion_main, Criterion};
+use std::sync::{Arc, LazyLock};
+
+/// Generates BooleanArrays with different true/false distributions for 
benchmarking.
+///
+/// Returns a vector of tuples containing scenario name and corresponding 
BooleanArray.
+///
+/// # Arguments
+/// - `TEST_ALL_FALSE` - Used to generate what kind of test data
+/// - `len` - Length of the BooleanArray to generate
+fn generate_boolean_cases(
+len: usize,
+) -> Vec<(String, BooleanArray)> {
+let mut cases = Vec::with_capacity(6);
+
+// Scenario 1: All elements false or all elements true
+if TEST_ALL_FALSE {
+let all_false = BooleanArray::from(vec![false; len]);
+cases.push(("all_false".to_string(), all_false));
+} else {
+let all_true = BooleanArray::from(vec![true; len]);
+cases.push(("all_true".to_string(), all_true));
+}
+
+// Scenario 2: Single true at first position or single false at first 
position
+if TEST_ALL_FALSE {
+let mut first_true = vec![false; len];
+first_true[0] = true;
+cases.push(("one_true_first".to_string(), 
BooleanArray::from(first_true)));
+} else {
+let mut first_false = vec![true; len];
+first_false[0] = false;
+cases.push((
+"one_false_first".to_string(),
+BooleanArray::from(first_false),
+));
+}
+
+// Scenario 3: Single true at last position or single false at last 
position
+if TEST_ALL_FALSE {
+let mut last_true = vec![false; len];
+last_true[len - 1] = true;
+cases.push(("one_true_last".to_string(), 
BooleanArray::from(last_true)));
+} else {
+let mut last_false = vec![true; len];
+last_false[len - 1] = false;
+cases.push(("one_false_last".to_string(), 
BooleanArray::from(last_false)));
+}
+
+// Scenario 4: Single true at exact middle or single false at exact middle
+let mid = len / 2;
+if TEST_ALL_FALSE {
+let mut mid_true = vec![false; len];
+mid_true[mid] = true;
+cases.push(("one_true_middle".to_string(), 
BooleanArray::from(mid_true)));
+} else {
+let mut mid_false = vec![true; len];
+mid_false[mid] = false;
+cases.push((
+"one_false_middle".to_string(),
+BooleanArray::from(mid_false),
+));
+}
+
+// Scenario 5: Single true at 25% position or single false at 25% position
+let mid_left = len / 4;
+if TEST_ALL_FALSE {
+let mut mid_left_true = vec![false; len];
+mid_left_true[mid_left] = true;
+cases.push((
+"one_true_middle_left".to_string(),
+BooleanArray::from(mid_left_true),
+));
+} else {
+let mut mid_left_false = vec![true; len];
+mid_left_false[mid_left] = false;
+cases.push((
+"one_false_middle_left".to_string(),
+BooleanArray::from(mid_left_false),
+));
+}
+
+// Scenario 6: Single true at 75% position or single false at 75% position
+let mid_right = (3 * len) / 4;
+if TEST_ALL_FALSE {
+let mut mid_right_true = vec![false; len];
+mid_right_true[mid_right] = true;
+cases.push((
+"one_true_middle_right".to_string(),
+BooleanArray::from(mid_right_true),
+));
+} else {
+let mut mid_right_false = vec![true; len];
+mid_right_false[mid_right] = false;
+cases.push((
+"one_false_middle_right".to_string(),
+BooleanArray::from(mid_right_false),
+));
+}
+
+cases
+}
+
+fn benchmark_boolean_ops(c: &mut Criterion) {

Review Comment:
   Can you please update these benchmarks for the performance of boolean 
expression evaluation? As of now, these are benchmarks for `bool_and` and 
`bool_or`?
   
   In other words,

Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-07 Thread via GitHub


acking-you commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2783035350

   > Security audit failure is not related to this PR
   
   How can this be resolved?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-07 Thread via GitHub


acking-you commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2782983563

   I sincerely apologize for the delay in updating this PR. I have now designed 
a detailed comparative test for the `bool_or/bool_and` issue described by 
@Dandandan. The related results are as follows. I’m not sure what you think 
about these results — should we continue using `true_count/false_count`, or 
look for a better solution? @alamb @Dandandan
   
   
   1. I designed six distribution 
scenarios([code](https://github.com/apache/datafusion/pull/15462/files#diff-8710f6b44dd74240d19e6fcdfbf971c034f59cd48c022e51b07ed876a8cc7c5eR37-R120))
 for the `true_count/false_count` and `bool_or/bool_and` of the boolean array 
to conduct performance testing comparisons. The final results are as follows:
   
   | Test Case  | true_count/false_count | bool_or/bool_and 
   | Change   | Performance Impact |
   | -- | -- | 
--- |  | -- |
   | all_false  | 1.6593 µs - 1.6695 µs  | 4.1013 µs - 4.1774 µs
   | ​150.91%​​   | 🚫 Regressed   |
   | one_true_first | 1.6726 µs - 1.6771 µs  | ​**1.6885 ns - 1.7430 
ns**​ | ​**​-99.898%​**​ | ✅ Improved |
   | one_true_last  | 1.6663 µs - 1.6714 µs  | 4.1096 µs - 4.1554 µs
   | +147.14% | 🚫 Regressed   |
   | one_true_middle| 1.6723 µs - 1.6819 µs  | 2.1505 µs - 2.2117 µs
   | +30.180% | 🚫 Regressed   |
   | one_true_middle_left   | 1.6672 µs - 1.6727 µs  | 1.1088 µs - 1.1483 µs
   | -32.995% | ✅ Improved |
   | one_true_middle_right  | 1.6689 µs - 1.6741 µs  | 3.1562 µs - 3.2521 µs
   | +93.762% | 🚫 Regressed   |
   | all_true   | 1.6711 µs - 1.6747 µs  | 4.2779 µs - 4.4291 µs
   | +155.35% | 🚫 Regressed   |
   | one_false_first| 1.6722 µs - 1.6782 µs  | ​**1.6278 ns - 1.6360 
ns**​ | ​**​-99.903%​**​ | ✅ Improved |
   | one_false_last | 1.6818 µs - 1.7512 µs  | 4.2175 µs - 4.2930 µs
   | +153.50% | 🚫 Regressed   |
   | one_false_middle   | 1.8437 µs - 1.9665 µs  | 2.0575 µs - 2.0871 µs
   | +11.931% | 🚫 Regressed   |
   | one_false_middle_left  | 2.0004 µs - 2.3194 µs  | 1.0243 µs - 1.0398 µs
   | -57.059% | ✅ Improved |
   | one_false_middle_right | 2.0770 µs - 2.2721 µs  | 3.0275 µs - 3.0582 µs
   | +47.668% | 🚫 Regressed   |
   
   It can be seen that when `false/true` is located slightly to the left of the 
middle, `bool_or/bool_and` has a significant advantage, with up to 10^3 times 
the performance lead when in the first position.  
   
   However, in other cases, using `true_count/false_count` performs better, 
showing relatively stable behavior across various scenarios.
   
   2. The test can be reproduced with the following command:
   ```sh
   # test true_count/false_count 
   TEST_BOOL_COUNT=1 cargo bench --bench boolean_op
   # test bool_or/bool_and 
   cargo bench --bench boolean_op
   ```
   detail benchmark code: 
https://github.com/apache/datafusion/pull/15462/files#diff-8710f6b44dd74240d19e6fcdfbf971c034f59cd48c022e51b07ed876a8cc7c5e


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-07 Thread via GitHub


alamb commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2783021714

   > The related results are as follows. I’m not sure what you think about 
these results — should we continue using true_count/false_count, or look for a 
better solution? @alamb @Dandandan
   
   Thank you for your diligence @acking-you 
   
   I think we should merge this PR in (with the count bits and the benchmarks) 
and file a follow on ticket to potentially improve the performance in some 
other way. 
   
   I don't fully understand your performance results given that the two 
functions seem very similar -- maybe it has to do with option hanlding messing 
auto vectorization or something
   
   
https://docs.rs/arrow-array/54.3.1/src/arrow_array/array/boolean_array.rs.html#160
   https://docs.rs/arrow-arith/54.3.1/src/arrow_arith/aggregate.rs.html#751


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-07 Thread via GitHub


alamb commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2783022496

   Security audit is not related to this RP


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-06 Thread via GitHub


acking-you commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2030375948


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   Thank you very much for your reply, your opinion is correct. I actually did 
the same. I'm very sorry. Maybe my comment is a little confusing. I will fix it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-06 Thread via GitHub


acking-you commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2030375948


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   Thank you very much for your CR, your opinion is correct. I actually did the 
same. I'm very sorry. Maybe my comment is a little confusing. I will fix it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-06 Thread via GitHub


berkaysynnada commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2030260718


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   I might be missing some obvious points, but 
   1) applying the same check for rhs is redundant? (does it approximately 
requires the same computation if we continue the execution as is)
   2) why aren't we also covering the cases when **lhs is all true and op is 
AND** / **lhs is all false and op is OR**



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-05 Thread via GitHub


acking-you commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2020615756


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
+let data_type = arg.data_type();
+match (data_type, op) {
+(DataType::Boolean, Operator::And) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.false_count() == array.len();

Review Comment:
   > I wonder if it helps to short optimize this expression (e.g. match until 
we get a chunk of the bitmap != 0)
   
   I think the overhead added here should be very small (the compiler 
optimization should work well), and the test results we discussed before were 
sometimes fast and sometimes slow (maybe noise).
   
   Your suggestion of making an early judgment and returning false seems like a 
good idea, but I'm not sure if it will be effective.  
   The concern I have with this approach is that it requires adding an `if` 
condition inside the `for` loop, which will most likely disable the compiler's 
SIMD instruction optimization (I've encountered a similar situation before, and 
I had to manually unroll the SIMD...).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-05 Thread via GitHub


Dandandan commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2020558418


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
+let data_type = arg.data_type();
+match (data_type, op) {
+(DataType::Boolean, Operator::And) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.false_count() == array.len();

Review Comment:
   I remember this had some overhead (for calculating the counts) from a 
previous try.
   I wonder if it helps to short optimize this expression (e.g. match until we 
get a chunk of the bitmap `!= 0`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-03 Thread via GitHub


alamb commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2026912500


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
+let data_type = arg.data_type();
+match (data_type, op) {
+(DataType::Boolean, Operator::And) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.false_count() == array.len();

Review Comment:
   I looked more carefully at bool_or and I do think it would be faster than 
this implementation on the case where there are some true values (as it stops 
as soon as it finds a single non zero): 
https://docs.rs/arrow/latest/arrow/compute/fn.bool_or.html
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-03 Thread via GitHub


alamb commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2775475696

   > > I looked for a binary expression and/or micro benchmark and could not 
find one: 
https://github.com/apache/datafusion/tree/c929a1cd133590e4944bc2c7900611220450335a/datafusion/physical-expr/benches
   > 
   > Thank you for your suggestions and help. I also noticed this part of the 
code yesterday. I've had a cold and fever in the past few days, so I haven't 
tried the method mentioned by @Dandandan for benchmarking yet. I should be able 
to work on it today.
   
   I hope you feel better soon
   
   I have been running some benchmark numbers (below) and I have gotten some 
strange results. I definitely see this PR improves performance on the newly 
added clickbench extended benchmark (almost 3x faster!)
   
   However `clickbench_1` seems to show a slowdown compared to 
`clickbench_partitioned` which I don't understand. I will try and reproduce 
this locally
   
   Details
   
   
   ```
   
   Benchmark clickbench_1.json
   
   ┏━━┳┳━━━┳━━┓
   ┃ Query┃  main_base ┃ add_short_circuit ┃   Change ┃
   ┡━━╇╇━━━╇━━┩
   │ QQuery 0 │ 0.58ms │0.59ms │no change │
   │ QQuery 1 │70.93ms │   72.44ms │no change │
   │ QQuery 2 │   118.33ms │  120.96ms │no change │
   │ QQuery 3 │   124.52ms │  130.17ms │no change │
   │ QQuery 4 │   788.26ms │  835.71ms │ 1.06x slower │
   │ QQuery 5 │   891.18ms │ 1033.71ms │ 1.16x slower │
   │ QQuery 6 │68.02ms │   66.47ms │no change │
   │ QQuery 7 │79.51ms │   80.38ms │no change │
   │ QQuery 8 │   951.66ms │  962.46ms │no change │
   │ QQuery 9 │  1242.83ms │ 1277.46ms │no change │
   │ QQuery 10│   305.01ms │  312.25ms │no change │
   │ QQuery 11│   334.19ms │  341.58ms │no change │
   │ QQuery 12│   941.89ms │ 1079.60ms │ 1.15x slower │
   │ QQuery 13│  1341.24ms │ 1528.55ms │ 1.14x slower │
   │ QQuery 14│   874.85ms │ 1057.59ms │ 1.21x slower │
   │ QQuery 15│  1118.77ms │ 1134.54ms │no change │
   │ QQuery 16│  1803.30ms │ 1920.39ms │ 1.06x slower │
   │ QQuery 17│  1669.97ms │ 1780.42ms │ 1.07x slower │
   │ QQuery 18│  3269.95ms │ 3297.16ms │no change │
   │ QQuery 19│   120.62ms │  116.02ms │no change │
   │ QQuery 20│  1222.20ms │ 1302.52ms │ 1.07x slower │
   │ QQuery 21│  1471.42ms │ 1638.85ms │ 1.11x slower │
   │ QQuery 22│  2740.23ms │ 4508.20ms │ 1.65x slower │
   │ QQuery 23│  8733.62ms │10619.68ms │ 1.22x slower │
   │ QQuery 24│   515.74ms │  695.11ms │ 1.35x slower │
   │ QQuery 25│   431.26ms │  608.92ms │ 1.41x slower │
   │ QQuery 26│   568.41ms │  743.79ms │ 1.31x slower │
   │ QQuery 27│  1865.82ms │ 1966.63ms │ 1.05x slower │
   │ QQuery 28│ 13400.82ms │13320.07ms │no change │
   │ QQuery 29│   574.24ms │  590.87ms │no change │
   │ QQuery 30│   863.90ms │ 1059.33ms │ 1.23x slower │
   │ QQuery 31│   940.63ms │ 1115.46ms │ 1.19x slower │
   │ QQuery 32│  2790.62ms │ 2933.31ms │ 1.05x slower │
   │ QQuery 33│  3534.40ms │ 3583.06ms │no change │
   │ QQuery 34│  3515.54ms │ 3592.40ms │no change │
   │ QQuery 35│  1327.76ms │ 1384.98ms │no change │
   │ QQuery 36│   257.33ms │  269.35ms │no change │
   │ QQuery 37│   124.48ms │  187.43ms │ 1.51x slower │
   │ QQuery 38│   167.20ms │  182.73ms │ 1.09x slower │
   │ QQuery 39│   454.63ms │  474.86ms │no change │
   │ QQuery 40│82.41ms │   83.28ms │no change │
   │ QQuery 41│82.20ms │   80.21ms │no change │
   │ QQuery 42│78.13ms │   87.76ms │ 1.12x slower │
   └──┴┴───┴──┘
   ┏━━┳┓
   ┃ Benchmark Summary┃┃
   ┡━━╇┩
   │ Total Time (main_base)   │ 61858.61ms │
   │ Total Time (add_short_circuit)   │ 68177.23ms │
   │ Average Time (main_base) │  1438.57ms │
   │ Average Time (add_short_circuit) │  1585.52ms │
   │ Queries Faster   │  0 │
   │ Queries Slower   │ 21 │
   │ Queries with No Change   │ 22 │
   └──┴┘
   
   Benchmark clickbench_extended.json
   
   ┏━━┳┳━

Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-02 Thread via GitHub


acking-you commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2774487207

   > I looked for a binary expression and/or micro benchmark and could not find 
one: 
https://github.com/apache/datafusion/tree/c929a1cd133590e4944bc2c7900611220450335a/datafusion/physical-expr/benches
   
   Thank you for your suggestions and help. I also noticed this part of the 
code yesterday. I've had a cold and fever in the past few days, so I haven't 
tried the method mentioned by @Dandandan for benchmarking yet. I should be able 
to work on it today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-01 Thread via GitHub


acking-you commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2023116930


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
+let data_type = arg.data_type();
+match (data_type, op) {
+(DataType::Boolean, Operator::And) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.false_count() == array.len();

Review Comment:
   > Might be overkill, but one _could_ try a sampling approach: Run the loop 
with the early exit for the first few chunks, and then switch over to the 
unconditional loop.
   
   Thank you for your suggestion, but if we're only applying conditional checks 
to the first few blocks, then I feel this optimization might not be meaningful. 
If nearly all blocks can be filtered out by the preceding filter, the 
optimization will no longer be effective.
   
   >If we find that this slows down some other performance we could also add 
some sort of heuristic check to calling false_count / true_count -- like for 
example if the rhs arg is "complex" (not a Column for example)
   
   I tend to agree with @alamb's point that if the overhead of verification is 
somewhat unacceptable, adopting some heuristic approaches would be better.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-01 Thread via GitHub


ctsk commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2022894121


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
+let data_type = arg.data_type();
+match (data_type, op) {
+(DataType::Boolean, Operator::And) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.false_count() == array.len();

Review Comment:
   Might be overkill, but one *could* try a sampling approach: Run the loop 
with the early exit for the first few chunks, and then switch over to the 
unconditional loop.
   
   Almost seems like something the compiler could automagically do...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-31 Thread via GitHub


acking-you commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2020615756


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
+let data_type = arg.data_type();
+match (data_type, op) {
+(DataType::Boolean, Operator::And) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.false_count() == array.len();

Review Comment:
   > I wonder if it helps to short optimize this expression (e.g. match until 
we get a chunk of the bitmap != 0)
   
   I think the overhead added here should be very small (the compiler 
optimization should work well), and the test results we discussed before were 
sometimes fast and sometimes slow (maybe noise).
   
   Your suggestion of making an early judgment and returning false seems like a 
good idea, but I'm not sure if it will be effective.  
   The concern I have with this approach is that it requires adding an `if` 
condition inside the `for` loop, which will most likely disable the compiler's 
SIMD instruction optimization (I've encountered a similar situation before when 
writing code, and I had to manually unroll the SIMD...).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-31 Thread via GitHub


acking-you commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2020726805


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
+let data_type = arg.data_type();
+match (data_type, op) {
+(DataType::Boolean, Operator::And) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.false_count() == array.len();

Review Comment:
   > Either way, we can use `bool_and` 
(https://docs.rs/arrow/latest/arrow/compute/fn.bool_and.html) and `bool_or` 
which operates on `u64` values to test performance changes.
   
   Thank you for your suggestion. I will try it later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-31 Thread via GitHub


Dandandan commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2020695065


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
+let data_type = arg.data_type();
+match (data_type, op) {
+(DataType::Boolean, Operator::And) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.false_count() == array.len();

Review Comment:
   Either way, we can use `bool_and`  
(https://docs.rs/arrow/latest/arrow/compute/fn.bool_and.html) and `bool_or` 
which operates on `u64` values to test performance changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-31 Thread via GitHub


Dandandan commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2020695065


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
+let data_type = arg.data_type();
+match (data_type, op) {
+(DataType::Boolean, Operator::And) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.false_count() == array.len();

Review Comment:
   Either way, we can use `max_boolean`  
(https://docs.rs/arrow/latest/arrow/compute/fn.max_boolean.html) and 
`min_boolean` which operates on `u64` values to test performance changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-30 Thread via GitHub


acking-you commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2764998480

   > Also, could you please add the new Q6 benchmark in a separate PR so I can 
more easily run my benchmark scripts before/after your code change?
   
   I have successfully split Q6 into: 
https://github.com/apache/datafusion/pull/15500  
   Thank you very much for your CR. @alamb 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-30 Thread via GitHub


acking-you commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2020360286


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr {
 fn evaluate(&self, batch: &RecordBatch) -> Result {
 use arrow::compute::kernels::numeric::*;
 
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   I've moved the function outside and added some comments.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-29 Thread via GitHub


acking-you commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2019944906


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr {
 fn evaluate(&self, batch: &RecordBatch) -> Result {
 use arrow::compute::kernels::numeric::*;
 
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   > Is there any reason to have this function defined in the evaluate method? 
   
   There was no particular reason. Maybe I couldn't find a suitable place to 
write it at the time, haha. Where do you think this function should be placed?
   
   > If we find that this slows down some other performance we could also add 
some sort of heuristic check to calling false_count / true_count -- like for 
example if the rhs arg is "complex" (not a Column for example)
   
   I also agree that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-29 Thread via GitHub


acking-you commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2019945905


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr {
 fn evaluate(&self, batch: &RecordBatch) -> Result {
 use arrow::compute::kernels::numeric::*;
 
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
+let data_type = arg.data_type();
+match (data_type, op) {
+(DataType::Boolean, Operator::And) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.false_count() == array.len();
+}
+}
+ColumnarValue::Scalar(scalar) => {
+if let ScalarValue::Boolean(Some(value)) = scalar {
+return !value;
+}
+}
+}
+false
+}
+(DataType::Boolean, Operator::Or) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.true_count() == array.len();
+}
+}
+ColumnarValue::Scalar(scalar) => {
+if let ScalarValue::Boolean(Some(value)) = scalar {
+return *value;
+}
+}
+}
+false
+}
+_ => false,
+}
+}

Review Comment:
   I'll add these tests right away.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-29 Thread via GitHub


acking-you commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2019944906


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr {
 fn evaluate(&self, batch: &RecordBatch) -> Result {
 use arrow::compute::kernels::numeric::*;
 
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   > Is there any reason to have this function defined in the evaluate method? 
   There was no particular reason. Maybe I couldn't find a suitable place to 
write it at the time, haha.
   
   > If we find that this slows down some other performance we could also add 
some sort of heuristic check to calling false_count / true_count -- like for 
example if the rhs arg is "complex" (not a Column for example)
   
   I also agree that



##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr {
 fn evaluate(&self, batch: &RecordBatch) -> Result {
 use arrow::compute::kernels::numeric::*;
 
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   > Is there any reason to have this function defined in the evaluate method? 
   
   There was no particular reason. Maybe I couldn't find a suitable place to 
write it at the time, haha.
   
   > If we find that this slows down some other performance we could also add 
some sort of heuristic check to calling false_count / true_count -- like for 
example if the rhs arg is "complex" (not a Column for example)
   
   I also agree that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-29 Thread via GitHub


acking-you commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2764093707

   > Also, could you please add the new Q6 benchmark in a separate PR so I can 
more easily run my benchmark scripts before/after your code change?
   
   Okey,I got it.Do you mean that Q6 and its related description in the current 
branch need to be completed with a separate PR? But this way, it seems that you 
would still need to cherry-pick Q6 to the corresponding branch when testing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org