Re: [PR] Optimize performance of `string::ascii` function [datafusion]

2025-05-21 Thread via GitHub


alamb merged PR #16087:
URL: https://github.com/apache/datafusion/pull/16087


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Optimize performance of `string::ascii` function [datafusion]

2025-05-21 Thread via GitHub


alamb commented on PR #16087:
URL: https://github.com/apache/datafusion/pull/16087#issuecomment-2898904176

   This is better than what is on main and we can do always optimize further.
   
   Thanks a log @tlm365 @findepi and @Dandandan  -- always up and to the right


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Optimize performance of `string::ascii` function [datafusion]

2025-05-19 Thread via GitHub


tlm365 commented on code in PR #16087:
URL: https://github.com/apache/datafusion/pull/16087#discussion_r2096902385


##
datafusion/functions/benches/ascii.rs:
##
@@ -0,0 +1,116 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+extern crate criterion;
+mod helper;
+
+use arrow::datatypes::{DataType, Field};
+use criterion::{black_box, criterion_group, criterion_main, Criterion};
+use datafusion_expr::ScalarFunctionArgs;
+use helper::gen_string_array;
+
+fn criterion_benchmark(c: &mut Criterion) {
+let ascii = datafusion_functions::string::ascii();
+
+// All benches are single batch run with 8192 rows
+const N_ROWS: usize = 8192;
+const STR_LEN: usize = 16;
+const NULL_DENSITY: f32 = 0.2;

Review Comment:
   > I think it makes sense to add a variant of the tests with `NULL_DENSITY` 
of 0.0. This way we can optimize for the case with only non-null values.
   
   I just added benchmark for this case. Seems that there is not much change in 
performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Optimize performance of `string::ascii` function [datafusion]

2025-05-19 Thread via GitHub


alamb commented on code in PR #16087:
URL: https://github.com/apache/datafusion/pull/16087#discussion_r2096357959


##
datafusion/functions/src/string/ascii.rs:
##
@@ -103,19 +103,22 @@ impl ScalarUDFImpl for AsciiFunc {
 
 fn calculate_ascii<'a, V>(array: V) -> Result
 where
-V: ArrayAccessor,
+V: StringArrayType<'a, Item = &'a str>,
 {
-let iter = ArrayIter::new(array);
-let result = iter
-.map(|string| {
-string.map(|s| {
-let mut chars = s.chars();
-chars.next().map_or(0, |v| v as i32)
-})
+let values: Vec<_> = (0..array.len())
+.map(|i| {
+if array.is_null(i) {
+0

Review Comment:
   That is my understanding as well -- it should be safe (in the rust sense) to 
look up an entry that is set to null



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Optimize performance of `string::ascii` function [datafusion]

2025-05-19 Thread via GitHub


Dandandan commented on code in PR #16087:
URL: https://github.com/apache/datafusion/pull/16087#discussion_r2096162977


##
datafusion/functions/src/string/ascii.rs:
##
@@ -103,19 +106,29 @@ impl ScalarUDFImpl for AsciiFunc {
 
 fn calculate_ascii<'a, V>(array: V) -> Result
 where
-V: ArrayAccessor,
+V: StringArrayType<'a, Item = &'a str>,
 {
-let iter = ArrayIter::new(array);
-let result = iter
-.map(|string| {
-string.map(|s| {
-let mut chars = s.chars();
-chars.next().map_or(0, |v| v as i32)
-})
-})
-.collect::();
-
-Ok(Arc::new(result) as ArrayRef)
+let mut values = Vec::with_capacity(array.len());

Review Comment:
   The benchmarks don't test this case I believe? Can we add variants of the 
benchmarks with this case tested?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Optimize performance of `string::ascii` function [datafusion]

2025-05-19 Thread via GitHub


Dandandan commented on code in PR #16087:
URL: https://github.com/apache/datafusion/pull/16087#discussion_r2096168025


##
datafusion/functions/src/string/ascii.rs:
##
@@ -103,19 +103,22 @@ impl ScalarUDFImpl for AsciiFunc {
 
 fn calculate_ascii<'a, V>(array: V) -> Result
 where
-V: ArrayAccessor,
+V: StringArrayType<'a, Item = &'a str>,
 {
-let iter = ArrayIter::new(array);
-let result = iter
-.map(|string| {
-string.map(|s| {
-let mut chars = s.chars();
-chars.next().map_or(0, |v| v as i32)
-})
+let values: Vec<_> = (0..array.len())
+.map(|i| {
+if array.is_null(i) {
+0

Review Comment:
   I realize now the spec doesn't say anything about the offset pointing to 
valid utf8 or not. 
   But `value(i)` being not unsafe, I guess `arrow-rs` should provide that 
guarantee.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Optimize performance of `string::ascii` function [datafusion]

2025-05-19 Thread via GitHub


Dandandan commented on code in PR #16087:
URL: https://github.com/apache/datafusion/pull/16087#discussion_r2096161845


##
datafusion/functions/src/string/ascii.rs:
##
@@ -103,19 +103,22 @@ impl ScalarUDFImpl for AsciiFunc {
 
 fn calculate_ascii<'a, V>(array: V) -> Result
 where
-V: ArrayAccessor,
+V: StringArrayType<'a, Item = &'a str>,
 {
-let iter = ArrayIter::new(array);
-let result = iter
-.map(|string| {
-string.map(|s| {
-let mut chars = s.chars();
-chars.next().map_or(0, |v| v as i32)
-})
+let values: Vec<_> = (0..array.len())
+.map(|i| {
+if array.is_null(i) {
+0

Review Comment:
   I had to look it up
   
   > Offsets must be monotonically increasing, that is offsets[j+1] >= 
offsets[j] for 0 <= j < length, even for null slots. This property ensures the 
location for all values is valid and well defined.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Optimize performance of `string::ascii` function [datafusion]

2025-05-19 Thread via GitHub


Dandandan commented on code in PR #16087:
URL: https://github.com/apache/datafusion/pull/16087#discussion_r2096157208


##
datafusion/functions/benches/ascii.rs:
##
@@ -0,0 +1,116 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+extern crate criterion;
+mod helper;
+
+use arrow::datatypes::{DataType, Field};
+use criterion::{black_box, criterion_group, criterion_main, Criterion};
+use datafusion_expr::ScalarFunctionArgs;
+use helper::gen_string_array;
+
+fn criterion_benchmark(c: &mut Criterion) {
+let ascii = datafusion_functions::string::ascii();
+
+// All benches are single batch run with 8192 rows
+const N_ROWS: usize = 8192;
+const STR_LEN: usize = 16;
+const NULL_DENSITY: f32 = 0.2;

Review Comment:
   I think it makes sense to add a variant of the tests with `NULL_DENSITY` of 
0.0.
   This way we can optimize for the case with only nulls.



##
datafusion/functions/benches/ascii.rs:
##
@@ -0,0 +1,116 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+extern crate criterion;
+mod helper;
+
+use arrow::datatypes::{DataType, Field};
+use criterion::{black_box, criterion_group, criterion_main, Criterion};
+use datafusion_expr::ScalarFunctionArgs;
+use helper::gen_string_array;
+
+fn criterion_benchmark(c: &mut Criterion) {
+let ascii = datafusion_functions::string::ascii();
+
+// All benches are single batch run with 8192 rows
+const N_ROWS: usize = 8192;
+const STR_LEN: usize = 16;
+const NULL_DENSITY: f32 = 0.2;

Review Comment:
   I think it makes sense to add a variant of the tests with `NULL_DENSITY` of 
0.0.
   This way we can optimize for the case with only non-null values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Optimize performance of `string::ascii` function [datafusion]

2025-05-19 Thread via GitHub


alamb commented on code in PR #16087:
URL: https://github.com/apache/datafusion/pull/16087#discussion_r2096115219


##
datafusion/functions/src/string/ascii.rs:
##
@@ -103,19 +103,22 @@ impl ScalarUDFImpl for AsciiFunc {
 
 fn calculate_ascii<'a, V>(array: V) -> Result
 where
-V: ArrayAccessor,
+V: StringArrayType<'a, Item = &'a str>,
 {
-let iter = ArrayIter::new(array);
-let result = iter
-.map(|string| {
-string.map(|s| {
-let mut chars = s.chars();
-chars.next().map_or(0, |v| v as i32)
-})
+let values: Vec<_> = (0..array.len())
+.map(|i| {
+if array.is_null(i) {
+0

Review Comment:
   It would be legal. I am not sure it would be faster, though we could easily 
check with a benchmark once this PR is merged
   
   We could at least add a check on the outside of the loop
   
   ```rust
   // If the array has no nulls, skip checking each value
   if array.null_count() == 0 {
 let values: Vec<_> = (0..array.len())
   }
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Optimize performance of `string::ascii` function [datafusion]

2025-05-19 Thread via GitHub


findepi commented on code in PR #16087:
URL: https://github.com/apache/datafusion/pull/16087#discussion_r2095349825


##
datafusion/functions/src/string/ascii.rs:
##
@@ -103,19 +103,22 @@ impl ScalarUDFImpl for AsciiFunc {
 
 fn calculate_ascii<'a, V>(array: V) -> Result
 where
-V: ArrayAccessor,
+V: StringArrayType<'a, Item = &'a str>,
 {
-let iter = ArrayIter::new(array);
-let result = iter
-.map(|string| {
-string.map(|s| {
-let mut chars = s.chars();
-chars.next().map_or(0, |v| v as i32)
-})
+let values: Vec<_> = (0..array.len())
+.map(|i| {
+if array.is_null(i) {
+0

Review Comment:
   Would it be legal to omit this `if` (and the jump it implies)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Optimize performance of `string::ascii` function [datafusion]

2025-05-19 Thread via GitHub


tlm365 commented on code in PR #16087:
URL: https://github.com/apache/datafusion/pull/16087#discussion_r2095176682


##
datafusion/functions/src/string/ascii.rs:
##
@@ -103,19 +106,29 @@ impl ScalarUDFImpl for AsciiFunc {
 
 fn calculate_ascii<'a, V>(array: V) -> Result
 where
-V: ArrayAccessor,
+V: StringArrayType<'a, Item = &'a str>,
 {
-let iter = ArrayIter::new(array);
-let result = iter
-.map(|string| {
-string.map(|s| {
-let mut chars = s.chars();
-chars.next().map_or(0, |v| v as i32)
-})
-})
-.collect::();
-
-Ok(Arc::new(result) as ArrayRef)
+let mut values = Vec::with_capacity(array.len());

Review Comment:
   I updated the code and benchmark. Already tested this version:
   ```rust
   fn calculate_ascii<'a, V>(array: V) -> Result
   where
   V: StringArrayType<'a, Item = &'a str>,
   {
   let nulls = array.nulls();
   
   let values: Vec = if nulls.map_or(false, |n| n.null_count() > 0) {
   // Nulls present: handle each element with null-check
   (0..array.len())
   .map(|i| {
   if array.is_null(i) {
   0
   } else {
   let s = array.value(i);
   s.chars().next().map_or(0, |c| c as i32)
   }
   })
   .collect()
   } else {
   // Fast path: no null check needed
   (0..array.len())
   .map(|i| {
   let s = unsafe { array.value_unchecked(i) };
   s.chars().next().map_or(0, |c| c as i32)
   })
   .collect()
   };
   
   let array = Int32Array::new(values.into(), nulls.cloned());
   Ok(Arc::new(array))
   }
   ```
   but the performance difference is not significant, so I chose the current 
version to keep the code simple.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Optimize performance of `string::ascii` function [datafusion]

2025-05-19 Thread via GitHub


Dandandan commented on code in PR #16087:
URL: https://github.com/apache/datafusion/pull/16087#discussion_r2095061480


##
datafusion/functions/src/string/ascii.rs:
##
@@ -103,19 +106,29 @@ impl ScalarUDFImpl for AsciiFunc {
 
 fn calculate_ascii<'a, V>(array: V) -> Result
 where
-V: ArrayAccessor,
+V: StringArrayType<'a, Item = &'a str>,
 {
-let iter = ArrayIter::new(array);
-let result = iter
-.map(|string| {
-string.map(|s| {
-let mut chars = s.chars();
-chars.next().map_or(0, |v| v as i32)
-})
-})
-.collect::();
-
-Ok(Arc::new(result) as ArrayRef)
+let mut values = Vec::with_capacity(array.len());

Review Comment:
   Even faster should be (using `collect` rather than `push` and avoiding copy 
by using `into`):
   
   (Didn't compile the code, but it should look like this)
   
   ```
   let values: Vec<_> = (0..array.len()).map(|i| {
 if array.is_null(i) {
   0
   } else {
   let s = array.value(i);
   s.chars().next().map_or(0, |c| c as i32)
   }
   }).collect();
   
   let array = Int32Array::new(values.into(), array.nulls().cloned());
   Ok(Arc::new(array))
   ```
   
   Futhermore, you can specialize for non-null arrays:
   
   ```
   let values: Vec<_> = 
   match array.nulls().filter(|n| n.null_count > 0) {
Some(nulls) {
// existing code
},
None => {
// skip null check
(0..array.len()).map(|i| {
let s = array.value(i);
s.chars().next().map_or(0, |c| c as i32)
}
   }).collect()
 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Optimize performance of `string::ascii` function [datafusion]

2025-05-19 Thread via GitHub


tlm365 commented on code in PR #16087:
URL: https://github.com/apache/datafusion/pull/16087#discussion_r2095013248


##
datafusion/functions/src/string/ascii.rs:
##
@@ -103,19 +103,21 @@ impl ScalarUDFImpl for AsciiFunc {
 
 fn calculate_ascii<'a, V>(array: V) -> Result
 where
-V: ArrayAccessor,
+V: StringArrayType<'a, Item = &'a str>,
 {
-let iter = ArrayIter::new(array);
-let result = iter
-.map(|string| {
-string.map(|s| {
-let mut chars = s.chars();
-chars.next().map_or(0, |v| v as i32)
-})
-})
-.collect::();
-
-Ok(Arc::new(result) as ArrayRef)
+let mut builder = Int32Builder::with_capacity(array.len());

Review Comment:
   @Dandandan Thanks for reviewing,
   
   > even faster is to avoid the builder, by collecting into a `Vec` and 
clone the incoming bitmap.
   
   Nice, I checked again and that's true



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Optimize performance of `string::ascii` function [datafusion]

2025-05-19 Thread via GitHub


Dandandan commented on code in PR #16087:
URL: https://github.com/apache/datafusion/pull/16087#discussion_r2094992836


##
datafusion/functions/src/string/ascii.rs:
##
@@ -103,19 +103,21 @@ impl ScalarUDFImpl for AsciiFunc {
 
 fn calculate_ascii<'a, V>(array: V) -> Result
 where
-V: ArrayAccessor,
+V: StringArrayType<'a, Item = &'a str>,
 {
-let iter = ArrayIter::new(array);
-let result = iter
-.map(|string| {
-string.map(|s| {
-let mut chars = s.chars();
-chars.next().map_or(0, |v| v as i32)
-})
-})
-.collect::();
-
-Ok(Arc::new(result) as ArrayRef)
+let mut builder = Int32Builder::with_capacity(array.len());

Review Comment:
   even faster is to avoid the builder, by collecting into a `Vec` and 
clone the incoming bitmap.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org