2010YOUY01 commented on code in PR #12444:
URL: https://github.com/apache/datafusion/pull/12444#discussion_r1759687997
##########
datafusion/functions/src/unicode/substr.rs:
##########
@@ -197,6 +213,29 @@ fn string_view_substr(
let start_array = as_int64_array(&args[0])?;
+ // Notes for ASCII-only optimization:
+ //
+ // String characters are variable length encoded in UTF-8, `substr()`
function's
+ // arguments are character-based, converting them into byte-based indices
+ // requires expensive decoding.
+ // However, checking if a string is ASCII-only is relatively cheap.
+ // If strings are ASCII only, use byte-based indices instead.
+ //
+ // A common pattern to call `substr()` is taking a small prefix of a long
+ // string, such as `substr(long_str_with_1k_chars, 1, 32)`.
+ // In such case the overhead of ASCII-validation may not be worth it, so
+ // skip the validation for long strings for now.
Review Comment:
> Why not check only the requested string prefix for being ascii? could
`string_view_array.is_ascii` variant validate string prefixes of given length
why still being vectorized?
I think it's a good idea for the current situation
However in the long term we might use an alternative approach: do validation
when reading arrays from storage to memory, and cache this `is_ascii` property
within the arrow array (as suggested by @alamb
https://github.com/apache/datafusion/pull/12444#pullrequestreview-2304056515)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]