neilconway opened a new pull request, #21442:
URL: https://github.com/apache/datafusion/pull/21442
## Which issue does this PR close?
- Closes #21441.
## Rationale for this change
This PR makes two distinct optimizations to the `left` and `right` builtin
UDFs:
1. The `left` and `right` built-in UDFs have a zero-copy path for `Utf8View`
input, but they always copy for `Utf8` and `LargeUtf8` inputs. If we make these
functions always return `Utf8View`, we can add a zero-copy path for `Utf8` and
`LargeUtf8` paths as well. We can't take this path in the case when the largest
offset in the input string array is > 4GB, but that is rare. This follows the
recent optimization for `substr` (#21366)
2. For `Utf8View` input, we were constructing the return value via
`StringViewArray::try_new`, which does some fairly expensive validation. We
know the return value is correct by construction, so we can use
`StringViewArray::new_unchecked` instead.
Benchmarks (ARM64):
```
- left/string short_result: 179.6µs → 127.1µs (-29.2%)
- left/string long_result: 324.3µs → 262.2µs (-19.1%)
- left/string_view short_result: 220.9µs → 122.5µs (-44.5%)
- left/string_view long_result: 383.1µs → 212.0µs (-44.7%)
- right/string short_result: 180.4µs → 126.0µs (-30.2%)
- right/string long_result: 392.0µs → 343.9µs (-12.3%)
- right/string_view short_result: 228.7µs → 125.3µs (-45.2%)
- right/string_view long_result: 393.6µs → 238.0µs (-39.5%)
```
## What changes are included in this PR?
* Update benchmarks to measure both inline and out-of-line string results
* Change `left` and `right` return types to be `Utf8View`
* Optimize `left` and `right` string array path to do zero-copy when possible
* Optimize `left` and `right` string view path, and refactor it to be more
similar to the array path
* Add more SLT tests to cover modified code paths
* Update various test expectations to reflect the new return type
## Are these changes tested?
Yes; benchmarked and new tests added.
## Are there any user-facing changes?
The return value of these functions have changed. This shouldn't typically
break any user logic, although it might result in the planner inserting or
removing casts for downstream operators, and the performance of downstream
operators might either be better or worse, depending on whether the downstream
code is better suited for `Utf8` or `Utf8View` string representations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]