neilconway opened a new pull request, #22029:
URL: https://github.com/apache/datafusion/pull/22029
## Which issue does this PR close?
- Closes #21997 (potentially).
## Rationale for this change
This PR adds two new APIs to `GenericStringArrayBuilder` and
`StringViewArrayBuilder`:
1. `append_with` appends a row whose bytes are produced by invoking a
closure that is passed a `StringWriter`
2. `append_byte_map` appends a row whose bytes are produced by mapping each
byte of the input with a byte-to-byte map closure.
For `StringViewArrayBuilder`, `StringWriter` is an append-only string writer
that switches between writing to a new inline view (for short strings) or to
the in-progress data block automatically. For `GenericStringArrayBuilder`,
`StringWriter` just appends to the value buffer directly.
(We need two new APIs because `append_byte_map` vectorizes a lot better than
`append_with`, so callers that fit the byte-to-byte map pattern should prefer
it.)
Both of these new APIs allow string UDFs to avoid creating an intermediate
data copy in many cases. To illustrate this, this PR adopts the new APIs in
`replace`.
Benchmarks (Arm64):
ASCII single-byte from (byte-map path)
- size=1024 str_len=32 nulls=0 : 15.2 µs -> 12.7 µs (−16.4%)
- size=1024 str_len=32 nulls=0.2 : 13.8 µs -> 12.0 µs (−13.1%)
- size=1024 str_len=128 nulls=0 : 10.8 µs -> 8.0 µs (−26.6%)
- size=1024 str_len=128 nulls=0.2 : 10.6 µs -> 7.7 µs (−27.0%)
- size=4096 str_len=32 nulls=0 : 59.7 µs -> 48.4 µs (−18.9%)
- size=4096 str_len=32 nulls=0.2 : 53.0 µs -> 46.1 µs (−13.0%)
- size=4096 str_len=128 nulls=0 : 40.7 µs -> 30.7 µs (−24.6%)
- size=4096 str_len=128 nulls=0.2 : 38.8 µs -> 28.0 µs (−27.9%)
Multi-byte from, StringArray (Writer general path)
- size=1024 str_len=32 nulls=0 : 24.4 µs -> 20.9 µs (−14.5%)
- size=1024 str_len=32 nulls=0.2 : 19.0 µs -> 16.6 µs (−12.7%)
- size=1024 str_len=128 nulls=0 : 39.8 µs -> 34.5 µs (−13.4%)
- size=1024 str_len=128 nulls=0.2 : 31.2 µs -> 28.0 µs (−10.1%)
- size=4096 str_len=32 nulls=0 : 99.4 µs -> 83.6 µs (−15.9%)
- size=4096 str_len=32 nulls=0.2 : 78.2 µs -> 67.6 µs (−13.5%)
- size=4096 str_len=128 nulls=0 : 180.9 µs -> 160.3 µs (−11.4%)
- size=4096 str_len=128 nulls=0.2 : 137.4 µs -> 124.3 µs (−9.5%)
Multi-byte from, StringViewArray (Writer general path)
- size=1024 str_len=32 nulls=0 : 24.7 µs -> 21.2 µs (−14.0%)
- size=1024 str_len=32 nulls=0.2 : 19.4 µs -> 17.0 µs (−12.3%)
- size=1024 str_len=128 nulls=0 : 39.6 µs -> 34.7 µs (−12.6%)
- size=1024 str_len=128 nulls=0.2 : 31.9 µs -> 28.3 µs (−11.0%)
- size=4096 str_len=32 nulls=0 : 100.1 µs -> 84.0 µs (−16.1%)
- size=4096 str_len=32 nulls=0.2 : 79.9 µs -> 69.7 µs (−12.9%)
- size=4096 str_len=128 nulls=0 : 177.5 µs -> 158.1 µs (−10.9%)
- size=4096 str_len=128 nulls=0.2 : 139.3 µs -> 127.3 µs (−8.6%)
## What changes are included in this PR?
* Add `append_byte_map` and `append_with` to both of the bulk-NULL string
builders
* Add unit tests
* Adopt the new APIs in `replace`
## Are these changes tested?
Yes; new tests added.
## Are there any user-facing changes?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]