kosiew opened a new pull request, #22990:
URL: https://github.com/apache/datafusion/pull/22990
## Which issue does this PR close?
* Part of #22688
## Rationale for this change
`GenericStringArrayBuilder` only exposed infallible append APIs that panic
when string offsets exceed the underlying offset type limits. String functions
such as `replace`, `replace_view`, and the generic `initcap` path relied on
these APIs, meaning extreme output sizes could panic instead of returning a
recoverable `DataFusionError`.
This change introduces fallible builder APIs and migrates selected string
UDFs to use them so offset overflow is reported as an error rather than causing
a panic.
## What changes are included in this PR?
* Add overflow-checked helper functions to `GenericStringArrayBuilder`:
* `try_offset`
* `try_push_offset_for_len`
* `try_append_bytes`
* Add fallible append APIs:
* `try_append_value`
* `try_append_placeholder`
* `try_append_byte_map`
* `try_append_with`
* Introduce a shared overflow error path that returns a `DataFusionError`
instead of panicking.
* Keep existing infallible append APIs for compatibility while documenting
that new overflow-sensitive call sites should prefer the `try_*` variants.
* Refactor `replace` and `replace_view` to share a generic `replace_arrays`
implementation.
* Change `apply_replace` to return `Result<()>` and propagate errors from
builder operations.
* Update `replace`/`replace_view` to use the new fallible builder APIs and
thread errors with `?`.
* Update the generic `Utf8`/`LargeUtf8` path in `initcap` to use
`try_append_placeholder` and `try_append_value`.
* Add rollback handling in `try_append_with` so builder state is restored if
offset validation fails.
## Are these changes tested?
Yes.
Added tests in `datafusion/functions/src/strings.rs`:
* `generic_string_builder_try_append_success_path`
* `generic_string_builder_mixed_append_success_path`
* `generic_string_builder_try_offset_overflow`
* `generic_string_builder_try_append_bytes_overflow`
Existing `replace` and `initcap` tests remain in place and the migrated code
paths continue to be exercised by those test suites.
## Are there any user-facing changes?
Yes.
For extreme string outputs that exceed the offset limits of the underlying
string array type, affected functions now return a `DataFusionError` instead of
panicking. Normal behavior and results are otherwise unchanged.
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content
has been manually reviewed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]