kosiew opened a new pull request, #8415:
URL: https://github.com/apache/arrow-rs/pull/8415
# Which issue does this PR close?
Closes #8403.
---
# Rationale for this change
Casting from `BinaryView` to `Utf8View` currently attempts a direct
conversion using `to_string_view()` which returns an error if any value
contains invalid UTF‑8. This behavior is inconsistent with other binary array
types in Arrow, which honor `CastOptions.safe = true` by replacing invalid
UTF‑8 sequences with `NULL` values rather than failing the entire cast
operation.
This PR makes `BinaryView`'s casting behavior consistent with other binary
types and with user expectations: when `CastOptions.safe` is `true`, invalid
UTF‑8 bytes are replaced by `NULL` in the resulting `StringViewArray`; when
`CastOptions.safe` is `false`, the cast retains the existing failure behavior.
---
# What changes are included in this PR?
* Change `cast_with_options` to delegate the `BinaryView -> Utf8View` branch
to a new helper function `cast_binary_view_to_string_view(array, cast_options)`
instead of directly calling `to_string_view()` and erroring.
* Add `extend_valid_utf8` helper to centralize the logic of mapping
`Option<&[u8]>` to `Option<&str>` (using `std::str::from_utf8(...).ok()`), and
reuse it for both `GenericStringBuilder` and `StringViewBuilder` flows.
* Implement `cast_binary_view_to_string_view` which:
* Attempts `array.clone().to_string_view()` (fast, zero-copy path) and
returns it when `Ok`.
* On `Err`, checks `cast_options.safe`:
* If `true`, builds a `StringViewArray` by filtering invalid UTF‑8 to
`NULL` using `extend_valid_utf8` and returns that array.
* If `false`, propagates the original error (existing behavior).
* Add a unit test `test_binary_view_to_string_view_with_invalid_utf8`
covering both `safe=false` (expect error) and `safe=true` (expect `NULL` where
invalid UTF‑8 occurred).
Files changed (high level):
* `arrow-cast/src/cast/mod.rs`: route `BinaryView -> Utf8View` case to the
new helper.
* `arrow-cast/src/cast/string.rs`: add `extend_valid_utf8` and
`cast_binary_view_to_string_view`, and use `extend_valid_utf8` from an existing
cast path.
---
# Are there any user-facing changes?
Yes — this changes the observable behavior of casting `BinaryView` to
`Utf8View`:
* With `CastOptions.safe = true` (the safe mode), invalid UTF‑8 in
`BinaryView` elements will be converted to `NULL` in the resulting `Utf8View`
array instead of causing the entire cast to fail.
* With `CastOptions.safe = false`, an invalid UTF‑8 still causes the cast to
fail as before.
This is a bug fix aligning `BinaryView` with the semantics of other binary
types and with documented expectations for `CastOptions.safe`.
No public API surface is changed beyond the fixed behavior; the new helpers
are crate-private.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]