kosiew opened a new pull request, #8415:
URL: https://github.com/apache/arrow-rs/pull/8415

   
   # Which issue does this PR close?
   
   Closes #8403.
   
   ---
   
   # Rationale for this change
   
   Casting from `BinaryView` to `Utf8View` currently attempts a direct 
conversion using `to_string_view()` which returns an error if any value 
contains invalid UTF‑8. This behavior is inconsistent with other binary array 
types in Arrow, which honor `CastOptions.safe = true` by replacing invalid 
UTF‑8 sequences with `NULL` values rather than failing the entire cast 
operation.
   
   This PR makes `BinaryView`'s casting behavior consistent with other binary 
types and with user expectations: when `CastOptions.safe` is `true`, invalid 
UTF‑8 bytes are replaced by `NULL` in the resulting `StringViewArray`; when 
`CastOptions.safe` is `false`, the cast retains the existing failure behavior.
   
   ---
   
   # What changes are included in this PR?
   
   * Change `cast_with_options` to delegate the `BinaryView -> Utf8View` branch 
to a new helper function `cast_binary_view_to_string_view(array, cast_options)` 
instead of directly calling `to_string_view()` and erroring.
   
   * Add `extend_valid_utf8` helper to centralize the logic of mapping 
`Option<&[u8]>` to `Option<&str>` (using `std::str::from_utf8(...).ok()`), and 
reuse it for both `GenericStringBuilder` and `StringViewBuilder` flows.
   
   * Implement `cast_binary_view_to_string_view` which:
   
     * Attempts `array.clone().to_string_view()` (fast, zero-copy path) and 
returns it when `Ok`.
     * On `Err`, checks `cast_options.safe`:
   
       * If `true`, builds a `StringViewArray` by filtering invalid UTF‑8 to 
`NULL` using `extend_valid_utf8` and returns that array.
       * If `false`, propagates the original error (existing behavior).
   
   * Add a unit test `test_binary_view_to_string_view_with_invalid_utf8` 
covering both `safe=false` (expect error) and `safe=true` (expect `NULL` where 
invalid UTF‑8 occurred).
   
   Files changed (high level):
   
   * `arrow-cast/src/cast/mod.rs`: route `BinaryView -> Utf8View` case to the 
new helper.
   * `arrow-cast/src/cast/string.rs`: add `extend_valid_utf8` and 
`cast_binary_view_to_string_view`, and use `extend_valid_utf8` from an existing 
cast path.
   
   ---
   
   # Are there any user-facing changes?
   
   Yes — this changes the observable behavior of casting `BinaryView` to 
`Utf8View`:
   
   * With `CastOptions.safe = true` (the safe mode), invalid UTF‑8 in 
`BinaryView` elements will be converted to `NULL` in the resulting `Utf8View` 
array instead of causing the entire cast to fail.
   * With `CastOptions.safe = false`, an invalid UTF‑8 still causes the cast to 
fail as before.
   
   This is a bug fix aligning `BinaryView` with the semantics of other binary 
types and with documented expectations for `CastOptions.safe`.
   
   No public API surface is changed beyond the fixed behavior; the new helpers 
are crate-private.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to