ethan-tyler commented on PR #9220:
URL: https://github.com/apache/arrow-rs/pull/9220#issuecomment-3799750405

   
   > I think you can use `StringViewBuilder` to do this: 
https://docs.rs/arrow/latest/arrow/array/type.StringViewBuilder.html
   > 
   > The one thing that might be tricky is knowing what the pre-existing index 
was
   > 
   > Specifically: 
https://docs.rs/arrow/latest/arrow/array/type.StringViewBuilder.html#method.with_deduplicate_strings
   > 
   > I am not sure what you mean by "invariants like prefix/offset correctness"
   
   You're right that `StringViewBuilder` supports `with_deduplicate_strings`, 
but that's deduplication of the backing storage for a view array. It doesn't 
provide the value→dictionary-index ("get_or_insert") we need to build the keys, 
and the API doesn't expose which prior entry was reused.
   
   Packing directly to `Dictionary<K, Utf8View>` would therefore need a view 
dictionary builder (or a `StringViewBuilder` API that returns the existing/new 
index without appending duplicates).
   
   For this PR I kept the two step path: pack to `Dictionary<K, Utf8/Binary>` 
(reusing `GenericByteDictionaryBuilder`), then cast to view values. When the 
dictionary values are `Utf8/Binary` and offsets fit, `dictionary_cast` can use 
`view_from_dict_values` (`append_block` + views) so the values buffer is reused 
zero-copy. `Large*` / oversized falls back to the general cast path.
   
   Also, "prefix/offset correctness" was poor wording, I just meant 
constructing valid `ByteView { prefix, buffer_index, offset, length }` 
descriptors, which the existing cast path handles.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to