alamb opened a new issue, #5910:
URL: https://github.com/apache/arrow-rs/issues/5910

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Part of implementing `StringView` 
https://github.com/apache/arrow-rs/issues/5374
   
   @XiangpengHao  implemented `gc` which compacts all the strings in a 
StringView/BinaryView into contiguous storage in 
https://github.com/apache/arrow-rs/issues/5513
   
   However, that functionality does not deduplicate/intern the strings -- it 
just copies them over
   
   
   **Describe the solution you'd like**
   
   We should make it easy to deduplicate the strings in a StringView. 
   
   I do think we should change `gc` to do deduplication without an explict as 
(as deduplication is expensive)
   
   
   **Describe alternatives you've considered**
   1. Do nothing (users can implement their own version of this code without 
any addtional apis)
   2. Add a new function (e.g. `GenericBinaryView::dedupe`) that deduplicated 
such arrays (likely not moving any strings, but just updating views)
   3. Add an argument to `GenericBinaryView::gc` that controlled the behavior 
(as in could also specify doing gc)
   
   **Additional context**
   @alexwilcoxson-rel asked in  
https://github.com/apache/arrow-rs/issues/5904#issuecomment-2174386654
   
   > Can/will this incorporate deduping/interning/implicitly using the gc 
function that landed recently?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to