msalib opened a new issue, #3228:
URL: https://github.com/apache/arrow-rs/issues/3228

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   When I filed https://github.com/apache/arrow-rs/issues/3123, I was surprised 
to discover that concatenating lots of `Utf8` elements is supposed to panic 
when the total size is over 2 GB, even though the individual sizes are much 
smaller. That constraint was really unexpected! It makes sense if you 
understand the storage model, but I didn't and so was very surprised.
   
   **Describe the solution you'd like**
   
   I'm not sure how to surface this knowledge better. When I first skimmed the 
data type docs, I walked away thinking that `LargeUtf8` is for cases where an 
individual element is large (I wasn't even clear that large meant > 2 GB) and 
that I should use `Utf8` for everything else. But I should've understood the 
constraint as "use `LargeUtf8` everywhere except places where you can guarantee 
that you'll never have an array with more than 2 GB of text total".
   
   Maybe we just need a big statement in the [Physical Memory 
Layout](https://arrow.apache.org/docs/format/Columnar.html#physical-memory-layout)
 guide and the 
[`DataType`](https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Utf8)
 doc string explaining that you cannot ever build an array where the total text 
size is over 2 GB if you use `Utf8`
   
   **Describe alternatives you've considered**
   
   This feels like a landmine and I wish Arrow could transparently convert 
between these types as needed. Ideally there should just be a `Utf8` type that 
internally specifies what type it uses to manage offsets.
   
   Alternatively, I wish the `concat` kernel could return a more explicit 
failure message by explicitly checking for this sort of overflow, something 
like "I've been asked to concat 2 `Utf8` arrays into an array that will be over 
2 GB and I cannot do that: these arrays need to be `LargeUtf8` instead". I 
mean, when you're doing the concatenation, you can check lengths explicitly 
ahead of time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to