mapleFU opened a new issue, #36139:
URL: https://github.com/apache/arrow/issues/36139

   ### Describe the enhancement requested
   
   Currently, in parquet c++, if the min-max is string / binary:
   1. During writing values, min-max will be collect
   2. When building statistics, there is apply truncate, which will discard 
min-max if they're longer than expected
   
   Can we use a "truncate" for this:
   1. For BINARY
     1. If it's minimum, just truncate to length is ok
     2. if it's maximum:
       1. if the truncated binary would be 0xFF 0xFF ... 0xFF, we cannot 
truncate it
       2. Otherwise, get the "next" valid truncated binary
   2. For String
      1.  If it's minimum, just truncate and to **a valid utf8** is ok
      2. If it's maximum, first truncate to a valid utf8, then try to advance 
it.
   
   
   References:
   1. https://github.com/apache/parquet-mr/pull/481
   2. https://github.com/apache/arrow-rs/pull/4389
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to