friendlymatthew opened a new issue, #7902:
URL: https://github.com/apache/arrow-rs/issues/7902

   This is a follow up on https://github.com/apache/arrow-rs/pull/7878
   
   The [variant 
spec](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#:~:text=The%20last%20part%20of%20the%20metadata%20is%20bytes%2C%20which%20stores%20all%20the%20string%20values%20in%20the%20dictionary.%20All%20string%20values%20must%20be%20UTF%2D8%20encoded%20strings.)
 states the string values in the metadata dictionary must be UTF-8 encoded 
strings. 
   
   
   We do this check here: 
   
https://github.com/apache/arrow-rs/blob/387490a7a97a9ea6d2fcd0105e6a1abaf819a386/parquet-variant/src/variant/metadata.rs#L250-L252
   
   
   Since we offer `simdutf8` as an optional dependency in other crates, we 
could do the same when performing the validation above. See @Dandandan's 
[comment](https://github.com/apache/arrow-rs/pull/7878#discussion_r2197556647).
   
   
   The rough idea being:
   
   If `simdutf8` is supported, do: 
   ```rs
   let value_str = simdutf8::basic::from_utf8(value_buffer)?;
   ```
   
   else, default to the existing implementation
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to