Dandandan commented on a change in pull request #1082:
URL: https://github.com/apache/arrow-rs/pull/1082#discussion_r785434772



##########
File path: parquet/src/arrow/array_reader/byte_array.rs
##########
@@ -192,7 +211,16 @@ impl<I: OffsetSizeTrait + ScalarValue> OffsetBuffer<I> {
         self.offsets.len() - 1
     }
 
-    fn try_push(&mut self, data: &[u8]) -> Result<()> {
+    fn try_push(&mut self, data: &[u8], validate_utf8: bool) -> Result<()> {
+        if validate_utf8 {
+            if let Err(e) = std::str::from_utf8(data) {

Review comment:
       I think we tried to do something similar with parquet2 but concluded 
that the individual strings should be checked instead. `simdutf8` is more 
impressive at checking non ASCII strings btw.
   Checking the code points at the offsets seems an interesting approach!
   Also FYI @jorgecarleitao




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to