v0y4g3r opened a new issue, #3506:
URL: https://github.com/apache/arrow-rs/issues/3506

   **Describe the bug**
   
   As per [Parquet's 
spec](https://parquet.apache.org/docs/file-format/data-pages/encodings/) and 
[Java 
implementation](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/IntStatistics.java#L82),
 Statistics use plain encoding which encodes INT64 to bytes in litten endian. 
   
   In arrow's official Parquet implementation, when decoding column statistics, 
it decodes data in little endian:
   
https://github.com/apache/arrow-rs/blob/3788fd20f053ee58f08b4d09cd4dac5bb9b96c06/parquet/src/file/statistics.rs#L171-L177
   
   But when writing min/max value of statistics, it simply convert the memory 
representation of i64 values into byte slice, which is platform dependent.
   
   
https://github.com/apache/arrow-rs/blob/3788fd20f053ee58f08b4d09cd4dac5bb9b96c06/parquet/src/data_type.rs#L451-L463
   
   **To Reproduce**
   It would be rather easy to reproduce this problem, but I don't have any big 
endian device like MIPS server by my side.
   
   
   **Expected behavior**
   
   Encode min/max value of statistics into little endian bytes.
   
   **Additional context**
   
   When encoding stats, Parquet uses `AsBytes` trait to convert i64 into byte 
slice, 
   
https://github.com/apache/arrow-rs/blob/3788fd20f053ee58f08b4d09cd4dac5bb9b96c06/parquet/src/data_type.rs#L428-L431
   
   Thus the lifetime of slice returned is bound with the value itself. If we 
want to convert a i64 into little endian byte slice in a big endian platform, 
we must create a temporarily array to store the converted little endian bytes 
of the value instead of just reinterpret the address of value into a byte 
slice. When `as_bytes` method returns, the temp array will be dropped which 
violates the trait's lifetime constraint. We may need to change `AsBytes` into 
sth like:
   ```rs
   pub trait AsBytes {
       fn encode(&self, buf: &mut Vec<u8>) -> usize;
   }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to