lvhuyen commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input 
format
URL: https://github.com/apache/flink/pull/6483#issuecomment-418967608
 
 
   @HuangZhenQiu 
   Here is the schema of that parquet file, printed in Zeppelin.
   > root
   >  |-- metrics_date: timestamp (nullable = true)
   >  |-- counter: long (nullable = true)
   >  |-- meter: double (nullable = true)
   >  |-- customer_id: string (nullable = true)
   I also attach that sample file here: 
[https://github.com/lvhuyen/flink/blob/parquet_input_format(7243)/flink-formats/flink-parquet/src/test/resources/test.parquet](url
   )
   
   I tried to debug in IntelliJ, that column is in fact stored as primitive 
type int96 (not 64), and as Apache's 
[https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderImpl.java](url),
 int96 is treated as a String (line 274). The way they converted from ByteArray 
into a String at line 393 of 
[https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java](url)
 seems to be irreversible and leads to data loss (my data has metrics_date = 
2018-09-01 15:02:55.0, which was read as a bytes array of [0, 118, -95, -103, 
69, 49, 0, 0, -5, -126, 37, 0]. After that line 393, I got a string with length 
= 12 which has the same character at 3, 4, 9, and 10th position. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to