[ 
https://issues.apache.org/jira/browse/FLINK-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605466#comment-16605466
 ] 

ASF GitHub Bot commented on FLINK-7243:
---------------------------------------

lvhuyen edited a comment on issue #6483: [FLINK-7243][flink-formats] Add 
parquet input format
URL: https://github.com/apache/flink/pull/6483#issuecomment-418967608
 
 
   @HuangZhenQiu 
   Here is the schema of that parquet file, printed in Zeppelin.
   > root
   >  |-- metrics_date: timestamp (nullable = true)
   >  |-- counter: long (nullable = true)
   >  |-- meter: double (nullable = true)
   >  |-- customer_id: string (nullable = true)
   I also attach that sample file here: 
[https://github.com/lvhuyen/flink/blob/parquet_input_format(7243)/flink-formats/flink-parquet/src/test/resources/test.parquet](url
   )
   
   I tried to debug in IntelliJ, that column is in fact stored as primitive 
type int96 (not 64), and as Apache's 
[https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderImpl.java](url),
 int96 is treated as a Binary (line 274). As per your current implementation in 
RowCoverter class, a Binary is converted into a String using UTF-8, which seems 
to be irreversible and leads to data loss (my data has metrics_date = 
2018-09-01 15:02:55.0, which was read as a bytes array of [0, 118, -95, -103, 
69, 49, 0, 0, -5, -126, 37, 0] the got converted to a string with length = 12 
which has the same character at 3, 4, 9, and 10th position. 
   Should that possible to modify the method 
RowConverter.RowPrimitiveConverter.addBinary() to handle String / BigInteger 
differently?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add ParquetInputFormat
> ----------------------
>
>                 Key: FLINK-7243
>                 URL: https://issues.apache.org/jira/browse/FLINK-7243
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Table API & SQL
>            Reporter: godfrey he
>            Assignee: Zhenqiu Huang
>            Priority: Major
>              Labels: pull-request-available
>
> Add a {{ParquetInputFormat}} to read data from a Apache Parquet file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to