Yijie Shen created PARQUET-251:
----------------------------------

             Summary: Binary column statistics error when reuse byte[] among 
rows
                 Key: PARQUET-251
                 URL: https://issues.apache.org/jira/browse/PARQUET-251
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.6.0
            Reporter: Yijie Shen


I think it is a common practice when inserting table data as parquet file, one 
would always reuse the same object among rows, and if a column is byte[] of 
fixed length, the byte[] would also be reused. 

If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row 
groups created by a single task would have the same max & min binary value, 
just as the last row's binary content.

The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary 
references, since I use ByteArrayBackedBinary for byte[], the real content of 
max & min would always point to the reused byte[], therefore the latest row's 
content.

Does parquet declare somewhere that the user shouldn't reuse byte[] for Binary 
type?  If it doesn't, I think it's a bug and can be reproduced by [Spark SQL's 
RowWriteSupport 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354]

The related Spark JIRA ticket: 
[SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to