[ 
https://issues.apache.org/jira/browse/PARQUET-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14493364#comment-14493364
 ] 

Yijie Shen commented on PARQUET-251:
------------------------------------

[~jnadeau], I checked the PR, It seems you just change
{code}
public void updateStats(Binary min_value, Binary max_value) {
    if (min.compareTo(min_value) > 0) { min = min_value; }
    if (max.compareTo(max_value) < 0) { max = max_value; }
  }
{code}
to 
{code}
public void updateStats(Binary min_value, Binary max_value) {
    if (min.compareTo(min_value) > 0) { min = 
Binary.fromByteArray(min_value.getBytes()); }
    if (max.compareTo(max_value) < 0) { max = 
Binary.fromByteArray(max_value.getBytes()); }
}
{code}

As I mentioned in the description, the bug occurs when I reuse byte[], using 
ByteArrayBackedBinary as Binary, fromByteArray & getBytes are not really doing 
the "defensive copy" job mentioned by [~rdblue], it's still passing the 
original byte[].

Correct me if I misunderstand your PR :)

> Binary column statistics error when reuse byte[] among rows
> -----------------------------------------------------------
>
>                 Key: PARQUET-251
>                 URL: https://issues.apache.org/jira/browse/PARQUET-251
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Yijie Shen
>            Priority: Blocker
>
> I think it is a common practice when inserting table data as parquet file, 
> one would always reuse the same object among rows, and if a column is byte[] 
> of fixed length, the byte[] would also be reused. 
> If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row 
> groups created by a single task would have the same max & min binary value, 
> just as the last row's binary content.
> The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary 
> references, since I use ByteArrayBackedBinary for byte[], the real content of 
> max & min would always point to the reused byte[], therefore the latest row's 
> content.
> Does parquet declare somewhere that the user shouldn't reuse byte[] for 
> Binary type?  If it doesn't, I think it's a bug and can be reproduced by 
> [Spark SQL's RowWriteSupport 
> |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354]
> The related Spark JIRA ticket: 
> [SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to