[ 
https://issues.apache.org/jira/browse/PARQUET-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495062#comment-14495062
 ] 

Jason Altekruse commented on PARQUET-251:
-----------------------------------------

For a design standpoint, I completely agree that it would be clearest to have 
the Binary objects maintain zero shared state.

I think my concern is that the interface that is trying to unify byte[] with 
other types might be ill fit to have a method that returns a bare byte[] 
without an explicit expectation that it be an immutable copy of the data being 
currently represented. An implementation that re-uses a buffer will not be able 
to return its byte[] alone if the buffer has a chance of being allocated to the 
wrong size for the current value, so it would need to make a copy anyway.

As far as writing to a stream is concerned, this is actually already covered by 
methods on Binary itself, which can write the current value to a stream with 
whatever method they need, specific to how the bytes are stored. I think adding 
interfaces for any other use cases is the cleanest way to do it. For example 
how the statistics want to use the data, it might make the most sense for the 
binary to include a clone method explicitly, that will create a copy of the 
binary in the most efficient manner possible. For Drill it would be useful if 
we could have the flexibility to return a new Binary backed by a direct memory 
buffer that we can track with our own allocator, for another type of binary, 
maybe one that internally has some amount of compression on individual values, 
would return a more efficient copy of itself than a copy that could be made of 
the byte[] returned from this method.

> Binary column statistics error when reuse byte[] among rows
> -----------------------------------------------------------
>
>                 Key: PARQUET-251
>                 URL: https://issues.apache.org/jira/browse/PARQUET-251
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Yijie Shen
>            Priority: Blocker
>
> I think it is a common practice when inserting table data as parquet file, 
> one would always reuse the same object among rows, and if a column is byte[] 
> of fixed length, the byte[] would also be reused. 
> If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row 
> groups created by a single task would have the same max & min binary value, 
> just as the last row's binary content.
> The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary 
> references, since I use ByteArrayBackedBinary for byte[], the real content of 
> max & min would always point to the reused byte[], therefore the latest row's 
> content.
> Does parquet declare somewhere that the user shouldn't reuse byte[] for 
> Binary type?  If it doesn't, I think it's a bug and can be reproduced by 
> [Spark SQL's RowWriteSupport 
> |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354]
> The related Spark JIRA ticket: 
> [SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to