[
https://issues.apache.org/jira/browse/PARQUET-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495062#comment-14495062
]
Jason Altekruse commented on PARQUET-251:
-----------------------------------------
For a design standpoint, I completely agree that it would be clearest to have
the Binary objects maintain zero shared state.
I think my concern is that the interface that is trying to unify byte[] with
other types might be ill fit to have a method that returns a bare byte[]
without an explicit expectation that it be an immutable copy of the data being
currently represented. An implementation that re-uses a buffer will not be able
to return its byte[] alone if the buffer has a chance of being allocated to the
wrong size for the current value, so it would need to make a copy anyway.
As far as writing to a stream is concerned, this is actually already covered by
methods on Binary itself, which can write the current value to a stream with
whatever method they need, specific to how the bytes are stored. I think adding
interfaces for any other use cases is the cleanest way to do it. For example
how the statistics want to use the data, it might make the most sense for the
binary to include a clone method explicitly, that will create a copy of the
binary in the most efficient manner possible. For Drill it would be useful if
we could have the flexibility to return a new Binary backed by a direct memory
buffer that we can track with our own allocator, for another type of binary,
maybe one that internally has some amount of compression on individual values,
would return a more efficient copy of itself than a copy that could be made of
the byte[] returned from this method.
> Binary column statistics error when reuse byte[] among rows
> -----------------------------------------------------------
>
> Key: PARQUET-251
> URL: https://issues.apache.org/jira/browse/PARQUET-251
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.6.0
> Reporter: Yijie Shen
> Priority: Blocker
>
> I think it is a common practice when inserting table data as parquet file,
> one would always reuse the same object among rows, and if a column is byte[]
> of fixed length, the byte[] would also be reused.
> If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row
> groups created by a single task would have the same max & min binary value,
> just as the last row's binary content.
> The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary
> references, since I use ByteArrayBackedBinary for byte[], the real content of
> max & min would always point to the reused byte[], therefore the latest row's
> content.
> Does parquet declare somewhere that the user shouldn't reuse byte[] for
> Binary type? If it doesn't, I think it's a bug and can be reproduced by
> [Spark SQL's RowWriteSupport
> |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354]
> The related Spark JIRA ticket:
> [SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)