[ 
https://issues.apache.org/jira/browse/PARQUET-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14515981#comment-14515981
 ] 

Konstantin Shaposhnikov commented on PARQUET-258:
-------------------------------------------------

[~rdblue], you are right, I should have searched for an existing bug report 
first.

> Binary statistics is not updated correctly if an underlying Binary array is 
> modified in place
> ---------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-258
>                 URL: https://issues.apache.org/jira/browse/PARQUET-258
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Konstantin Shaposhnikov
>
> The following test case shows the problem:
> {code}
>     byte[] bytes = new byte[] { 49 };
>     BinaryStatistics reusableStats =  new BinaryStatistics();
>     reusableStats.updateStats(Binary.fromByteArray(bytes));
>     bytes[0] = 50;
>     reusableStats.updateStats(Binary.fromByteArray(bytes, 0, 1));
>  
>     assertArrayEquals(new byte[] { 49 }, reusableStats.getMinBytes());
>     assertArrayEquals(new byte[] { 50 }, reusableStats.getMaxBytes());
> {code}
> I discovered the bug when converting an AVRO file to a Parquet file by 
> reading GenericRecords from a file using [DataFileStream.next(D 
> reuse)|http://javadox.com/org.apache.avro/avro/1.7.6/org/apache/avro/file/DataFileStream.html#next(D)]
>  method. The problem is that underlying byte array of avro Utf8 object is 
> passed to parquet that saves it as part of BinaryStatistics and then the same 
> array is modified in place on the next read.
> I am not sure what is the right way to fix the problem (in BinaryStatistics 
> or AvroWriteSupport).
> If BinaryStatistics implementation is correct (for performance reasons) then 
> this behavior should be documented and AvroWriteSupport.fromAvroString should 
> be fixed to duplicate underlying Utf8 array.
> I am happy to create a pull request once the desired way to fix the issue is 
> discussed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to