Konstantin Shaposhnikov created PARQUET-258:
-----------------------------------------------

             Summary: Binary statistics is not updated correctly if an 
underlying Binary array is modified in place
                 Key: PARQUET-258
                 URL: https://issues.apache.org/jira/browse/PARQUET-258
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.6.0
            Reporter: Konstantin Shaposhnikov


The following test case shows the problem:

{code}
    byte[] bytes = new byte[] { 49 };
    BinaryStatistics reusableStats =  new BinaryStatistics();
    reusableStats.updateStats(Binary.fromByteArray(bytes));
    bytes[0] = 50;
    reusableStats.updateStats(Binary.fromByteArray(bytes, 0, 1));
 
    assertArrayEquals(new byte[] { 49 }, reusableStats.getMinBytes());
    assertArrayEquals(new byte[] { 50 }, reusableStats.getMaxBytes());
{code}

I discovered the bug when converting an AVRO file to a Parquet file by reading 
GenericRecords from a file using [DataFileStream.next(D 
reuse)|http://javadox.com/org.apache.avro/avro/1.7.6/org/apache/avro/file/DataFileStream.html#next(D)]
 method. The problem is that underlying byte array of avro Utf8 object is 
passed to parquet that saves it as part of BinaryStatistics and then the same 
array is modified in place on the next read.

I am not sure what is the right way to fix the problem (in BinaryStatistics or 
AvroWriteSupport).

If BinaryStatistics implementation is correct (for performance reasons) then 
this behavior should be documented and AvroWriteSupport.fromAvroString should 
be fixed to duplicate underlying Utf8 array.

I am happy to create a pull request once the desired way to fix the issue is 
discussed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to