Konstantin Shaposhnikov created PARQUET-258:
-----------------------------------------------
Summary: Binary statistics is not updated correctly if an
underlying Binary array is modified in place
Key: PARQUET-258
URL: https://issues.apache.org/jira/browse/PARQUET-258
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Affects Versions: 1.6.0
Reporter: Konstantin Shaposhnikov
The following test case shows the problem:
{code}
byte[] bytes = new byte[] { 49 };
BinaryStatistics reusableStats = new BinaryStatistics();
reusableStats.updateStats(Binary.fromByteArray(bytes));
bytes[0] = 50;
reusableStats.updateStats(Binary.fromByteArray(bytes, 0, 1));
assertArrayEquals(new byte[] { 49 }, reusableStats.getMinBytes());
assertArrayEquals(new byte[] { 50 }, reusableStats.getMaxBytes());
{code}
I discovered the bug when converting an AVRO file to a Parquet file by reading
GenericRecords from a file using [DataFileStream.next(D
reuse)|http://javadox.com/org.apache.avro/avro/1.7.6/org/apache/avro/file/DataFileStream.html#next(D)]
method. The problem is that underlying byte array of avro Utf8 object is
passed to parquet that saves it as part of BinaryStatistics and then the same
array is modified in place on the next read.
I am not sure what is the right way to fix the problem (in BinaryStatistics or
AvroWriteSupport).
If BinaryStatistics implementation is correct (for performance reasons) then
this behavior should be documented and AvroWriteSupport.fromAvroString should
be fixed to duplicate underlying Utf8 array.
I am happy to create a pull request once the desired way to fix the issue is
discussed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)