[
https://issues.apache.org/jira/browse/PARQUET-152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nezih Yigitbasi updated PARQUET-152:
------------------------------------
Description:
While running some tests against the master branch I noticed that when writing
a fixed length byte array and the array's size is > dictionaryPageSize (in my
test it was 512), the encoding falls back to DELTA_BYTE_ARRAY as seen below:
Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written
12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B raw, 1,710B
comp, 5 pages, encodings: [DELTA_BYTE_ARRAY]
But then read fails with the following exception:
{noformat}
Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is
only supported for type BINARY
at parquet.column.Encoding$7.getValuesReader(Encoding.java:193)
at
parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534)
at
parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574)
at
parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54)
at
parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518)
at
parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510)
at parquet.column.page.DataPageV2.accept(DataPageV2.java:123)
at
parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510)
at
parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502)
at
parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604)
at
parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:348)
at
parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
at
parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
at
parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:267)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
at
parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
at
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129)
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
... 16 more
{noformat}
When the array's size is < dictionaryPageSize, RLE_DICTIONARY encoding is used
and read works fine:
Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written
50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B comp, 1
pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw, 1B comp}
was:
While running some tests against the master branch I noticed that when writing
a fixed length byte array and the array's size is > dictionaryPageSize (in my
test it was 512), the encoding falls back to DELTA_BYTE_ARRAY as seen below:
Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written
12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B raw, 1,710B
comp, 5 pages, encodings: [DELTA_BYTE_ARRAY]
But then read fails with the following exception:
Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is
only supported for type BINARY
at parquet.column.Encoding$7.getValuesReader(Encoding.java:193)
at
parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534)
at
parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574)
at
parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54)
at
parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518)
at
parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510)
at parquet.column.page.DataPageV2.accept(DataPageV2.java:123)
at
parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510)
at
parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502)
at
parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604)
at
parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:348)
at
parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
at
parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
at
parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:267)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
at
parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
at
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129)
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
... 16 more
When the array's size is < dictionaryPageSize, RLE_DICTIONARY encoding is used
and read works fine:
Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written
50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B comp, 1
pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw, 1B comp}
> Encoding issue with fixed length byte arrays
> --------------------------------------------
>
> Key: PARQUET-152
> URL: https://issues.apache.org/jira/browse/PARQUET-152
> Project: Parquet
> Issue Type: Bug
> Reporter: Nezih Yigitbasi
> Priority: Minor
>
> While running some tests against the master branch I noticed that when
> writing a fixed length byte array and the array's size is >
> dictionaryPageSize (in my test it was 512), the encoding falls back to
> DELTA_BYTE_ARRAY as seen below:
> Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore:
> written 12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B
> raw, 1,710B comp, 5 pages, encodings: [DELTA_BYTE_ARRAY]
> But then read fails with the following exception:
> {noformat}
> Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is
> only supported for type BINARY
> at parquet.column.Encoding$7.getValuesReader(Encoding.java:193)
> at
> parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534)
> at
> parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574)
> at
> parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54)
> at
> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518)
> at
> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510)
> at parquet.column.page.DataPageV2.accept(DataPageV2.java:123)
> at
> parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510)
> at
> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502)
> at
> parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604)
> at
> parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:348)
> at
> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
> at
> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
> at
> parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:267)
> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
> at
> parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
> at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
> at
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129)
> at
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
> ... 16 more
> {noformat}
> When the array's size is < dictionaryPageSize, RLE_DICTIONARY encoding is
> used and read works fine:
> Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore:
> written 50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B
> comp, 1 pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw,
> 1B comp}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)