Ryan Blue created PARQUET-82:
--------------------------------
Summary: ColumnChunkPageWriteStore assumes pages are smaller than
Integer.MAX_VALUE
Key: PARQUET-82
URL: https://issues.apache.org/jira/browse/PARQUET-82
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Reporter: Ryan Blue
Assignee: Ryan Blue
The ColumnChunkPageWriteStore casts both the compressed size and uncompressed
size of a page from a long to an int. If the uncompressed size of a page
exceeds Integer.MAX_VALUE, the write doesn't fail, although it creates bad
metadata:
{code}
chunk1: BINARY GZIP DO:0 FPO:4 SZ:267184096/-2143335445/-8.02 VC:41
ENC:BIT_PACKED,PLAIN
{code}
At read time, the BytesInput will try to [allocate a byte
array|https://github.com/apache/incubator-parquet-mr/blob/master/parquet-encoding/src/main/java/parquet/bytes/BytesInput.java#L200]
for the uncompressed data and fails:
{code}
Caused by: parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file hdfs://nameservice1/OUTPUT/part-m-00000.gz.parquet
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177)
at
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:95)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:66)
at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)
at
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:65)
... 16 more
Caused by: java.lang.NegativeArraySizeException
at parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:183)
at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521)
at parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493)
at parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:544)
at parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:339)
at
parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
at
parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
at
parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:265)
at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:59)
at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:73)
at
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110)
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
... 21 more
{code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)