Kristoffer Sjögren created PARQUET-112:
------------------------------------------
Summary: RunLengthBitPackingHybridDecoder: Reading past
RLE/BitPacking stream.
Key: PARQUET-112
URL: https://issues.apache.org/jira/browse/PARQUET-112
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Environment: Java 1.7 Linux Debian
Reporter: Kristoffer Sjögren
I am using Avro and Crunch 0.11 to write data into Hadoop CDH 4.6 in parquet
format. This works fine for a few gigabytes but blows up in the
RunLengthBitPackingHybridDecoder when reading a few thousands gigabytes.
parquet.io.ParquetDecodingException: Can not read value at 19453 in block 0 in
file
hdfs://nn-ix01.se-ix.delta.prod:8020/user/stoffe/parquet/dogfight/2014/09/29/part-m-00153.snappy.parquet
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177)
at
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
at
org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:157)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
at
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: parquet.io.ParquetDecodingException: Can't read value in column
[action] BINARY at value 697332 out of 872236, 96921 out of 96921 in
currentPage. repetition level: 0, definition level: 1
at
parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:466)
at
parquet.column.impl.ColumnReaderImpl.getBinary(ColumnReaderImpl.java:414)
at parquet.filter.ColumnPredicates$1.apply(ColumnPredicates.java:64)
at parquet.filter.ColumnRecordFilter.isMatch(ColumnRecordFilter.java:69)
at
parquet.io.FilteredRecordReader.skipToMatch(FilteredRecordReader.java:71)
at parquet.io.FilteredRecordReader.read(FilteredRecordReader.java:57)
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:173)
... 13 more
Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking
stream.
at parquet.Preconditions.checkArgument(Preconditions.java:47)
at
parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:80)
at
parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:62)
at
parquet.column.values.dictionary.DictionaryValuesReader.readBytes(DictionaryValuesReader.java:73)
at
parquet.column.impl.ColumnReaderImpl$2$7.read(ColumnReaderImpl.java:311)
at
parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
... 19 more
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)