Michal Gorecki created PARQUET-511:
--------------------------------------

             Summary: Integer overflow on counting values in column
                 Key: PARQUET-511
                 URL: https://issues.apache.org/jira/browse/PARQUET-511
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.8.1
            Reporter: Michal Gorecki
            Assignee: Michal Gorecki
            Priority: Critical


Parquet will ignore a column if the combined amount of elements in the column 
is larger than the size of an int.

The issue is that as the column reader is initialized and the rep and def 
levels are initialized per column, the size of the integer will overflow, 
causing these values to not be set properly. Then, during read, the level will 
not match the current level of the reader, and a null value will be provided. 
Since there is no overflow checking, no exception is thrown, and it appears 
that the data is corrupted.

This happened to us with a fairly complex schema, with an array of maps, which 
contained arrays as well. There were over 4 billion values in all column pages 
in one row group, which is what triggered the overflow.

Relevant stack trace
org.apache.parquet.io.ParquetDecodingException: Can not read value at 172310 in 
block 0 in file <redacted>
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
        at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
   ...
        at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
        at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1626)
        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by:      <redacted> INT64 at value 95584934 out of 95530352, 130598 out 
of 130598 in currentPage. repetition level: 0, definition level: 2
        at 
org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:484)
        at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:370)
        at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
        ... 18 more
Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking 
stream.
        at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
        at 
org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)
        at 
org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)
        at 
org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong(DictionaryValuesReader.java:121)
        at 
org.apache.parquet.column.impl.ColumnReaderImpl$2$4.read(ColumnReaderImpl.java:263)
        at 
org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464)
        ... 21 more




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to