[ https://issues.apache.org/jira/browse/PARQUET-511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140121#comment-15140121 ]
Ryan Blue commented on PARQUET-511: ----------------------------------- Thanks [~goreckim]! I'll have a look soon. I know we've also considered a maximum number of records to add to a row group for cases where compression is good enough that you never hit the max. That would work for a fix as well (if that's not what you did). > Integer overflow on counting values in column > --------------------------------------------- > > Key: PARQUET-511 > URL: https://issues.apache.org/jira/browse/PARQUET-511 > Project: Parquet > Issue Type: Bug > Components: parquet-mr > Affects Versions: 1.8.1 > Reporter: Michal Gorecki > Assignee: Michal Gorecki > Priority: Critical > > Parquet will ignore a column if the combined amount of elements in the column > is larger than the size of an int. > The issue is that as the column reader is initialized and the rep and def > levels are initialized per column, the size of the integer will overflow, > causing these values to not be set properly. Then, during read, the level > will not match the current level of the reader, and a null value will be > provided. Since there is no overflow checking, no exception is thrown, and it > appears that the data is corrupted. > This happened to us with a fairly complex schema, with an array of maps, > which contained arrays as well. There were over 4 billion values in all > column pages in one row group, which is what triggered the overflow. > Relevant stack trace > org.apache.parquet.io.ParquetDecodingException: Can not read value at 172310 > in block 0 in file <redacted> > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227) > ... > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1626) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: <redacted> INT64 at value 95584934 out of 95530352, 130598 > out of 130598 in currentPage. repetition level: 0, definition level: 2 > at > org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:484) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:370) > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218) > ... 18 more > Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking > stream. > at > org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82) > at > org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64) > at > org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong(DictionaryValuesReader.java:121) > at > org.apache.parquet.column.impl.ColumnReaderImpl$2$4.read(ColumnReaderImpl.java:263) > at > org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464) > ... 21 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)