[ 
https://issues.apache.org/jira/browse/PARQUET-511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140121#comment-15140121
 ] 

Ryan Blue commented on PARQUET-511:
-----------------------------------

Thanks [~goreckim]! I'll have a look soon. I know we've also considered a 
maximum number of records to add to a row group for cases where compression is 
good enough that you never hit the max. That would work for a fix as well (if 
that's not what you did).

> Integer overflow on counting values in column
> ---------------------------------------------
>
>                 Key: PARQUET-511
>                 URL: https://issues.apache.org/jira/browse/PARQUET-511
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.8.1
>            Reporter: Michal Gorecki
>            Assignee: Michal Gorecki
>            Priority: Critical
>
> Parquet will ignore a column if the combined amount of elements in the column 
> is larger than the size of an int.
> The issue is that as the column reader is initialized and the rep and def 
> levels are initialized per column, the size of the integer will overflow, 
> causing these values to not be set properly. Then, during read, the level 
> will not match the current level of the reader, and a null value will be 
> provided. Since there is no overflow checking, no exception is thrown, and it 
> appears that the data is corrupted.
> This happened to us with a fairly complex schema, with an array of maps, 
> which contained arrays as well. There were over 4 billion values in all 
> column pages in one row group, which is what triggered the overflow.
> Relevant stack trace
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 172310 
> in block 0 in file <redacted>
>         at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
>         at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
>    ...
>         at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
>         at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>         at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1626)
>         at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
>         at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
>         at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>         at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>         at org.apache.spark.scheduler.Task.run(Task.scala:70)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by:      <redacted> INT64 at value 95584934 out of 95530352, 130598 
> out of 130598 in currentPage. repetition level: 0, definition level: 2
>         at 
> org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:484)
>         at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:370)
>         at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
>         at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
>         ... 18 more
> Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking 
> stream.
>         at 
> org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>         at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)
>         at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)
>         at 
> org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong(DictionaryValuesReader.java:121)
>         at 
> org.apache.parquet.column.impl.ColumnReaderImpl$2$4.read(ColumnReaderImpl.java:263)
>         at 
> org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464)
>         ... 21 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to