[ 
https://issues.apache.org/jira/browse/PARQUET-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007326#comment-16007326
 ] 

Cheng Lian edited comment on PARQUET-980 at 5/11/17 10:46 PM:
--------------------------------------------------------------

The current write path ensures that it never writes a page that is larger than 
2GB, but the read path may read 1 or more column chunks consisting of multiple 
pages into a single byte array (or {{ByteBuffer}}) no larger than 2GB.

We hit this issue in production because the data distribution happened to be 
similar to the situation mentioned in the JIRA description and produced a 
skewed row group containing a column chunk larger than 2GB.

I think there are two separate issues to fix:

# On the write path, the strategy that dynamically adjusts memory check 
intervals needs some tweaking. The assumption that sizes of adjacent records 
are similar can be easily broken.
# On the read path, the {{ConsecutiveChunkList.readAll()}} method should 
support reading data larger than 2GB, probably by using multiple buffers.

Another option is to ensure that no row groups larger than 2GB can be ever 
written. Thoughts?

BTW, the [parquet-python|https://github.com/jcrobak/parquet-python/] library 
can read this kind of malformed Parquet files successfully with [this 
patch|https://github.com/jcrobak/parquet-python/pull/56]. We used it to recover 
our data from the malformed Parquet file.


was (Author: lian cheng):
The current write path ensures that it never writes a page that is larger than 
2GB, but the read path may read 1 or more column chunks consisting of multiple 
pages into a single byte array (or {{ByteBuffer}}) no larger than 2GB.

We hit this issue in production because the data distribution happened to be 
similar to the situation mentioned in the JIRA description and produced a 
skewed row group containing a column chunk larger than 2GB.

I think there are two separate issues to fix:

# On the write path, the strategy that dynamically adjusts memory check 
intervals needs some tweaking. The assumption that sizes of adjacent records 
are similar can be easily broken.
# On the read path, the {{ConsecutiveChunkList.readAll()}} method should 
support reading data larger than 2GB, probably by using multiple buffers.

Another option is to ensure that no row groups larger than 2GB can be ever 
written. Thoughts?

BTW, the [parquet-python|https://github.com/jcrobak/parquet-python/] library 
can read this kind of malformed Parquet file successfully with [this 
patch|https://github.com/jcrobak/parquet-python/pull/56]. We used it to recover 
our data from the malformed Parquet file.

> Cannot read row group larger than 2GB
> -------------------------------------
>
>                 Key: PARQUET-980
>                 URL: https://issues.apache.org/jira/browse/PARQUET-980
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.8.0, 1.8.1, 1.8.2
>            Reporter: Herman van Hovell
>
> Parquet MR 1.8.2 does not support reading row groups which are larger than 2 
> GB. 
> See:https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1064
> We are seeing this when writing skewed records. This throws off the 
> estimation of the memory check interval in the InternalParquetRecordWriter. 
> The following spark code illustrates this:
> {noformat}
> /**
>  * Create a data frame that will make parquet write a file with a row group 
> larger than 2 GB. Parquet
>  * only checks the size of the row group after writing a number of records. 
> This number is based on
>  * average row size of the already written records. This is problematic in 
> the following scenario:
>  * - The initial (100) records in the record group are relatively small.
>  * - The InternalParquetRecordWriter checks if it needs to write to disk (it 
> should not), it assumes
>  *   that the remaining records have a similar size, and (greatly) increases 
> the check interval (usually
>  *   to 10000).
>  * - The remaining records are much larger then expected, making the row 
> group larger than 2 GB (which
>  *   makes reading the row group impossible).
>  *
>  * The data frame below illustrates such a scenario. This creates a row group 
> of approximately 4GB.
>  */
> val badDf = spark.range(0, 2200, 1, 1).mapPartitions { iterator =>
>   var i = 0
>   val random = new scala.util.Random(42)
>   val buffer = new Array[Char](750000)
>   iterator.map { id =>
>     // the first 200 records have a length of 1K and the remaining 2000 have 
> a length of 750K.
>     val numChars = if (i < 200) 1000 else 750000
>     i += 1
>     // create a random array
>     var j = 0
>     while (j < numChars) {
>       // Generate a char (borrowed from scala.util.Random)
>       buffer(j) = (random.nextInt(0xD800 - 1) + 1).toChar
>       j += 1
>     }
>     // create a string: the string constructor will copy the buffer.
>     new String(buffer, 0, numChars)
>   }
> }
> badDf.write.parquet("somefile")
> val corruptedDf = spark.read.parquet("somefile")
> corruptedDf.select(count(lit(1)), max(length($"value"))).show()
> {noformat}
> The latter fails with the following exception:
> {noformat}
> java.lang.NegativeArraySizeException
>       at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1064)
>       at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:698)
> ...
> {noformat}
> This seems to be fixed by commit 
> https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
>  in parquet 1.9.x. Is there any chance that we can fix this in 1.8.x?
>  This can happen when 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to