Re: Fail to read back written large parquet file

Steve Loughran Fri, 05 Aug 2022 03:40:11 -0700

tha has to be an integer wraparound...something is using a signed int for
position, so when it goes above 2GB it goes negative, and a seek(negative
value) is rejected.


fix: find the variable and make it a long



On Thu, 4 Aug 2022 at 11:09, Jozef Vilcek <[email protected]> wrote:

> I came across a case where a job writes out a data set in parquet format
> and it can not be read back as it appears to be corrupted.
>
> Files fail to read back if their size is going over 2GB. If I set the job
> to produce more smaller files from exactly the same input, all is good.
>
> Job write to parquet Avro messages via `parquet-avro` and `parquet-mr`. It
> does happen with v1.10.1 and v1.12.0.
>
> Read error is:
>
> Cannot seek to negative offset
> java.io.EOFException: Cannot seek to negative offset
> at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1454)
> at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
> at
>
> org.apache.parquet.hadoop.util.H2SeekableInputStream.seek(H2SeekableInputStream.java:60)
> at
>
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1157)
> at
>
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>
> When digging a bit into the read, the code materializing
> `ColumnChunkMetaData` is here [1] starting to see negative values for
> `firstDataPage`. Printing some info from `reader.getRowGroups` yields:
>
>
> startingPos=4, totalBytesSize=519551822, rowCount=2300100
> startingPos=108156606, totalBytesSize=517597985, rowCount=2300100
> ...
> startingPos=1950017569, totalBytesSize=511705703, rowCount=2300100
> startingPos=2058233752, totalBytesSize=521762439, rowCount=2300100
> startingPos=-2128348908, totalBytesSize=508570588, rowCount=2300100
> startingPos=-2020294298, totalBytesSize=518901187, rowCount=2300100
> startingPos=-1911848035, totalBytesSize=512724804, rowCount=2300100
> startingPos=-1803573306, totalBytesSize=510980877, rowCount=2300100
> startingPos=-1695543557, totalBytesSize=525871692, rowCount=2300100
> startingPos=-1587016600, totalBytesSize=519353830, rowCount=2300100
> startingPos=-1478696427, totalBytesSize=451032173, rowCount=2090372
>
>
>
> Unfortunately, I was not able to reproduce it locally by taking avro schema
> and generating random inputs and writing them out to a local file. Every
> time, compressed or uncompressed, 3GB file was reading back correctly.
>
> I am looking for help in finding a solution of hints in debugging this, as
> I am out of clues to try to pinpoint and reproduce the problem.
>
> Thanks!
>
> [1]
>
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java#L127
>

Re: Fail to read back written large parquet file

Reply via email to