Jozef, feel free to open a Parquet JIRA to give the issue more details. Ideally the writer should recover itself and produce the correct result, but I don't have enough context yet so not sure whether it's doable.
On Mon, Aug 8, 2022 at 1:53 AM Jozef Vilcek <[email protected]> wrote: > > Found it. Problem in my case was not with Parquet but my implementation of > the `OutputFile` wrapper providing `PositionOutputStream`. > > Would it make sense to do changes to the writer to crash on negative > offsets rather than continue and produce unreadable results. > > On Fri, Aug 5, 2022 at 8:42 PM Chao Sun <[email protected]> wrote: > > > Seems the file was corrupted during write. There's a similar issue > > https://issues.apache.org/jira/browse/PARQUET-2164 we found recently. > > > > On Fri, Aug 5, 2022 at 3:40 AM Steve Loughran > > <[email protected]> wrote: > > > > > > tha has to be an integer wraparound...something is using a signed int for > > > position, so when it goes above 2GB it goes negative, and a seek(negative > > > value) is rejected. > > > > > > fix: find the variable and make it a long > > > > > > > > > > > > On Thu, 4 Aug 2022 at 11:09, Jozef Vilcek <[email protected]> wrote: > > > > > > > I came across a case where a job writes out a data set in parquet > > format > > > > and it can not be read back as it appears to be corrupted. > > > > > > > > Files fail to read back if their size is going over 2GB. If I set the > > job > > > > to produce more smaller files from exactly the same input, all is good. > > > > > > > > Job write to parquet Avro messages via `parquet-avro` and > > `parquet-mr`. It > > > > does happen with v1.10.1 and v1.12.0. > > > > > > > > Read error is: > > > > > > > > Cannot seek to negative offset > > > > java.io.EOFException: Cannot seek to negative offset > > > > at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1454) > > > > at > > org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62) > > > > at > > > > > > > > > > org.apache.parquet.hadoop.util.H2SeekableInputStream.seek(H2SeekableInputStream.java:60) > > > > at > > > > > > > > > > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1157) > > > > at > > > > > > > > > > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > > > > > > > > When digging a bit into the read, the code materializing > > > > `ColumnChunkMetaData` is here [1] starting to see negative values for > > > > `firstDataPage`. Printing some info from `reader.getRowGroups` yields: > > > > > > > > > > > > startingPos=4, totalBytesSize=519551822, rowCount=2300100 > > > > startingPos=108156606, totalBytesSize=517597985, rowCount=2300100 > > > > ... > > > > startingPos=1950017569, totalBytesSize=511705703, rowCount=2300100 > > > > startingPos=2058233752, totalBytesSize=521762439, rowCount=2300100 > > > > startingPos=-2128348908, totalBytesSize=508570588, rowCount=2300100 > > > > startingPos=-2020294298, totalBytesSize=518901187, rowCount=2300100 > > > > startingPos=-1911848035, totalBytesSize=512724804, rowCount=2300100 > > > > startingPos=-1803573306, totalBytesSize=510980877, rowCount=2300100 > > > > startingPos=-1695543557, totalBytesSize=525871692, rowCount=2300100 > > > > startingPos=-1587016600, totalBytesSize=519353830, rowCount=2300100 > > > > startingPos=-1478696427, totalBytesSize=451032173, rowCount=2090372 > > > > > > > > > > > > > > > > Unfortunately, I was not able to reproduce it locally by taking avro > > schema > > > > and generating random inputs and writing them out to a local file. > > Every > > > > time, compressed or uncompressed, 3GB file was reading back correctly. > > > > > > > > I am looking for help in finding a solution of hints in debugging > > this, as > > > > I am out of clues to try to pinpoint and reproduce the problem. > > > > > > > > Thanks! > > > > > > > > [1] > > > > > > > > > > https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java#L127 > > > > > >
