Hi,

we have a feed-based distributed system and we are facing the problem that
sometimes one special feed produces a parquet file where further processing
fails with the following error message:


19/10/30 16:11:22 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library

19/10/30 16:11:22 INFO compress.CodecPool: Got brand-new decompressor [.gz]

parquet.io.ParquetDecodingException: Can not read value at 290784 in block
1 in file hdfs://<dir redacted
>/attempt_1569270420570_240216_m_000006_0.0.parquet.gz

        at parquet.hadoop.InternalParquetRecordReader.nextKeyValue
(InternalParquetRecordReader.java:241)

        at parquet.hadoop.ParquetReader.read(ParquetReader.java:125)

        at parquet.tools.command.CatCommand.execute(CatCommand.java:74)

        at parquet.tools.Main.main(Main.java:223)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke
(NativeMethodAccessorImpl.java:62)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:498)

        at org.apache.hadoop.util.RunJar.run(RunJar.java:226)

        at org.apache.hadoop.util.RunJar.main(RunJar.java:141)

Caused by: parquet.io.ParquetDecodingException: Can't read value in column
[<column redacted>] BINARY at value 81099 out of 81099, 81099 out of 81099
in currentPage. repetition level: 0, definition level: 3

        at parquet.column.impl.ColumnReaderImpl.readValue
(ColumnReaderImpl.java:483)

        at
parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:370)


        at parquet.io.RecordReaderImplementation.read
(RecordReaderImplementation.java:406)

        at parquet.hadoop.InternalParquetRecordReader.nextKeyValue
(InternalParquetRecordReader.java:216)

        ... 9 more

Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking
stream.

        at parquet.Preconditions.checkArgument(Preconditions.java:55)

        at
parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)


        at
parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)


        at
parquet.column.values.dictionary.DictionaryValuesReader.readBytes(DictionaryValuesReader.java:85)


        at
parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:312)

        at parquet.column.impl.ColumnReaderImpl.readValue
(ColumnReaderImpl.java:464)

        ... 12 more

Can not read value at 290784 in block 1 in file hdfs://<dir redacted>
/attempt_1569270420570_240216_m_000006_0.0.parquet.gz


If I use parquet-tools to inspect the problem "cat" fails with the above
error message while "dump" works. However looking at the output of "cat "
before the error, a shift is visible as given in the following example:



Id=first id

Name= first name

Url=first url



Id=second id

Name= second name

Url=second url



*Id=third id*

*Name= fourth name*

*Url=fourth url*



Id=fourth id

Name= fifth name

Url=fifth url



Looking at the output of "dump", I also see that the number of fields is
not always the same, the column of one field has for example 100 values
while another column has 99, and so on. This is not the case in valid files
and is much likely related to the shift above. We are using Parquet 1.5.0
and a version update would be a major task.

Does anybody know what could cause such a shift or how we could avoid it?
As mentioned above, it only happens in one of our feeds and only sometimes.

The problem sound like https://issues.apache.org/jira/browse/PARQUET-112
but I could not find a resolution there.

It also sounds a bit like https://issues.apache.org/jira/browse/PARQUET-511,
however INT64 values are not involved here as far as I know.

Any help is appreciated.

Cheers

Jan

Reply via email to