Dong Jiang created PARQUET-1203:
-----------------------------------
Summary: Corrupted parquet file from Spark
Key: PARQUET-1203
URL: https://issues.apache.org/jira/browse/PARQUET-1203
Project: Parquet
Issue Type: Bug
Environment: Spark 2.2.1
Reporter: Dong Jiang
Hi,
We are running on Spark 2.2.1, generating parquet files on S3, like the
following
pseudo code
df.write.parquet(...)
We have recently noticed parquet file corruptions, when reading the parquet
in Spark or Presto. I downloaded the corrupted file from S3 and got following
errors in Spark as the following:
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read
value at 40870 in block 0 in file
file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-4af35426f434.c000.snappy.parquet
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read
page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
in col [incoming_aliases_array, list, element, key_value, value] BINARY
It appears only one column in one of the rows in the file is corrupt, the
file has 111041 rows.
My questions are
1) How can I identify the corrupted row?
2) What could cause the corruption? Spark issue or Parquet issue?
Any help is greatly appreciated.
Thanks,
Dong
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)