Dong Jiang created PARQUET-1203:
-----------------------------------

             Summary: Corrupted parquet file from Spark
                 Key: PARQUET-1203
                 URL: https://issues.apache.org/jira/browse/PARQUET-1203
             Project: Parquet
          Issue Type: Bug
         Environment: Spark 2.2.1
            Reporter: Dong Jiang


Hi, 

We are running on Spark 2.2.1, generating parquet files on S3, like the 
following 
pseudo code 
df.write.parquet(...) 
We have recently noticed parquet file corruptions, when reading the parquet 
in Spark or Presto. I downloaded the corrupted file from S3 and got following 
errors in Spark as the following: 

Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read 
value at 40870 in block 0 in file 
file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-4af35426f434.c000.snappy.parquet
 

Caused by: org.apache.parquet.io.ParquetDecodingException: could not read 
page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594] 
in col [incoming_aliases_array, list, element, key_value, value] BINARY 

It appears only one column in one of the rows in the file is corrupt, the 
file has 111041 rows. 

My questions are 
1) How can I identify the corrupted row? 
2) What could cause the corruption? Spark issue or Parquet issue? 

Any help is greatly appreciated. 

Thanks, 

Dong 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to