DILIP KUMAR MOHAPATRO created SPARK-30650:
---------------------------------------------

             Summary: The parquet file written by spark often incurs corrupted 
footer and hence not readable 
                 Key: SPARK-30650
                 URL: https://issues.apache.org/jira/browse/SPARK-30650
             Project: Spark
          Issue Type: Bug
          Components: Block Manager, Input/Output, Optimizer
    Affects Versions: 1.6.1
            Reporter: DILIP KUMAR MOHAPATRO


This issue is similar to an archived one,

[https://mail-archives.apache.org/mod_mbox/spark-issues/201501.mbox/%3cjira.12767358.1421214067000.78480.1421214094...@atlassian.jira%3E]

The parquet file written by spark often incurs corrupted footer and hence not 
readable by spark.

The issue is more consistent when the granularity of a field increases. i.e. 
when redundancy of values in dataset is reduced(= more number of unique values).

Coalesce also doesn't help here. It automatically generated a certain number of 
parquet files, each with a definite size as controlled by spark internals. But, 
few of them written corrupted footer. But writing job ends with success status. 

Here are few examples,

There are the files(267.2 M each) which the 1.6.x version spark has generated. 
But few of them are found with corrupted footer and hence not readable. This 
scenario happens more frequently when the file(input) size exceeds a certain 
limit and also the level of redundancy of the data matters. With the same file 
size, Lesser the level of redundancy, more is the probability of getting the 
footer corrupted.

Hence in iterations of the job when those are required to read for processing, 
ends up with
{{{color:#FF0000}*Can not read value 0 in block _n_ in file xxxx*{color}}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to