[jira] [Commented] (SPARK-30650) The parquet file written by spark often incurs corrupted footer and hence not readable

Hyukjin Kwon (Jira) Wed, 29 Jan 2020 17:29:22 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-30650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026369#comment-17026369
 ]


Hyukjin Kwon commented on SPARK-30650:
--------------------------------------

Spark versions before 2.3 are EOL. Can you verify if there are similar issues 
in at least Spark 2.4.x? I am leaving this resolved before that's verified.

> The parquet file written by spark often incurs corrupted footer and hence not 
> readable 
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-30650
>                 URL: https://issues.apache.org/jira/browse/SPARK-30650
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager, Input/Output, Optimizer
>    Affects Versions: 1.6.1
>            Reporter: DILIP KUMAR MOHAPATRO
>            Priority: Major
>
> This issue is similar to an archived one,
> [https://mail-archives.apache.org/mod_mbox/spark-issues/201501.mbox/%3cjira.12767358.1421214067000.78480.1421214094...@atlassian.jira%3E]
> The parquet file written by spark often incurs corrupted footer and hence not 
> readable by spark.
> The issue is more consistent when the granularity of a field increases. i.e. 
> when redundancy of values in dataset is reduced(= more number of unique 
> values).
> Coalesce also doesn't help here. It automatically generated a certain number 
> of parquet files, each with a definite size as controlled by spark internals. 
> But, few of them written corrupted footer. But writing job ends with success 
> status. 
> Here are few examples,
> There are the files(267.2 M each) which the 1.6.x version spark has 
> generated. But few of them are found with corrupted footer and hence not 
> readable. This scenario happens more frequently when the file(input) size 
> exceeds a certain limit and also the level of redundancy of the data matters. 
> With the same file size, Lesser the level of redundancy, more is the 
> probability of getting the footer corrupted.
> Hence in iterations of the job when those are required to read for 
> processing, ends up with
> {{{color:#FF0000}*Can not read value 0 in block _n_ in file xxxx*{color}}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30650) The parquet file written by spark often incurs corrupted footer and hence not readable

Reply via email to