[ https://issues.apache.org/jira/browse/SPARK-30650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-30650. ---------------------------------- Resolution: Incomplete > The parquet file written by spark often incurs corrupted footer and hence not > readable > --------------------------------------------------------------------------------------- > > Key: SPARK-30650 > URL: https://issues.apache.org/jira/browse/SPARK-30650 > Project: Spark > Issue Type: Bug > Components: Block Manager, Input/Output, Optimizer > Affects Versions: 1.6.1 > Reporter: DILIP KUMAR MOHAPATRO > Priority: Major > > This issue is similar to an archived one, > [https://mail-archives.apache.org/mod_mbox/spark-issues/201501.mbox/%3cjira.12767358.1421214067000.78480.1421214094...@atlassian.jira%3E] > The parquet file written by spark often incurs corrupted footer and hence not > readable by spark. > The issue is more consistent when the granularity of a field increases. i.e. > when redundancy of values in dataset is reduced(= more number of unique > values). > Coalesce also doesn't help here. It automatically generated a certain number > of parquet files, each with a definite size as controlled by spark internals. > But, few of them written corrupted footer. But writing job ends with success > status. > Here are few examples, > There are the files(267.2 M each) which the 1.6.x version spark has > generated. But few of them are found with corrupted footer and hence not > readable. This scenario happens more frequently when the file(input) size > exceeds a certain limit and also the level of redundancy of the data matters. > With the same file size, Lesser the level of redundancy, more is the > probability of getting the footer corrupted. > Hence in iterations of the job when those are required to read for > processing, ends up with > {{{color:#FF0000}*Can not read value 0 in block _n_ in file xxxx*{color}}} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org