DILIP KUMAR MOHAPATRO created SPARK-30650: ---------------------------------------------
Summary: The parquet file written by spark often incurs corrupted footer and hence not readable Key: SPARK-30650 URL: https://issues.apache.org/jira/browse/SPARK-30650 Project: Spark Issue Type: Bug Components: Block Manager, Input/Output, Optimizer Affects Versions: 1.6.1 Reporter: DILIP KUMAR MOHAPATRO This issue is similar to an archived one, [https://mail-archives.apache.org/mod_mbox/spark-issues/201501.mbox/%3cjira.12767358.1421214067000.78480.1421214094...@atlassian.jira%3E] The parquet file written by spark often incurs corrupted footer and hence not readable by spark. The issue is more consistent when the granularity of a field increases. i.e. when redundancy of values in dataset is reduced(= more number of unique values). Coalesce also doesn't help here. It automatically generated a certain number of parquet files, each with a definite size as controlled by spark internals. But, few of them written corrupted footer. But writing job ends with success status. Here are few examples, There are the files(267.2 M each) which the 1.6.x version spark has generated. But few of them are found with corrupted footer and hence not readable. This scenario happens more frequently when the file(input) size exceeds a certain limit and also the level of redundancy of the data matters. With the same file size, Lesser the level of redundancy, more is the probability of getting the footer corrupted. Hence in iterations of the job when those are required to read for processing, ends up with {{{color:#FF0000}*Can not read value 0 in block _n_ in file xxxx*{color}}} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org