[ 
https://issues.apache.org/jira/browse/SPARK-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5968:
------------------------------
    Description: 
This may happen in the case of schema evolving, namely appending new Parquet 
data with different but compatible schema to existing Parquet files:
{code}
15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file for 
rankings
parquet.io.ParquetEncodingException: 
file:/Users/matei/workspace/apache-spark/rankings/part-r-00001.parquet invalid: 
all the files must be contained in the root rankings
at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
at 
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
at 
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
{code}
The reason is that the Spark SQL schemas stored in Parquet key-value metadata 
differ. Parquet doesn't know how to "merge" these opaque user-defined metadata, 
and just throw an exception and give up writing summary files. Since the 
Parquet data source in Spark 1.3.0 supports schema merging, it's harmless.  But 
this is kind of scary for the user.  We should try to suppress this through the 
logger. 

  was:
{code}
15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file for 
rankings
parquet.io.ParquetEncodingException: 
file:/Users/matei/workspace/apache-spark/rankings/part-r-00001.parquet invalid: 
all the files must be contained in the root rankings
at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
at 
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
at 
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
{code}

it is only a warning, but kind of scary for the user.  We should try to 
suppress this through the logger. 


> Parquet warning in spark-shell
> ------------------------------
>
>                 Key: SPARK-5968
>                 URL: https://issues.apache.org/jira/browse/SPARK-5968
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Michael Armbrust
>            Assignee: Cheng Lian
>            Priority: Critical
>
> This may happen in the case of schema evolving, namely appending new Parquet 
> data with different but compatible schema to existing Parquet files:
> {code}
> 15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file 
> for rankings
> parquet.io.ParquetEncodingException: 
> file:/Users/matei/workspace/apache-spark/rankings/part-r-00001.parquet 
> invalid: all the files must be contained in the root rankings
> at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
> at 
> parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
> at 
> parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
> {code}
> The reason is that the Spark SQL schemas stored in Parquet key-value metadata 
> differ. Parquet doesn't know how to "merge" these opaque user-defined 
> metadata, and just throw an exception and give up writing summary files. 
> Since the Parquet data source in Spark 1.3.0 supports schema merging, it's 
> harmless.  But this is kind of scary for the user.  We should try to suppress 
> this through the logger. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to