Hi there, I have a problem with a spark streaming job running on Spark 1.4.1, that appends to parquet table.
My job receives json strings and creates JsonRdd out of it. The jsons might come in different shape as most of the fields are optional. But they never have conflicting schemas. Next, for each (non-empty) Rdd I'm saving it to parquet files, using append to existing table: jsonRdd.write.mode(SaveMode.Append).parquet(dataDirPath) Unfortunately I'm hitting now an issue on every append of conflict: Aug 9, 2015 7:58:03 AM WARNING: parquet.hadoop.ParquetOutputCommitter: could not write summary file for hdfs://example.com:8020/tmp/parquet java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: [{...schema1...}, {...schema2...} ] The schemas are very similar, some attributes may be missing comparing to other, but for sure they are not conflicting. They are pretty lengthy, but I compared them with diff and ensured, that there are no conflicts. Even with this WARNING, the write actually succeeds, I'm able to read this data. But on every batch, there is yet another schema in the displayed "conflicting values" array. I would like the job to run forever, so I can't even ignore this warning because it will probably end with OOM. Do you know what might be the reason of this error/ warning? How to overcome this? Maybe it is a Spark bug/regression? I saw tickets like SPARK-6010 <https://issues.apache.org/jira/browse/SPARK-6010>, but they seem to be fixed in 1.3.0 (I'm using 1.4.1). Thanks for any help! Krzysiek