Merge metadata error when appending to parquet table

Krzysztof Zarzycki Sun, 09 Aug 2015 05:20:08 -0700

Hi there,
I have a problem with a spark streaming job  running on Spark 1.4.1, that
appends to parquet table.


My job receives json strings and creates JsonRdd out of it. The jsons might
come in different shape as most of the fields are optional. But they never
have conflicting schemas.
Next, for each (non-empty) Rdd I'm saving it to parquet files, using append
to existing table:

jsonRdd.write.mode(SaveMode.Append).parquet(dataDirPath)

Unfortunately I'm hitting now an issue on every append of conflict:

Aug 9, 2015 7:58:03 AM WARNING: parquet.hadoop.ParquetOutputCommitter:
could not write summary file for hdfs://example.com:8020/tmp/parquet
java.lang.RuntimeException: could not merge metadata: key
org.apache.spark.sql.parquet.row.metadata has conflicting values:
[{...schema1...}, {...schema2...} ]

The schemas are very similar, some attributes may be missing comparing to
other, but for sure they are not conflicting. They are pretty lengthy, but
I compared them with diff and ensured, that there are no conflicts.

Even with this WARNING, the write actually succeeds, I'm able to read this
data.  But on every batch, there is yet another schema in the displayed
"conflicting values" array. I would like the job to run forever, so I can't
even ignore this warning because it will probably end with OOM.

Do you know what might be the reason of this error/ warning? How to
overcome this? Maybe it is a Spark bug/regression? I saw tickets like
SPARK-6010 <https://issues.apache.org/jira/browse/SPARK-6010>, but they
seem to be fixed in 1.3.0 (I'm using 1.4.1).


Thanks for any help!
Krzysiek

Merge metadata error when appending to parquet table

Reply via email to