Re: Merge metadata error when appending to parquet table

2015-08-09 Thread Cheng Lian
The conflicting metadata values warning is a known issue 
https://issues.apache.org/jira/browse/PARQUET-194


The option "parquet.enable.summary-metadata" is a Hadoop option rather 
than a Spark option, so you need to either add it to your Hadoop 
configuration file(s) or add it via `sparkContext.hadoopConfiguration` 
before starting your job.


Cheng

On 8/9/15 8:57 PM, Krzysztof Zarzycki wrote:
Besides finding to this problem, I think I can workaround at least the 
WARNING message by overwriting parquet variable: 
parquet.enable.summary-metadata
That according to this PARQUET-107 
 ticket  can be 
used to disable writing summary file which is an issue here.

How can I set this variable? I tried
sql.setConf("parquet.enable.summary-metadata", "false")
sql.sql("SET parquet.enable.summary-metadata=false")
As well as: spark-submit --conf parquet.enable.summary-metadata=false

But neither helped. Anyone can help? Of course the original problem 
stays open.

Thanks!
Krzysiek

2015-08-09 14:19 GMT+02:00 Krzysztof Zarzycki >:


Hi there,
I have a problem with a spark streaming job  running on Spark
1.4.1, that appends to parquet table.

My job receives json strings and creates JsonRdd out of it. The
jsons might come in different shape as most of the fields are
optional. But they never have conflicting schemas.
Next, for each (non-empty) Rdd I'm saving it to parquet files,
using append to existing table:

jsonRdd.write.mode(SaveMode.Append).parquet(dataDirPath)

Unfortunately I'm hitting now an issue on every append of conflict:

Aug 9, 2015 7:58:03 AM WARNING:
parquet.hadoop.ParquetOutputCommitter: could not write summary
file for hdfs://example.com:8020/tmp/parquet

java.lang.RuntimeException: could not merge metadata: key
org.apache.spark.sql.parquet.row.metadata has conflicting values:
[{...schema1...}, {...schema2...} ]

The schemas are very similar, some attributes may be missing
comparing to other, but for sure they are not conflicting. They
are pretty lengthy, but I compared them with diff and ensured,
that there are no conflicts.

Even with this WARNING, the write actually succeeds, I'm able to
read this data.  But on every batch, there is yet another schema
in the displayed "conflicting values" array. I would like the job
to run forever, so I can't even ignore this warning because it
will probably end with OOM.

Do you know what might be the reason of this error/ warning? How
to overcome this? Maybe it is a Spark bug/regression? I saw
tickets like SPARK-6010
, but they seem
to be fixed in 1.3.0 (I'm using 1.4.1).


Thanks for any help!
Krzysiek







Re: Merge metadata error when appending to parquet table

2015-08-09 Thread Krzysztof Zarzycki
Besides finding to this problem, I think I can workaround at least the
WARNING message by overwriting parquet variable:
parquet.enable.summary-metadata
That according to this PARQUET-107
 ticket  can be used to
disable writing summary file which is an issue here.
How can I set this variable? I tried
sql.setConf("parquet.enable.summary-metadata", "false")

sql.sql("SET parquet.enable.summary-metadata=false")

As well as: spark-submit --conf parquet.enable.summary-metadata=false

But neither helped. Anyone can help? Of course the original problem stays
open.
Thanks!
Krzysiek

2015-08-09 14:19 GMT+02:00 Krzysztof Zarzycki :

> Hi there,
> I have a problem with a spark streaming job  running on Spark 1.4.1, that
> appends to parquet table.
>
> My job receives json strings and creates JsonRdd out of it. The jsons
> might come in different shape as most of the fields are optional. But they
> never have conflicting schemas.
> Next, for each (non-empty) Rdd I'm saving it to parquet files, using
> append to existing table:
>
> jsonRdd.write.mode(SaveMode.Append).parquet(dataDirPath)
>
> Unfortunately I'm hitting now an issue on every append of conflict:
>
> Aug 9, 2015 7:58:03 AM WARNING: parquet.hadoop.ParquetOutputCommitter:
> could not write summary file for hdfs://example.com:8020/tmp/parquet
> java.lang.RuntimeException: could not merge metadata: key
> org.apache.spark.sql.parquet.row.metadata has conflicting values:
> [{...schema1...}, {...schema2...} ]
>
> The schemas are very similar, some attributes may be missing comparing to
> other, but for sure they are not conflicting. They are pretty lengthy, but
> I compared them with diff and ensured, that there are no conflicts.
>
> Even with this WARNING, the write actually succeeds, I'm able to read this
> data.  But on every batch, there is yet another schema in the displayed
> "conflicting values" array. I would like the job to run forever, so I can't
> even ignore this warning because it will probably end with OOM.
>
> Do you know what might be the reason of this error/ warning? How to
> overcome this? Maybe it is a Spark bug/regression? I saw tickets like
> SPARK-6010 , but they
> seem to be fixed in 1.3.0 (I'm using 1.4.1).
>
>
> Thanks for any help!
> Krzysiek
>
>
>


Merge metadata error when appending to parquet table

2015-08-09 Thread Krzysztof Zarzycki
Hi there,
I have a problem with a spark streaming job  running on Spark 1.4.1, that
appends to parquet table.

My job receives json strings and creates JsonRdd out of it. The jsons might
come in different shape as most of the fields are optional. But they never
have conflicting schemas.
Next, for each (non-empty) Rdd I'm saving it to parquet files, using append
to existing table:

jsonRdd.write.mode(SaveMode.Append).parquet(dataDirPath)

Unfortunately I'm hitting now an issue on every append of conflict:

Aug 9, 2015 7:58:03 AM WARNING: parquet.hadoop.ParquetOutputCommitter:
could not write summary file for hdfs://example.com:8020/tmp/parquet
java.lang.RuntimeException: could not merge metadata: key
org.apache.spark.sql.parquet.row.metadata has conflicting values:
[{...schema1...}, {...schema2...} ]

The schemas are very similar, some attributes may be missing comparing to
other, but for sure they are not conflicting. They are pretty lengthy, but
I compared them with diff and ensured, that there are no conflicts.

Even with this WARNING, the write actually succeeds, I'm able to read this
data.  But on every batch, there is yet another schema in the displayed
"conflicting values" array. I would like the job to run forever, so I can't
even ignore this warning because it will probably end with OOM.

Do you know what might be the reason of this error/ warning? How to
overcome this? Maybe it is a Spark bug/regression? I saw tickets like
SPARK-6010 , but they
seem to be fixed in 1.3.0 (I'm using 1.4.1).


Thanks for any help!
Krzysiek