[
https://issues.apache.org/jira/browse/PARQUET-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeffrey Olchovy updated PARQUET-171:
------------------------------------
Description:
Given multiple "different-yet-compatible" Avro-backed Parquet files, a runtime
exception will be encountered when trying to merge the metadata values across
the files if they are used as input sources for a MapReduce job.
A contrived example of this problem is provided, along with a derived version
of {{AvroReadSupport}} that can correctly handle valid schema
resolution/evolution scenarios.
*Illustration of Problem*
A simple Avro schema exists, which contains a single record type that consists
of a required String member.
{noformat}
{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]}
{noformat}
When stored as Parquet-Avro the resulting schema is:
{noformat}
message com.tapad.avro.Foo {
required binary my_field (UTF8);
}
{noformat}
Data is written to a Parquet-Avro file with the following contents:
{noformat}
my_field = aaa
my_field = bbb
{noformat}
The schema for the Foo record is then changed so that its String member is made
optional, with a default value of null now provided for the String member.
{noformat}
{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}
message com.tapad.avro.Foo {
optional binary my_field (UTF8);
}
{noformat}
This change adheres to the Avro Schema Resoution rules as found in
http://avro.apache.org/docs/current/spec.html#Schema+Resolution.
Data is then written to a new Parquet-Avro file.
{noformat}
my_field = ccc
{noformat}
When both Parquet-Avro files are used as input to a MapReduce job, wherein the
schemas in the data files are considered to be the "writer" schemas and the
schema on our job's classpath -- in this case, the updated schema -- is used as
the "reader" schema, the following {{RuntimeException}} is encountered:
{noformat}
Caused by: java.lang.RuntimeException: could not merge metadata: key
avro.schema has conflicting values:
[{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]},
{"type":"record",
"name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}]
42 at
parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
43 at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
44 at
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:263)
...
{noformat}
*Solution*
Each schema in every data file (the "writer" schemas) should check for schema
compatibility with the "reader" schema. If all "writer" schemas are compatible
with the "reader" schema, all records in all data files can be migrated to the
"reader" schema.
The Apache Avro library provides utilities for performing compatibility checks
across schemas and provided is a derived version of {{AvroReadSupport}} which
uses these utilities to successfully process the records in the aforementioned
data files when they are both used as input to a MapReduce job.
-_NOTE: Solution will be provided as a hyperlink to a Github Pull Request_-
https://github.com/apache/incubator-parquet-mr/pull/107
was:
Given multiple "different-yet-compatible" Avro-backed Parquet files, a runtime
exception will be encountered when trying to merge the metadata values across
the files if they are used as input sources for a MapReduce job.
A contrived example of this problem is provided, along with a derived version
of {{AvroReadSupport}} that can correctly handle valid schema
resolution/evolution scenarios.
*Illustration of Problem*
A simple Avro schema exists, which contains a single record type that consists
of a required String member.
{noformat}
{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]}
{noformat}
When stored as Parquet-Avro the resulting schema is:
{noformat}
message com.tapad.avro.Foo {
required binary my_field (UTF8);
}
{noformat}
Data is written to a Parquet-Avro file with the following contents:
{noformat}
my_field = aaa
my_field = bbb
{noformat}
The schema for the Foo record is then changed so that its String member is made
optional, with a default value of null now provided for the String member.
{noformat}
{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}
message com.tapad.avro.Foo {
optional binary my_field (UTF8);
}
{noformat}
This change adheres to the Avro Schema Resoution rules as found in
http://avro.apache.org/docs/current/spec.html#Schema+Resolution.
Data is then written to a new Parquet-Avro file.
{noformat}
my_field = ccc
{noformat}
When both Parquet-Avro files are used as input to a MapReduce job, wherein the
schemas in the data files are considered to be the "writer" schemas and the
schema on our job's classpath -- in this case, the updated schema -- is used as
the "reader" schema, the following {{RuntimeException}} is encountered:
{noformat}
Caused by: java.lang.RuntimeException: could not merge metadata: key
avro.schema has conflicting values:
[{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]},
{"type":"record",
"name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}]
42 at
parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
43 at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
44 at
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:263)
...
{noformat}
*Solution*
Each schema in every data file (the "writer" schemas) should check for schema
compatibility with the "reader" schema. If all "writer" schemas are compatible
with the "reader" schema, all records in all data files can be migrated to the
"reader" schema.
The Apache Avro library provides utilities for performing compatibility checks
across schemas and provided is a derived version of {{AvroReadSupport}} which
uses these utilities to successfully process the records in the aforementioned
data files when they are both used as input to a MapReduce job.
_NOTE: Solution will be provided as a hyperlink to a Github Pull Request_
> AvroReadSupport does not support Avro schema resolution
> -------------------------------------------------------
>
> Key: PARQUET-171
> URL: https://issues.apache.org/jira/browse/PARQUET-171
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Reporter: Jeffrey Olchovy
>
> Given multiple "different-yet-compatible" Avro-backed Parquet files, a
> runtime exception will be encountered when trying to merge the metadata
> values across the files if they are used as input sources for a MapReduce job.
> A contrived example of this problem is provided, along with a derived version
> of {{AvroReadSupport}} that can correctly handle valid schema
> resolution/evolution scenarios.
> *Illustration of Problem*
> A simple Avro schema exists, which contains a single record type that
> consists of a required String member.
> {noformat}
> {"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]}
> {noformat}
> When stored as Parquet-Avro the resulting schema is:
> {noformat}
> message com.tapad.avro.Foo {
> required binary my_field (UTF8);
> }
> {noformat}
> Data is written to a Parquet-Avro file with the following contents:
> {noformat}
> my_field = aaa
> my_field = bbb
> {noformat}
> The schema for the Foo record is then changed so that its String member is
> made optional, with a default value of null now provided for the String
> member.
> {noformat}
> {"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}
> message com.tapad.avro.Foo {
> optional binary my_field (UTF8);
> }
> {noformat}
> This change adheres to the Avro Schema Resoution rules as found in
> http://avro.apache.org/docs/current/spec.html#Schema+Resolution.
> Data is then written to a new Parquet-Avro file.
> {noformat}
> my_field = ccc
> {noformat}
> When both Parquet-Avro files are used as input to a MapReduce job, wherein
> the schemas in the data files are considered to be the "writer" schemas and
> the schema on our job's classpath -- in this case, the updated schema -- is
> used as the "reader" schema, the following {{RuntimeException}} is
> encountered:
> {noformat}
> Caused by: java.lang.RuntimeException: could not merge metadata: key
> avro.schema has conflicting values:
> [{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]},
> {"type":"record",
> "name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}]
> 42 at
> parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
> 43 at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
> 44 at
> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:263)
> ...
> {noformat}
> *Solution*
> Each schema in every data file (the "writer" schemas) should check for schema
> compatibility with the "reader" schema. If all "writer" schemas are
> compatible with the "reader" schema, all records in all data files can be
> migrated to the "reader" schema.
> The Apache Avro library provides utilities for performing compatibility
> checks across schemas and provided is a derived version of
> {{AvroReadSupport}} which uses these utilities to successfully process the
> records in the aforementioned data files when they are both used as input to
> a MapReduce job.
> -_NOTE: Solution will be provided as a hyperlink to a Github Pull Request_-
> https://github.com/apache/incubator-parquet-mr/pull/107
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)