[jira] [Updated] (PARQUET-171) AvroReadSupport does not support Avro schema resolution

Jeffrey Olchovy (JIRA) Sun, 25 Jan 2015 14:09:50 -0800

     [ 
https://issues.apache.org/jira/browse/PARQUET-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jeffrey Olchovy updated PARQUET-171:
------------------------------------
    Description: 
Given multiple "different-yet-compatible" Avro-backed Parquet files, a runtime 
exception will be encountered when trying to merge the metadata values across 
the files if they are used as input sources for a MapReduce job.

A contrived example of this problem is provided, along with a derived version 
of {{AvroReadSupport}} that can correctly handle valid schema 
resolution/evolution scenarios.

*Illustration of Problem*
A simple Avro schema exists, which contains a single record type that consists 
of a required String member.

{noformat}
{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]}
{noformat}

When stored as Parquet-Avro the resulting schema is:

{noformat}
message com.tapad.avro.Foo {
  required binary my_field (UTF8);
}
{noformat}

Data is written to a Parquet-Avro file with the following contents:

{noformat}
my_field = aaa
my_field = bbb
{noformat}

The schema for the Foo record is then changed so that its String member is made 
optional, with a default value of null now provided for the String member.

{noformat}
{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}

message com.tapad.avro.Foo {
  optional binary my_field (UTF8);
}
{noformat}

This change adheres to the Avro Schema Resoution rules as found in 
http://avro.apache.org/docs/current/spec.html#Schema+Resolution.

Data is then written to a new Parquet-Avro file.

{noformat}
my_field = ccc
{noformat}

When both Parquet-Avro files are used as input to a MapReduce job, wherein the 
schemas in the data files are considered to be the "writer" schemas and the 
schema on our job's classpath -- in this case, the updated schema -- is used as 
the "reader" schema, the following {{RuntimeException}} is encountered:

{noformat}
Caused by: java.lang.RuntimeException: could not merge metadata: key 
avro.schema has conflicting values: 
[{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]},
 {"type":"record",    
"name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}]
 42       at 
parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
 43       at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
 44       at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:263)
...
{noformat}

*Solution*

Each schema in every data file (the "writer" schemas) should check for schema 
compatibility with the "reader" schema. If all "writer" schemas are compatible 
with the "reader" schema, all records in all data files can be migrated to the 
"reader" schema.

The Apache Avro library provides utilities for performing compatibility checks 
across schemas and provided is a derived version of {{AvroReadSupport}} which 
uses these utilities to successfully process the records in the aforementioned 
data files when they are both used as input to a MapReduce job.

-_NOTE: Solution will be provided as a hyperlink to a Github Pull Request_- 
https://github.com/apache/incubator-parquet-mr/pull/107

  was:
Given multiple "different-yet-compatible" Avro-backed Parquet files, a runtime 
exception will be encountered when trying to merge the metadata values across 
the files if they are used as input sources for a MapReduce job.

A contrived example of this problem is provided, along with a derived version 
of {{AvroReadSupport}} that can correctly handle valid schema 
resolution/evolution scenarios.

*Illustration of Problem*
A simple Avro schema exists, which contains a single record type that consists 
of a required String member.

{noformat}
{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]}
{noformat}

When stored as Parquet-Avro the resulting schema is:

{noformat}
message com.tapad.avro.Foo {
  required binary my_field (UTF8);
}
{noformat}

Data is written to a Parquet-Avro file with the following contents:

{noformat}
my_field = aaa
my_field = bbb
{noformat}

The schema for the Foo record is then changed so that its String member is made 
optional, with a default value of null now provided for the String member.

{noformat}
{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}

message com.tapad.avro.Foo {
  optional binary my_field (UTF8);
}
{noformat}

This change adheres to the Avro Schema Resoution rules as found in 
http://avro.apache.org/docs/current/spec.html#Schema+Resolution.

Data is then written to a new Parquet-Avro file.

{noformat}
my_field = ccc
{noformat}

When both Parquet-Avro files are used as input to a MapReduce job, wherein the 
schemas in the data files are considered to be the "writer" schemas and the 
schema on our job's classpath -- in this case, the updated schema -- is used as 
the "reader" schema, the following {{RuntimeException}} is encountered:

{noformat}
Caused by: java.lang.RuntimeException: could not merge metadata: key 
avro.schema has conflicting values: 
[{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]},
 {"type":"record",    
"name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}]
 42       at 
parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
 43       at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
 44       at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:263)
...
{noformat}

*Solution*

Each schema in every data file (the "writer" schemas) should check for schema 
compatibility with the "reader" schema. If all "writer" schemas are compatible 
with the "reader" schema, all records in all data files can be migrated to the 
"reader" schema.

The Apache Avro library provides utilities for performing compatibility checks 
across schemas and provided is a derived version of {{AvroReadSupport}} which 
uses these utilities to successfully process the records in the aforementioned 
data files when they are both used as input to a MapReduce job.

_NOTE: Solution will be provided as a hyperlink to a Github Pull Request_


> AvroReadSupport does not support Avro schema resolution
> -------------------------------------------------------
>
>                 Key: PARQUET-171
>                 URL: https://issues.apache.org/jira/browse/PARQUET-171
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>            Reporter: Jeffrey Olchovy
>
> Given multiple "different-yet-compatible" Avro-backed Parquet files, a 
> runtime exception will be encountered when trying to merge the metadata 
> values across the files if they are used as input sources for a MapReduce job.
> A contrived example of this problem is provided, along with a derived version 
> of {{AvroReadSupport}} that can correctly handle valid schema 
> resolution/evolution scenarios.
> *Illustration of Problem*
> A simple Avro schema exists, which contains a single record type that 
> consists of a required String member.
> {noformat}
> {"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]}
> {noformat}
> When stored as Parquet-Avro the resulting schema is:
> {noformat}
> message com.tapad.avro.Foo {
>   required binary my_field (UTF8);
> }
> {noformat}
> Data is written to a Parquet-Avro file with the following contents:
> {noformat}
> my_field = aaa
> my_field = bbb
> {noformat}
> The schema for the Foo record is then changed so that its String member is 
> made optional, with a default value of null now provided for the String 
> member.
> {noformat}
> {"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}
> message com.tapad.avro.Foo {
>   optional binary my_field (UTF8);
> }
> {noformat}
> This change adheres to the Avro Schema Resoution rules as found in 
> http://avro.apache.org/docs/current/spec.html#Schema+Resolution.
> Data is then written to a new Parquet-Avro file.
> {noformat}
> my_field = ccc
> {noformat}
> When both Parquet-Avro files are used as input to a MapReduce job, wherein 
> the schemas in the data files are considered to be the "writer" schemas and 
> the schema on our job's classpath -- in this case, the updated schema -- is 
> used as the "reader" schema, the following {{RuntimeException}} is 
> encountered:
> {noformat}
> Caused by: java.lang.RuntimeException: could not merge metadata: key 
> avro.schema has conflicting values: 
> [{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]},
>  {"type":"record",    
> "name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}]
>  42       at 
> parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
>  43       at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
>  44       at 
> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:263)
> ...
> {noformat}
> *Solution*
> Each schema in every data file (the "writer" schemas) should check for schema 
> compatibility with the "reader" schema. If all "writer" schemas are 
> compatible with the "reader" schema, all records in all data files can be 
> migrated to the "reader" schema.
> The Apache Avro library provides utilities for performing compatibility 
> checks across schemas and provided is a derived version of 
> {{AvroReadSupport}} which uses these utilities to successfully process the 
> records in the aforementioned data files when they are both used as input to 
> a MapReduce job.
> -_NOTE: Solution will be provided as a hyperlink to a Github Pull Request_- 
> https://github.com/apache/incubator-parquet-mr/pull/107



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PARQUET-171) AvroReadSupport does not support Avro schema resolution

Reply via email to