[
https://issues.apache.org/jira/browse/PIG-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cheolsoo Park updated PIG-2579:
-------------------------------
Attachment: PIG-2579.patch
PIG-2579-avro_test_files.tar.gz
I updated the original Stan's patch re-basing it to trunk. While I kept the
core logic unchanged, I made some modifications as follows:
# Removed glob pattern related code as it's resolved in PIG-2492.
# Added an option 'multiple_schema' to AvroStorage. By default, AvroStorage
assumes that all the input files have the same schema, but if 'multiple_schema'
is passed to load function, it tries to merge every input schema.
# Allows multiple schemas with the same name. I use paths to identify schemas
instead of their names.
# Refactored code.
# Added unit tests.
I think that the most arguable part is how to merge two different schemas into
one. In shorts, the rules are as follows:
# Different primitive types can be merged if certain conditions are met. Please
see AvroStorageUtils.mergeType() for more details.
# Only the same kind of complex types can be merged. e.g. record + record =>
ok, but record + array => error.
# For records, the union of fields is returned.
# For arrays/maps, their element types/value types are merged.
# For unions, the union of unions is returned.
# For fixeds, only the same size of fixeds can be merged.
It's easy to see in a unit test (TestAvroStorageUtils) what's expected when two
schemas are merged.
Please let me know if you have any questions/concerns.
Thanks!
> Support for multiple input schemas in AvroStorage
> -------------------------------------------------
>
> Key: PIG-2579
> URL: https://issues.apache.org/jira/browse/PIG-2579
> Project: Pig
> Issue Type: New Feature
> Components: piggybank
> Affects Versions: 0.9.2, 0.11
> Reporter: Stan Rosenberg
> Assignee: Cheolsoo Park
> Priority: Minor
> Attachments: avro_storage_union_schema.patch,
> avro_storage_union_schema_test.tar.gz, PIG-2579-avro_test_files.tar.gz,
> PIG-2579.patch
>
>
> This is a barebones patch for AvroStorage which enables support of multiple
> input schemas. The assumption is that the input consists of avro files
> having different schemas that can be unioned, e.g., flat records.
> A simple illustrative example is attached
> (avro_storage_union_schema_test.tar.gz): run create_avro1.pig, followed by
> create_avro2.pig, followed by read_avro.pig.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira