[ 
https://issues.apache.org/jira/browse/PIG-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-2579:
-------------------------------

    Attachment: PIG-2579.patch
                PIG-2579-avro_test_files.tar.gz

I updated the original Stan's patch re-basing it to trunk. While I kept the 
core logic unchanged, I made some modifications as follows:
# Removed glob pattern related code as it's resolved in PIG-2492.
# Added an option 'multiple_schema' to AvroStorage. By default, AvroStorage 
assumes that all the input files have the same schema, but if 'multiple_schema' 
is passed to load function, it tries to merge every input schema.
# Allows multiple schemas with the same name. I use paths to identify schemas 
instead of their names.
# Refactored code.
# Added unit tests.

I think that the most arguable part is how to merge two different schemas into 
one. In shorts, the rules are as follows:
# Different primitive types can be merged if certain conditions are met. Please 
see AvroStorageUtils.mergeType() for more details.
# Only the same kind of complex types can be merged. e.g. record + record => 
ok, but record + array => error.
# For records, the union of fields is returned.
# For arrays/maps, their element types/value types are merged.
# For unions, the union of unions is returned.
# For fixeds, only the same size of fixeds can be merged.

It's easy to see in a unit test (TestAvroStorageUtils) what's expected when two 
schemas are merged.

Please let me know if you have any questions/concerns.

Thanks!
                
> Support for multiple input schemas in AvroStorage
> -------------------------------------------------
>
>                 Key: PIG-2579
>                 URL: https://issues.apache.org/jira/browse/PIG-2579
>             Project: Pig
>          Issue Type: New Feature
>          Components: piggybank
>    Affects Versions: 0.9.2, 0.11
>            Reporter: Stan Rosenberg
>            Assignee: Cheolsoo Park
>            Priority: Minor
>         Attachments: avro_storage_union_schema.patch, 
> avro_storage_union_schema_test.tar.gz, PIG-2579-avro_test_files.tar.gz, 
> PIG-2579.patch
>
>
> This is a barebones patch for AvroStorage which enables support of multiple 
> input schemas.  The assumption is that the input consists of avro files 
> having different schemas that can be unioned, e.g., flat records.  
> A simple illustrative example is attached 
> (avro_storage_union_schema_test.tar.gz): run create_avro1.pig, followed by 
> create_avro2.pig, followed by read_avro.pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to