[ 
https://issues.apache.org/jira/browse/PIG-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-5231:
------------------------------
    Attachment: pig-5231-v01.patch

I understand that PigStorage is not designed for handling multiple schemas, but 
it would be nice if we can still handle this limited case where user simply 
added more fields leaving the rest of schema untouched.  

In general, we ask our users to use AvroStorage or HcatLoader when they expect 
schema evolution(merging).

A couple of approaches I can think of.
(1) Scan all schemas and fail if any are different
or
(2) Scan all schemas and pick the schema with longest 
or 
(3) Drop any fields not part of the schema

Attaching a patch for approach (3).

> PigStorage with -schema may produce inconsistent outputs with more fields
> -------------------------------------------------------------------------
>
>                 Key: PIG-5231
>                 URL: https://issues.apache.org/jira/browse/PIG-5231
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>            Priority: Minor
>         Attachments: pig-5231-v01.patch
>
>
> When multiple directories are passed to PigStorage(',','-schema'), pig will 
> {quote}
> No attempt to merge conflicting schemas is made during loading. The first 
> schema encountered during a file system scan is used.
> {quote}
> For two directories input with schema
> file1: (f1:chararray, f2:int) and 
> file2: (f1:chararray, f2:int, f3:int) 
> Pig will pick the first schema from file1 and only allow f1, f2 access. 
> However, output would still contain 3 fields for tuples from file2.  This 
> later leads to complete corrupt outputs due to shifted fields resulting in 
> incorrect references. 
> (This may also happen when input itself contains the delimiter.)
> If file2 schema is picked, this is already handled by filling the missing 
> fields with null.  (PIG-3100)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to