[ 
https://issues.apache.org/jira/browse/PIG-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669256#comment-13669256
 ] 

Peter Connolly commented on PIG-2537:
-------------------------------------

As a workaround, I'm able to move the FLATTEN operator to the rightmost column 
and then run a second generate on all of the fields to fix this problem.  I'm 
only dealing with two columns in the tuple, so I'm not sure it will work with 
more columns.

Using the example above, it might look something like this:
grunt> A = load 'file' as ( a : tuple( x, y, z ), b, c );
--B will have a variable number of null columns on the right side, but columns 
b and c will be correct
grunt> B = foreach A generate b, c, flatten( $0 ) AS (x,y,z);
--Running another foreach inserts null values for the extra columns
grunt> C = foreach B generate b,c,x,y,z;


                
> Output from flatten with a null tuple input generating data inconsistent with 
> the schema
> ----------------------------------------------------------------------------------------
>
>                 Key: PIG-2537
>                 URL: https://issues.apache.org/jira/browse/PIG-2537
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Xuefu Zhang
>            Assignee: Daniel Dai
>             Fix For: 0.12
>
>         Attachments: PIG-2537-1.patch, PIG-2537-2.patch, PIG-2537-3.patch
>
>
> For the following pig script,
> grunt> A = load 'file' as ( a : tuple( x, y, z ), b, c );
> grunt> B = foreach A generate flatten( $0 ), b, c;
> grunt> describe B;
> B: {a::x: bytearray,a::y: bytearray,a::z: bytearray,b: bytearray,c: bytearray}
> Alias B has a clear schema.
> However, on the backend, for a row if $0 happens to be null, then output 
> tuple become something like 
> (null, b_value, c_value), which is obviously inconsistent with the schema. 
> The behaviour is confirmed by pig code inspection. 
> This inconsistency corrupts data because of position shifts. Expected output 
> row should be something like
> (null, null, null, b_value, c_value).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to