[ 
https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831749#action_12831749
 ] 

Ashutosh Chauhan commented on PIG-1188:
---------------------------------------

I have a different take on this. Referring to original description of Jira, I 
would expect Pig's behavior should be one given in "Current result" and not as 
given in "Desired result". Pig should not try to do anything behind the scenes 
with data which "Desired result" is proposing to do. In cases where columns are 
not consistent, there are two scenarios with or without schema. If user did 
supply the schema, then I would consider that user is telling to Pig that data 
is consistent with the schema he is providing and if thats not the case, its 
perfectly fine to throw exception at runtime. Tricky case is when schema is not 
provided and user tries to access a non-existent field. I think even in such 
cases its valid to throw exception at runtime, instead of returning null. 
First, if user is trying to access a non-existent field thats an error 
condition in any case. Second, it can't be assumed that user wants those 
non-existent field to be treated as null. If he wants it that way, he should 
implement LoadFunc interface which treats them that way. Third, doing further 
operations on these columns down the pipeline may result in non-predictable 
results in other operators. Fourth, returning null will obscure the bugs in Pig 
where Pig (instead of user himself) tries to access non-existent fields to 
construct new tuples at run time to do e.g. joins (see PIG-1131). 

In short, I am suggesting that Pig should continue to have a behavior it has 
today. That is it can load variable number of columns in a tuple. But, if user 
access a non-existent field throw the exception and let user deal with  such 
scenario himself by implementing his own LoadFunc interface. 

Thoughts ?

> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. 
> When we have schema, we should generate input data according to the schema, 
> and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to