[
https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831749#action_12831749
]
Ashutosh Chauhan commented on PIG-1188:
---------------------------------------
I have a different take on this. Referring to original description of Jira, I
would expect Pig's behavior should be one given in "Current result" and not as
given in "Desired result". Pig should not try to do anything behind the scenes
with data which "Desired result" is proposing to do. In cases where columns are
not consistent, there are two scenarios with or without schema. If user did
supply the schema, then I would consider that user is telling to Pig that data
is consistent with the schema he is providing and if thats not the case, its
perfectly fine to throw exception at runtime. Tricky case is when schema is not
provided and user tries to access a non-existent field. I think even in such
cases its valid to throw exception at runtime, instead of returning null.
First, if user is trying to access a non-existent field thats an error
condition in any case. Second, it can't be assumed that user wants those
non-existent field to be treated as null. If he wants it that way, he should
implement LoadFunc interface which treats them that way. Third, doing further
operations on these columns down the pipeline may result in non-predictable
results in other operators. Fourth, returning null will obscure the bugs in Pig
where Pig (instead of user himself) tries to access non-existent fields to
construct new tuples at run time to do e.g. joins (see PIG-1131).
In short, I am suggesting that Pig should continue to have a behavior it has
today. That is it can load variable number of columns in a tuple. But, if user
access a non-existent field throw the exception and let user deal with such
scenario himself by implementing his own LoadFunc interface.
Thoughts ?
> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
> Key: PIG-1188
> URL: https://issues.apache.org/jira/browse/PIG-1188
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.6.0
> Reporter: Daniel Dai
> Assignee: Richard Ding
> Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data.
> When we have schema, we should generate input data according to the schema,
> and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1 2
> 1 2 3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.