[ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831749#action_12831749 ]
Ashutosh Chauhan commented on PIG-1188: --------------------------------------- I have a different take on this. Referring to original description of Jira, I would expect Pig's behavior should be one given in "Current result" and not as given in "Desired result". Pig should not try to do anything behind the scenes with data which "Desired result" is proposing to do. In cases where columns are not consistent, there are two scenarios with or without schema. If user did supply the schema, then I would consider that user is telling to Pig that data is consistent with the schema he is providing and if thats not the case, its perfectly fine to throw exception at runtime. Tricky case is when schema is not provided and user tries to access a non-existent field. I think even in such cases its valid to throw exception at runtime, instead of returning null. First, if user is trying to access a non-existent field thats an error condition in any case. Second, it can't be assumed that user wants those non-existent field to be treated as null. If he wants it that way, he should implement LoadFunc interface which treats them that way. Third, doing further operations on these columns down the pipeline may result in non-predictable results in other operators. Fourth, returning null will obscure the bugs in Pig where Pig (instead of user himself) tries to access non-existent fields to construct new tuples at run time to do e.g. joins (see PIG-1131). In short, I am suggesting that Pig should continue to have a behavior it has today. That is it can load variable number of columns in a tuple. But, if user access a non-existent field throw the exception and let user deal with such scenario himself by implementing his own LoadFunc interface. Thoughts ? > Padding nulls to the input tuple according to input schema > ---------------------------------------------------------- > > Key: PIG-1188 > URL: https://issues.apache.org/jira/browse/PIG-1188 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.6.0 > Reporter: Daniel Dai > Assignee: Richard Ding > Fix For: 0.7.0 > > > Currently, the number of fields in the input tuple is determined by the data. > When we have schema, we should generate input data according to the schema, > and padding nulls if necessary. Here is one example: > Pig script: > {code} > a = load '1.txt' as (a0, a1); > dump a; > {code} > Input file: > {code} > 1 2 > 1 2 3 > 1 > {code} > Current result: > {code} > (1,2) > (1,2,3) > (1) > {code} > Desired result: > {code} > (1,2) > (1,2) > (1, null) > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.