[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

Alan Gates (JIRA) Tue, 09 Feb 2010 16:27:53 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831776#action_12831776
 ]


Alan Gates commented on PIG-1188:
---------------------------------

A few thoughts:

In a job that is going to process a billion rows and run for 3 hours 1 bad row 
should not cause the whole job to fail.

This invalid access should certainly cause a warning.  Users can look at the 
warnings at the end of the query and decide they do not want to keep the output 
because of the warnings.  But failure should not be the default case (see 
previous point).  Perhaps we should have a warnings = error option like 
compilers do so users who are very worried about the warnings can make sure 
they fail.  But that's a different proposal for a different JIRA.

bq. Third, doing further operations on these columns down the pipeline may 
result in non-predictable results in other operators.

I don't follow.  Nulls in the pipeline shouldn't cause a problem.  UDFs and 
operators need to be able to handle null values whether they come from 
processing or from the data itself.

bq. Second, it can't be assumed that user wants those non-existent field to be 
treated as null. If he wants it that way, he should implement LoadFunc 
interface which treats them that way.

One could argue that it can't be assumed the user wants his query to fail when 
a field is missing.  We have to assume one way or another.  Null is a better 
assumption than failure, since it is possible for a user who doesn't want that 
behavior to detect it and deal with it.  As it is now, the user has to modify 
his data or write a new load function to deal with padding his data.

I agree with you that in the schema case, it would be ideal if not having a 
field was an error.  However, given the architecture this is difficult.  And 
stipulating that load functions test every record to assure it matches the 
schema is too much of a performance penalty.  But for the non-schema case I 
don't agree.  Pig's philsophy of "Pigs eat anything" doesn't mean much if Pig 
gags as soon as it gets a record that doesn't match it's expectation.




> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. 
> When we have schema, we should generate input data according to the schema, 
> and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

Reply via email to