[
https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835944#action_12835944
]
Richard Ding commented on PIG-1188:
-----------------------------------
To summarize where we are:
Right now Pig project operator pads null if the value to be projected doesn't
exist. As a consequence, the desired result is achieved if PigStorage is used
and a schema with data types is specified, since in this case Pig inserts a
project+cast operator for each field in the schema.
In the case where no schema is specified in the load statement, Pig is doing a
good job adhering to the Pig's philosophy and let the program run without
throwing runtime exception.
Now leave the case where a schema is specified without data types. There are
several options:
* Pig automatically insert a project operator for each field in the schema
to ensure the input data matches the schema. The trade-off for this is the
performance penalty. Is it worthwhile if most user data is well-behaved?
* Users can explicitly add a foreach statement after the load statement
which projects all the fields in the schema. This is similar to the practice by
the users to run a map job first to cleanup the data.
* Pig can also delegate the padding work to the loaders. The problem is that
now the schema isn't passed to the loaders.
> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
> Key: PIG-1188
> URL: https://issues.apache.org/jira/browse/PIG-1188
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.6.0
> Reporter: Daniel Dai
> Assignee: Richard Ding
> Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data.
> When we have schema, we should generate input data according to the schema,
> and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1 2
> 1 2 3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.