[
https://issues.apache.org/jira/browse/PIG-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150557#comment-13150557
]
Gianmarco De Francisci Morales commented on PIG-2359:
-----------------------------------------------------
bq. Be careful with the assumption that schema is going to be same for all the
rows in a data. Currently, Pig doesn't make this assumption and is thus able to
work with tuples of varying schema in data. See, PIG-1131 where a related
optimization was attempted (and also PIG-1188).
Yes Ashutosh, you are right.
But when a user specifies a schema Pig enforces it on the data, so all the
tuples have the same schema anyway.
{code}
grunt> sh cat file.txt
1 2
1 2 3
1
2
grunt> a = load 'file.txt' AS (x1:int, x2:int);
grunt> dump a
(1,2)
(1,2)
(1,)
(,2)
{code}
If we don't have a schema then we don't use this new kind of tuples.
Instead, I see a more general problem of how to handle the serialization of
null fields with these new Tuple implementations.
The schema I proposed needs to be augmented with either a NULL_TYPE which makes
us lose track of the original type in the tuple, or modify the schema to use 1
bit of each type byte.
> Support more efficient Tuples when schemas are known
> ----------------------------------------------------
>
> Key: PIG-2359
> URL: https://issues.apache.org/jira/browse/PIG-2359
> Project: Pig
> Issue Type: New Feature
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Attachments: PIG-2359.1.patch
>
>
> Pig Tuples have significant overhead due to the fact that all the fields are
> Objects.
> When a Tuple only contains primitive fields (ints, longs, etc), it's possible
> to avoid this overhead, which would result in significant memory savings.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira