[jira] [Commented] (PIG-2359) Support more efficient Tuples when schemas are known

Gianmarco De Francisci Morales (Commented) (JIRA) Tue, 15 Nov 2011 07:35:14 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150557#comment-13150557
 ]


Gianmarco De Francisci Morales commented on PIG-2359:
-----------------------------------------------------

bq. Be careful with the assumption that schema is going to be same for all the 
rows in a data. Currently, Pig doesn't make this assumption and is thus able to 
work with tuples of varying schema in data. See, PIG-1131 where a related 
optimization was attempted (and also PIG-1188).

Yes Ashutosh, you are right.
But when a user specifies a schema Pig enforces it on the data, so all the 
tuples have the same schema anyway.

{code}
grunt> sh cat file.txt
1       2
1       2       3
1
        2

grunt> a = load 'file.txt' AS (x1:int, x2:int);
grunt> dump a
(1,2)
(1,2)
(1,)
(,2)

{code}

If we don't have a schema then we don't use this new kind of tuples.
Instead, I see a more general problem of how to handle the serialization of 
null fields with these new Tuple implementations.
The schema I proposed needs to be augmented with either a NULL_TYPE which makes 
us lose track of the original type in the tuple, or modify the schema to use 1 
bit of each type byte.
                
> Support more efficient Tuples when schemas are known
> ----------------------------------------------------
>
>                 Key: PIG-2359
>                 URL: https://issues.apache.org/jira/browse/PIG-2359
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: PIG-2359.1.patch
>
>
> Pig Tuples have significant overhead due to the fact that all the fields are 
> Objects.
> When a Tuple only contains primitive fields (ints, longs, etc), it's possible 
> to avoid this overhead, which would result in significant memory savings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2359) Support more efficient Tuples when schemas are known

Reply via email to