[ https://issues.apache.org/jira/browse/PIG-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150557#comment-13150557 ]
Gianmarco De Francisci Morales commented on PIG-2359: ----------------------------------------------------- bq. Be careful with the assumption that schema is going to be same for all the rows in a data. Currently, Pig doesn't make this assumption and is thus able to work with tuples of varying schema in data. See, PIG-1131 where a related optimization was attempted (and also PIG-1188). Yes Ashutosh, you are right. But when a user specifies a schema Pig enforces it on the data, so all the tuples have the same schema anyway. {code} grunt> sh cat file.txt 1 2 1 2 3 1 2 grunt> a = load 'file.txt' AS (x1:int, x2:int); grunt> dump a (1,2) (1,2) (1,) (,2) {code} If we don't have a schema then we don't use this new kind of tuples. Instead, I see a more general problem of how to handle the serialization of null fields with these new Tuple implementations. The schema I proposed needs to be augmented with either a NULL_TYPE which makes us lose track of the original type in the tuple, or modify the schema to use 1 bit of each type byte. > Support more efficient Tuples when schemas are known > ---------------------------------------------------- > > Key: PIG-2359 > URL: https://issues.apache.org/jira/browse/PIG-2359 > Project: Pig > Issue Type: New Feature > Reporter: Dmitriy V. Ryaboy > Assignee: Dmitriy V. Ryaboy > Attachments: PIG-2359.1.patch > > > Pig Tuples have significant overhead due to the fact that all the fields are > Objects. > When a Tuple only contains primitive fields (ints, longs, etc), it's possible > to avoid this overhead, which would result in significant memory savings. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira