[ 
https://issues.apache.org/jira/browse/PIG-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150351#comment-13150351
 ] 

Gianmarco De Francisci Morales commented on PIG-2359:
-----------------------------------------------------

Yes, we already do that.
But that's because a byte is negligible when compared to the size of the rest 
of the tuple.
I think that if we want really a more efficient tuple implementation when 
schemas are known, we need to strip the schema from the data. What's the point 
of repeating the schema in each tuple apart from ease of implementation?
This modification might be done in a different Jira, while we can keep this one 
for the bytearray implementation.

For the PrimitiveFieldTuple implementation, should we create a different byte 
for each tuple?
This way we can save on the size and schema byte and make it really compact.
Otherwise we could use a byte to indicate PRIMITIVE_FIELD_TUPLE and then a 
second one to indicate the schema (Double, Float, etc..)

For the PrimitiveTuple, we would use PRIMITIVE_TUPLE, then the size as a byte 
(I assume we don't really use schemas with more than 255 primitives?), the 
schema (1 byte per type) and finally the data in the bytearray.
Actually, given that we are changing the serialization format, we don't need 
the schema to be 1 byte per type, but we could multiplex several fields in the 
same byte. We have 8 primitive types in Pig (by the way, should we also 
implement PByteTuple, PBooleanTuple?), so 3 bits will suffice. We can use 4 for 
alignment and expandability. This cuts by 50% the overhead due to schema.

Thoughts?
                
> Support more efficient Tuples when schemas are known
> ----------------------------------------------------
>
>                 Key: PIG-2359
>                 URL: https://issues.apache.org/jira/browse/PIG-2359
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: PIG-2359.1.patch
>
>
> Pig Tuples have significant overhead due to the fact that all the fields are 
> Objects.
> When a Tuple only contains primitive fields (ints, longs, etc), it's possible 
> to avoid this overhead, which would result in significant memory savings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to