[jira] [Updated] (PIG-2359) Support more efficient Tuples when schemas are known

Dmitriy V. Ryaboy (Updated) (JIRA) Sun, 13 Nov 2011 15:14:16 -0800

     [ 
https://issues.apache.org/jira/browse/PIG-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dmitriy V. Ryaboy updated PIG-2359:
-----------------------------------

    Attachment: PIG-2359.1.patch

The attached patch is a first cut at adding this support.

Note that it changes the TupleFactory interface by adding a couple new methods 
for creating optimized tuples.

Two flavors of optimized tuples are provided:

1) For single-field tuple, we provide a PrimitiveFieldTuple, which simply wraps 
a primitive value (or a string). 

2) For multi-field tuples, we provide an implementation that uses a single 
bytebuffer to hold the data in memory, and deserializes the appropriate field 
on read. This incurs a bit of a read-time penalty, but I believe it's a good 
trade-off, since (a) most of the time we only read once, and the allocation 
costs are much lower than for regular tuples, and (b) the memory overhead is 
several times lower than for regular tuples, so we'll save on GC.

Microbenchmark results can be found in the javadoc for PrimitiveTuple.

Note that so far I haven't changed any behavior in existing Pig code, other 
than changing one interface. The next step would be to start using these Tuples 
when possible.

One complication is that since we don't push much metadata around with tuples, 
we can only deserialize them into standard tuples; so all savings are lost once 
we hit an MR boundary. Changing this would require a pretty significant 
refactor, I'd love to hear ideas from folks who worked on BinInterSedes on how 
to do this.

So far, I've played with using these in some UDFs that generate large bags of 
tuples, and the difference in both speed and memory use if fairly dramatic.
                
> Support more efficient Tuples when schemas are known
> ----------------------------------------------------
>
>                 Key: PIG-2359
>                 URL: https://issues.apache.org/jira/browse/PIG-2359
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: PIG-2359.1.patch
>
>
> Pig Tuples have significant overhead due to the fact that all the fields are 
> Objects.
> When a Tuple only contains primitive fields (ints, longs, etc), it's possible 
> to avoid this overhead, which would result in significant memory savings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2359) Support more efficient Tuples when schemas are known

Reply via email to