[
https://issues.apache.org/jira/browse/PIG-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-2359:
-----------------------------------
Attachment: PIG-2359.1.patch
The attached patch is a first cut at adding this support.
Note that it changes the TupleFactory interface by adding a couple new methods
for creating optimized tuples.
Two flavors of optimized tuples are provided:
1) For single-field tuple, we provide a PrimitiveFieldTuple, which simply wraps
a primitive value (or a string).
2) For multi-field tuples, we provide an implementation that uses a single
bytebuffer to hold the data in memory, and deserializes the appropriate field
on read. This incurs a bit of a read-time penalty, but I believe it's a good
trade-off, since (a) most of the time we only read once, and the allocation
costs are much lower than for regular tuples, and (b) the memory overhead is
several times lower than for regular tuples, so we'll save on GC.
Microbenchmark results can be found in the javadoc for PrimitiveTuple.
Note that so far I haven't changed any behavior in existing Pig code, other
than changing one interface. The next step would be to start using these Tuples
when possible.
One complication is that since we don't push much metadata around with tuples,
we can only deserialize them into standard tuples; so all savings are lost once
we hit an MR boundary. Changing this would require a pretty significant
refactor, I'd love to hear ideas from folks who worked on BinInterSedes on how
to do this.
So far, I've played with using these in some UDFs that generate large bags of
tuples, and the difference in both speed and memory use if fairly dramatic.
> Support more efficient Tuples when schemas are known
> ----------------------------------------------------
>
> Key: PIG-2359
> URL: https://issues.apache.org/jira/browse/PIG-2359
> Project: Pig
> Issue Type: New Feature
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Attachments: PIG-2359.1.patch
>
>
> Pig Tuples have significant overhead due to the fact that all the fields are
> Objects.
> When a Tuple only contains primitive fields (ints, longs, etc), it's possible
> to avoid this overhead, which would result in significant memory savings.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira