[
https://issues.apache.org/jira/browse/PIG-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Gates updated PIG-1875:
----------------------------
Attachment: mrtuple.patch
Here's a first pass at what MToRTuple might look like. I've done some basic
testing to assure this works, but nothing comprehensive.
In test runs where I serialized 100k tuples, wrote them to disk, and read them
back I got the following results:
DefaultTuple:
time to write to disk: 81.93 sec
size on disk: 98M
time to read from disk: 12.62 sec
size in memory (after read): 238M
MToRTuple:
time to write to disk: 10.49 sec
size on disk: 58M
time to read from disk: 1.10 sec
size in memory (after read): 57M
So roughly 1/4 the memory consumption and ~10x speedup on disk reads and writes.
> Keep tuples serialized to limit spilling and speed it when it happens
> ---------------------------------------------------------------------
>
> Key: PIG-1875
> URL: https://issues.apache.org/jira/browse/PIG-1875
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Reporter: Alan Gates
> Priority: Minor
> Attachments: mrtuple.patch
>
>
> Currently Pig reads records off of the reduce iterator and immediately
> deserializes them into Java objects. This takes up much more memory than
> serialized versions, thus Pig spills sooner then if it stored them in
> serialized form. Also, if it does have to spill, it has to serialize them
> again, and then again deserialize them after reading from the spill file.
> We should explore storing them in memory serialized when they are read off of
> the reduce iterator.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira