[
https://issues.apache.org/jira/browse/PIG-5083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822909#comment-15822909
]
Rohini Palaniswamy commented on PIG-5083:
-----------------------------------------
If you look at
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/Packager.java#L113-L129
, in case of mapreduce where readOnce is true records are read from the
PeekedBag (extends ReadOnceBag) and put in a InternalCachedBag before being
handed off to the CombinerPackager.getNext() which then creates different bags
for the rest of the plan to work with. Since the bag is only iterated once,
there is no need to materialize it into a InternalCachedBag. Iteration can be
done on the ReadOnceBag.
What this patch does is pass the PeekedBag directly to the CombinerPackager in
case of mapreduce and pass TezReadOnceBag with tez. Tez was always
constructing a InternalCachedBag before and did not have concept of ReadOnceBag
(readOnce was always false). It saves one copy and a lot of memory+GC.
> CombinerPackager and LitePackager should not materialize bags
> -------------------------------------------------------------
>
> Key: PIG-5083
> URL: https://issues.apache.org/jira/browse/PIG-5083
> Project: Pig
> Issue Type: Bug
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5083-1.patch
>
>
> Before PIG-3591 and creation of CombinerPackager, POCombinerPackage directly
> read from the combiner/reducer input instead of materializing the bag.
> https://github.com/apache/pig/blob/branch-0.12/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POCombinerPackage.java#L140-L161
> The unnecessary materialization leads to lot of spills and OOMs in some cases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)