[ 
https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185409#comment-15185409
 ] 

Eyal Allweil commented on DATAFU-116:
-------------------------------------

As far as I can tell, when the accumulator is used, Pig passes 
_pig.accumulative.batchsize_ tuples from each bag until all the tuples are 
exhausted. I think an implementation that iterates over the bags and only keeps 
some of the tuples in between batches is possible - hopefully very few, but the 
worst case is all of them, which is no worse than the current implementation.

I'm assuming Pig passes batches in this way based on the code in 
[POPackage|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java]
 and from looking through all the documentation I could find on accumulators. 
If I'm wrong it does mean that an accumulator implementation isn't worthwhile.

> Make SetIntersect and SetDifference implement Accumulator
> ---------------------------------------------------------
>
>                 Key: DATAFU-116
>                 URL: https://issues.apache.org/jira/browse/DATAFU-116
>             Project: DataFu
>          Issue Type: Improvement
>    Affects Versions: 1.3.0
>            Reporter: Eyal Allweil
>
> SetIntersect and SetDifference accept only sorted bags, and the output is 
> always smaller than the inputs. Therefore an accumulator implementation 
> should be possible and it will improve memory usage (somewhat) and allow Pig 
> to optimize loops with these operations better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to