[jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185046#comment-15185046 ] Matthew Hayes commented on DATAFU-116: -- I don't think an efficient accumulator implementation is possible for these UDFs. We have no control over how the data from each bag is fed into the accumulate method. You'd be forced to hold values from the bags in memory, which makes memory usage worse. > Make SetIntersect and SetDifference implement Accumulator > - > > Key: DATAFU-116 > URL: https://issues.apache.org/jira/browse/DATAFU-116 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil > > SetIntersect and SetDifference accept only sorted bags, and the output is > always smaller than the inputs. Therefore an accumulator implementation > should be possible and it will improve memory usage (somewhat) and allow Pig > to optimize loops with these operations better. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185409#comment-15185409 ] Eyal Allweil commented on DATAFU-116: - As far as I can tell, when the accumulator is used, Pig passes _pig.accumulative.batchsize_ tuples from each bag until all the tuples are exhausted. I think an implementation that iterates over the bags and only keeps some of the tuples in between batches is possible - hopefully very few, but the worst case is all of them, which is no worse than the current implementation. I'm assuming Pig passes batches in this way based on the code in [POPackage|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java] and from looking through all the documentation I could find on accumulators. If I'm wrong it does mean that an accumulator implementation isn't worthwhile. > Make SetIntersect and SetDifference implement Accumulator > - > > Key: DATAFU-116 > URL: https://issues.apache.org/jira/browse/DATAFU-116 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil > > SetIntersect and SetDifference accept only sorted bags, and the output is > always smaller than the inputs. Therefore an accumulator implementation > should be possible and it will improve memory usage (somewhat) and allow Pig > to optimize loops with these operations better. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187523#comment-15187523 ] Matthew Hayes commented on DATAFU-116: -- bq. but the worst case is all of them, which is no worse than the current implementation. I think it would be worse than the current implementation actually. Pig does not keep the entire input bags in memory. I'm not an expert on Pig internals, but I believe as you iterate through the members of a DataBag it loads the data in chunks from disk. Without doing this it wouldn't be possible to operate on bags larger than what can fit in memory. > Make SetIntersect and SetDifference implement Accumulator > - > > Key: DATAFU-116 > URL: https://issues.apache.org/jira/browse/DATAFU-116 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil > > SetIntersect and SetDifference accept only sorted bags, and the output is > always smaller than the inputs. Therefore an accumulator implementation > should be possible and it will improve memory usage (somewhat) and allow Pig > to optimize loops with these operations better. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189158#comment-15189158 ] Eyal Allweil commented on DATAFU-116: - As far as I know, the behavior you're describing is how Pig deals with UDF's that implement the Accumulator interface. If the UDF doesn't (if it only extends EvalFunc) the parameters (including bags) are passed in memory in their entirety. I'm basing this on [this quote from Programming Pig|http://stackoverflow.com/a/15813789/150992]. That's why I'm suggesting this change. > Make SetIntersect and SetDifference implement Accumulator > - > > Key: DATAFU-116 > URL: https://issues.apache.org/jira/browse/DATAFU-116 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil > > SetIntersect and SetDifference accept only sorted bags, and the output is > always smaller than the inputs. Therefore an accumulator implementation > should be possible and it will improve memory usage (somewhat) and allow Pig > to optimize loops with these operations better. -- This message was sent by Atlassian JIRA (v6.3.4#6332)