[jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator

2016-03-08 Thread Matthew Hayes (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185046#comment-15185046
 ] 

Matthew Hayes commented on DATAFU-116:
--

I don't think an efficient accumulator implementation is possible for these 
UDFs. We have no control over how the data from each bag is fed into the 
accumulate method. You'd be forced to hold values from the bags in memory, 
which makes memory usage worse.

> Make SetIntersect and SetDifference implement Accumulator
> -
>
> Key: DATAFU-116
> URL: https://issues.apache.org/jira/browse/DATAFU-116
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>
> SetIntersect and SetDifference accept only sorted bags, and the output is 
> always smaller than the inputs. Therefore an accumulator implementation 
> should be possible and it will improve memory usage (somewhat) and allow Pig 
> to optimize loops with these operations better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator

2016-03-08 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185409#comment-15185409
 ] 

Eyal Allweil commented on DATAFU-116:
-

As far as I can tell, when the accumulator is used, Pig passes 
_pig.accumulative.batchsize_ tuples from each bag until all the tuples are 
exhausted. I think an implementation that iterates over the bags and only keeps 
some of the tuples in between batches is possible - hopefully very few, but the 
worst case is all of them, which is no worse than the current implementation.

I'm assuming Pig passes batches in this way based on the code in 
[POPackage|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java]
 and from looking through all the documentation I could find on accumulators. 
If I'm wrong it does mean that an accumulator implementation isn't worthwhile.

> Make SetIntersect and SetDifference implement Accumulator
> -
>
> Key: DATAFU-116
> URL: https://issues.apache.org/jira/browse/DATAFU-116
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>
> SetIntersect and SetDifference accept only sorted bags, and the output is 
> always smaller than the inputs. Therefore an accumulator implementation 
> should be possible and it will improve memory usage (somewhat) and allow Pig 
> to optimize loops with these operations better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator

2016-03-09 Thread Matthew Hayes (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187523#comment-15187523
 ] 

Matthew Hayes commented on DATAFU-116:
--

bq. but the worst case is all of them, which is no worse than the current 
implementation.

I think it would be worse than the current implementation actually.  Pig does 
not keep the entire input bags in memory.  I'm not an expert on Pig internals, 
but I believe as you iterate through the members of a DataBag it loads the data 
in chunks from disk.  Without doing this it wouldn't be possible to operate on 
bags larger than what can fit in memory.  

> Make SetIntersect and SetDifference implement Accumulator
> -
>
> Key: DATAFU-116
> URL: https://issues.apache.org/jira/browse/DATAFU-116
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>
> SetIntersect and SetDifference accept only sorted bags, and the output is 
> always smaller than the inputs. Therefore an accumulator implementation 
> should be possible and it will improve memory usage (somewhat) and allow Pig 
> to optimize loops with these operations better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator

2016-03-10 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189158#comment-15189158
 ] 

Eyal Allweil commented on DATAFU-116:
-

As far as I know, the behavior you're describing is how Pig deals with UDF's 
that implement the Accumulator interface. If the UDF doesn't (if it only 
extends EvalFunc) the parameters (including bags) are passed in memory in their 
entirety. I'm basing this on [this quote from Programming 
Pig|http://stackoverflow.com/a/15813789/150992]. That's why I'm suggesting this 
change.



> Make SetIntersect and SetDifference implement Accumulator
> -
>
> Key: DATAFU-116
> URL: https://issues.apache.org/jira/browse/DATAFU-116
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>
> SetIntersect and SetDifference accept only sorted bags, and the output is 
> always smaller than the inputs. Therefore an accumulator implementation 
> should be possible and it will improve memory usage (somewhat) and allow Pig 
> to optimize loops with these operations better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)