Accumulator like interface should be used with Pig operators after (co)group in
certain cases
---------------------------------------------------------------------------------------------
Key: PIG-1836
URL: https://issues.apache.org/jira/browse/PIG-1836
Project: Pig
Issue Type: Improvement
Reporter: Alan Gates
There are a number of cases where people (co)group their data, and then pass it
to an operator other than foreach with a UDF, but where an accumulator like
interface would still make sense. A few examples:
{code}
C = group B by $0;
D = foreach C generate flatten(B);
...
C = group B by $0;
D = stream C through 'script.py';
...
C = group B by $0;
store C into 'output';
{code}
In all these cases the following operator does not require all the data to be
held in memory at once. There may be others beyond this. Changing this part
of the pipeline would greatly speed these types of queries and make them less
likely to die with out of memory errors.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira