[ 
https://issues.apache.org/jira/browse/PIG-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238205#comment-13238205
 ] 

Daniel Dai commented on PIG-2610:
---------------------------------

Yes, we shall open a Jira for the new rule. For now, you can try to manually 
optimize the script by moving filter before group and project necessary columns 
before group. The GC exception is not from bag but from POProject, my suspicion 
is hadoop shuffle/sorting use too much memory and there is no memory for Pig to 
turn around.
                
> GC errors on using FILTER within nested FOREACH
> -----------------------------------------------
>
>                 Key: PIG-2610
>                 URL: https://issues.apache.org/jira/browse/PIG-2610
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Prashant Kommireddi
>
> User has reported running into GC overhead errors while trying to use FILTER 
> within FOREACH and aggregating the filtered field. Here is the sample 
> PigLatin script provided by the user that generated this issue. 
> {code}
> raw = LOAD 'input' using MyCustomLoader();
> searches = FOREACH raw GENERATE
>                day, searchType,
>                FLATTEN(impBag) AS (adType, clickCount)
>            ;
> groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50;
> counts = FOREACH groupedSearches{
>                type1 = FILTER searches BY adType == 'type1';
>                type2 = FILTER searches BY adType == 'type2';
>                GENERATE
>                    FLATTEN(group) AS (day, searchType),
>                    COUNT(searches) numSearches,
>                    SUM(clickCount) AS clickCountPerSearchType,
>                    SUM(type1.clickCount) AS type1ClickCount,
>                    SUM(type2.clickCount) AS type2ClickCount;
>        };
> {code}
> Pig should be able to handle this case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to