[ 
https://issues.apache.org/jira/browse/PIG-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078646#comment-13078646
 ] 

Aniket Mokashi commented on PIG-1631:
-------------------------------------

With 2 level nested foreach, pig supports iterating over an inner bag of an 
alias.
For example,
{code}
c = foreach b { c1 = foreach a generate a1; generate c1; }
{code}
Support of multi level nested foreach would mean, pig would be able to iterate 
over inner bags of bags of an alias.
For example, pig would be able to support something similar to-
{code}
f = foreach e { f1 = foreach b { e1 = a.a1; generate e1;}; generate f1; }
{code}
(Note: This expression has 3 level nested foreach, 2 from foreach + 1 from 
projection.)

Although this is desirable, there are several complications with respect to 
current state of the code.
1. Current parser nodes are not designed to support multi level recursive 
calls. We currently use a lot of global state in order to make a decision about 
the parsing logic. To make it support multi-level foreach, we would need to 
rewrite a lot of parser code to support recursion.
For example, col_ref which reduces to alias_col_ref, infers inOp from 
$statement::inputAlias, which is assumed to be set by rel operator. But, once 
we add nesting, we have to consider a stack-traceback inorder to infer the 
next-higher inOp.
2. Pig currently supports 6 nested operations, which would be leaves of tree 
once we support multi-level nested foreach. These operations would need to 
revisited in order to complete the support.
3. We would need to add a lot of validations for negative cases. This is 
complicated by the fact that pig supports scalars.

Earlier, my intuition was that multi-level nested foreach would be supported by 
pig "backend" similar to 2 level nested foreach, as the dependencies in the 
plan would take care of streaming of the bags inside inner foreachs. To justify 
my hypothesis, I tried to develop a patch that can support above mentioned 
multi-level foreach with following script (Patch attached). But, the code 
changes break the other related code paths.
{code}
a = load '1.txt' as (a0:int, a1:int, a2:int);
b = group a by a0;
e = group b all;
f = foreach e { f1 = foreach b { e1 = a.a1; generate e1;}; generate f1; }
{code}

Thoughts? Would it be a good idea to support multi-level foreach with only 
foreach generate nesting?

> Support to 2 level nested foreach
> ---------------------------------
>
>                 Key: PIG-1631
>                 URL: https://issues.apache.org/jira/browse/PIG-1631
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.7.0
>            Reporter: Viraj Bhat
>            Assignee: Aniket Mokashi
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: NestedForeachPatch3.txt, PIG-1631_3.patch
>
>
> What I would like to do is generate certain metrics for every listing 
> impression in the context of a page like clicks on the page etc. So, I first 
> group by to get clicks and impression together. Now, I would want to iterate 
> through the mini-table (one per serve-id) and compute metrics. Since nested 
> foreach within foreach is not supported I ended up writing a UDF that took 
> both the bags and computed the metric. It would have been elegant to keep the 
> logic of iterating over the records outside in the PIG script. 
> Here is some pseudocode of how I would have liked to write it:
> {code}
> -- Let us say in our page context there was click on rank 2 for which there 
> were 3 ads 
> A1 = LOAD '...' AS (page_id, rank); -- clicks. 
> A2 = Load '...' AS (page_id, rank); -- impressions
> B = COGROUP A1 by (page_id), A2 by (page_id); 
> -- Let us say B contains the following schema 
> -- (group, {(A1...)} {(A2...)})  
> -- Each record would be in B would be:
> -- page_id_1, {(page_id_1, 2)} {(page_id_1, 1) (page_id_1, 2) (page_id_1, 3))}
> C = FOREACH B GENERATE {
>                 D = FLATTEN(A1), FLATTEN(A2); -- This wont work in current 
> pig as well. Basically, I would like a mini-table which represents an entire 
> serve. 
>                 FOREACH D GENERATE
>                         page_id_1,
>                         A2:rank,
>                         SOMEUDF(A1:rank, A2::rank);  -- This UDF returns a 
> value (like v1, v2, v3 depending on A1::rank and A2::rank)
> };
> # output
> # page_id, 1, v1
> # page_id,  2, v2
> # page_id, 3, v3
> DUMP C;
> {code}
> P.S: I understand that I could have alternatively, flattened the fields of B 
> and then done a GROUP on page_id and then iterated through the records 
> calling 'SOMEUDF' appropriately but that would be 2 map-reduce operations 
> AFAIK. 
> This is a candidate project for Google summer of code 2011. More information 
> about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to