Philip (flip) Kromer created PIG-3919:
-----------------------------------------

             Summary: Inner FOREACH of nested FOREACH should have access to 
main-body aliases
                 Key: PIG-3919
                 URL: https://issues.apache.org/jira/browse/PIG-3919
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.12.0, 0.13.0
            Reporter: Philip (flip) Kromer



Pig should allow values calculated in the main body of a FOREACH to be accessed 
by and inner nested FOREACH.

{code}
top_queries = LOAD './test/data/pigunit/top_queries_input_data.txt' AS 
(site:chararray, hits:int);
-- yahoo        10
-- twitter      7
-- ...

top_queries_g = GROUP top_queries BY site;

-- BREAKS: Invalid field projection. Projected field [top_queries] does not 
exist in schema: site:chararray,hits:int. - org.apache.pig.tools.grunt.Grunt
cant_use_values_in_inner_foreach = FOREACH top_queries_g {
  n_sites    = COUNT_STAR(top_queries);
  hits_x     = FOREACH top_queries GENERATE hits / n_sites;
  GENERATE group AS site, n_sites, hits_x;
  };
DUMP cant_use_values_in_inner_foreach;
{code}

-- This works, because n_sites behaves same regardless of scope
can_use_const_val = FOREACH top_queries_g {
  n_sites    = 3;
  hits_x     = FOREACH top_queries GENERATE hits / n_sites;
  GENERATE group AS site, n_sites, hits_x;
  };
DUMP can_use_const_val;
{code}

Pig handles the schema for the inner foreach in a very confusing way. 

It should not allow statements in the main foreach body that aren't in the 
main-body scope:

{code}
works_but_is_confusing = FOREACH top_queries_g {
  namelen_g  = SIZE(group);
  namelen_s  = SIZE(site); -- this should not work
  
  -- but it does, because namelen_s gains right scope when evaluated
  hits_x     = FOREACH top_queries GENERATE namelen_s * hits;
  -- instead, this should work, only evaluating namelen_g once
  -- hits_x  = FOREACH top_queries GENERATE namelen_g * hits;

  -- if I used 'namelen_s' in this line, it would break.
  GENERATE group AS site, namelen_g, hits_x;
  };
DUMP works_but_is_confusing;
{code}

Here, the inner foreach precedes the declaration of 'site' in the main body:

{code}
-- declaring main-body site _after_ the inner foreach doesn't interfere
alias_means_two_things = FOREACH top_queries_g {
  hits_x     = FOREACH top_queries GENERATE SIZE(site)*hits; -- works
  site       = CONCAT(group, group);
  namelen_s  = SIZE(site);
  GENERATE site, namelen_s, hits_x;
  };
DUMP alias_means_two_things;
{code}

Simply switching the order of the lines causes an error -- the main body 
declaration of site hides the inner-bag alias. Also, the error shows up on the 
line in the main-body, which is very confusing.

{code}
-- BREAKS
main_body_hides_alias = FOREACH top_queries_g {
  site       = CONCAT(group, group);  -- Projected field [group] does not exist 
in schema: site:chararray,hits:int
  namelen_s  = SIZE(site);
  hits_x     = FOREACH top_queries GENERATE SIZE(site)*hits;
  GENERATE site, namelen_s, hits_x;
  };
DUMP main_body_hides_alias;
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to