[ 
https://issues.apache.org/jira/browse/PIG-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip (flip) Kromer updated PIG-3919:
--------------------------------------

    Description: 
Pig should allow values calculated in the main body of a FOREACH to be accessed 
by and inner nested FOREACH.

{code}
top_queries = LOAD './test/data/pigunit/top_queries_input_data.txt' AS 
(site:chararray, hits:int);
-- yahoo        10
-- twitter      7
-- ...

top_queries_g = GROUP top_queries BY site;

-- BREAKS: Invalid field projection. Projected field [top_queries] does not 
exist in schema: site:chararray,hits:int. - org.apache.pig.tools.grunt.Grunt
cant_use_values_in_inner_foreach = FOREACH top_queries_g {
  n_sites    = COUNT_STAR(top_queries);
  hits_x     = FOREACH top_queries GENERATE hits / n_sites;
  GENERATE group AS site, n_sites, hits_x;
  };
DUMP cant_use_values_in_inner_foreach;

-- This works, because n_sites behaves same regardless of scope
can_use_const_val = FOREACH top_queries_g {
  n_sites    = 3;
  hits_x     = FOREACH top_queries GENERATE hits / n_sites;
  GENERATE group AS site, n_sites, hits_x;
  };
DUMP can_use_const_val;
{code}

Pig handles the schema for the inner foreach in a very confusing way. 

It should not allow statements in the main foreach body that aren't in the 
main-body scope:

{code}
works_but_is_confusing = FOREACH top_queries_g {
  namelen_g  = SIZE(group);
  namelen_s  = SIZE(site); -- this should not work
  
  -- but it does, because namelen_s gains right scope when evaluated
  hits_x     = FOREACH top_queries GENERATE namelen_s * hits;
  -- instead, this should work, only evaluating namelen_g once
  -- hits_x  = FOREACH top_queries GENERATE namelen_g * hits;

  -- if I used 'namelen_s' in this line, it would break.
  GENERATE group AS site, namelen_g, hits_x;
  };
DUMP works_but_is_confusing;
{code}

Here, the inner foreach precedes the declaration of 'site' in the main body:

{code}
-- declaring main-body site _after_ the inner foreach doesn't interfere
alias_means_two_things = FOREACH top_queries_g {
  hits_x     = FOREACH top_queries GENERATE SIZE(site)*hits; -- works
  site       = CONCAT(group, group);
  namelen_s  = SIZE(site);
  GENERATE site, namelen_s, hits_x;
  };
DUMP alias_means_two_things;
{code}

Simply switching the order of the lines causes an error -- the main body 
declaration of site hides the inner-bag alias. Also, the error shows up on the 
line in the main-body, which is very confusing.

{code}
-- BREAKS
main_body_hides_alias = FOREACH top_queries_g {
  site       = CONCAT(group, group);  -- Projected field [group] does not exist 
in schema: site:chararray,hits:int
  namelen_s  = SIZE(site);
  hits_x     = FOREACH top_queries GENERATE SIZE(site)*hits;
  GENERATE site, namelen_s, hits_x;
  };
DUMP main_body_hides_alias;
{code}

  was:

Pig should allow values calculated in the main body of a FOREACH to be accessed 
by and inner nested FOREACH.

{code}
top_queries = LOAD './test/data/pigunit/top_queries_input_data.txt' AS 
(site:chararray, hits:int);
-- yahoo        10
-- twitter      7
-- ...

top_queries_g = GROUP top_queries BY site;

-- BREAKS: Invalid field projection. Projected field [top_queries] does not 
exist in schema: site:chararray,hits:int. - org.apache.pig.tools.grunt.Grunt
cant_use_values_in_inner_foreach = FOREACH top_queries_g {
  n_sites    = COUNT_STAR(top_queries);
  hits_x     = FOREACH top_queries GENERATE hits / n_sites;
  GENERATE group AS site, n_sites, hits_x;
  };
DUMP cant_use_values_in_inner_foreach;
{code}

-- This works, because n_sites behaves same regardless of scope
can_use_const_val = FOREACH top_queries_g {
  n_sites    = 3;
  hits_x     = FOREACH top_queries GENERATE hits / n_sites;
  GENERATE group AS site, n_sites, hits_x;
  };
DUMP can_use_const_val;
{code}

Pig handles the schema for the inner foreach in a very confusing way. 

It should not allow statements in the main foreach body that aren't in the 
main-body scope:

{code}
works_but_is_confusing = FOREACH top_queries_g {
  namelen_g  = SIZE(group);
  namelen_s  = SIZE(site); -- this should not work
  
  -- but it does, because namelen_s gains right scope when evaluated
  hits_x     = FOREACH top_queries GENERATE namelen_s * hits;
  -- instead, this should work, only evaluating namelen_g once
  -- hits_x  = FOREACH top_queries GENERATE namelen_g * hits;

  -- if I used 'namelen_s' in this line, it would break.
  GENERATE group AS site, namelen_g, hits_x;
  };
DUMP works_but_is_confusing;
{code}

Here, the inner foreach precedes the declaration of 'site' in the main body:

{code}
-- declaring main-body site _after_ the inner foreach doesn't interfere
alias_means_two_things = FOREACH top_queries_g {
  hits_x     = FOREACH top_queries GENERATE SIZE(site)*hits; -- works
  site       = CONCAT(group, group);
  namelen_s  = SIZE(site);
  GENERATE site, namelen_s, hits_x;
  };
DUMP alias_means_two_things;
{code}

Simply switching the order of the lines causes an error -- the main body 
declaration of site hides the inner-bag alias. Also, the error shows up on the 
line in the main-body, which is very confusing.

{code}
-- BREAKS
main_body_hides_alias = FOREACH top_queries_g {
  site       = CONCAT(group, group);  -- Projected field [group] does not exist 
in schema: site:chararray,hits:int
  namelen_s  = SIZE(site);
  hits_x     = FOREACH top_queries GENERATE SIZE(site)*hits;
  GENERATE site, namelen_s, hits_x;
  };
DUMP main_body_hides_alias;
{code}


> Inner FOREACH of nested FOREACH should have access to main-body aliases
> -----------------------------------------------------------------------
>
>                 Key: PIG-3919
>                 URL: https://issues.apache.org/jira/browse/PIG-3919
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.12.0, 0.13.0
>            Reporter: Philip (flip) Kromer
>              Labels: alias, foreach, innerbag, nested, schema
>
> Pig should allow values calculated in the main body of a FOREACH to be 
> accessed by and inner nested FOREACH.
> {code}
> top_queries = LOAD './test/data/pigunit/top_queries_input_data.txt' AS 
> (site:chararray, hits:int);
> -- yahoo      10
> -- twitter    7
> -- ...
> top_queries_g = GROUP top_queries BY site;
> -- BREAKS: Invalid field projection. Projected field [top_queries] does not 
> exist in schema: site:chararray,hits:int. - org.apache.pig.tools.grunt.Grunt
> cant_use_values_in_inner_foreach = FOREACH top_queries_g {
>   n_sites    = COUNT_STAR(top_queries);
>   hits_x     = FOREACH top_queries GENERATE hits / n_sites;
>   GENERATE group AS site, n_sites, hits_x;
>   };
> DUMP cant_use_values_in_inner_foreach;
> -- This works, because n_sites behaves same regardless of scope
> can_use_const_val = FOREACH top_queries_g {
>   n_sites    = 3;
>   hits_x     = FOREACH top_queries GENERATE hits / n_sites;
>   GENERATE group AS site, n_sites, hits_x;
>   };
> DUMP can_use_const_val;
> {code}
> Pig handles the schema for the inner foreach in a very confusing way. 
> It should not allow statements in the main foreach body that aren't in the 
> main-body scope:
> {code}
> works_but_is_confusing = FOREACH top_queries_g {
>   namelen_g  = SIZE(group);
>   namelen_s  = SIZE(site); -- this should not work
>   
>   -- but it does, because namelen_s gains right scope when evaluated
>   hits_x     = FOREACH top_queries GENERATE namelen_s * hits;
>   -- instead, this should work, only evaluating namelen_g once
>   -- hits_x  = FOREACH top_queries GENERATE namelen_g * hits;
>   -- if I used 'namelen_s' in this line, it would break.
>   GENERATE group AS site, namelen_g, hits_x;
>   };
> DUMP works_but_is_confusing;
> {code}
> Here, the inner foreach precedes the declaration of 'site' in the main body:
> {code}
> -- declaring main-body site _after_ the inner foreach doesn't interfere
> alias_means_two_things = FOREACH top_queries_g {
>   hits_x     = FOREACH top_queries GENERATE SIZE(site)*hits; -- works
>   site       = CONCAT(group, group);
>   namelen_s  = SIZE(site);
>   GENERATE site, namelen_s, hits_x;
>   };
> DUMP alias_means_two_things;
> {code}
> Simply switching the order of the lines causes an error -- the main body 
> declaration of site hides the inner-bag alias. Also, the error shows up on 
> the line in the main-body, which is very confusing.
> {code}
> -- BREAKS
> main_body_hides_alias = FOREACH top_queries_g {
>   site       = CONCAT(group, group);  -- Projected field [group] does not 
> exist in schema: site:chararray,hits:int
>   namelen_s  = SIZE(site);
>   hits_x     = FOREACH top_queries GENERATE SIZE(site)*hits;
>   GENERATE site, namelen_s, hits_x;
>   };
> DUMP main_body_hides_alias;
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to