[ https://issues.apache.org/jira/browse/PIG-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Philip (flip) Kromer updated PIG-3919: -------------------------------------- Description: Pig should allow values calculated in the main body of a FOREACH to be accessed by and inner nested FOREACH. {code} top_queries = LOAD './test/data/pigunit/top_queries_input_data.txt' AS (site:chararray, hits:int); -- yahoo 10 -- twitter 7 -- ... top_queries_g = GROUP top_queries BY site; -- BREAKS: Invalid field projection. Projected field [top_queries] does not exist in schema: site:chararray,hits:int. - org.apache.pig.tools.grunt.Grunt cant_use_values_in_inner_foreach = FOREACH top_queries_g { n_sites = COUNT_STAR(top_queries); hits_x = FOREACH top_queries GENERATE hits / n_sites; GENERATE group AS site, n_sites, hits_x; }; DUMP cant_use_values_in_inner_foreach; -- This works, because n_sites behaves same regardless of scope can_use_const_val = FOREACH top_queries_g { n_sites = 3; hits_x = FOREACH top_queries GENERATE hits / n_sites; GENERATE group AS site, n_sites, hits_x; }; DUMP can_use_const_val; {code} Pig handles the schema for the inner foreach in a very confusing way. It should not allow statements in the main foreach body that aren't in the main-body scope: {code} works_but_is_confusing = FOREACH top_queries_g { namelen_g = SIZE(group); namelen_s = SIZE(site); -- this should not work -- but it does, because namelen_s gains right scope when evaluated hits_x = FOREACH top_queries GENERATE namelen_s * hits; -- instead, this should work, only evaluating namelen_g once -- hits_x = FOREACH top_queries GENERATE namelen_g * hits; -- if I used 'namelen_s' in this line, it would break. GENERATE group AS site, namelen_g, hits_x; }; DUMP works_but_is_confusing; {code} Here, the inner foreach precedes the declaration of 'site' in the main body: {code} -- declaring main-body site _after_ the inner foreach doesn't interfere alias_means_two_things = FOREACH top_queries_g { hits_x = FOREACH top_queries GENERATE SIZE(site)*hits; -- works site = CONCAT(group, group); namelen_s = SIZE(site); GENERATE site, namelen_s, hits_x; }; DUMP alias_means_two_things; {code} Simply switching the order of the lines causes an error -- the main body declaration of site hides the inner-bag alias. Also, the error shows up on the line in the main-body, which is very confusing. {code} -- BREAKS main_body_hides_alias = FOREACH top_queries_g { site = CONCAT(group, group); -- Projected field [group] does not exist in schema: site:chararray,hits:int namelen_s = SIZE(site); hits_x = FOREACH top_queries GENERATE SIZE(site)*hits; GENERATE site, namelen_s, hits_x; }; DUMP main_body_hides_alias; {code} was: Pig should allow values calculated in the main body of a FOREACH to be accessed by and inner nested FOREACH. {code} top_queries = LOAD './test/data/pigunit/top_queries_input_data.txt' AS (site:chararray, hits:int); -- yahoo 10 -- twitter 7 -- ... top_queries_g = GROUP top_queries BY site; -- BREAKS: Invalid field projection. Projected field [top_queries] does not exist in schema: site:chararray,hits:int. - org.apache.pig.tools.grunt.Grunt cant_use_values_in_inner_foreach = FOREACH top_queries_g { n_sites = COUNT_STAR(top_queries); hits_x = FOREACH top_queries GENERATE hits / n_sites; GENERATE group AS site, n_sites, hits_x; }; DUMP cant_use_values_in_inner_foreach; {code} -- This works, because n_sites behaves same regardless of scope can_use_const_val = FOREACH top_queries_g { n_sites = 3; hits_x = FOREACH top_queries GENERATE hits / n_sites; GENERATE group AS site, n_sites, hits_x; }; DUMP can_use_const_val; {code} Pig handles the schema for the inner foreach in a very confusing way. It should not allow statements in the main foreach body that aren't in the main-body scope: {code} works_but_is_confusing = FOREACH top_queries_g { namelen_g = SIZE(group); namelen_s = SIZE(site); -- this should not work -- but it does, because namelen_s gains right scope when evaluated hits_x = FOREACH top_queries GENERATE namelen_s * hits; -- instead, this should work, only evaluating namelen_g once -- hits_x = FOREACH top_queries GENERATE namelen_g * hits; -- if I used 'namelen_s' in this line, it would break. GENERATE group AS site, namelen_g, hits_x; }; DUMP works_but_is_confusing; {code} Here, the inner foreach precedes the declaration of 'site' in the main body: {code} -- declaring main-body site _after_ the inner foreach doesn't interfere alias_means_two_things = FOREACH top_queries_g { hits_x = FOREACH top_queries GENERATE SIZE(site)*hits; -- works site = CONCAT(group, group); namelen_s = SIZE(site); GENERATE site, namelen_s, hits_x; }; DUMP alias_means_two_things; {code} Simply switching the order of the lines causes an error -- the main body declaration of site hides the inner-bag alias. Also, the error shows up on the line in the main-body, which is very confusing. {code} -- BREAKS main_body_hides_alias = FOREACH top_queries_g { site = CONCAT(group, group); -- Projected field [group] does not exist in schema: site:chararray,hits:int namelen_s = SIZE(site); hits_x = FOREACH top_queries GENERATE SIZE(site)*hits; GENERATE site, namelen_s, hits_x; }; DUMP main_body_hides_alias; {code} > Inner FOREACH of nested FOREACH should have access to main-body aliases > ----------------------------------------------------------------------- > > Key: PIG-3919 > URL: https://issues.apache.org/jira/browse/PIG-3919 > Project: Pig > Issue Type: Bug > Affects Versions: 0.12.0, 0.13.0 > Reporter: Philip (flip) Kromer > Labels: alias, foreach, innerbag, nested, schema > > Pig should allow values calculated in the main body of a FOREACH to be > accessed by and inner nested FOREACH. > {code} > top_queries = LOAD './test/data/pigunit/top_queries_input_data.txt' AS > (site:chararray, hits:int); > -- yahoo 10 > -- twitter 7 > -- ... > top_queries_g = GROUP top_queries BY site; > -- BREAKS: Invalid field projection. Projected field [top_queries] does not > exist in schema: site:chararray,hits:int. - org.apache.pig.tools.grunt.Grunt > cant_use_values_in_inner_foreach = FOREACH top_queries_g { > n_sites = COUNT_STAR(top_queries); > hits_x = FOREACH top_queries GENERATE hits / n_sites; > GENERATE group AS site, n_sites, hits_x; > }; > DUMP cant_use_values_in_inner_foreach; > -- This works, because n_sites behaves same regardless of scope > can_use_const_val = FOREACH top_queries_g { > n_sites = 3; > hits_x = FOREACH top_queries GENERATE hits / n_sites; > GENERATE group AS site, n_sites, hits_x; > }; > DUMP can_use_const_val; > {code} > Pig handles the schema for the inner foreach in a very confusing way. > It should not allow statements in the main foreach body that aren't in the > main-body scope: > {code} > works_but_is_confusing = FOREACH top_queries_g { > namelen_g = SIZE(group); > namelen_s = SIZE(site); -- this should not work > > -- but it does, because namelen_s gains right scope when evaluated > hits_x = FOREACH top_queries GENERATE namelen_s * hits; > -- instead, this should work, only evaluating namelen_g once > -- hits_x = FOREACH top_queries GENERATE namelen_g * hits; > -- if I used 'namelen_s' in this line, it would break. > GENERATE group AS site, namelen_g, hits_x; > }; > DUMP works_but_is_confusing; > {code} > Here, the inner foreach precedes the declaration of 'site' in the main body: > {code} > -- declaring main-body site _after_ the inner foreach doesn't interfere > alias_means_two_things = FOREACH top_queries_g { > hits_x = FOREACH top_queries GENERATE SIZE(site)*hits; -- works > site = CONCAT(group, group); > namelen_s = SIZE(site); > GENERATE site, namelen_s, hits_x; > }; > DUMP alias_means_two_things; > {code} > Simply switching the order of the lines causes an error -- the main body > declaration of site hides the inner-bag alias. Also, the error shows up on > the line in the main-body, which is very confusing. > {code} > -- BREAKS > main_body_hides_alias = FOREACH top_queries_g { > site = CONCAT(group, group); -- Projected field [group] does not > exist in schema: site:chararray,hits:int > namelen_s = SIZE(site); > hits_x = FOREACH top_queries GENERATE SIZE(site)*hits; > GENERATE site, namelen_s, hits_x; > }; > DUMP main_body_hides_alias; > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)