[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366248#comment-16366248 ] Koji Noguchi commented on PIG-4608: --- bq. To me, UPDATE $1 with r+$2 means update the first field, regardless of name, with r+second field. You probably meant update the second field with r+third field. (Pig counts from 0 position.) In any cases, I get your point. [~daijy], [~rohini], [~kpriceyahoo], any preferences? bq. UPDATE $1 means n=$1 and updating the _n_th field accordingly. My type of interpretation for $1 probably should be disallowed anyways since this takes away the optimization opportunity. (not knowing which fields getting updated/dropped at compile time.) > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366223#comment-16366223 ] Will Lauer commented on PIG-4608: - To me, {{UPDATE $1 with r+$2}} means update the first field, regardless of name, with r+second field. I think {{UPDATE 1 with r+$2}} means that the user is trying to update a field named "1". The fact that this is an illegal field name (not an identifier) should generate an error. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366197#comment-16366197 ] Koji Noguchi commented on PIG-4608: --- {quote}The idea is to make ".." syntax more flexible, {quote} I think one of the goal here is to let users manipulate records without using ".." at all. For the initial version, let's just focus on the basics. We can add more later, but of course changing is always tough. I don't want this jira to go stale after having such a great contribution from Will. I feel having UPDATE and DROP with simple column(field) updates is a good start. Only thing I'm not clear on is, {code:java} /* simple update using positional arguments */ a = FOREACH b UPDATE $1 with r+$2; {code} Should this be {{UPDATE 1 with r+$2}} ? To me, {{UPDATE $1}} means {{n=$1}} and updating the _n_th field accordingly. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340555#comment-16340555 ] Koji Noguchi commented on PIG-4608: --- {quote}I didn't see UPDATE/DROP in a single statement in the example, are we not going to support both in the same statement? I actually prefer those in the same statement, as I feel users usually think about adjusting all columns in the same time. {quote} This could be because I requested in one of my previous comments as. "For now, can we just require separate statements for update and delete ?" I just wanted to keep it simple and leave the combining part later when we have more use cases. Also, I'm afraid of confusions in overlapping index/fields. Say, {{A:(f0:int, f1:int, f2:int, f3:int)}} {code:java} B = FOREACH A drop f1 , update 2 with $1 ; {code} Is the code updating {{f2}} with the value of {{f1}}? Or, updating {{f3}} with value of {{f2}} ? or something else? > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336593#comment-16336593 ] Daniel Dai commented on PIG-4608: - bq. a = FOREACH b UPDATE q AS q:int – This should be illegal, right? If the type is changed, an explicit modify of the value should occur This should be valid, AS clause has the capacity to change types. UPDATE clause is evaluated before AS clause, so a = FOREACH b UPDATE q WITH (int)q AS q:chararray; Will result a chararray q. bq. flattening a tuple into existing fields - does this make sense This makes sense, it is a symmetry to the AS clause I didn't see UPDATE/DROP in a single statement in the example, are we not going to support both in the same statement? I actually prefer those in the same statement, as I feel users usually think about adjusting all columns in the same time. How about APPEND? Actually when I think about DROP/APPEND, I feel we have to have INSERT as well to close the loop. But if adding INSERT, other syntax might be more proper, such as: a = FOREACH b generate .., UPDATE a10 WITH 1 as new_a10, ..a20, 2 as a_20_plus_half, ..a30, a32.., UPDATE a40 WITH 2 as new_a40, 1 as a41; Here: Update: a10, a40 using UPDATE clause Insert: a_20_plus_half Drop: a31 Append: a41 In the original use case, it can be written as: intermediate = foreach i generate .., 3 as f3, .., 6 as f6, .. 48 as f48, ..; The idea is to make ".." syntax more flexible, skip prefix/suffix if can be inferred. Probably more natural to add support for INSERT with this, thus make the syntax complete. How's that sound? > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331344#comment-16331344 ] Rohini Palaniswamy commented on PIG-4608: - bq. a = FOREACH b UPDATE q AS q:int – This should be illegal, right? If the type is changed, an explicit modify of the value should occur That should be supported after PIG-2315 (not pulled into our internal Y releases). [~knoguchi] can confirm. bq. This should be illegal, right? No types should be present in a DROP statement yes. bq. flattening a tuple into existing fields - does this make sense Not sure if there is a use case, but don't see a problem against adding support for it. What happens if $5 has more than 3 fields? I am assuming it will be something like a = FOREACH b UPDATE q with $5.f1 , r WITH $5.f2 , s with $5.f3 as t; bq. flattening a bag into existing fields, exploding rows in the process You will have to add support for maps as well. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331309#comment-16331309 ] Will Lauer commented on PIG-4608: - Ok, just to close the loop, here are several examples given the new proposed syntax. I want to make sure I understand which are correct and what the behavior is in each case. ``` /* simple projection, specifying resulting schema, using both explicit column names and positions */ a = FOREACH b GENERATE 1+s as x:long, $2+$3 as y:chararray, q-1 as z; a = FOREACH b GENERATE FLATTEN(s) as (x:int, y:long, z:chararray); -- flattening tuples into individual columns a = FOREACH b GENERATE FLATTEN(s) as x:int, 1 as y; -- flattening bags into multiple rows /* complex projection, specifying resulting schema, using both explicity column names and positions */ a = FOREACH b { q = COUNT(s); r = someUdf($1,$2); GENERATE q as x:long, r as y; } /* simple update */ a = FOREACH b UPDATE q with r+s; /* complex update */ a = FOREACH b { q = COUNT(s); r = someUdf($1, $2); UPDATE qprime WITH q, rprime WITH r; } /* simple update using positional arguments */ a = FOREACH b UPDATE $1 with r+$2; /* simple renaming of a column */ a = FOREACH b UPDATE q as r; /* simple schema type change */ a = FOREACH b UPDATE q WITH (int)q AS q:int; -- change q from something to int a = FOREACH b UPDATE q AS q:int -- This should be illegal, right? If the type is changed, an explicit modify of the value should occur /* rename, type, and value change together */ a = FOREACH b UPDATE q WITH computeR(q) as r:long; /* simple column drop */ a = FOREACH b DROP q,r,$5; -- drops columns q, r, and whatever is the 5th column a = FOREACH b DROP q:int; -- This should be illegal, right? No types should be present in a DROP statement /* updating an individual field within a tuple - not implemented in the initial version */ a = FOREACH b UPDATE q.$1.fieldN WITH r+s; /* renaming an individual field within a tuple - not implemented in the initial version */ a = FOREACH b UPDATE q.$1.fieldN AS newFieldN; -- has the result of renaming the field within q.$1, not renaming q or $1 /* flattening a tuple into existing fields - does this make sense?*/ a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5); a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5) AS (q, r, t); -- renaming one column during flattening assignment a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5) AS (q:int, r:chararray, s:long); -- re-typing arguments as part of flattening /* flattening a bag into existing fields, exploding rows in the process -- does this make sense? */ a = FOREACH b UPDATE f1 WITH FLATTEN(bagCol); a = FOREACH b UPDATE f1 WITH FLATTEN(bagCol) as f2:int; -- rename field and possibly retype as part of the flatten ``` While I admit the WITH/AS syntax is useful, it still feels a bit weird to me as a pig script writer. I'd love to have [~kpriceyahoo] weigh in on the proposal to ensure it still makes sense to heavy pig script writers. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329539#comment-16329539 ] Daniel Dai commented on PIG-4608: - "add" (or append?)/"update xxx as"/"drop" syntax sounds good to me. We also want to make sure it works with positional reference ($0, $1, etc). You might take a look PIG-3122 for keywords conflicts if applicable. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329258#comment-16329258 ] Will Lauer commented on PIG-4608: - {quote}This is Yahoo-centric, but would it be possible to grep our logs for existing pig jobs and see how many of them have keyword conflicts with 'update', 'delete', 'drop', etc? I'm indifferent on 'delete' versus 'drop', but it'd be interesting to know which one would impact fewer existing scripts.{quote} Keyword conflicts shouldn't really be an issue give the syntax we are talking about. Given the way the syntax works, it will always be obvious to the parser whether "drop"/"delete" is refering to a column or is a keyword. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329135#comment-16329135 ] Will Lauer commented on PIG-4608: - I like drop. If everyone else agrees, I'll change my patch to use that instead of delete. On a similar note, if we ever want to include adding columns in addition to update and delete, I'd suggest "append" since it correctly implies that the columns are added at the end of the schema. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327955#comment-16327955 ] Rohini Palaniswamy commented on PIG-4608: - bq. As for the 'update val AS col' versus 'update col BY val', I think the former looks less confusing to a current pig user. Agree and prefer usage of AS. Also BY does not make sense grammatically. It should either be AS (update val AS col) or WITH (update col WITH val). bq. As for delete, I somehow prefer using drop. I prefer drop as well as that is similar to the alter table drop column of SQL. Delete is generally used for rows. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327736#comment-16327736 ] Kevin J. Price commented on PIG-4608: - This is Yahoo-centric, but would it be possible to grep our logs for existing pig jobs and see how many of them have keyword conflicts with 'update', 'delete', 'drop', etc? I'm indifferent on 'delete' versus 'drop', but it'd be interesting to know which one would impact fewer existing scripts. As for the 'update val AS col' versus 'update col BY val', I think the former looks less confusing to a current pig user. 'BY' only gets used currently for key ordering in groups and joins, whereas 'AS' is already used for value assignment. I agree that there's a difference between 'GENERATE val AS col' and 'UPDATE val AS col', but it's a fairly philosophical difference from the user's perspective. In both cases, they want col to have the value val after the statement, so having the same syntax makes sense. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327686#comment-16327686 ] Koji Noguchi commented on PIG-4608: --- For now, can we just require separate statements for update and delete ? Also, not too excited with the use of {{AS}} for specifying which field to update. So far, "as" has been used for only naming the fields. Wondering if we can require the field-name upfront. Instead of * {{c = foreach a update "prefix"+x as x}} can we write as * {{c = foreach a update x by "prefix"+x}}; As for delete, I somehow prefer using {{drop}}. [~daijy], love to hear your thoughts on this. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327602#comment-16327602 ] Will Lauer commented on PIG-4608: - While up in the middle of the night dealing with a sick child, I realized there was way to make the parsing sane if updates, adds and deletes were to be included all in a single statement. How does this syntax look? {code} a = load 'input' using mock.Storage() as (x:chararray, y:chararray, z:long); b = foreach a generate x+y as q, y, z:long; c = foreach a update "prefix"+x as x, (chararray)(z+1) as z:charrarray; d = foreach a delete x, z; e = foreach a { nextInt = z+1; update nextInt as z:int } f = foreach a add { 1+oldCol as new:long, somethingElse as new2 } delete { colToRemove, otherColToRemove } update { 1+oldCol2 as updatedCol, "1"+oldCol2 as updatedTypeCol:chararray }; g = foreach a { nextInt = z+1; add { 1+oldCol as new:long, somethingElse as new2 } delete { colToRemove, otherColToRemove } update { 1+oldCol2 as updatedCol, "1"+oldCol2 as updatedTypeCol:chararray }; } {code} In this case, the surrounding curly braces would be required if putting multiple clauses in a single FOREACH. Add, delete, or update could all be included alone without the extra curly braces, but if you want to combine them, the curly braces would be required. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325889#comment-16325889 ] Will Lauer commented on PIG-4608: - OK, a review request has been posted for my initial pass at the code: https://reviews.apache.org/r/65159/ > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325888#comment-16325888 ] Will Lauer commented on PIG-4608: - I took a pass at allowing add, update, and delete in the same command, with a syntax like {noformat} b = foreach a add new1 as x, new2 as y, update old1+new1 as z, delete old2; {noformat} While logically it makes sense, the parser code starts to get a bit brittle, as it becomes hard to tell the difference between the add/update/delete keywords and column names in expressions or schemas. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324986#comment-16324986 ] Will Lauer commented on PIG-4608: - I've gone ahead and made a patch to implement a version of this functionality as a starting point for discussion. Once I figure out how to upload it to reviewboard, everyone can take a look at it. There are several requirements that we have here: # Need to modify values of arbitrary fields #* without having to specify every field #* without the field order changing unexpectedly #* without having to know the current index to the field # Need to remove fields #* without having to know the index of the field #* without reordering the rest of the fields # Need ability to change the type of a field # Ability to reference a field without specifying its disambiguating join prefix when field is unambiguous # Update must support the FOREACH nested block syntax Additionally, I agree with Rohini that "strict" mode is required to prevent typos from causing scripts to run with the unexpected behaviors of adding a new column instead of modifying an existing one). While nice to have, being able to specify adds, deletes, and updates all in the same statement isn't a strict requirement, as that can be done simply with multiple successive FOREACH statements. The syntax that I've made work is {code} a = load 'input' using mock.Storage() as (x:chararray, y:chararray, z:long); b = foreach a generate x+y as q, y, z:long; c = foreach a update "prefix"+x as x, (chararray)(z+1) as z:charrarray; d = foreach a delete x, z; e = foreach a { nextInt = z+1; update nextInt as z:int } {code} To me, the ... syntax seems weird, so I've gone with seprate UPDATE and DELETE commands. For clarity, only a single command can exist per statement (no foreach update a, delete b). Similarly, there is no support for appending columns, as that is easily accomplished already with {code} b = foreach a generate *, a+5 as newCol:chararray; {code} > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596409#comment-14596409 ] Rohini Palaniswamy commented on PIG-4608: - Sounds good. Can we just add ... for the ones to be appended to make appending clear? i.e updated = FOREACH three_numbers GENERATE 3 as f3, 6 as f6, 9 as f9 ... f1+f2 as new_sum; FOREACH ... UPDATE -- Key: PIG-4608 URL: https://issues.apache.org/jira/browse/PIG-4608 Project: Pig Issue Type: New Feature Reporter: Haley Thrapp I would like to propose a new command in Pig, FOREACH...UPDATE. Syntactically, it would look much like FOREACH … GENERATE. Example: Input data: (1,2,3) (2,3,4) (3,4,5) -- Load the data three_numbers = LOAD 'input_data' USING PigStorage() AS (f1:int, f2:int, f3:int); -- Sum up the row updated = FOREACH three_numbers UPDATE 5 as f1, f1+f2 as new_sum ; Dump updated; (5,2,3,3) (5,3,4,5) (5,4,5,7) Fields to update must be specified by alias. Any fields in the UPDATE that do not match an existing field will be appended to the end of the tuple. This command is particularly desirable in scripts that deal with a large number of fields (in the 20-200 range). Often, we need to only make modifications to a few fields. The FOREACH ... UPDATE statement, allows the developer to focus on the actual logical changes instead of having to list all of the fields that are also being passed through. My team has prototyped this with changes to FOREACH ... GENERATE. We believe this can be done with changes to the parser and the creation of a new LOUpdate. No physical plan changes should be needed because we will leverage what LOGenerate does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596462#comment-14596462 ] Kevin J. Price commented on PIG-4608: - Several of us actually discussed this at some length, and didn't think it was worth differentiating between modified columns and appended columns in the command. Two ideas we had: # A token, like you have, indicating that the remaining fields are being added. We were considering using an 'ADD' keyword. As in: {code} updated = FOREACH three_numbers UPDATE 3 AS f3, 6 AS f6 ADD f1+f2 AS new_sum; {code} # Separate statements for 'strict' versus 'non-strict' mode. e.g., for updating with appending you would use {code} updated = FOREACH three_numbers UPDATE_STRICT 3 AS f3, 6 AS f6; {code} and for updating with appending, you could use {code} updated = FOREACH three_numbers UPDATE 3 AS f3, 6 AS f6, f1+f2 AS new_sum; {code} However, our overall view from writing pig scripts is that chances are very few people would ever want to use the strict mode, nor did we see much value in having the extra token (ADD or ...) separating out appended columns. From a programming viewpoint, it just makes more logical sense to us to view it as an implicit update or add construct. FOREACH ... UPDATE -- Key: PIG-4608 URL: https://issues.apache.org/jira/browse/PIG-4608 Project: Pig Issue Type: New Feature Reporter: Haley Thrapp I would like to propose a new command in Pig, FOREACH...UPDATE. Syntactically, it would look much like FOREACH … GENERATE. Example: Input data: (1,2,3) (2,3,4) (3,4,5) -- Load the data three_numbers = LOAD 'input_data' USING PigStorage() AS (f1:int, f2:int, f3:int); -- Sum up the row updated = FOREACH three_numbers UPDATE 5 as f1, f1+f2 as new_sum ; Dump updated; (5,2,3,3) (5,3,4,5) (5,4,5,7) Fields to update must be specified by alias. Any fields in the UPDATE that do not match an existing field will be appended to the end of the tuple. This command is particularly desirable in scripts that deal with a large number of fields (in the 20-200 range). Often, we need to only make modifications to a few fields. The FOREACH ... UPDATE statement, allows the developer to focus on the actual logical changes instead of having to list all of the fields that are also being passed through. My team has prototyped this with changes to FOREACH ... GENERATE. We believe this can be done with changes to the parser and the creation of a new LOUpdate. No physical plan changes should be needed because we will leverage what LOGenerate does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596469#comment-14596469 ] Rohini Palaniswamy commented on PIG-4608: - The problem with the implicit add is that user typos could make it an add instead of update. For eg: If user specified updated = FOREACH three_numbers UPDATE_STRICT 3 AS f3, 6 AS f7; but actually meant to say 6 AS f6; , then the script will run fine and will require more debugging to find why the output is not as expected. So would prefer having ... at the end to make any additions explicit. That way one can throw errors for update of columns that do not exist. FOREACH ... UPDATE -- Key: PIG-4608 URL: https://issues.apache.org/jira/browse/PIG-4608 Project: Pig Issue Type: New Feature Reporter: Haley Thrapp I would like to propose a new command in Pig, FOREACH...UPDATE. Syntactically, it would look much like FOREACH … GENERATE. Example: Input data: (1,2,3) (2,3,4) (3,4,5) -- Load the data three_numbers = LOAD 'input_data' USING PigStorage() AS (f1:int, f2:int, f3:int); -- Sum up the row updated = FOREACH three_numbers UPDATE 5 as f1, f1+f2 as new_sum ; Dump updated; (5,2,3,3) (5,3,4,5) (5,4,5,7) Fields to update must be specified by alias. Any fields in the UPDATE that do not match an existing field will be appended to the end of the tuple. This command is particularly desirable in scripts that deal with a large number of fields (in the 20-200 range). Often, we need to only make modifications to a few fields. The FOREACH ... UPDATE statement, allows the developer to focus on the actual logical changes instead of having to list all of the fields that are also being passed through. My team has prototyped this with changes to FOREACH ... GENERATE. We believe this can be done with changes to the parser and the creation of a new LOUpdate. No physical plan changes should be needed because we will leverage what LOGenerate does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594020#comment-14594020 ] Rohini Palaniswamy commented on PIG-4608: - Actually why not just make it as updated = FOREACH three_numbers GENERATE ... 5 as f1 ... f1+f2 as new_sum; That should gel well with the current project-range syntax - http://pig.apache.org/docs/r0.14.0/basic.html#prexp. FOREACH ... UPDATE -- Key: PIG-4608 URL: https://issues.apache.org/jira/browse/PIG-4608 Project: Pig Issue Type: New Feature Reporter: Haley Thrapp I would like to propose a new command in Pig, FOREACH...UPDATE. Syntactically, it would look much like FOREACH … GENERATE. Example: Input data: (1,2,3) (2,3,4) (3,4,5) -- Load the data three_numbers = LOAD 'input_data' USING PigStorage() AS (f1:int, f2:int, f3:int); -- Sum up the row updated = FOREACH three_numbers UPDATE 5 as f1, f1+f2 as new_sum ; Dump updated; (5,2,3,3) (5,3,4,5) (5,4,5,7) Fields to update must be specified by alias. Any fields in the UPDATE that do not match an existing field will be appended to the end of the tuple. This command is particularly desirable in scripts that deal with a large number of fields (in the 20-200 range). Often, we need to only make modifications to a few fields. The FOREACH ... UPDATE statement, allows the developer to focus on the actual logical changes instead of having to list all of the fields that are also being passed through. My team has prototyped this with changes to FOREACH ... GENERATE. We believe this can be done with changes to the parser and the creation of a new LOUpdate. No physical plan changes should be needed because we will leverage what LOGenerate does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594401#comment-14594401 ] Jacob Tolar commented on PIG-4608: -- Hi Rohini, are you suggesting this: {code} updated = FOREACH three_numbers GENERATE ..., 5 as f1, ..., f1+f2 as new_sum; {code} ? Here's an exaggerated example of why we think something like foreach .. update would work better. Original pig script: {code} -- assume we are using the schema load option ( http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html ) -- with fields named f1, f2, ..., f50 i = load '/path/to/data' USING PigStorage(); intermediate = foreach i generate f1, f2, 3 as f3, f4, f5, 6 as f6, -- ... you get the idea, we're updating every 3rd field for some reason 48 as f48, f49, f50; store intermediate into '/path/to/output' USING PigStorage(','); {code} Here it is with project-range notation that exists in pig. In this particularly nasty case we are still mentioning every single field, even though we're using project-range: {code} i = load '/path/to/data' USING PigStorage(); intermediate = foreach i generate f1..f2, 3 as f3, f4..f5, 6 as f6, -- etc 48 as f48, f49..f50; store intermediate into '/path/to/output' USING PigStorage(','); {code} I think this is what you're suggesting. It's a little better than the project-range but still not great (lots of extra dots): {code} i = load '/path/to/data' USING PigStorage(); intermediate = foreach i generate ..., 3 as f3, ..., 6 as f6, ..., 9 as f9, -- etc 48 as f48, ...; store intermediate into '/path/to/output' USING PigStorage(','); {code} With foreach ... update, we only need to list the fields that are changing. {code} i = load '/path/to/data' USING PigStorage(); intermediate = foreach i update 3 as f3, 6 as f6, 9 as f9, -- etc 48 as f48; store intermediate into '/path/to/output' USING PigStorage(','); {code} The last one is much clearer (if 'foreach update' has clearly defined semantics) and is also the shortest because it has the least extra syntactic overhead: you only need to type exactly what you want, nothing more. That makes it easier to write, easier to read later, and (we believe...but we can't use it yet :)) less prone to error. FOREACH ... UPDATE -- Key: PIG-4608 URL: https://issues.apache.org/jira/browse/PIG-4608 Project: Pig Issue Type: New Feature Reporter: Haley Thrapp I would like to propose a new command in Pig, FOREACH...UPDATE. Syntactically, it would look much like FOREACH … GENERATE. Example: Input data: (1,2,3) (2,3,4) (3,4,5) -- Load the data three_numbers = LOAD 'input_data' USING PigStorage() AS (f1:int, f2:int, f3:int); -- Sum up the row updated = FOREACH three_numbers UPDATE 5 as f1, f1+f2 as new_sum ; Dump updated; (5,2,3,3) (5,3,4,5) (5,4,5,7) Fields to update must be specified by alias. Any fields in the UPDATE that do not match an existing field will be appended to the end of the tuple. This command is particularly desirable in scripts that deal with a large number of fields (in the 20-200 range). Often, we need to only make modifications to a few fields. The FOREACH ... UPDATE statement, allows the developer to focus on the actual logical changes instead of having to list all of the fields that are also being passed through. My team has prototyped this with changes to FOREACH ... GENERATE. We believe this can be done with changes to the parser and the creation of a new LOUpdate. No physical plan changes should be needed because we will leverage what LOGenerate does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)