[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324986#comment-16324986 ]
Will Lauer commented on PIG-4608: --------------------------------- I've gone ahead and made a patch to implement a version of this functionality as a starting point for discussion. Once I figure out how to upload it to reviewboard, everyone can take a look at it. There are several requirements that we have here: # Need to modify values of arbitrary fields #* without having to specify every field #* without the field order changing unexpectedly #* without having to know the current index to the field # Need to remove fields #* without having to know the index of the field #* without reordering the rest of the fields # Need ability to change the type of a field # Ability to reference a field without specifying its disambiguating join prefix when field is unambiguous # Update must support the FOREACH nested block syntax Additionally, I agree with Rohini that "strict" mode is required to prevent typos from causing scripts to run with the unexpected behaviors of adding a new column instead of modifying an existing one). While nice to have, being able to specify adds, deletes, and updates all in the same statement isn't a strict requirement, as that can be done simply with multiple successive FOREACH statements. The syntax that I've made work is {code} a = load 'input' using mock.Storage() as (x:chararray, y:chararray, z:long); b = foreach a generate x+y as q, y, z:long; c = foreach a update "prefix"+x as x, (chararray)(z+1) as z:charrarray; d = foreach a delete x, z; e = foreach a { nextInt = z+1; update nextInt as z:int } {code} To me, the ... syntax seems weird, so I've gone with seprate UPDATE and DELETE commands. For clarity, only a single command can exist per statement (no foreach update a, delete b). Similarly, there is no support for appending columns, as that is easily accomplished already with {code} b = foreach a generate *, a+5 as newCol:chararray; {code} > FOREACH ... UPDATE > ------------------ > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature > Reporter: Haley Thrapp > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v6.4.14#64029)