[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324986#comment-16324986
 ] 

Will Lauer commented on PIG-4608:
---------------------------------

I've gone ahead and made a patch to implement a version of this functionality 
as a starting point for discussion. Once I figure out how to upload it to 
reviewboard, everyone can take a look at it.

There are several requirements that we have here:
# Need to modify values of arbitrary fields 
#* without having to specify every field
#* without the field order changing unexpectedly
#* without having to know the current index to the field
# Need to remove fields
#* without having to know the index of the field
#* without reordering the rest of the fields
# Need ability to change the type of a field
# Ability to reference a field without specifying its disambiguating join 
prefix when field is unambiguous
# Update must support the FOREACH nested block syntax 

Additionally, I agree with Rohini that "strict" mode is required to prevent 
typos from causing scripts to run with the unexpected behaviors of adding  a 
new column instead of modifying an existing one).

While nice to have, being able to specify adds, deletes, and updates all in the 
same statement isn't a strict requirement, as that can be done simply with 
multiple successive FOREACH statements.

The syntax that I've made work is
{code}
a = load 'input' using mock.Storage() as (x:chararray, y:chararray, z:long);
b = foreach a generate x+y as q, y, z:long;
c = foreach a update "prefix"+x as x, (chararray)(z+1) as z:charrarray;
d = foreach a delete x, z;
e = foreach a {
           nextInt = z+1;
           update nextInt as z:int
    }
{code}

To me, the ... syntax seems weird, so I've gone with seprate UPDATE and DELETE 
commands. For clarity, only a single command can exist per statement (no 
foreach update a, delete b). Similarly, there is no support for appending 
columns, as that is easily accomplished already with 
{code}
b = foreach a generate *, a+5 as  newCol:chararray;
{code}

> FOREACH ... UPDATE
> ------------------
>
>                 Key: PIG-4608
>                 URL: https://issues.apache.org/jira/browse/PIG-4608
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Haley Thrapp
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to