[
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331309#comment-16331309
]
Will Lauer commented on PIG-4608:
---------------------------------
Ok, just to close the loop, here are several examples given the new proposed
syntax. I want to make sure I understand which are correct and what the
behavior is in each case.
```
/* simple projection, specifying resulting schema, using both explicit column
names and positions */
a = FOREACH b GENERATE 1+s as x:long, $2+$3 as y:chararray, q-1 as z;
a = FOREACH b GENERATE FLATTEN(s) as (x:int, y:long, z:chararray); --
flattening tuples into individual columns
a = FOREACH b GENERATE FLATTEN(s) as x:int, 1 as y; -- flattening bags into
multiple rows
/* complex projection, specifying resulting schema, using both explicity column
names and positions */
a = FOREACH b {
q = COUNT(s);
r = someUdf($1,$2);
GENERATE q as x:long, r as y;
}
/* simple update */
a = FOREACH b UPDATE q with r+s;
/* complex update */
a = FOREACH b {
q = COUNT(s);
r = someUdf($1, $2);
UPDATE qprime WITH q, rprime WITH r;
}
/* simple update using positional arguments */
a = FOREACH b UPDATE $1 with r+$2;
/* simple renaming of a column */
a = FOREACH b UPDATE q as r;
/* simple schema type change */
a = FOREACH b UPDATE q WITH (int)q AS q:int; -- change q from something to int
a = FOREACH b UPDATE q AS q:int -- This should be illegal, right? If the type
is changed, an explicit modify of the value should occur
/* rename, type, and value change together */
a = FOREACH b UPDATE q WITH computeR(q) as r:long;
/* simple column drop */
a = FOREACH b DROP q,r,$5; -- drops columns q, r, and whatever is the 5th column
a = FOREACH b DROP q:int; -- This should be illegal, right? No types should be
present in a DROP statement
/* updating an individual field within a tuple - not implemented in the initial
version */
a = FOREACH b UPDATE q.$1.fieldN WITH r+s;
/* renaming an individual field within a tuple - not implemented in the initial
version */
a = FOREACH b UPDATE q.$1.fieldN AS newFieldN; -- has the result of renaming
the field within q.$1, not renaming q or $1
/* flattening a tuple into existing fields - does this make sense?*/
a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5);
a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5) AS (q, r, t); -- renaming one
column during flattening assignment
a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5) AS (q:int, r:chararray, s:long);
-- re-typing arguments as part of flattening
/* flattening a bag into existing fields, exploding rows in the process -- does
this make sense? */
a = FOREACH b UPDATE f1 WITH FLATTEN(bagCol);
a = FOREACH b UPDATE f1 WITH FLATTEN(bagCol) as f2:int; -- rename field and
possibly retype as part of the flatten
```
While I admit the WITH/AS syntax is useful, it still feels a bit weird to me as
a pig script writer. I'd love to have [~kpriceyahoo] weigh in on the proposal
to ensure it still makes sense to heavy pig script writers.
> FOREACH ... UPDATE
> ------------------
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
> Issue Type: New Feature
> Reporter: Haley Thrapp
> Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large
> number of fields (in the 20-200 range). Often, we need to only make
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the
> developer to focus on the actual logical changes instead of having to list
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe
> this can be done with changes to the parser and the creation of a new
> LOUpdate. No physical plan changes should be needed because we will leverage
> what LOGenerate does.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)