[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-02-15 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366248#comment-16366248
 ] 

Koji Noguchi commented on PIG-4608:
---

bq. To me, UPDATE $1 with r+$2 means update the first field, regardless of 
name, with r+second field.
You probably meant update the second field with r+third field.  (Pig counts 
from 0 position.)

In any cases, I get your point.  
[~daijy], [~rohini], [~kpriceyahoo], any preferences? 


bq. UPDATE $1 means n=$1 and updating the _n_th field accordingly.
My type of interpretation for $1 probably should be disallowed anyways since 
this takes away the optimization opportunity.  (not knowing which fields 
getting updated/dropped at compile time.)


> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-02-15 Thread Will Lauer (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366223#comment-16366223
 ] 

Will Lauer commented on PIG-4608:
-

To me, {{UPDATE $1 with r+$2}} means update the first field, regardless of 
name, with r+second field. I think {{UPDATE 1 with r+$2}} means that the user 
is trying to update a field named "1". The fact that this is an illegal field 
name (not an identifier) should generate an error.

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-02-15 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366197#comment-16366197
 ] 

Koji Noguchi commented on PIG-4608:
---

{quote}The idea is to make ".." syntax more flexible,
{quote}
I think one of the goal here is to let users manipulate records without using 
".." at all. 
 For the initial version, let's just focus on the basics. We can add more 
later, but of course changing is always tough.

I don't want this jira to go stale after having such a great contribution from 
Will.
 I feel having UPDATE and DROP with simple column(field) updates is a good 
start.

Only thing I'm not clear on is,
{code:java}
/* simple update using positional arguments */
a = FOREACH b UPDATE $1 with r+$2;
{code}
Should this be {{UPDATE 1 with r+$2}} ? 
 To me, {{UPDATE $1}} means  {{n=$1}} and updating the _n_th field accordingly.

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-01-25 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340555#comment-16340555
 ] 

Koji Noguchi commented on PIG-4608:
---

{quote}I didn't see UPDATE/DROP in a single statement in the example, are we 
not going to support both in the same statement? I actually prefer those in the 
same statement, as I feel users usually think about adjusting all columns in 
the same time.
{quote}
This could be because I requested in one of my previous comments as. "For now, 
can we just require separate statements for update and delete ?" 
 I just wanted to keep it simple and leave the combining part later when we 
have more use cases.

Also, I'm afraid of confusions in overlapping index/fields.
 Say, {{A:(f0:int, f1:int, f2:int, f3:int)}}
{code:java}
B = FOREACH A drop f1 , update 2 with $1 ;
{code}
Is the code updating {{f2}} with the value of {{f1}}?
Or, updating {{f3}} with value of {{f2}} ? or something else?  

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-01-23 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336593#comment-16336593
 ] 

Daniel Dai commented on PIG-4608:
-

bq. a = FOREACH b UPDATE q AS q:int – This should be illegal, right? If the 
type is changed, an explicit modify of the value should occur
This should be valid, AS clause has the capacity to change types. UPDATE clause 
is evaluated before AS clause, so
a = FOREACH b UPDATE q WITH (int)q AS q:chararray;
Will result a chararray q.

bq. flattening a tuple into existing fields - does this make sense
This makes sense, it is a symmetry to the AS clause

I didn't see UPDATE/DROP in a single statement in the example, are we not going 
to support both in the same statement? I actually prefer those in the same 
statement, as I feel users usually think about adjusting all columns in the 
same time. How about APPEND? Actually when I think about DROP/APPEND, I feel we 
have to have INSERT as well to close the loop. But if adding INSERT, other 
syntax might be more proper, such as:
a = FOREACH b generate .., UPDATE a10 WITH 1 as new_a10, ..a20, 2 as 
a_20_plus_half, ..a30, a32.., UPDATE a40 WITH 2 as new_a40, 1 as a41;
Here:
Update: a10, a40 using UPDATE clause
Insert: a_20_plus_half
Drop: a31
Append: a41

In the original use case, it can be written as:
intermediate = foreach i generate .., 3 as f3, .., 6 as f6, .. 48 as f48, ..;

The idea is to make ".." syntax more flexible, skip prefix/suffix if can be 
inferred. Probably more natural to add support for INSERT with this, thus make 
the syntax complete. How's that sound?

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-01-18 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331344#comment-16331344
 ] 

Rohini Palaniswamy commented on PIG-4608:
-

bq. a = FOREACH b UPDATE q AS q:int – This should be illegal, right? If the 
type is changed, an explicit modify of the value should occur
That should be supported after PIG-2315 (not pulled into our internal Y 
releases). [~knoguchi] can confirm. 

bq. This should be illegal, right? No types should be present in a DROP 
statement
yes.

bq. flattening a tuple into existing fields - does this make sense
Not sure if there is a use case, but don't see a problem against adding support 
for it. What happens if $5 has more than 3 fields? I am assuming it will be 
something like
a = FOREACH b UPDATE q with $5.f1 , r WITH $5.f2 , s with $5.f3 as t; 

bq. flattening a bag into existing fields, exploding rows in the process
You will have to add support for maps as well. 

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-01-18 Thread Will Lauer (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331309#comment-16331309
 ] 

Will Lauer commented on PIG-4608:
-

Ok, just to close the loop, here are several examples given the new proposed 
syntax. I want to make sure I understand which are correct and what the 
behavior is in each case.

```
/* simple projection, specifying resulting schema, using both explicit column 
names and positions */
a = FOREACH b GENERATE 1+s as x:long, $2+$3 as y:chararray, q-1 as z;
a = FOREACH b GENERATE FLATTEN(s) as (x:int, y:long, z:chararray); -- 
flattening tuples into individual columns
a = FOREACH b GENERATE FLATTEN(s) as x:int, 1 as y; -- flattening bags into 
multiple rows

/* complex projection, specifying resulting schema, using both explicity column 
names and positions */
a = FOREACH b {
    q = COUNT(s);
    r = someUdf($1,$2);
    GENERATE q as x:long, r as y;
}

/* simple update */
a = FOREACH b UPDATE q with r+s;

/* complex update */
a = FOREACH b {
q = COUNT(s);
r = someUdf($1, $2);
UPDATE qprime WITH q, rprime WITH r;
}

/* simple update using positional arguments */
a = FOREACH b UPDATE $1 with r+$2;

/* simple renaming of a column */
a = FOREACH b UPDATE q as r;

/* simple schema type change */
a = FOREACH b UPDATE q WITH (int)q AS q:int; -- change q from something to int
a = FOREACH b UPDATE q AS q:int -- This should be illegal, right? If the type 
is changed, an explicit modify of the value should occur

/* rename, type, and value change together */
a = FOREACH b UPDATE q WITH computeR(q) as r:long;

/* simple column drop */
a = FOREACH b DROP q,r,$5; -- drops columns q, r, and whatever is the 5th column
a = FOREACH b DROP q:int; -- This should be illegal, right? No types should be 
present in a DROP statement
 
/* updating an individual field within a tuple - not implemented in the initial 
version */
a = FOREACH b UPDATE q.$1.fieldN WITH r+s; 

/* renaming an individual field within a tuple - not implemented in the initial 
version */
a = FOREACH b UPDATE q.$1.fieldN AS newFieldN; -- has the result of renaming 
the field within q.$1, not renaming q or $1

/* flattening a tuple into existing fields - does this make sense?*/
a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5);
a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5) AS (q, r, t); -- renaming one 
column during flattening assignment
a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5) AS (q:int, r:chararray, s:long); 
-- re-typing arguments as part of flattening

/* flattening a bag into existing fields, exploding rows in the process -- does 
this make sense? */
a = FOREACH b UPDATE f1 WITH FLATTEN(bagCol);
a = FOREACH b UPDATE f1 WITH FLATTEN(bagCol) as f2:int; -- rename field and 
possibly retype as part of the flatten
```

While I admit the WITH/AS syntax is useful, it still feels a bit weird to me as 
a pig script writer. I'd love to have [~kpriceyahoo] weigh in on the proposal 
to ensure it still makes sense to heavy pig script writers.

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-01-17 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329539#comment-16329539
 ] 

Daniel Dai commented on PIG-4608:
-

"add" (or append?)/"update xxx as"/"drop" syntax sounds good to me. We also 
want to make sure it works with positional reference ($0, $1, etc). You might 
take a look PIG-3122 for keywords conflicts if applicable.

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-01-17 Thread Will Lauer (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329258#comment-16329258
 ] 

Will Lauer commented on PIG-4608:
-

{quote}This is Yahoo-centric, but would it be possible to grep our logs for 
existing pig jobs and see how many of them have keyword conflicts with 
'update', 'delete', 'drop', etc? I'm indifferent on 'delete' versus 'drop', but 
it'd be interesting to know which one would impact fewer existing 
scripts.{quote}
Keyword conflicts shouldn't really be an issue give the syntax we are talking 
about. Given the way the syntax works, it will always be obvious to the parser 
whether "drop"/"delete" is refering to a column or is a keyword.

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-01-17 Thread Will Lauer (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329135#comment-16329135
 ] 

Will Lauer commented on PIG-4608:
-

I like drop. If everyone else agrees, I'll change my patch to use that instead 
of delete.

On a similar note, if we ever want to include adding columns in addition to 
update and delete, I'd suggest "append" since it correctly implies that the 
columns are added at the end of the schema.

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-01-16 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327955#comment-16327955
 ] 

Rohini Palaniswamy commented on PIG-4608:
-

bq. As for the 'update val AS col' versus 'update col BY val', I think the 
former looks less confusing to a current pig user.
Agree and prefer usage of AS. Also BY does not make sense grammatically. It 
should either be AS (update val AS col) or WITH (update col WITH val). 

bq. As for delete, I somehow prefer using drop.
I prefer drop as well as that is similar to the alter table drop column of SQL. 
Delete is generally used for rows.





> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-01-16 Thread Kevin J. Price (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327736#comment-16327736
 ] 

Kevin J. Price commented on PIG-4608:
-

This is Yahoo-centric, but would it be possible to grep our logs for existing 
pig jobs and see how many of them have keyword conflicts with 'update', 
'delete', 'drop', etc? I'm indifferent on 'delete' versus 'drop', but it'd be 
interesting to know which one would impact fewer existing scripts.

As for the 'update val AS col' versus 'update col BY val', I think the former 
looks less confusing to a current pig user. 'BY' only gets used currently for 
key ordering in groups and joins, whereas 'AS' is already used for value 
assignment. I agree that there's a difference between 'GENERATE val AS col' and 
'UPDATE val AS col', but it's a fairly philosophical difference from the user's 
perspective. In both cases, they want col to have the value val after the 
statement, so having the same syntax makes sense.

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-01-16 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327686#comment-16327686
 ] 

Koji Noguchi commented on PIG-4608:
---

For now, can we just require separate statements for update and delete ?
Also, not too excited with the use of {{AS}} for specifying which field to 
update.
So far, "as" has been used for only naming the fields. 
Wondering if we can require the field-name upfront. 

Instead of 
 * {{c = foreach a update "prefix"+x as x}}

can we write as 
 * {{c = foreach a update x by "prefix"+x}};

As for delete, I somehow prefer using {{drop}}.

[~daijy], love to hear your thoughts on this.

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-01-16 Thread Will Lauer (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327602#comment-16327602
 ] 

Will Lauer commented on PIG-4608:
-

While up in the middle of the night dealing with a sick child, I realized there 
was way to make the parsing sane if updates, adds and deletes were to be 
included all in a single statement. How does this syntax look?

{code}
a = load 'input' using mock.Storage() as (x:chararray, y:chararray, z:long);
b = foreach a generate x+y as q, y, z:long;
c = foreach a update "prefix"+x as x, (chararray)(z+1) as z:charrarray;
d = foreach a delete x, z;
e = foreach a {
   nextInt = z+1;
   update nextInt as z:int
}
f = foreach a
add {
   1+oldCol as new:long,
   somethingElse as new2
 } delete {
   colToRemove,
   otherColToRemove
 } update {
   1+oldCol2 as updatedCol,
   "1"+oldCol2 as updatedTypeCol:chararray
 };
g = foreach a {
   nextInt = z+1;
   add {
  1+oldCol as new:long,
  somethingElse as new2
   } delete {
  colToRemove,
  otherColToRemove
   } update {
  1+oldCol2 as updatedCol,
  "1"+oldCol2 as updatedTypeCol:chararray
   };
}
{code}

In this case, the surrounding curly braces would be required if putting 
multiple clauses in a single FOREACH. Add, delete, or update could all be 
included alone without the extra curly braces, but if you want to combine them, 
the curly braces would be required.

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-01-14 Thread Will Lauer (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325889#comment-16325889
 ] 

Will Lauer commented on PIG-4608:
-

OK, a review request has been posted for my initial pass at the code: 
https://reviews.apache.org/r/65159/

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-01-14 Thread Will Lauer (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325888#comment-16325888
 ] 

Will Lauer commented on PIG-4608:
-

I took a pass at allowing add, update, and delete in the same command, with a 
syntax like
{noformat}
b = foreach a
  add
  new1 as x,
  new2 as y,
  update
  old1+new1 as z,
  delete
  old2;
{noformat}

While logically it makes sense, the parser code starts to get a bit brittle, as 
it becomes hard to tell the difference between the add/update/delete keywords 
and column names in expressions or schemas.

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2018-01-12 Thread Will Lauer (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324986#comment-16324986
 ] 

Will Lauer commented on PIG-4608:
-

I've gone ahead and made a patch to implement a version of this functionality 
as a starting point for discussion. Once I figure out how to upload it to 
reviewboard, everyone can take a look at it.

There are several requirements that we have here:
# Need to modify values of arbitrary fields 
#* without having to specify every field
#* without the field order changing unexpectedly
#* without having to know the current index to the field
# Need to remove fields
#* without having to know the index of the field
#* without reordering the rest of the fields
# Need ability to change the type of a field
# Ability to reference a field without specifying its disambiguating join 
prefix when field is unambiguous
# Update must support the FOREACH nested block syntax 

Additionally, I agree with Rohini that "strict" mode is required to prevent 
typos from causing scripts to run with the unexpected behaviors of adding  a 
new column instead of modifying an existing one).

While nice to have, being able to specify adds, deletes, and updates all in the 
same statement isn't a strict requirement, as that can be done simply with 
multiple successive FOREACH statements.

The syntax that I've made work is
{code}
a = load 'input' using mock.Storage() as (x:chararray, y:chararray, z:long);
b = foreach a generate x+y as q, y, z:long;
c = foreach a update "prefix"+x as x, (chararray)(z+1) as z:charrarray;
d = foreach a delete x, z;
e = foreach a {
   nextInt = z+1;
   update nextInt as z:int
}
{code}

To me, the ... syntax seems weird, so I've gone with seprate UPDATE and DELETE 
commands. For clarity, only a single command can exist per statement (no 
foreach update a, delete b). Similarly, there is no support for appending 
columns, as that is easily accomplished already with 
{code}
b = foreach a generate *, a+5 as  newCol:chararray;
{code}

> FOREACH ... UPDATE
> --
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
>  Issue Type: New Feature
>Reporter: Haley Thrapp
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do 
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large 
> number of fields (in the 20-200 range). Often, we need to only make 
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
> developer to focus on the actual logical changes instead of having to list 
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
> this can be done with changes to the parser and the creation of a new 
> LOUpdate. No physical plan changes should be needed because we will leverage 
> what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2015-06-22 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596409#comment-14596409
 ] 

Rohini Palaniswamy commented on PIG-4608:
-

Sounds good. Can we just add ... for the ones to be appended to make appending 
clear? i.e

updated = FOREACH three_numbers GENERATE 3 as f3, 6 as f6, 9 as f9 ... f1+f2 as 
new_sum;

 FOREACH ... UPDATE
 --

 Key: PIG-4608
 URL: https://issues.apache.org/jira/browse/PIG-4608
 Project: Pig
  Issue Type: New Feature
Reporter: Haley Thrapp

 I would like to propose a new command in Pig, FOREACH...UPDATE.
 Syntactically, it would look much like FOREACH … GENERATE.
 Example:
 Input data:
 (1,2,3)
 (2,3,4)
 (3,4,5)
 -- Load the data
 three_numbers = LOAD 'input_data'
 USING PigStorage()
 AS (f1:int, f2:int, f3:int);
 -- Sum up the row
 updated = FOREACH three_numbers UPDATE
 5 as f1,
 f1+f2 as new_sum
 ;
 Dump updated;
 (5,2,3,3)
 (5,3,4,5)
 (5,4,5,7)
 Fields to update must be specified by alias. Any fields in the UPDATE that do 
 not match an existing field will be appended to the end of the tuple.
 This command is particularly desirable in scripts that deal with a large 
 number of fields (in the 20-200 range). Often, we need to only make 
 modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
 developer to focus on the actual logical changes instead of having to list 
 all of the fields that are also being passed through.
 My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
 this can be done with changes to the parser and the creation of a new 
 LOUpdate. No physical plan changes should be needed because we will leverage 
 what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2015-06-22 Thread Kevin J. Price (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596462#comment-14596462
 ] 

Kevin J. Price commented on PIG-4608:
-

Several of us actually discussed this at some length, and didn't think it was 
worth differentiating between modified columns and appended columns in the 
command. Two ideas we had:
# A token, like you have, indicating that the remaining fields are being added. 
We were considering using an 'ADD' keyword. As in:
{code}
updated = FOREACH three_numbers UPDATE 3 AS f3, 6 AS f6 ADD f1+f2 AS new_sum;
{code}
# Separate statements for 'strict' versus 'non-strict' mode. e.g., for updating 
with appending you would use
{code}
updated = FOREACH three_numbers UPDATE_STRICT 3 AS f3, 6 AS f6;
{code}
and for updating with appending, you could use
{code}
updated = FOREACH three_numbers UPDATE 3 AS f3, 6 AS f6, f1+f2 AS new_sum;
{code}

However, our overall view from writing pig scripts is that chances are very few 
people would ever want to use the strict mode, nor did we see much value in 
having the extra token (ADD or ...) separating out appended columns. From a 
programming viewpoint, it just makes more logical sense to us to view it as an 
implicit update or add construct.

 FOREACH ... UPDATE
 --

 Key: PIG-4608
 URL: https://issues.apache.org/jira/browse/PIG-4608
 Project: Pig
  Issue Type: New Feature
Reporter: Haley Thrapp

 I would like to propose a new command in Pig, FOREACH...UPDATE.
 Syntactically, it would look much like FOREACH … GENERATE.
 Example:
 Input data:
 (1,2,3)
 (2,3,4)
 (3,4,5)
 -- Load the data
 three_numbers = LOAD 'input_data'
 USING PigStorage()
 AS (f1:int, f2:int, f3:int);
 -- Sum up the row
 updated = FOREACH three_numbers UPDATE
 5 as f1,
 f1+f2 as new_sum
 ;
 Dump updated;
 (5,2,3,3)
 (5,3,4,5)
 (5,4,5,7)
 Fields to update must be specified by alias. Any fields in the UPDATE that do 
 not match an existing field will be appended to the end of the tuple.
 This command is particularly desirable in scripts that deal with a large 
 number of fields (in the 20-200 range). Often, we need to only make 
 modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
 developer to focus on the actual logical changes instead of having to list 
 all of the fields that are also being passed through.
 My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
 this can be done with changes to the parser and the creation of a new 
 LOUpdate. No physical plan changes should be needed because we will leverage 
 what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2015-06-22 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596469#comment-14596469
 ] 

Rohini Palaniswamy commented on PIG-4608:
-

The problem with the implicit add is that user typos could make it an add 
instead of update. For eg: If user specified

updated = FOREACH three_numbers UPDATE_STRICT 3 AS f3, 6 AS f7;

but actually meant to say 6 AS f6; , then the script will run fine and will 
require more debugging to find why the output is not as expected.  So would 
prefer having ... at the end to make any additions explicit. That way one can 
throw errors for update of columns that do not exist.

 FOREACH ... UPDATE
 --

 Key: PIG-4608
 URL: https://issues.apache.org/jira/browse/PIG-4608
 Project: Pig
  Issue Type: New Feature
Reporter: Haley Thrapp

 I would like to propose a new command in Pig, FOREACH...UPDATE.
 Syntactically, it would look much like FOREACH … GENERATE.
 Example:
 Input data:
 (1,2,3)
 (2,3,4)
 (3,4,5)
 -- Load the data
 three_numbers = LOAD 'input_data'
 USING PigStorage()
 AS (f1:int, f2:int, f3:int);
 -- Sum up the row
 updated = FOREACH three_numbers UPDATE
 5 as f1,
 f1+f2 as new_sum
 ;
 Dump updated;
 (5,2,3,3)
 (5,3,4,5)
 (5,4,5,7)
 Fields to update must be specified by alias. Any fields in the UPDATE that do 
 not match an existing field will be appended to the end of the tuple.
 This command is particularly desirable in scripts that deal with a large 
 number of fields (in the 20-200 range). Often, we need to only make 
 modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
 developer to focus on the actual logical changes instead of having to list 
 all of the fields that are also being passed through.
 My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
 this can be done with changes to the parser and the creation of a new 
 LOUpdate. No physical plan changes should be needed because we will leverage 
 what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2015-06-19 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594020#comment-14594020
 ] 

Rohini Palaniswamy commented on PIG-4608:
-

Actually why not just make it as

updated = FOREACH three_numbers GENERATE ... 5 as f1 ... f1+f2 as new_sum;

That should gel well with the current project-range syntax - 
http://pig.apache.org/docs/r0.14.0/basic.html#prexp.

 FOREACH ... UPDATE
 --

 Key: PIG-4608
 URL: https://issues.apache.org/jira/browse/PIG-4608
 Project: Pig
  Issue Type: New Feature
Reporter: Haley Thrapp

 I would like to propose a new command in Pig, FOREACH...UPDATE.
 Syntactically, it would look much like FOREACH … GENERATE.
 Example:
 Input data:
 (1,2,3)
 (2,3,4)
 (3,4,5)
 -- Load the data
 three_numbers = LOAD 'input_data'
 USING PigStorage()
 AS (f1:int, f2:int, f3:int);
 -- Sum up the row
 updated = FOREACH three_numbers UPDATE
 5 as f1,
 f1+f2 as new_sum
 ;
 Dump updated;
 (5,2,3,3)
 (5,3,4,5)
 (5,4,5,7)
 Fields to update must be specified by alias. Any fields in the UPDATE that do 
 not match an existing field will be appended to the end of the tuple.
 This command is particularly desirable in scripts that deal with a large 
 number of fields (in the 20-200 range). Often, we need to only make 
 modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
 developer to focus on the actual logical changes instead of having to list 
 all of the fields that are also being passed through.
 My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
 this can be done with changes to the parser and the creation of a new 
 LOUpdate. No physical plan changes should be needed because we will leverage 
 what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

2015-06-19 Thread Jacob Tolar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594401#comment-14594401
 ] 

Jacob Tolar commented on PIG-4608:
--

Hi Rohini, are you suggesting this:

{code}
updated = FOREACH three_numbers GENERATE
   ...,
   5 as f1,
   ...,
   f1+f2 as new_sum;
{code}

?

Here's an exaggerated example of why we think something like foreach .. update 
would work better. Original pig script:

{code}
-- assume we are using the schema load option ( 
http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html ) 
-- with fields named f1, f2, ..., f50
i = load '/path/to/data' USING PigStorage();
intermediate = foreach i generate
  f1, 
  f2, 
  3 as f3, 
  f4, 
  f5, 
  6 as f6,  
  -- ... you get the idea, we're updating every 3rd field for some reason
  48 as f48,
  f49,
  f50;
store intermediate into '/path/to/output' USING PigStorage(',');
{code}

Here it is with project-range notation that exists in pig. In this particularly 
nasty case we are still mentioning every single field, even though we're using 
project-range:
{code}
i = load '/path/to/data' USING PigStorage();
intermediate = foreach i generate
  f1..f2,
  3 as f3, 
  f4..f5,
  6 as f6, 
  -- etc
  48 as f48,
  f49..f50;
store intermediate into '/path/to/output' USING PigStorage(',');
{code}

I think this is what you're suggesting. It's a little better than the 
project-range but still not great (lots of extra dots): 
{code}
i = load '/path/to/data' USING PigStorage();
intermediate = foreach i generate
  ...,
  3 as f3, 
  ...,
  6 as f6, 
  ...,
  9 as f9, 
  -- etc
  48 as f48,
  ...;
store intermediate into '/path/to/output' USING PigStorage(',');
{code}

With foreach ... update, we only need to list the fields that are changing.

{code}
i = load '/path/to/data' USING PigStorage();
intermediate = foreach i update
  3 as f3, 
  6 as f6, 
  9 as f9, 
  -- etc
  48 as f48;
store intermediate into '/path/to/output' USING PigStorage(',');
{code}

The last one is much clearer (if 'foreach update' has clearly defined 
semantics) and is also the shortest because it has the least extra syntactic 
overhead: you only need to type exactly what you want, nothing more. That makes 
it easier to write, easier to read later, and (we believe...but we can't use it 
yet :)) less prone to error.

 FOREACH ... UPDATE
 --

 Key: PIG-4608
 URL: https://issues.apache.org/jira/browse/PIG-4608
 Project: Pig
  Issue Type: New Feature
Reporter: Haley Thrapp

 I would like to propose a new command in Pig, FOREACH...UPDATE.
 Syntactically, it would look much like FOREACH … GENERATE.
 Example:
 Input data:
 (1,2,3)
 (2,3,4)
 (3,4,5)
 -- Load the data
 three_numbers = LOAD 'input_data'
 USING PigStorage()
 AS (f1:int, f2:int, f3:int);
 -- Sum up the row
 updated = FOREACH three_numbers UPDATE
 5 as f1,
 f1+f2 as new_sum
 ;
 Dump updated;
 (5,2,3,3)
 (5,3,4,5)
 (5,4,5,7)
 Fields to update must be specified by alias. Any fields in the UPDATE that do 
 not match an existing field will be appended to the end of the tuple.
 This command is particularly desirable in scripts that deal with a large 
 number of fields (in the 20-200 range). Often, we need to only make 
 modifications to a few fields. The FOREACH ... UPDATE statement, allows the 
 developer to focus on the actual logical changes instead of having to list 
 all of the fields that are also being passed through.
 My team has prototyped this with changes to FOREACH ... GENERATE. We believe 
 this can be done with changes to the parser and the creation of a new 
 LOUpdate. No physical plan changes should be needed because we will leverage 
 what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)