[
https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14594401#comment-14594401
]
Jacob Tolar commented on PIG-4608:
----------------------------------
Hi Rohini, are you suggesting this:
{code}
updated = FOREACH three_numbers GENERATE
...,
5 as f1,
...,
f1+f2 as new_sum;
{code}
?
Here's an exaggerated example of why we think something like foreach .. update
would work better. Original pig script:
{code}
-- assume we are using the schema load option (
http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html )
-- with fields named f1, f2, ..., f50
i = load '/path/to/data' USING PigStorage();
intermediate = foreach i generate
f1,
f2,
3 as f3,
f4,
f5,
6 as f6,
-- ... you get the idea, we're updating every 3rd field for some reason
48 as f48,
f49,
f50;
store intermediate into '/path/to/output' USING PigStorage(',');
{code}
Here it is with project-range notation that exists in pig. In this particularly
nasty case we are still mentioning every single field, even though we're using
project-range:
{code}
i = load '/path/to/data' USING PigStorage();
intermediate = foreach i generate
f1..f2,
3 as f3,
f4..f5,
6 as f6,
-- etc
48 as f48,
f49..f50;
store intermediate into '/path/to/output' USING PigStorage(',');
{code}
I think this is what you're suggesting. It's a little better than the
project-range but still not great (lots of extra dots):
{code}
i = load '/path/to/data' USING PigStorage();
intermediate = foreach i generate
...,
3 as f3,
...,
6 as f6,
...,
9 as f9,
-- etc
48 as f48,
...;
store intermediate into '/path/to/output' USING PigStorage(',');
{code}
With foreach ... update, we only need to list the fields that are changing.
{code}
i = load '/path/to/data' USING PigStorage();
intermediate = foreach i update
3 as f3,
6 as f6,
9 as f9,
-- etc
48 as f48;
store intermediate into '/path/to/output' USING PigStorage(',');
{code}
The last one is much clearer (if 'foreach update' has clearly defined
semantics) and is also the shortest because it has the least extra syntactic
overhead: you only need to type exactly what you want, nothing more. That makes
it easier to write, easier to read later, and (we believe...but we can't use it
yet :)) less prone to error.
> FOREACH ... UPDATE
> ------------------
>
> Key: PIG-4608
> URL: https://issues.apache.org/jira/browse/PIG-4608
> Project: Pig
> Issue Type: New Feature
> Reporter: Haley Thrapp
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do
> not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large
> number of fields (in the 20-200 range). Often, we need to only make
> modifications to a few fields. The FOREACH ... UPDATE statement, allows the
> developer to focus on the actual logical changes instead of having to list
> all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe
> this can be done with changes to the parser and the creation of a new
> LOUpdate. No physical plan changes should be needed because we will leverage
> what LOGenerate does.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)