[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336593#comment-16336593 ]
Daniel Dai edited comment on PIG-4608 at 1/24/18 1:00 AM: ---------------------------------------------------------- bq. a = FOREACH b UPDATE q AS q:int – This should be illegal, right? If the type is changed, an explicit modify of the value should occur This should be valid, AS clause has the capacity to change types. UPDATE clause is evaluated before AS clause, so a = FOREACH b UPDATE q WITH (int)q AS q:chararray; Will result a chararray q. bq. flattening a tuple into existing fields - does this make sense This makes sense, it is a symmetry to the AS clause I didn't see UPDATE/DROP in a single statement in the example, are we not going to support both in the same statement? I actually prefer those in the same statement, as I feel users usually think about adjusting all columns in the same time. How about APPEND? Actually when I think about DROP/APPEND, I feel we have to have INSERT as well to close the loop. But if adding INSERT, other syntax might be more proper, such as: a = FOREACH b generate .., UPDATE a10 WITH 1 as new_a10, ..a20, 2 as a_20_plus_half, ..a30, a32.., UPDATE a40 WITH 2 as new_a40, 1 as a41; Here: Update: a10, a40 using UPDATE clause Insert: a_20_plus_half Drop: a31 Append: a41 In the original use case, it can be written as: intermediate = foreach i generate .., 3 as f3, .., 6 as f6, .., 48 as f48, ..; The idea is to make ".." syntax more flexible, skip prefix/suffix if can be inferred. Probably more natural to add support for INSERT with this, thus make the syntax complete. How's that sound? was (Author: daijy): bq. a = FOREACH b UPDATE q AS q:int – This should be illegal, right? If the type is changed, an explicit modify of the value should occur This should be valid, AS clause has the capacity to change types. UPDATE clause is evaluated before AS clause, so a = FOREACH b UPDATE q WITH (int)q AS q:chararray; Will result a chararray q. bq. flattening a tuple into existing fields - does this make sense This makes sense, it is a symmetry to the AS clause I didn't see UPDATE/DROP in a single statement in the example, are we not going to support both in the same statement? I actually prefer those in the same statement, as I feel users usually think about adjusting all columns in the same time. How about APPEND? Actually when I think about DROP/APPEND, I feel we have to have INSERT as well to close the loop. But if adding INSERT, other syntax might be more proper, such as: a = FOREACH b generate .., UPDATE a10 WITH 1 as new_a10, ..a20, 2 as a_20_plus_half, ..a30, a32.., UPDATE a40 WITH 2 as new_a40, 1 as a41; Here: Update: a10, a40 using UPDATE clause Insert: a_20_plus_half Drop: a31 Append: a41 In the original use case, it can be written as: intermediate = foreach i generate .., 3 as f3, .., 6 as f6, .. 48 as f48, ..; The idea is to make ".." syntax more flexible, skip prefix/suffix if can be inferred. Probably more natural to add support for INSERT with this, thus make the syntax complete. How's that sound? > FOREACH ... UPDATE > ------------------ > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature > Reporter: Haley Thrapp > Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)