[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (38 issues) Subscriber: pigdaily Key Summary PIG-5317Upgrade old dependencies: commons-lang, hsqldb, commons-logging https://issues-test.apache.org/jira/browse/PIG-5317 PIG-5316Initialize mapred.task.id property for PoS jobs https://issues-test.apache.org/jira/browse/PIG-5316 PIG-5312Uids not set in inner schemas after UNION ONSCHEMA https://issues-test.apache.org/jira/browse/PIG-5312 PIG-5310MergeJoin throwing NullPointer Exception https://issues-test.apache.org/jira/browse/PIG-5310 PIG-5300hashCode for Bag needs to be order independent https://issues-test.apache.org/jira/browse/PIG-5300 PIG-5273_SUCCESS file should be created at the end of the job https://issues-test.apache.org/jira/browse/PIG-5273 PIG-5267Review of org.apache.pig.impl.io.BufferedPositionedInputStream https://issues-test.apache.org/jira/browse/PIG-5267 PIG-5256Bytecode generation for POFilter and POForeach https://issues-test.apache.org/jira/browse/PIG-5256 PIG-5191Pig HBase 2.0.0 support https://issues-test.apache.org/jira/browse/PIG-5191 PIG-5160SchemaTupleFrontend.java is not thread safe, cause PigServer thrown NPE in multithread env https://issues-test.apache.org/jira/browse/PIG-5160 PIG-5115Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias https://issues-test.apache.org/jira/browse/PIG-5115 PIG-5106Optimize when mapreduce.input.fileinputformat.input.dir.recursive set to true https://issues-test.apache.org/jira/browse/PIG-5106 PIG-5081Can not run pig on spark source code distribution https://issues-test.apache.org/jira/browse/PIG-5081 PIG-5080Support store alias as spark table https://issues-test.apache.org/jira/browse/PIG-5080 PIG-5057IndexOutOfBoundsException when pig reducer processOnePackageOutput https://issues-test.apache.org/jira/browse/PIG-5057 PIG-5029Optimize sort case when data is skewed https://issues-test.apache.org/jira/browse/PIG-5029 PIG-4926Modify the content of start.xml for spark mode https://issues-test.apache.org/jira/browse/PIG-4926 PIG-4913Reduce jython function initiation during compilation https://issues-test.apache.org/jira/browse/PIG-4913 PIG-4849pig on tez will cause tez-ui to crash,because the content from timeline server is too long. https://issues-test.apache.org/jira/browse/PIG-4849 PIG-4750REPLACE_MULTI should compile Pattern once and reuse it https://issues-test.apache.org/jira/browse/PIG-4750 PIG-4684Exception should be changed to warning when job diagnostics cannot be fetched https://issues-test.apache.org/jira/browse/PIG-4684 PIG-4656Improve String serialization and comparator performance in BinInterSedes https://issues-test.apache.org/jira/browse/PIG-4656 PIG-4598Allow user defined plan optimizer rules https://issues-test.apache.org/jira/browse/PIG-4598 PIG-4551Partition filter is not pushed down in case of SPLIT https://issues-test.apache.org/jira/browse/PIG-4551 PIG-4539New PigUnit https://issues-test.apache.org/jira/browse/PIG-4539 PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException https://issues-test.apache.org/jira/browse/PIG-4515 PIG-4323PackageConverter hanging in Spark https://issues-test.apache.org/jira/browse/PIG-4323 PIG-4313StackOverflowError in LIMIT operation on Spark https://issues-test.apache.org/jira/browse/PIG-4313 PIG-4251Pig on Storm https://issues-test.apache.org/jira/browse/PIG-4251 PIG-4002Disable combiner when map-side aggregation is used https://issues-test.apache.org/jira/browse/PIG-4002 PIG-3952PigStorage accepts '-tagSplit' to return full split information https://issues-test.apache.org/jira/browse/PIG-3952 PIG-3911Define unique fields with @OutputSchema https://issues-test.apache.org/jira/browse/PIG-3911 PIG-3877Getting Geo Latitude/Longitude from Address Lines https://issues-test.apache.org/jira/browse/PIG-3877 PIG-3873Geo distance calculation using Haversine https://issues-test.apache.org/jira/browse/PIG-3873 PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange handling of Daylight Saving Time with location based timezones https://issues-test.apache.org/jira/browse/PIG-3864 PIG-3668COR built-in function when atleast one of the coefficient values is NaN https://issues-test.apache.org/jira/browse/PIG-3668 PIG-3587add functionality for rolling over dates https://issues-test.apache.org/jira/browse/PIG-3587 PIG-1804Alow Jython
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (33 issues) Subscriber: pigdaily Key Summary PIG-5323Implement LastInputStreamingOptimizer in Tez https://issues.apache.org/jira/browse/PIG-5323 PIG-5273_SUCCESS file should be created at the end of the job https://issues.apache.org/jira/browse/PIG-5273 PIG-5267Review of org.apache.pig.impl.io.BufferedPositionedInputStream https://issues.apache.org/jira/browse/PIG-5267 PIG-5256Bytecode generation for POFilter and POForeach https://issues.apache.org/jira/browse/PIG-5256 PIG-5191Pig HBase 2.0.0 support https://issues.apache.org/jira/browse/PIG-5191 PIG-5160SchemaTupleFrontend.java is not thread safe, cause PigServer thrown NPE in multithread env https://issues.apache.org/jira/browse/PIG-5160 PIG-5115Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias https://issues.apache.org/jira/browse/PIG-5115 PIG-5106Optimize when mapreduce.input.fileinputformat.input.dir.recursive set to true https://issues.apache.org/jira/browse/PIG-5106 PIG-5081Can not run pig on spark source code distribution https://issues.apache.org/jira/browse/PIG-5081 PIG-5080Support store alias as spark table https://issues.apache.org/jira/browse/PIG-5080 PIG-5057IndexOutOfBoundsException when pig reducer processOnePackageOutput https://issues.apache.org/jira/browse/PIG-5057 PIG-5029Optimize sort case when data is skewed https://issues.apache.org/jira/browse/PIG-5029 PIG-4926Modify the content of start.xml for spark mode https://issues.apache.org/jira/browse/PIG-4926 PIG-4913Reduce jython function initiation during compilation https://issues.apache.org/jira/browse/PIG-4913 PIG-4849pig on tez will cause tez-ui to crash,because the content from timeline server is too long. https://issues.apache.org/jira/browse/PIG-4849 PIG-4750REPLACE_MULTI should compile Pattern once and reuse it https://issues.apache.org/jira/browse/PIG-4750 PIG-4684Exception should be changed to warning when job diagnostics cannot be fetched https://issues.apache.org/jira/browse/PIG-4684 PIG-4656Improve String serialization and comparator performance in BinInterSedes https://issues.apache.org/jira/browse/PIG-4656 PIG-4598Allow user defined plan optimizer rules https://issues.apache.org/jira/browse/PIG-4598 PIG-4551Partition filter is not pushed down in case of SPLIT https://issues.apache.org/jira/browse/PIG-4551 PIG-4539New PigUnit https://issues.apache.org/jira/browse/PIG-4539 PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException https://issues.apache.org/jira/browse/PIG-4515 PIG-4323PackageConverter hanging in Spark https://issues.apache.org/jira/browse/PIG-4323 PIG-4313StackOverflowError in LIMIT operation on Spark https://issues.apache.org/jira/browse/PIG-4313 PIG-4251Pig on Storm https://issues.apache.org/jira/browse/PIG-4251 PIG-4002Disable combiner when map-side aggregation is used https://issues.apache.org/jira/browse/PIG-4002 PIG-3952PigStorage accepts '-tagSplit' to return full split information https://issues.apache.org/jira/browse/PIG-3952 PIG-3911Define unique fields with @OutputSchema https://issues.apache.org/jira/browse/PIG-3911 PIG-3877Getting Geo Latitude/Longitude from Address Lines https://issues.apache.org/jira/browse/PIG-3877 PIG-3873Geo distance calculation using Haversine https://issues.apache.org/jira/browse/PIG-3873 PIG-3668COR built-in function when atleast one of the coefficient values is NaN https://issues.apache.org/jira/browse/PIG-3668 PIG-3587add functionality for rolling over dates https://issues.apache.org/jira/browse/PIG-3587 PIG-1804Alow Jython function to implement Algebraic and/or Accumulator interfaces https://issues.apache.org/jira/browse/PIG-1804 You may edit this subscription at: https://issues.apache.org/jira/secure/EditSubscription!default.jspa?subId=16328=12322384
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327955#comment-16327955 ] Rohini Palaniswamy commented on PIG-4608: - bq. As for the 'update val AS col' versus 'update col BY val', I think the former looks less confusing to a current pig user. Agree and prefer usage of AS. Also BY does not make sense grammatically. It should either be AS (update val AS col) or WITH (update col WITH val). bq. As for delete, I somehow prefer using drop. I prefer drop as well as that is similar to the alter table drop column of SQL. Delete is generally used for rows. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327736#comment-16327736 ] Kevin J. Price commented on PIG-4608: - This is Yahoo-centric, but would it be possible to grep our logs for existing pig jobs and see how many of them have keyword conflicts with 'update', 'delete', 'drop', etc? I'm indifferent on 'delete' versus 'drop', but it'd be interesting to know which one would impact fewer existing scripts. As for the 'update val AS col' versus 'update col BY val', I think the former looks less confusing to a current pig user. 'BY' only gets used currently for key ordering in groups and joins, whereas 'AS' is already used for value assignment. I agree that there's a difference between 'GENERATE val AS col' and 'UPDATE val AS col', but it's a fairly philosophical difference from the user's perspective. In both cases, they want col to have the value val after the statement, so having the same syntax makes sense. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327686#comment-16327686 ] Koji Noguchi commented on PIG-4608: --- For now, can we just require separate statements for update and delete ? Also, not too excited with the use of {{AS}} for specifying which field to update. So far, "as" has been used for only naming the fields. Wondering if we can require the field-name upfront. Instead of * {{c = foreach a update "prefix"+x as x}} can we write as * {{c = foreach a update x by "prefix"+x}}; As for delete, I somehow prefer using {{drop}}. [~daijy], love to hear your thoughts on this. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327602#comment-16327602 ] Will Lauer edited comment on PIG-4608 at 1/16/18 7:08 PM: -- While up in the middle of the night dealing with a sick child, I realized there was way to make the parsing sane if updates, adds and deletes were to be included all in a single statement. How does this syntax look? {code} a = load 'input' using mock.Storage() as (x:chararray, y:chararray, z:long); b = foreach a generate x+y as q, y, z:long; c = foreach a update "prefix"+x as x, (chararray)(z+1) as z:charrarray; d = foreach a delete x, z; e = foreach a { nextInt = z+1; update nextInt as z:int } f = foreach a add { 1+oldCol as new:long, somethingElse as new2 } delete { colToRemove, otherColToRemove } update { 1+oldCol2 as updatedCol, "1"+oldCol2 as updatedTypeCol:chararray }; g = foreach a { nextInt = z+1; add { 1+oldCol as new:long, somethingElse as new2 } delete { colToRemove, otherColToRemove } update { 1+oldCol2 as updatedCol, "1"+oldCol2 as updatedTypeCol:chararray }; } {code} In this case, the surrounding curly braces would be required if putting multiple clauses in a single FOREACH. Add, delete, or update could all be included alone without the extra curly braces, but if you want to combine them, the curly braces would be required. was (Author: wla...@yahoo-inc.com): While up in the middle of the night dealing with a sick child, I realized there was way to make the parsing sane if updates, adds and deletes were to be included all in a single statement. How does this syntax look? {code} a = load 'input' using mock.Storage() as (x:chararray, y:chararray, z:long); b = foreach a generate x+y as q, y, z:long; c = foreach a update "prefix"+x as x, (chararray)(z+1) as z:charrarray; d = foreach a delete x, z; e = foreach a { nextInt = z+1; update nextInt as z:int } f = foreach a add { 1+oldCol as new:long, somethingElse as new2 } delete { colToRemove, otherColToRemove } update { 1+oldCol2 as updatedCol, "1"+oldCol2 as updatedTypeCol:chararray }; g = foreach a { nextInt = z+1; add { 1+oldCol as new:long, somethingElse as new2 } delete { colToRemove, otherColToRemove } update { 1+oldCol2 as updatedCol, "1"+oldCol2 as updatedTypeCol:chararray }; } {code} In this case, the surrounding curly braces would be required if putting multiple clauses in a single FOREACH. Add, delete, or update could all be included alone without the extra curly braces, but if you want to combine them, the curly braces would be required. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327602#comment-16327602 ] Will Lauer commented on PIG-4608: - While up in the middle of the night dealing with a sick child, I realized there was way to make the parsing sane if updates, adds and deletes were to be included all in a single statement. How does this syntax look? {code} a = load 'input' using mock.Storage() as (x:chararray, y:chararray, z:long); b = foreach a generate x+y as q, y, z:long; c = foreach a update "prefix"+x as x, (chararray)(z+1) as z:charrarray; d = foreach a delete x, z; e = foreach a { nextInt = z+1; update nextInt as z:int } f = foreach a add { 1+oldCol as new:long, somethingElse as new2 } delete { colToRemove, otherColToRemove } update { 1+oldCol2 as updatedCol, "1"+oldCol2 as updatedTypeCol:chararray }; g = foreach a { nextInt = z+1; add { 1+oldCol as new:long, somethingElse as new2 } delete { colToRemove, otherColToRemove } update { 1+oldCol2 as updatedCol, "1"+oldCol2 as updatedTypeCol:chararray }; } {code} In this case, the surrounding curly braces would be required if putting multiple clauses in a single FOREACH. Add, delete, or update could all be included alone without the extra curly braces, but if you want to combine them, the curly braces would be required. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp >Priority: Major > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work started] (PIG-5253) Pig Hadoop 3 support
[ https://issues.apache.org/jira/browse/PIG-5253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on PIG-5253 started by Nandor Kollar. -- > Pig Hadoop 3 support > > > Key: PIG-5253 > URL: https://issues.apache.org/jira/browse/PIG-5253 > Project: Pig > Issue Type: Improvement >Reporter: Nandor Kollar >Assignee: Nandor Kollar >Priority: Major > Fix For: 0.18.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)