[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (25 issues) Subscriber: pigdaily Key Summary PIG-4598Allow user defined plan optimizer rules https://issues.apache.org/jira/browse/PIG-4598 PIG-4581thread safe issue in NodeIdGenerator https://issues.apache.org/jira/browse/PIG-4581 PIG-4574Eliminate identity vertex for order by and skewed join right after LOAD https://issues.apache.org/jira/browse/PIG-4574 PIG-4539New PigUnit https://issues.apache.org/jira/browse/PIG-4539 PIG-4526Make setting up the build environment easier https://issues.apache.org/jira/browse/PIG-4526 PIG-4468Pig's jackson version conflicts with that of hadoop 2.6.0 https://issues.apache.org/jira/browse/PIG-4468 PIG-4455Should use DependencyOrderWalker instead of DepthFirstWalker in MRPrinter https://issues.apache.org/jira/browse/PIG-4455 PIG-4417Pig's register command should support automatic fetching of jars from repo. https://issues.apache.org/jira/browse/PIG-4417 PIG-4373Implement Optimize the use of DistributedCache(PIG-2672) and PIG-3861 in Tez https://issues.apache.org/jira/browse/PIG-4373 PIG-4341Add CMX support to pig.tmpfilecompression.codec https://issues.apache.org/jira/browse/PIG-4341 PIG-4323PackageConverter hanging in Spark https://issues.apache.org/jira/browse/PIG-4323 PIG-4313StackOverflowError in LIMIT operation on Spark https://issues.apache.org/jira/browse/PIG-4313 PIG-4251Pig on Storm https://issues.apache.org/jira/browse/PIG-4251 PIG-4111Make Pig compiles with avro-1.7.7 https://issues.apache.org/jira/browse/PIG-4111 PIG-4002Disable combiner when map-side aggregation is used https://issues.apache.org/jira/browse/PIG-4002 PIG-3952PigStorage accepts '-tagSplit' to return full split information https://issues.apache.org/jira/browse/PIG-3952 PIG-3911Define unique fields with @OutputSchema https://issues.apache.org/jira/browse/PIG-3911 PIG-3877Getting Geo Latitude/Longitude from Address Lines https://issues.apache.org/jira/browse/PIG-3877 PIG-3873Geo distance calculation using Haversine https://issues.apache.org/jira/browse/PIG-3873 PIG-3866Create ThreadLocal classloader per PigContext https://issues.apache.org/jira/browse/PIG-3866 PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange handling of Daylight Saving Time with location based timezones https://issues.apache.org/jira/browse/PIG-3864 PIG-3851Upgrade jline to 2.11 https://issues.apache.org/jira/browse/PIG-3851 PIG-3668COR built-in function when atleast one of the coefficient values is NaN https://issues.apache.org/jira/browse/PIG-3668 PIG-3635Fix e2e tests for Hadoop 2.X on Windows https://issues.apache.org/jira/browse/PIG-3635 PIG-3587add functionality for rolling over dates https://issues.apache.org/jira/browse/PIG-3587 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384
[jira] [Commented] (PIG-4610) Enable "TestOrcStorage“ unit test in spark mode
[ https://issues.apache.org/jira/browse/PIG-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597045#comment-14597045 ] Mohit Sabharwal commented on PIG-4610: -- +1 (non-binding) > Enable "TestOrcStorage“ unit test in spark mode > --- > > Key: PIG-4610 > URL: https://issues.apache.org/jira/browse/PIG-4610 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4610.patch > > > In https://builds.apache.org/job/Pig-spark/222/#showFailuresLink, it shows > following unit test failures about "TestOrcStorage": > org.apache.pig.builtin.TestOrcStorage.testJoinWithPruning > org.apache.pig.builtin.TestOrcStorage.testLoadStoreMoreDataType > org.apache.pig.builtin.TestOrcStorage.testMultiStore -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4610) Enable "TestOrcStorage“ unit test in spark mode
[ https://issues.apache.org/jira/browse/PIG-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated PIG-4610: -- Attachment: PIG-4610.patch [~mohitsabharwal],[~kexianda],[~praveenr019],[~xuefuz]: PIG-4610.patch fixes following unit test failures: org.apache.pig.builtin.TestOrcStorage.testJoinWithPruning org.apache.pig.builtin.TestOrcStorage.testLoadStoreMoreDataType org.apache.pig.builtin.TestOrcStorage.testMultiStore Let's make an example to explain why it fails before: testOrcStorage.tmp.pig: orc-file-11-format.orc is found in $PIG_HOME/test/org/apache/pig/builtin/orc/orc-file-11-format.orc {code} A = load './orc-file-11-format.orc' using OrcStorage(); B = foreach A generate int1,string1; D = limit B 10; store D into './testOrcStorage.tmp.out'; {code} the result of spark: {code} false 1 false 1 false 1 false 1 false 1 false 1 false 1 false 1 false 1 false 1 {code} the result of MR: {code} 65536 hi 65536 bye 65536 hi 65536 bye 65536 hi 65536 bye 65536 hi 65536 bye 65536 hi 65536 bye {code} the data format from orc-file-11-format.orc is like: the requireColumns is the 4th and 9th(this info is stored in orc-file-11-format.orc): {code} {true, 100, 2048, 65536, 9223372036854775807, 2.0, -5.0, , bye, {[{1, bye}, {2, sigh}]}, [{1, cat}, {-10, in}, {1234, hat}], {chani={5, chani}, mauddib={1, mauddib}}, 2000-03-12 15:00:01, 12345678.6547457} {code} the difference between spark and mr is because [{{OrcStorage#mRequiredColumns}} |https://github.com/apache/pig/blob/trunk/src/org/apache/pig/builtin/OrcStorage.java#L298] is not initialized([{{UDFContext.getUDFContext().isFrontend()}}|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/builtin/OrcStorage.java#L296] is true). The reason {{UDFContext.getUDFContext().isFrontend()}} is true because [{{jconf.get(MRConfiguration.JOB_APPLICATION_ATTEMPT_ID)}}|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/util/UDFContext.java#L238] is null. PIG-4610.patch is set {{MRConfiguration.JOB_APPLICATION_ATTEMPT_ID}} in SparkUtil#newJobConf. > Enable "TestOrcStorage“ unit test in spark mode > --- > > Key: PIG-4610 > URL: https://issues.apache.org/jira/browse/PIG-4610 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4610.patch > > > In https://builds.apache.org/job/Pig-spark/222/#showFailuresLink, it shows > following unit test failures about "TestOrcStorage": > org.apache.pig.builtin.TestOrcStorage.testJoinWithPruning > org.apache.pig.builtin.TestOrcStorage.testLoadStoreMoreDataType > org.apache.pig.builtin.TestOrcStorage.testMultiStore -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596469#comment-14596469 ] Rohini Palaniswamy commented on PIG-4608: - The problem with the implicit add is that user typos could make it an add instead of update. For eg: If user specified updated = FOREACH three_numbers UPDATE_STRICT 3 AS f3, 6 AS f7; but actually meant to say 6 AS f6; , then the script will run fine and will require more debugging to find why the output is not as expected. So would prefer having ... at the end to make any additions explicit. That way one can throw errors for update of columns that do not exist. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596462#comment-14596462 ] Kevin J. Price commented on PIG-4608: - Several of us actually discussed this at some length, and didn't think it was worth differentiating between modified columns and appended columns in the command. Two ideas we had: # A token, like you have, indicating that the remaining fields are being added. We were considering using an 'ADD' keyword. As in: {code} updated = FOREACH three_numbers UPDATE 3 AS f3, 6 AS f6 ADD f1+f2 AS new_sum; {code} # Separate statements for 'strict' versus 'non-strict' mode. e.g., for updating with appending you would use {code} updated = FOREACH three_numbers UPDATE_STRICT 3 AS f3, 6 AS f6; {code} and for updating with appending, you could use {code} updated = FOREACH three_numbers UPDATE 3 AS f3, 6 AS f6, f1+f2 AS new_sum; {code} However, our overall view from writing pig scripts is that chances are very few people would ever want to use the strict mode, nor did we see much value in having the extra token (ADD or ...) separating out appended columns. >From a programming viewpoint, it just makes more logical sense to us to view it as an implicit "update or add" construct. > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596409#comment-14596409 ] Rohini Palaniswamy commented on PIG-4608: - Sounds good. Can we just add ... for the ones to be appended to make appending clear? i.e updated = FOREACH three_numbers GENERATE 3 as f3, 6 as f6, 9 as f9 ... f1+f2 as new_sum; > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PIG-4608) FOREACH ... UPDATE
[ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596409#comment-14596409 ] Rohini Palaniswamy edited comment on PIG-4608 at 6/22/15 6:42 PM: -- Sounds good. Can we just add ... once at the end for the ones to be appended to make appending clear? i.e updated = FOREACH three_numbers GENERATE 3 as f3, 6 as f6, 9 as f9 ... f1+f2 as new_sum; was (Author: rohini): Sounds good. Can we just add ... for the ones to be appended to make appending clear? i.e updated = FOREACH three_numbers GENERATE 3 as f3, 6 as f6, 9 as f9 ... f1+f2 as new_sum; > FOREACH ... UPDATE > -- > > Key: PIG-4608 > URL: https://issues.apache.org/jira/browse/PIG-4608 > Project: Pig > Issue Type: New Feature >Reporter: Haley Thrapp > > I would like to propose a new command in Pig, FOREACH...UPDATE. > Syntactically, it would look much like FOREACH … GENERATE. > Example: > Input data: > (1,2,3) > (2,3,4) > (3,4,5) > -- Load the data > three_numbers = LOAD 'input_data' > USING PigStorage() > AS (f1:int, f2:int, f3:int); > -- Sum up the row > updated = FOREACH three_numbers UPDATE > 5 as f1, > f1+f2 as new_sum > ; > Dump updated; > (5,2,3,3) > (5,3,4,5) > (5,4,5,7) > Fields to update must be specified by alias. Any fields in the UPDATE that do > not match an existing field will be appended to the end of the tuple. > This command is particularly desirable in scripts that deal with a large > number of fields (in the 20-200 range). Often, we need to only make > modifications to a few fields. The FOREACH ... UPDATE statement, allows the > developer to focus on the actual logical changes instead of having to list > all of the fields that are also being passed through. > My team has prototyped this with changes to FOREACH ... GENERATE. We believe > this can be done with changes to the parser and the creation of a new > LOUpdate. No physical plan changes should be needed because we will leverage > what LOGenerate does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4443) Write inputsplits in Tez to disk if the size is huge and option to compress pig input splits
[ https://issues.apache.org/jira/browse/PIG-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596201#comment-14596201 ] Rohini Palaniswamy commented on PIG-4443: - Just to be sure, are you getting this error with Pig on Tez or Mapreduce? And the error is while submitting the job or after it completes and fetching task reports? > Write inputsplits in Tez to disk if the size is huge and option to compress > pig input splits > > > Key: PIG-4443 > URL: https://issues.apache.org/jira/browse/PIG-4443 > Project: Pig > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.15.0 > > Attachments: PIG-4443-1.patch, PIG-4443-Fix-TEZ-2192-2.patch, > PIG-4443-Fix-TEZ-2192.patch > > > Pig sets the input split information in user payload and when running against > a table with 10s of 1000s of partitions, DAG submission fails with > java.io.IOException: Requested data length 305844060 is longer than maximum > configured RPC length 67108864 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4443) Write inputsplits in Tez to disk if the size is huge and option to compress pig input splits
[ https://issues.apache.org/jira/browse/PIG-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595756#comment-14595756 ] Ángel Álvarez commented on PIG-4443: I have a script in PIG that loads data from Hive using org.apache.hive.hcatalog.pig.HCatLoader. This script works fine in Pig 0.14, but in Pig 0.15 I'm getting this error: Requested data length 160452289 is longer than maximum configured RPC length 67108864 In Pig 0.14 I had to deal with this issue too, but I could always make it work by reducing the number of splits in the Hive tables created by Sqoop (using no more than 60 splits). Is there any special configuration needed? > Write inputsplits in Tez to disk if the size is huge and option to compress > pig input splits > > > Key: PIG-4443 > URL: https://issues.apache.org/jira/browse/PIG-4443 > Project: Pig > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.15.0 > > Attachments: PIG-4443-1.patch, PIG-4443-Fix-TEZ-2192-2.patch, > PIG-4443-Fix-TEZ-2192.patch > > > Pig sets the input split information in user payload and when running against > a table with 10s of 1000s of partitions, DAG submission fails with > java.io.IOException: Requested data length 305844060 is longer than maximum > configured RPC length 67108864 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
How I can get the last tuple of a bag
I have data like : (lucy,{(34,,45),(34,,45),(34,,45),(34,,45),(34,,45),(34,,45),(34,,45),(34,,45),(34,,45)}) (lili,{(12,lili,23),(12,lili,23),(12,lili,23),(12,lili,34),(12,lili,23),(12,lili,89),(12,lili,23),(12,lili,23),(12,lili,23),(12,lili,34),(12,lili,23),(12,lili,89),(12,lili,23),(12,lili,23)}) (limaomao,{(,limaomao,56),(,limaomao,56),(,limaomao,56),(,limaomao,56),(,limaomao,56),(,limaomao,56),(,limaomao,56),(,limaomao,56),(,limaomao,56)}) its metadata is: t2: {group: chararray,t1: {(a: int,b: chararray,c: int)}} I can get first tuple of t1 using limit or FirstTupleFromBag.html .But, How I can get every last tuple of t1. thanks