[jira] Updated: (PIG-566) Dump and store outputs do not match for PigStorage
[ https://issues.apache.org/jira/browse/PIG-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-566: --- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Yes, my mistake. Thanks Mridul. Fortunately Gianmarco doesn't listen to me :) I manually test the patch, all tests pass. Committed to trunk. Thanks Gianmarco! Dump and store outputs do not match for PigStorage -- Key: PIG-566 URL: https://issues.apache.org/jira/browse/PIG-566 Project: Pig Issue Type: Bug Affects Versions: 0.7.0, 0.8.0 Reporter: Santhosh Srinivasan Assignee: Gianmarco De Francisci Morales Priority: Minor Fix For: 0.8.0 Attachments: PIG-566.patch, PIG-566.patch, PIG-566.patch, PIG-566.patch, PIG-566.patch The dump and store formats for PigStorage do not match for longs and floats. {code} grunt y = foreach x generate {(2985671202194220139L)}; grunt describe y; y: {{(long)}} grunt dump y; ({(2985671202194220139L)}) grunt store y into 'y'; grunt cat y {(2985671202194220139)} {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-566) Dump and store outputs do not match for PigStorage
[ https://issues.apache.org/jira/browse/PIG-566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867898#action_12867898 ] Daniel Dai commented on PIG-566: Seems hudson is down. I will manually run the tests. Dump and store outputs do not match for PigStorage -- Key: PIG-566 URL: https://issues.apache.org/jira/browse/PIG-566 Project: Pig Issue Type: Bug Affects Versions: 0.7.0, 0.8.0 Reporter: Santhosh Srinivasan Assignee: Gianmarco De Francisci Morales Priority: Minor Fix For: 0.8.0 Attachments: PIG-566.patch, PIG-566.patch, PIG-566.patch, PIG-566.patch, PIG-566.patch The dump and store formats for PigStorage do not match for longs and floats. {code} grunt y = foreach x generate {(2985671202194220139L)}; grunt describe y; y: {{(long)}} grunt dump y; ({(2985671202194220139L)}) grunt store y into 'y'; grunt cat y {(2985671202194220139)} {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1381) Need a way for Pig to take an alternative property file
[ https://issues.apache.org/jira/browse/PIG-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1381: Fix Version/s: (was: 0.7.0) Need a way for Pig to take an alternative property file --- Key: PIG-1381 URL: https://issues.apache.org/jira/browse/PIG-1381 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: V.V.Chaitanya Krishna Fix For: 0.8.0 Attachments: PIG-1381-1.patch, PIG-1381-2.patch, PIG-1381-3.patch, PIG-1381-4.patch, PIG-1381-5.patch Currently, Pig read the first ever pig.properties in CLASSPATH. Pig has a default pig.properties and if user have a different pig.properties, there will be a conflict since we can only read one. There are couple of ways to solve it: 1. Give a command line option for user to pass an additional property file 2. Change the name for default pig.properties to pig-default.properties, and user can give a pig.properties to override 3. Further, can we consider to use pig-default.xml/pig-site.xml, which seems to be more natural for hadoop community. If so, we shall provide backward compatibility to also read pig.properties, pig-cluster-hadoop-site.xml. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-566) Dump and store outputs do not match for PigStorage
[ https://issues.apache.org/jira/browse/PIG-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-566: --- Fix Version/s: (was: 0.7.0) Dump and store outputs do not match for PigStorage -- Key: PIG-566 URL: https://issues.apache.org/jira/browse/PIG-566 Project: Pig Issue Type: Bug Affects Versions: 0.7.0, 0.8.0 Reporter: Santhosh Srinivasan Assignee: Gianmarco De Francisci Morales Priority: Minor Fix For: 0.8.0 Attachments: PIG-566.patch, PIG-566.patch, PIG-566.patch, PIG-566.patch, PIG-566.patch The dump and store formats for PigStorage do not match for longs and floats. {code} grunt y = foreach x generate {(2985671202194220139L)}; grunt describe y; y: {{(long)}} grunt dump y; ({(2985671202194220139L)}) grunt store y into 'y'; grunt cat y {(2985671202194220139)} {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1391) pig unit tests leave behind files in temp directory because MiniCluster files don't get deleted
[ https://issues.apache.org/jira/browse/PIG-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1391: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed pig unit tests leave behind files in temp directory because MiniCluster files don't get deleted --- Key: PIG-1391 URL: https://issues.apache.org/jira/browse/PIG-1391 Project: Pig Issue Type: Bug Affects Versions: 0.7.0, 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.7.0, 0.8.0, 0.6.0 Attachments: minicluster.patch, PIG-1391.06.2.patch, PIG-1391.06.patch, PIG-1391.07.patch, PIG-1391.trunk.patch Pig unit test runs leave behind files in temp dir (/tmp) and there are too many files in the directory over time. Most of the files are left behind by MiniCluster . It closes/shutsdown MiniDFSCluster, MiniMRCluster and the FileSystem that it has created when the constructor is called, only in finalize(). And java does not guarantee that finalize() will be called. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1417) Site changes for 0.7
[ https://issues.apache.org/jira/browse/PIG-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1417. - Hadoop Flags: [Reviewed] Resolution: Fixed Site changes for 0.7 Key: PIG-1417 URL: https://issues.apache.org/jira/browse/PIG-1417 Project: Pig Issue Type: Improvement Components: documentation Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-598) Parameter substitution ($PARAMETER) should not be performed in comments
[ https://issues.apache.org/jira/browse/PIG-598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-598. -- Parameter substitution ($PARAMETER) should not be performed in comments --- Key: PIG-598 URL: https://issues.apache.org/jira/browse/PIG-598 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: David Ciemiewicz Assignee: Thejas M Nair Fix For: 0.7.0 Attachments: PIG-598.1.patch, PIG-598.patch Compiling the following code example will generate an error that $NOT_A_PARAMETER is an Undefined Parameter. This is problematic as sometimes you want to comment out parts of your code, including parameters so that you don't have to define them. This I think it would be really good if parameter substitution was not performed in comments. {code} -- $NOT_A_PARAMETER {code} {code} -bash-3.00$ pig -exectype local -latest comment.pig USING: /grid/0/gs/pig/current java.lang.RuntimeException: Undefined parameter : NOT_A_PARAMETER at org.apache.pig.tools.parameters.PreprocessorContext.substitute(PreprocessorContext.java:221) at org.apache.pig.tools.parameters.ParameterSubstitutionPreprocessor.parsePigFile(ParameterSubstitutionPreprocessor.java:106) at org.apache.pig.tools.parameters.ParameterSubstitutionPreprocessor.genSubstitutedFile(ParameterSubstitutionPreprocessor.java:86) at org.apache.pig.Main.runParamPreprocessor(Main.java:394) at org.apache.pig.Main.main(Main.java:296) {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-617) Using SUM with basic type fails
[ https://issues.apache.org/jira/browse/PIG-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-617. -- Using SUM with basic type fails --- Key: PIG-617 URL: https://issues.apache.org/jira/browse/PIG-617 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Santhosh Srinivasan Fix For: 0.7.0 SUM is an aggregate function that expects a bag as an argument. When basic types are used as arguments to SUM, Pig fails during run time. The typechecker should catch this error and fail earlier. An example is given below: {code} grunt a = load 'one' as (i: int); grunt b = foreach a generate SUM(i); grunt dump b; 2009-01-12 14:11:47,595 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-01-12 14:12:12,617 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Map reduce job failed 2009-01-12 14:12:12,618 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Job failed! 2009-01-12 14:12:12,623 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error message from task (map) task_200812151518_9683_m_00java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.pig.data.DataBag 2009-01-12 14:12:12,623 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error message from task (map) task_200812151518_9683_m_00java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.pig.data.DataBag at org.apache.pig.builtin.IntSum.sum(IntSum.java:141) at org.apache.pig.builtin.IntSum.exec(IntSum.java:41) at org.apache.pig.builtin.IntSum.exec(IntSum.java:36) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:185) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:247) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:265) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:197) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:187) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:175) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) ... 2009-01-12 14:12:12,629 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias b 2009-01-12 14:12:12,629 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: Unable to open iterator for alias b at org.apache.pig.PigServer.openIterator(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:271) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:72) at org.apache.pig.Main.main(Main.java:302) Caused by: java.io.IOException: Job terminated with anomalous status FAILED at org.apache.pig.PigServer.openIterator(PigServer.java:419) ... 5 more {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-257) Allow usage of custom Hadoop InputFormat in Pig
[ https://issues.apache.org/jira/browse/PIG-257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-257. -- Allow usage of custom Hadoop InputFormat in Pig --- Key: PIG-257 URL: https://issues.apache.org/jira/browse/PIG-257 Project: Pig Issue Type: New Feature Reporter: Pi Song Fix For: 0.7.0 This very cool idea sprang out from a discussion in mailing-list (Thanks Manish Shah) There is a semantic issue that Hadoop Input Format generally expects K,V but Pig expects Tuple. We can solve this by sticking K,V as fields in Tuple. Provided that we've got rich built-in string/binary manipulation functions, Hadoop users shouldn't find it too costly to use Pig. This should definitely help accelerate Pig adoption process. After a brief look at the current code, this new feature will require changes in Map Reduce execution engine so I will wait until the type branch is complete before start working on this (If nobody expresses interest in doing it :) ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-518) LOBinCond exception in LogicalPlanValidationExecutor when providing default values for bag
[ https://issues.apache.org/jira/browse/PIG-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-518. -- LOBinCond exception in LogicalPlanValidationExecutor when providing default values for bag --- Key: PIG-518 URL: https://issues.apache.org/jira/browse/PIG-518 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Fix For: 0.7.0 Attachments: queries.txt, sports_views.txt The following piece of Pig script, which provides default values for bags {('','')} when the COUNT returns 0 fails with the following error. (Note: Files used in this script are enclosed on this Jira.) a = load 'sports_views.txt' as (col1, col2, col3); b = load 'queries.txt' as (colb1,colb2,colb3); mycogroup = cogroup a by col1 inner, b by colb1; mynewalias = foreach mycogroup generate flatten(a), flatten((COUNT(b) 0L ? b.(colb2,colb3) : {('','')})); dump mynewalias; java.io.IOException: Unable to open iterator for alias: mynewalias [Unable to store for alias: mynewalias [Can't overwrite cause]] at java.lang.Throwable.initCause(Throwable.java:320) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:1494) at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:85) at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:28) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.checkInnerPlan(TypeCheckingVisitor.java:2345) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2252) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:121) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:40) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30) at org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java: 79) at org.apache.pig.PigServer.compileLp(PigServer.java:684) at org.apache.pig.PigServer.compileLp(PigServer.java:655) at org.apache.pig.PigServer.store(PigServer.java:433) at org.apache.pig.PigServer.store(PigServer.java:421) at org.apache.pig.PigServer.openIterator(PigServer.java:384) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:269) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64) at org.apache.pig.Main.main(Main.java:306) Caused by: java.io.IOException: Unable to store for alias: mynewalias [Can't overwrite cause] ... 26 more Caused by: java.lang.IllegalStateException: Can't overwrite cause ... 26 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path
[ https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-756. -- UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path Key: PIG-756 URL: https://issues.apache.org/jira/browse/PIG-756 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz Fix For: 0.7.0 I have a utility function util.INSETFROMFILE() that I pass a file name during initialization. {code} define inQuerySet util.INSETFROMFILE(analysis/queries); A = load 'logs' using PigStorage() as ( date int, query chararray ); B = filter A by inQuerySet(query); {code} This provides a computationally inexpensive way to effect map-side joins for small sets plus functions of this style provide the ability to encapsulate more complex matching rules. For rapid development and debugging purposes, I want this code to run without modification on both my local file system when I do pig -exectype local and on HDFS. Pig needs to provide an API for UDFs which allow them to either: 1) know when they are in local or HDFS mode and let them open and read from files as appropriate 2) just provide a file name and read statements and have pig transparently manage local or HDFS opens and reads for the UDF UDFs need to read configuration information off the filesystem and it simplifies the process if one can just flip the switch of -exectype local. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-758) Converting load/store locations into fully qualified absolute paths
[ https://issues.apache.org/jira/browse/PIG-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-758. -- Converting load/store locations into fully qualified absolute paths --- Key: PIG-758 URL: https://issues.apache.org/jira/browse/PIG-758 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner Fix For: 0.7.0 As part of the multiquery optimization work there is a need to use absolute paths for load and store operations (because the current directory changes during the execution of the script). In order to do so, we are suggesting a change to the semantics of the location/filename string used in LoadFunc and Slicer/Slice. The proposed change is: * Load locations without a scheme part are expected to be hdfs (mapreduce mode) or local (local mode) paths * Any hdfs or local path will be translated to a fully qualified absolute path before it is handed to either a LoadFunc or Slicer * Any scheme other than file or hdfs will result in the load path to be passed through to the LoadFunc or Slicer without any modification. Example: If you have a LoadFunc that reads from a database, in the current system the following could be used: {noformat} a = load 'table' using DBLoader(); {noformat} With the proposed changes table would be translated into an hdfs path though (hdfs:///table). Probably not what the DBLoader would want to see. In order to make it work one could use: {noformat} a = load 'sql://table' using DBLoader(); {noformat} Now the DBLoader would see the unchanged string sql://table. This is an incompatible change, but hopefully not affecting many existing Loaders/Slicers. Since this is needed with the multiquery feature, the behavior can be reverted back by using the no_multiquery pig flag. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-613) Casting complex type(tuple/bag/map) does not take effect
[ https://issues.apache.org/jira/browse/PIG-613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-613. -- Casting complex type(tuple/bag/map) does not take effect Key: PIG-613 URL: https://issues.apache.org/jira/browse/PIG-613 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.7.0 Attachments: myfloatdata.txt, PIG-613-1.patch, PIG-613-2.patch, SQUARE.java Consider the following Pig script which casts return values of the SQUARE UDF which are tuples of doubles to long. The describe output of B shows it is long, however the result is still double. {code} register statistics.jar; A = load 'myfloatdata.txt' using PigStorage() as (doublecol:double); B = foreach A generate (tuple(long))statistics.SQUARE(doublecol) as squares:(loadtimesq); describe B; explain B; dump B; {code} === Describe output of B: B: {squares: (loadtimesq: long)} === Sample output of B: ((7885.44)) ((792098.2200010001)) ((1497360.9268889998)) ((50023.7956)) ((0.972196)) ((0.30980356)) ((9.9760144E-7)) === Cause: The cast for Tuples has not been implemented in POCast.java -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-726) Stop printing scope as part of Operator.toString()
[ https://issues.apache.org/jira/browse/PIG-726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-726. -- Stop printing scope as part of Operator.toString() -- Key: PIG-726 URL: https://issues.apache.org/jira/browse/PIG-726 Project: Pig Issue Type: Improvement Reporter: Thejas M Nair Assignee: Gunther Hagleitner Fix For: 0.7.0 When an operator is printed in pig, it prints a string with the user name and date at which the grunt shell was started. This information is not useful and makes the output very verbose. For example, a line in explain is like - ForEach tejas-Thu Mar 19 11:25:23 PDT 2009-4 Schema: {themap: map[ ]} Type: bag I am proposing that it should change to - ForEach (id:4) Schema: {themap: map[ ]} Type: bag That string comes from scope in OperatorKey class. We don't use make use of it anywhere, so we should stop printing it. The change is only in OperatorKey.toString(); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-829) DECLARE statement stop processing after special characters such as dot . , + % etc..
[ https://issues.apache.org/jira/browse/PIG-829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-829. -- DECLARE statement stop processing after special characters such as dot . , + % etc.. -- Key: PIG-829 URL: https://issues.apache.org/jira/browse/PIG-829 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.7.0 The below Pig script does not work well, when special characters are used in the DECLARE statement. {code} %DECLARE OUT foo.bar x = LOAD 'something' as (a:chararray, b:chararray); y = FILTER x BY ( a MATCHES '^.*yahoo.*$' ); STORE y INTO '$OUT'; {code} When the above script is run in the dry run mode; the substituted file does not contain the special character. {code} java -cp pig.jar:/homes/viraj/hadoop-0.18.0-dev/conf -Dhod.server='' org.apache.pig.Main -r declaresp.pig {code} Resulting file: declaresp.pig.substituted {code} x = LOAD 'something' as (a:chararray, b:chararray); y = FILTER x BY ( a MATCHES '^.*yahoo.*$' ); STORE y INTO 'foo'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-760. -- Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch, pigstorageschema_3.patch, pigstorageschema_4.patch, pigstorageschema_5.patch, pigstorageschema_7.patch, TEST-org.apache.pig.piggybank.test.TestPigStorageSchema.txt I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-803) Pig Latin Reference Manual - discussion of Pig streaming is incomplete
[ https://issues.apache.org/jira/browse/PIG-803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-803. -- Pig Latin Reference Manual - discussion of Pig streaming is incomplete -- Key: PIG-803 URL: https://issues.apache.org/jira/browse/PIG-803 Project: Pig Issue Type: Bug Components: documentation Reporter: David Ciemiewicz Assignee: Corinne Chandel Fix For: 0.7.0 The Pig Latin Reference Manual section on STREAM is missing broad swaths of information such as a discussion of the ship() clause. http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_STREAM_ A more complete definition seems to be here: http://wiki.apache.org/pig/PigStreamingFunctionalSpec However, it discusses auto shipping of scripts which doesn't seem to be working. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-887) document use of expressions in join,group,cogroup
[ https://issues.apache.org/jira/browse/PIG-887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-887. -- document use of expressions in join,group,cogroup - Key: PIG-887 URL: https://issues.apache.org/jira/browse/PIG-887 Project: Pig Issue Type: Improvement Components: documentation Reporter: Thejas M Nair Assignee: Olga Natkovich Fix For: 0.7.0 For join,group,cogroup relational operators, pig allows expressions to be used in place of the field aliases in the syntax documented in http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm . But this feature is not documented in the manual. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-834) incorrect plan when algebraic functions are nested
[ https://issues.apache.org/jira/browse/PIG-834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-834. -- incorrect plan when algebraic functions are nested -- Key: PIG-834 URL: https://issues.apache.org/jira/browse/PIG-834 Project: Pig Issue Type: Bug Components: impl Reporter: Thejas M Nair Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-834.patch, pig-834_2.patch, pig-834_3.patch a = load 'students.txt' as (c1,c2,c3,c4); c = group a by c2; f = foreach c generate COUNT(org.apache.pig.builtin.Distinct($1.$2)); Notice that Distinct udf is missing in Combiner and reduce stage. As a result distinct does not function, and incorrect results are produced. Distinct should have been evaluated in the 3 stages and output of Distinct should be given to COUNT in reduce stage. {code} # Map Reduce Plan #-- MapReduce node 1-122 Map Plan Local Rearrange[tuple]{bytearray}(false) - 1-139 | | | Project[bytearray][1] - 1-140 | |---New For Each(false,false)[bag] - 1-127 | | | POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] - 1-125 | | | |---POUserFunc(org.apache.pig.builtin.Distinct)[bag] - 1-126 | | | |---Project[bag][2] - 1-123 | | | |---Project[bag][1] - 1-124 | | | Project[bytearray][0] - 1-133 | |---Pre Combiner Local Rearrange[tuple]{Unknown} - 1-141 | |---Load(hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/tejas/students.txt:org.apache.pig.builtin.PigStorage) - 1-111 Combine Plan Local Rearrange[tuple]{bytearray}(false) - 1-143 | | | Project[bytearray][1] - 1-144 | |---New For Each(false,false)[bag] - 1-132 | | | POUserFunc(org.apache.pig.builtin.COUNT$Intermediate)[tuple] - 1-130 | | | |---Project[bag][0] - 1-135 | | | Project[bytearray][1] - 1-134 | |---POCombinerPackage[tuple]{bytearray} - 1-137 Reduce Plan Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-121 | |---New For Each(false)[bag] - 1-120 | | | POUserFunc(org.apache.pig.builtin.COUNT$Final)[long] - 1-119 | | | |---Project[bag][0] - 1-136 | |---POCombinerPackage[tuple]{bytearray} - 1-145 Global sort: false {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-843) PERFORMANCE: improvements in memory management
[ https://issues.apache.org/jira/browse/PIG-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-843. -- PERFORMANCE: improvements in memory management -- Key: PIG-843 URL: https://issues.apache.org/jira/browse/PIG-843 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Fix For: 0.7.0 Currently, Pig uses way too much memory. We need to understand where memory goes and come up with strategy to minimize memory footprint -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-844) PERFORMANCE: streaming data to the UDFs in foreach
[ https://issues.apache.org/jira/browse/PIG-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-844. -- PERFORMANCE: streaming data to the UDFs in foreach -- Key: PIG-844 URL: https://issues.apache.org/jira/browse/PIG-844 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Fix For: 0.7.0 Currently, Pig places the data passed to UDFs into a bag. This can cause the process to use more memory than actually needed as in many cases it would be better to push the data one tuple at a time to the UDFs. For the case where combiner is invoked, this might not be that important; however, for non-algebraic UDFs as well as other cases where combiner can't be used, this can provide significant memory improvement. Another possible use case is where the data is already grouped going into pig and we don't need to group it again. How this will effect UDF interface needs to be further discussed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-872. -- use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_872.patch.1 Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader
[ https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-879. -- Pig should provide a way for input location string in load statement to be passed as-is to the Loader - Key: PIG-879 URL: https://issues.apache.org/jira/browse/PIG-879 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Pradeep Kamath Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-879.patch, PIG-879.patch, PIG-879.patch, PIG-879.patch, PIG-879.patch Due to multiquery optimization, Pig always converts the filenames to absolute URIs (see http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section about Incompatible Changes - Path Names and Schemes). This is necessary since the script may have cd .. statements between load or store statements and if the load statements have relative paths, we would need to convert to absolute paths to know where to load/store from. To do this QueryParser.massageFilename() has the code below[1] which basically gives the fully qualified hdfs path However the issue with this approach is that if the filename string is something like hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2, the code below[1] actually translates this to hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2 and throws an exception that it is an incorrect path. Some loaders may want to interpret the filenames (the input location string in the load statement) in any way they wish and may want Pig to not make absolute paths out of them. There are a few options to address this: 1)A command line switch to indicate to Pig that pathnames in the script are all absolute and hence Pig should not alter them and pass them as-is to Loaders and Storers. 2)A keyword in the load and store statements to indicate the same intent to pig 3)A property which users can supply on cmdline or in pig.properties to indicate the same intent. 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String curDir) which does the conversion to absolute - this way Loader can chose to implement it as a noop. Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-933) broken link in pig-latin reference manual to hadoop file glob pattern documentation
[ https://issues.apache.org/jira/browse/PIG-933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-933. -- broken link in pig-latin reference manual to hadoop file glob pattern documentation --- Key: PIG-933 URL: https://issues.apache.org/jira/browse/PIG-933 Project: Pig Issue Type: Bug Components: documentation Reporter: Thejas M Nair Assignee: Olga Natkovich Priority: Minor Fix For: 0.7.0 http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_LOAD has a link to http://lucene.apache.org/hadoop/api/org/apache/hadoop/fs/FileSystem.html#globPaths(org.apache.hadoop.fs.Path)the , it should be - http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-937) Task get stuck in BasicTable's BTScaner's atEnd() method
[ https://issues.apache.org/jira/browse/PIG-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-937. -- Task get stuck in BasicTable's BTScaner's atEnd() method Key: PIG-937 URL: https://issues.apache.org/jira/browse/PIG-937 Project: Pig Issue Type: Bug Reporter: He Yongqiang Fix For: 0.7.0 It seems is caused by the infinite loop in the code: BasicTable, Line 698 {noformat} while (true) { int index = random.nextInt(cgScanners.length - 1) + 1; if (cgScanners[index] != null) { if (cgScanners[index].atEnd() != ret) { throw new IOException( atEnd() failed: Column Groups are not evenly positioned.); } break; } } {noformat} I think it's fine to just use a for loop here, like: {noformat} for (int index = 0; index cgScanners.length; index++) { if (cgScanners[index] != null) { if (cgScanners[index].atEnd() != ret) { throw new IOException( atEnd() failed: Column Groups are not evenly positioned.); } break; } } {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-940) Cross site HDFS access using the default.fs.name not possible in Pig
[ https://issues.apache.org/jira/browse/PIG-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-940. -- Cross site HDFS access using the default.fs.name not possible in Pig Key: PIG-940 URL: https://issues.apache.org/jira/browse/PIG-940 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.5.0 Environment: Hadoop 20 Reporter: Viraj Bhat Fix For: 0.7.0 I have a script which does the following.. access data from a remote HDFS location (via a HDFS installed at:hdfs://remotemachine1.company.com/ ) [[as I do not want to copy this huge amount of data between HDFS locations]]. However I want my Pigscript to write data to the HDFS running on localmachine.company.com. Currently Pig does not support that behavior and complains that: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist {code} A = LOAD 'hdfs://remotemachine1.company.com/user/viraj/A1.txt' as (a, b); B = LOAD 'hdfs://remotemachine1.company.com/user/viraj/B1.txt' as (c, d); C = JOIN A by a, B by c; store C into 'output' using PigStorage(); {code} === 2009-09-01 00:37:24,032 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localmachine.company.com:8020 2009-09-01 00:37:24,277 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localmachine.company.com:50300 2009-09-01 00:37:24,567 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage-POForEach to POJoinPackage 2009-09-01 00:37:24,573 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2009-09-01 00:37:24,573 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2009-09-01 00:37:26,197 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2009-09-01 00:37:26,249 [Thread-9] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-01 00:37:26,746 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-09-01 00:37:26,746 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-09-01 00:37:26,747 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map reduce job(s) failed! 2009-09-01 00:37:26,756 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed to produce result in: hdfs:/localmachine.company.com/tmp/temp-1470407685/tmp-510854480 2009-09-01 00:37:26,756 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-09-01 00:37:26,758 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist. Details at logfile: /home/viraj/pigscripts/pig_1251765443851.log === The error file in Pig contains: === ERROR 2998: Unhandled internal error. org.apache.pig.backend.executionengine.ExecException: ERROR 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist. at org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:126) at org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59) at org.apache.pig.impl.io.ValidatingInputFileSpec.init(ValidatingInputFileSpec.java:44) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:228) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at
[jira] Closed: (PIG-948) [Usability] Relating pig script with MR jobs
[ https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-948. -- [Usability] Relating pig script with MR jobs Key: PIG-948 URL: https://issues.apache.org/jira/browse/PIG-948 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Ashutosh Chauhan Assignee: Daniel Dai Priority: Minor Fix For: 0.7.0 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, PIG-948-5.patch, PIG-948-6.patch, pig-948.patch Currently its hard to find a way to relate pig script with specific MR job. In a loaded cluster with multiple simultaneous job submissions, its not easy to figure out which specific MR jobs were launched for a given pig script. If Pig can provide this info, it will be useful to debug and monitor the jobs resulting from a pig script. At the very least, Pig should be able to provide user the following information 1) Job id of the launched job. 2) Complete web url of jobtracker running this job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-977) exit status does not account for JOB_STATUS.TERMINATED
[ https://issues.apache.org/jira/browse/PIG-977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-977. -- exit status does not account for JOB_STATUS.TERMINATED -- Key: PIG-977 URL: https://issues.apache.org/jira/browse/PIG-977 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-977.patch For determining the exit status of pig query, only JOB_STATUS.FAILED is being used and status TERMINATED is ignored. I think the reason for this is that in ExecJob.JOB_STATUS only FAILED and COMPLETED are being used anywhere. Rest are unused. I think we should comment out the unused parts for now to indicate that, or fix the code for determining success/failure in GruntParser. executeBatch {code} public enum JOB_STATUS { QUEUED, RUNNING, SUSPENDED, TERMINATED, FAILED, COMPLETED, } {code} {code} private void executeBatch() throws IOException { if (mPigServer.isBatchOn()) { if (mExplain != null) { explainCurrentBatch(); } if (!mLoadOnly) { ListExecJob jobs = mPigServer.executeBatch(); for(ExecJob job: jobs) { == if (job.getStatus() == ExecJob.JOB_STATUS.FAILED) { mNumFailedJobs++; if (job.getException() != null) { LogUtils.writeLog( job.getException(), mPigServer.getPigContext().getProperties().getProperty(pig.logfile), log, true.equalsIgnoreCase(mPigServer.getPigContext().getProperties().getProperty(verbose)), Pig Stack Trace); } } else { mNumSucceededJobs++; } } } } } {code} Any opinions ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-952) [Zebra] Make Zebra Version Same as Pig Version
[ https://issues.apache.org/jira/browse/PIG-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-952. -- [Zebra] Make Zebra Version Same as Pig Version -- Key: PIG-952 URL: https://issues.apache.org/jira/browse/PIG-952 Project: Pig Issue Type: Improvement Components: build Affects Versions: 0.4.0 Reporter: Gaurav Jain Assignee: Gaurav Jain Priority: Minor Fix For: 0.7.0 Zebra release versions need to be same as Pig release versions for consistency -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-961) Integration with Hadoop 21
[ https://issues.apache.org/jira/browse/PIG-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-961. -- Integration with Hadoop 21 -- Key: PIG-961 URL: https://issues.apache.org/jira/browse/PIG-961 Project: Pig Issue Type: New Feature Reporter: Olga Natkovich Assignee: Ying He Fix For: 0.7.0 Attachments: hadoop21.jar, PIG-961.patch, PIG-961.patch2 Hadoop 21 is not yet released but we know that switch to new MR API is coming there. This JIRA is for early integration with this API -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
[ https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-966. -- Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces --- Key: PIG-966 URL: https://issues.apache.org/jira/browse/PIG-966 Project: Pig Issue Type: Improvement Components: impl Reporter: Alan Gates Assignee: Alan Gates Fix For: 0.7.0 I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces significantly. See http://wiki.apache.org/pig/LoadStoreRedesignProposal for full details -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-990) Provide a way to pin LogicalOperator Options
[ https://issues.apache.org/jira/browse/PIG-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-990. -- Provide a way to pin LogicalOperator Options Key: PIG-990 URL: https://issues.apache.org/jira/browse/PIG-990 Project: Pig Issue Type: Bug Components: impl Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.7.0 Attachments: pinned_options.patch, pinned_options_2.patch, pinned_options_3.patch, pinned_options_4.patch, pinned_options_5.patch, pinned_options_6.patch This is a proactive patch, setting up the groundwork for adding an optimizer. Some of the LogicalOperators have options. For example, LOJoin has a variety of join types (regular, fr, skewed, merge), which can be set by the user or chosen by a hypothetical optimizer. If a user selects a join type, pig philoophy guides us to always respect the user's choice and not explore alternatives. Therefore, we need a way to pin options. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-973) type resolution inconsistency
[ https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-973. -- type resolution inconsistency - Key: PIG-973 URL: https://issues.apache.org/jira/browse/PIG-973 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-973.patch This script works: A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: float); B = group A by age; C = foreach B { D = filter A by gpa 2.5; E = order A by name; F = A.age; describe F; G = distinct F; generate group, COUNT(D), MAX (E.name), MIN(G.$0);} dump C; This one produces an error: A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: float); B = group A by age; C = foreach B { D = filter A by gpa 2.5; E = order A by name; F = A.age; G = distinct F; generate group, COUNT(D), MAX (E.name), MIN(G);} dump C; Notice the difference in how MIN is passed the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-980) Optimizing nested order bys
[ https://issues.apache.org/jira/browse/PIG-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-980. -- Optimizing nested order bys --- Key: PIG-980 URL: https://issues.apache.org/jira/browse/PIG-980 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Ying He Fix For: 0.7.0 Pig needs to take advantage of secondary sort in Hadoop to optimize nested order bys. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1022) optimizer pushes filter before the foreach that generates column used by filter
[ https://issues.apache.org/jira/browse/PIG-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1022. --- optimizer pushes filter before the foreach that generates column used by filter --- Key: PIG-1022 URL: https://issues.apache.org/jira/browse/PIG-1022 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.4.0 Reporter: Thejas M Nair Assignee: Daniel Dai Fix For: 0.7.0 Attachments: PIG-1022-1.patch grunt l = load 'students.txt' using PigStorage() as (name:chararray, gender:chararray, age:chararray, score:chararray); grunt f = foreach l generate name, gender, age,score, '200' as gid:chararray; grunt g = group f by (name, gid); grunt f2 = foreach g generate group.name as name: chararray, group.gid as gid: chararray; grunt filt = filter f2 by gid == '200'; grunt explain filt; In the plan generated filt is pushed up after the load and before the first foreach, even though the filter is on gid which is generated in first foreach. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1045) Integration with Hadoop 20 New API
[ https://issues.apache.org/jira/browse/PIG-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1045. --- Integration with Hadoop 20 New API -- Key: PIG-1045 URL: https://issues.apache.org/jira/browse/PIG-1045 Project: Pig Issue Type: New Feature Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1045.patch, PIG-1045.patch Hadoop 21 is not yet released but we know that switch to new MR API is coming there. This JIRA is for early integration with the portion of this API that has been implemented in Hadoop 20. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1053) Consider moving to Hadoop for local mode
[ https://issues.apache.org/jira/browse/PIG-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1053. --- Consider moving to Hadoop for local mode Key: PIG-1053 URL: https://issues.apache.org/jira/browse/PIG-1053 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Ankit Modi Fix For: 0.7.0 Attachments: hadoopLocal.patch We need to consider moving Pig to use Hadoop's local mode instead of its own. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1072) ReversibleLoadStoreFunc interface should be removed to enable different load and store implementation classes to be used in a reversible manner
[ https://issues.apache.org/jira/browse/PIG-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1072. --- ReversibleLoadStoreFunc interface should be removed to enable different load and store implementation classes to be used in a reversible manner --- Key: PIG-1072 URL: https://issues.apache.org/jira/browse/PIG-1072 Project: Pig Issue Type: Sub-task Reporter: Pradeep Kamath Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1072.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1079) Modify merge join to use distributed cache to maintain the index
[ https://issues.apache.org/jira/browse/PIG-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1079. --- Modify merge join to use distributed cache to maintain the index Key: PIG-1079 URL: https://issues.apache.org/jira/browse/PIG-1079 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1079.patch, PIG-1079.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1086) Nested sort by * throw exception
[ https://issues.apache.org/jira/browse/PIG-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1086. --- Nested sort by * throw exception Key: PIG-1086 URL: https://issues.apache.org/jira/browse/PIG-1086 Project: Pig Issue Type: Bug Affects Versions: 0.5.0 Reporter: Daniel Dai Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1086.patch The following script fail: A = load '1.txt' as (a0, a1, a2); B = group A by a0; C = foreach B { D = order A by *; generate group, D;}; explain C; Here is the stack: Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.get(ArrayList.java:324) at org.apache.pig.impl.logicalLayer.schema.Schema.getField(Schema.java:752) at org.apache.pig.impl.logicalLayer.LOSort.getSortInfo(LOSort.java:332) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1365) at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:176) at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:43) at org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:69) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1274) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:234) at org.apache.pig.PigServer.compilePp(PigServer.java:864) at org.apache.pig.PigServer.explain(PigServer.java:583) ... 8 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1093) pig.properties file is missing from distributions
[ https://issues.apache.org/jira/browse/PIG-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1093. --- pig.properties file is missing from distributions - Key: PIG-1093 URL: https://issues.apache.org/jira/browse/PIG-1093 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.5.0, 0.6.0 Reporter: Alan Gates Assignee: Alan Gates Fix For: 0.7.0 Attachments: PIG-1093.patch pig.properties (in fact the entire conf directory) is not included in the jars distributed as part of the 0.5 release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1075) Error in Cogroup when key fields types don't match
[ https://issues.apache.org/jira/browse/PIG-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1075. --- Error in Cogroup when key fields types don't match -- Key: PIG-1075 URL: https://issues.apache.org/jira/browse/PIG-1075 Project: Pig Issue Type: Bug Affects Versions: 0.5.0 Reporter: Ankur Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1075.patch When Cogrouping 2 relations on multiple key fields, pig throws an error if the corresponding types don't match. Consider the following script:- A = LOAD 'data' USING PigStorage() as (a:chararray, b:int, c:int); B = LOAD 'data' USING PigStorage() as (a:chararray, b:chararray, c:int); C = CoGROUP A BY (a,b,c), B BY (a,b,c); D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B); describe D; dump D; The complete stack trace of the error thrown is Pig Stack Trace --- ERROR 1051: Cannot cast to Unknown org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to describe schema for alias D at org.apache.pig.PigServer.dumpSchema(PigServer.java:436) at org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:233) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:253) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An unexpected exception caused the validation to stop at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30) at org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83) at org.apache.pig.PigServer.compileLp(PigServer.java:821) at org.apache.pig.PigServer.dumpSchema(PigServer.java:428) ... 6 more Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1060: Cannot resolve COGroup output schema at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2463) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:372) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101) ... 11 more Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1051: Cannot cast to Unknown at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForCOGroupInnerPlan(TypeCheckingVisitor.java:2552) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2451) ... 16 more The error message does not help the user in identifying the issue clearly especially if the pig script is large and complex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1101) Pig parser does not recognize its own data type in LIMIT statement
[ https://issues.apache.org/jira/browse/PIG-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1101. --- Pig parser does not recognize its own data type in LIMIT statement -- Key: PIG-1101 URL: https://issues.apache.org/jira/browse/PIG-1101 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Ashutosh Chauhan Priority: Minor Fix For: 0.7.0 Attachments: pig-1101.patch I have a Pig script in which I specify the number of records to limit as a long type. {code} A = LOAD '/user/viraj/echo.txt' AS (txt:chararray); B = LIMIT A 10L; DUMP B; {code} I get a parser error: 2009-11-21 02:25:51,100 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered LONGINTEGER 10L at line 3, column 13. Was expecting: INTEGER ... at org.apache.pig.impl.logicalLayer.parser.QueryParser.generateParseException(QueryParser.java:8963) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_consume_token(QueryParser.java:8839) at org.apache.pig.impl.logicalLayer.parser.QueryParser.LimitClause(QueryParser.java:1656) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1280) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:893) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:682) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017) In fact 10L seems to work in the foreach generate construct. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1088) change merge join and merge join indexer to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1088. --- change merge join and merge join indexer to work with new LoadFunc interface Key: PIG-1088 URL: https://issues.apache.org/jira/browse/PIG-1088 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.7.0 Attachments: PIG-1088.1.patch, PIG-1088.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1106) FR join should not spill
[ https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1106. --- FR join should not spill Key: PIG-1106 URL: https://issues.apache.org/jira/browse/PIG-1106 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Ankit Modi Fix For: 0.7.0 Attachments: frjoin-nonspill.patch Currently, the values for the replicated side of the data are placed in a spillable bag (POFRJoin near line 275). This does not make sense because the whole point of the optimization is that the data on one side fits into memory. We already have a non-spillable bag implemented (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And of course need to do lots of testing to make sure that we don't spill but die instead when we run out of memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1099) [zebra] version on APACHE trunk should be 0.7.0 to be in pace with PIG
[ https://issues.apache.org/jira/browse/PIG-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1099. --- [zebra] version on APACHE trunk should be 0.7.0 to be in pace with PIG -- Key: PIG-1099 URL: https://issues.apache.org/jira/browse/PIG-1099 Project: Pig Issue Type: Bug Reporter: Yan Zhou Assignee: Yan Zhou Priority: Trivial Fix For: 0.7.0 Attachments: PIG_1099.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1102. --- Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch, PIG_1102.patch.1 Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1103) refactor test-commit
[ https://issues.apache.org/jira/browse/PIG-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1103. --- refactor test-commit Key: PIG-1103 URL: https://issues.apache.org/jira/browse/PIG-1103 Project: Pig Issue Type: Task Reporter: Olga Natkovich Assignee: Olga Natkovich Fix For: 0.7.0 Attachments: PIG-1103.patch Due to the changes to the local mode, many tests are now taking longer. Need to make sure that test-commit still finishes within 10 minutes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1110) Handle compressed file formats -- Gz, BZip with the new proposal
[ https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1110. --- Handle compressed file formats -- Gz, BZip with the new proposal Key: PIG-1110 URL: https://issues.apache.org/jira/browse/PIG-1110 Project: Pig Issue Type: Sub-task Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1110.patch, PIG-1110.patch, PIG_1110_Jeff.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1115) [zebra] temp files are not cleaned.
[ https://issues.apache.org/jira/browse/PIG-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1115. --- [zebra] temp files are not cleaned. --- Key: PIG-1115 URL: https://issues.apache.org/jira/browse/PIG-1115 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Hong Tang Assignee: Gaurav Jain Fix For: 0.7.0 Attachments: PIG-1115.patch Temp files created by zebra during table creation are not cleaned where there is any task failure, which results in waste of disk space. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1117) Pig reading hive columnar rc tables
[ https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1117. --- Pig reading hive columnar rc tables --- Key: PIG-1117 URL: https://issues.apache.org/jira/browse/PIG-1117 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Gerrit Jansen van Vuuren Assignee: Gerrit Jansen van Vuuren Fix For: 0.7.0 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, PIG-1117-0.7.0-new.patch, PIG-1117-0.7.0-reviewed.patch, PIG-1117-0.7.0-reviewed.patch, PIG-1117.patch, PIG-117-v.0.6.0.patch, PIG-117-v.0.7.0.patch I've coded a LoadFunc implementation that can read from Hive Columnar RC tables, this is needed for a project that I'm working on because all our data is stored using the Hive thrift serialized Columnar RC format. I have looked at the piggy bank but did not find any implementation that could do this. We've been running it on our cluster for the last week and have worked out most bugs. There are still some improvements to be done but I would need like setting the amount of mappers based on date partitioning. Its been optimized so as to read only specific columns and can churn through a data set almost 8 times faster with this improvement because not all column data is read. I would like to contribute the class to the piggybank can you guide me in what I need to do? I've used hive specific classes to implement this, is it possible to add this to the piggy bank build ivy for automatic download of the dependencies? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1122) [zebra] Zebra build.xml still uses 0.6 version
[ https://issues.apache.org/jira/browse/PIG-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1122. --- [zebra] Zebra build.xml still uses 0.6 version -- Key: PIG-1122 URL: https://issues.apache.org/jira/browse/PIG-1122 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.7.0 Attachments: PIG-1122.patch Zebra still uses pig-0.6.0-dev-core.jar in build-contrib.xml. It should be changed to pig-0.7.0-dev-core.jar on APACHE trunk only. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1131. --- Pig simple join does not work when it contains empty lines -- Key: PIG-1131 URL: https://issues.apache.org/jira/browse/PIG-1131 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: junk1.txt, junk2.txt, pig-1131.patch, pig-1131.patch, simplejoinscript.pig I have a simple script, which does a JOIN. {code} input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); describe input1; input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); describe input2; joineddata = JOIN input1 by $0, input2 by $0; describe joineddata; store joineddata into 'result'; {code} The input data contains empty lines. The join fails in the Map phase with the following error in the PRLocalRearrange.java java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am surprised that the test cases did not detect this error. Could we add this data which contains empty lines to the testcases? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1140) [zebra] Use of Hadoop 2.0 APIs
[ https://issues.apache.org/jira/browse/PIG-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1140. --- [zebra] Use of Hadoop 2.0 APIs Key: PIG-1140 URL: https://issues.apache.org/jira/browse/PIG-1140 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: zebra.0209, zebra.0211, zebra.0212, zebra.0213 Currently, Zebra is still using already deprecated Hadoop 1.8 APIs. Need to upgrade to its 2.0 APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1146) Inconsistent column pruning in LOUnion
[ https://issues.apache.org/jira/browse/PIG-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1146. --- Inconsistent column pruning in LOUnion -- Key: PIG-1146 URL: https://issues.apache.org/jira/browse/PIG-1146 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0 Attachments: PIG-1146-1.patch, PIG-1146-2.patch This happens when we do a union on two relations, if one column comes from a loader, the other matching column comes from a constant, and this column get pruned. We prune for the one from loader and did not prune the constant. Thus leaves union an inconsistent state. Here is a script: {code} a = load '1.txt' as (a0, a1:chararray, a2); b = load '2.txt' as (b0, b2); c = foreach b generate b0, 'hello', b2; d = union a, c; e = foreach d generate $0, $2; dump e; {code} 1.txt: {code} ulysses thompson64 1.90 katie carson25 3.65 {code} 2.txt: {code} luke king 0.73 holly davidson 2.43 {code} expected output: (ulysses thompson,1.90) (katie carson,3.65) (luke king,0.73) (holly davidson,2.43) real output: (ulysses thompson,) (katie carson,) (luke king,0.73) (holly davidson,2.43) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1124) Unable to set Custom Job Name using the -Dmapred.job.name parameter
[ https://issues.apache.org/jira/browse/PIG-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1124. --- Unable to set Custom Job Name using the -Dmapred.job.name parameter --- Key: PIG-1124 URL: https://issues.apache.org/jira/browse/PIG-1124 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Ashutosh Chauhan Priority: Minor Fix For: 0.7.0 Attachments: pig-1124.patch As a Hadoop user I want to control the Job name for my analysis via the command line using the following construct:: java -cp pig.jar:$HADOOP_HOME/conf -Dmapred.job.name=hadoop_junkie org.apache.pig.Main broken.pig -Dmapred.job.name should normally set my Hadoop Job name, but somehow during the formation of the job.xml in Pig this information is lost and the job name turns out to be: PigLatin:broken.pig The current workaround seems to be wiring it in the script itself, using the following ( or using parameter substitution). set job.name 'my job' Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs
[ https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1149. --- Allow instantiation of SampleLoaders with parametrized LoadFuncs Key: PIG-1149 URL: https://issues.apache.org/jira/browse/PIG-1149 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.7.0 Attachments: pig_1149.patch, pig_1149_lsr-branch.patch Currently, it is not possible to instantiate a SampleLoader with something like PigStorage(':'). We should allow passing parameters to the loaders being sampled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1136) [zebra] Map Split of Storage info do not allow for leading underscore char '_'
[ https://issues.apache.org/jira/browse/PIG-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1136. --- [zebra] Map Split of Storage info do not allow for leading underscore char '_' -- Key: PIG-1136 URL: https://issues.apache.org/jira/browse/PIG-1136 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Yan Zhou Priority: Minor Fix For: 0.7.0 Attachments: pig-1136-xuefu-new.patch There is some user need to support that type of map keys. Pig's column does not allow for leading underscore, but apparently no restriction is placed on the map key. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1153) [zebra] spliting columns at different levels in a complex record column into different column groups throws exception
[ https://issues.apache.org/jira/browse/PIG-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1153. --- [zebra] spliting columns at different levels in a complex record column into different column groups throws exception - Key: PIG-1153 URL: https://issues.apache.org/jira/browse/PIG-1153 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Xuefu Zhang Assignee: Yan Zhou Fix For: 0.7.0 Attachments: PIG-1153.patch, PIG-1153.patch The following code sample: String strSch = r1:record(f1:int, f2:int), r2:record(f5:int, r3:record(f3:float, f4)); String strStorage = [r1.f1, r2.r3.f3, r2.f5]; [r1.f2, r2.r3.f4]; Partition p = new Partition(schema.toString(), strStorage, null); gives the following exception: org.apache.hadoop.zebra.parser.ParseException: Different Split Types Set on the same field: r2.f5 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1141) Make streaming work with the new load-store interfaces
[ https://issues.apache.org/jira/browse/PIG-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1141. --- Make streaming work with the new load-store interfaces --- Key: PIG-1141 URL: https://issues.apache.org/jira/browse/PIG-1141 Project: Pig Issue Type: Sub-task Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1141.patch, PIG-1141.patch, PIG-1141.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1154) local mode fails when hadoop config directory is specified in classpath
[ https://issues.apache.org/jira/browse/PIG-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1154. --- local mode fails when hadoop config directory is specified in classpath --- Key: PIG-1154 URL: https://issues.apache.org/jira/browse/PIG-1154 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Thejas M Nair Assignee: Ankit Modi Fix For: 0.7.0 Attachments: pig_1154.patch In local mode, the hadoop configuration should not be taken from the classpath . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1148) Move splitable logic from pig latin to InputFormat
[ https://issues.apache.org/jira/browse/PIG-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1148. --- Move splitable logic from pig latin to InputFormat -- Key: PIG-1148 URL: https://issues.apache.org/jira/browse/PIG-1148 Project: Pig Issue Type: Sub-task Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.7.0 Attachments: PIG-1148.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM
[ https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1157. --- Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM --- Key: PIG-1157 URL: https://issues.apache.org/jira/browse/PIG-1157 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Richard Ding Fix For: 0.7.0 Attachments: oomreplicatedjoin.pig, PIG-1157.patch, PIG-1157.patch, replicatedjoinexplain.log Hi all, I have a script which does 2 replicated joins in succession. Please note that the inputs do not exist on the HDFS. {code} A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c); A1 = FOREACH A GENERATE a; B = GROUP A1 BY a; C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y); D = JOIN C BY x, B BY group USING replicated; E = JOIN A BY a, D by x USING replicated; dump E; {code} 2009-12-16 19:12:00,253 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 4 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-only splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-reduce splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 2 splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. unable to create new native thread Details at logfile: pig_1260990666148.log Looking at the log file: Pig Stack Trace --- ERROR 2998: Unhandled internal error. unable to create new native thread java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:597) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773) at org.apache.pig.PigServer.store(PigServer.java:522) at org.apache.pig.PigServer.openIterator(PigServer.java:458) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) If we want to look at the explain output, we find that there is no Map Reduce plan that is generated. Why is the M/R plan not generated? Attaching the script and explain output. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1161) Add missing apache headers to a few classes
[ https://issues.apache.org/jira/browse/PIG-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1161. --- Add missing apache headers to a few classes --- Key: PIG-1161 URL: https://issues.apache.org/jira/browse/PIG-1161 Project: Pig Issue Type: Task Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Trivial Fix For: 0.7.0 Attachments: pig_missing_licenses.patch The following java classes are missing Apache License headers: StoreConfig MapRedUtil SchemaUtil TestDataBagAccess TestNullConstant TestSchemaUtil We should add the missing headers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1169) Top-N queries produce incorrect results when a store statement is added between order by and limit statement
[ https://issues.apache.org/jira/browse/PIG-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1169. --- Top-N queries produce incorrect results when a store statement is added between order by and limit statement Key: PIG-1169 URL: https://issues.apache.org/jira/browse/PIG-1169 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1169.patch ??We tried to get top N results after a groupby and sort, and got different results with or without storing the full sorted results. Here is a skeleton of our pig script.?? {code} raw_data = Load 'input_files' AS (f1, f2, ..., fn); grouped = group raw_data by (f1, f2); data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as value; ordered = order data by value DESC parallel 10; topn = limit ordered 10; store ordered into 'outputdir/full'; store topn into 'outputdir/topn'; {code} ??With the statement 'store ordered ...', top N results are incorrect, but without the statement, results are correct. Has anyone seen this before? I know a similar bug has been fixed in the multi-query release. We are on pig .4 and hadoop .20.1.?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1156) Add aliases to ExecJobs and PhysicalOperators
[ https://issues.apache.org/jira/browse/PIG-1156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1156. --- Add aliases to ExecJobs and PhysicalOperators - Key: PIG-1156 URL: https://issues.apache.org/jira/browse/PIG-1156 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: pig_batchAliases.patch Currently, the way to use muti-query from Java is as follows: 1. pigServer.setBatchOn(); 2. register your queries with pigServer 3. ListExecJob jobs = pigServer.executeBatch(); 4. for (ExecJob job : jobs) { IteratorTuple results = job.getResults(); } This will cause all stores to get evaluated in a single batch. However, there is no way to identify which of the ExecJobs corresponds to which store. We should add aliases by which the stored relations are known to ExecJob in order to allow the user to identify what the jobs correspond do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1158) pig command line -M option doesn't support table union correctly (comma seperated paths)
[ https://issues.apache.org/jira/browse/PIG-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1158. --- pig command line -M option doesn't support table union correctly (comma seperated paths) Key: PIG-1158 URL: https://issues.apache.org/jira/browse/PIG-1158 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jing Huang Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1158.patch for example, load (1.txt,2.txt) USING org.apache.hadoop.zebra.pig.TableLoader() i see this errror from stand out: [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: hdfs://gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/1.txt,2.txt does not exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1176) Column Pruner issues in union of loader with and without schema
[ https://issues.apache.org/jira/browse/PIG-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1176. --- Column Pruner issues in union of loader with and without schema --- Key: PIG-1176 URL: https://issues.apache.org/jira/browse/PIG-1176 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0 Attachments: PIG-1176-1.patch Column pruner for union could fail if one source of union have the schema and the other does not have schema. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2); b = foreach a generate a0; c = load '2.txt'; d = foreach c generate $0; e = union b, d; dump e; {code} However, this issue is in trunk only and is not applicable to 0.6 branch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1164) [zebra]smoke test
[ https://issues.apache.org/jira/browse/PIG-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1164. --- [zebra]smoke test - Key: PIG-1164 URL: https://issues.apache.org/jira/browse/PIG-1164 Project: Pig Issue Type: Test Affects Versions: 0.6.0 Reporter: Jing Huang Fix For: 0.7.0 Attachments: PIG-1164.patch, PIG-SMOKE.patch, smoke.patch Change zebra build.xml file to add smoke target. And env.sh and run script under zebra/src/test/smoke dir -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1170) [zebra] end to end test and stress test
[ https://issues.apache.org/jira/browse/PIG-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1170. --- [zebra] end to end test and stress test --- Key: PIG-1170 URL: https://issues.apache.org/jira/browse/PIG-1170 Project: Pig Issue Type: Test Affects Versions: 0.6.0 Reporter: Jing Huang Fix For: 0.7.0 Attachments: e2eStress.patch Add test cases for zebra end 2 end test , stress test and stress test verification tool. No unit test is needed for this jira. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1187) UTF-8 (international code) breaks with loader when load with schema is specified
[ https://issues.apache.org/jira/browse/PIG-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1187. --- UTF-8 (international code) breaks with loader when load with schema is specified Key: PIG-1187 URL: https://issues.apache.org/jira/browse/PIG-1187 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Ashutosh Chauhan Fix For: 0.7.0 I have a set of Pig statements which dump an international dataset. {code} INPUT_OBJECT = load 'internationalcode'; describe INPUT_OBJECT; dump INPUT_OBJECT; {code} Sample output (756a6196-ebcd-4789-ad2f-175e5df65d55,{(labelAaÂâÀ),(labelあいうえお1),(labelஜார்க2),(labeladfadf)}) It works and dumps results but when I use a schema for loading it fails. {code} INPUT_OBJECT = load 'internationalcode' AS (object_id:chararray, labels: bag {T: tuple(label:chararray)}); describe INPUT_OBJECT; {code} The error message is as follows:2010-01-14 02:23:27,320 FATAL org.apache.hadoop.mapred.Child: Error running child : org.apache.pig.data.parser.TokenMgrError: Error: Bailing out of infinite loop caused by repeated empty string matches at line 1, column 21. at org.apache.pig.data.parser.TextDataParserTokenManager.TokenLexicalActions(TextDataParserTokenManager.java:620) at org.apache.pig.data.parser.TextDataParserTokenManager.getNextToken(TextDataParserTokenManager.java:569) at org.apache.pig.data.parser.TextDataParser.jj_ntk(TextDataParser.java:651) at org.apache.pig.data.parser.TextDataParser.Tuple(TextDataParser.java:152) at org.apache.pig.data.parser.TextDataParser.Bag(TextDataParser.java:100) at org.apache.pig.data.parser.TextDataParser.Datum(TextDataParser.java:382) at org.apache.pig.data.parser.TextDataParser.Parse(TextDataParser.java:42) at org.apache.pig.builtin.Utf8StorageConverter.parseFromBytes(Utf8StorageConverter.java:68) at org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConverter.java:76) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:845) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1171) Top-N queries produce incorrect results when followed by a cross statement
[ https://issues.apache.org/jira/browse/PIG-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1171. --- Top-N queries produce incorrect results when followed by a cross statement -- Key: PIG-1171 URL: https://issues.apache.org/jira/browse/PIG-1171 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1171.patch ??I am not sure if this is a bug, or something more subtle, but here is the problem that I am having.?? ??When I LOAD a dataset, change it with an ORDER, LIMIT it, then CROSS it with itself, the results are not correct. I expect to see the cross of the limited, ordered dataset, but instead I see the cross of the limited dataset. Effectively, its like the LIMIT is being excluded.?? ??Example code follows:?? {code} A = load 'foo' as (f1:int, f2:int, f3:int); B = load 'foo' as (f1:int, f2:int, f3:int); a = ORDER A BY f1 DESC; b = ORDER B BY f1 DESC; aa = LIMIT a 1; bb = LIMIT b 1; C = CROSS aa, bb; DUMP C; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1173) pig cannot be built without an internet connection
[ https://issues.apache.org/jira/browse/PIG-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1173. --- pig cannot be built without an internet connection -- Key: PIG-1173 URL: https://issues.apache.org/jira/browse/PIG-1173 Project: Pig Issue Type: Bug Reporter: Jeff Hodges Assignee: Jeff Hodges Priority: Minor Fix For: 0.7.0 Attachments: offlinebuild-v2.patch, offlinebuild.patch Pig's build.xml does not allow for offline building even when it's been built before. This is because the ivy-download target has not conditional associated with it to turn it off. The Hadoop seems to be adding an unless=offline to the ivy-download target. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1184) PruneColumns optimization does not handle the case of foreach flatten correctly if flattened bag is not used later
[ https://issues.apache.org/jira/browse/PIG-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1184. --- PruneColumns optimization does not handle the case of foreach flatten correctly if flattened bag is not used later -- Key: PIG-1184 URL: https://issues.apache.org/jira/browse/PIG-1184 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Pradeep Kamath Assignee: Daniel Dai Fix For: 0.7.0 Attachments: PIG-1184-1.patch, PIG-1184-2.patch The following script : {noformat} -e a = load 'input.txt' as (f1:chararray, f2:chararray, f3:bag{t:tuple(id:chararray)}, f4:bag{t:tuple(loc:chararray)}); b = foreach a generate f1, f2, flatten(f3), flatten(f4), 10; b = foreach b generate f1, f2, \$4; dump b; {noformat} gives the following result: (oiue,M,10) {noformat} cat input.txt: oiueM {(3),(4)} {(toronto),(montreal)} {noformat} If PruneColumns optimizations is disabled, we get the right result: (oiue,M,10) (oiue,M,10) (oiue,M,10) (oiue,M,10) The flatten results in 4 records - so the output should contain 4 records. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1189) StoreFunc UDF should ship to the backend automatically without register
[ https://issues.apache.org/jira/browse/PIG-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1189. --- StoreFunc UDF should ship to the backend automatically without register - Key: PIG-1189 URL: https://issues.apache.org/jira/browse/PIG-1189 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0 Attachments: multimapstore.pig, multireducestore.pig, PIG-1189-1.patch, PIG-1189-2.patch, PIG-1189-3.patch, singlemapstore.pig, singlereducestore.pig Pig should ship store UDF to backend even if user do not use register. The prerequisite is that UDF should be in classpath on frontend. We make that work for load UDF in (PIG-881|https://issues.apache.org/jira/browse/PIG-881), we shall do the same thing for store UDF. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1194) ERROR 2055: Received Error while processing the map plan
[ https://issues.apache.org/jira/browse/PIG-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1194. --- ERROR 2055: Received Error while processing the map plan Key: PIG-1194 URL: https://issues.apache.org/jira/browse/PIG-1194 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.5.0, 0.6.0 Reporter: Viraj Bhat Assignee: Richard Ding Fix For: 0.7.0 Attachments: inputdata.txt, PIG-1194.patch, PIG-1294_1.patch I have a simple Pig script which takes 3 columns out of which one is null. {code} input = load 'inputdata.txt' using PigStorage() as (col1, col2, col3); a = GROUP input BY (((double) col3)/((double) col2) .001 OR col1 11 ? col1 : -1); b = FOREACH a GENERATE group as col1, SUM(input.col2) as col2, SUM(input.col3) as col3; store b into 'finalresult'; {code} When I run this script I get the following error: ERROR 2055: Received Error while processing the map plan. org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:277) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) A more useful error message for the purpose of debugging would be helpful. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1200) Using TableInputFormat in HBaseStorage
[ https://issues.apache.org/jira/browse/PIG-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1200. --- Using TableInputFormat in HBaseStorage -- Key: PIG-1200 URL: https://issues.apache.org/jira/browse/PIG-1200 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.7.0 Attachments: Pig_1200.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1190) Handling of quoted strings in pig-latin/grunt commands
[ https://issues.apache.org/jira/browse/PIG-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1190. --- Handling of quoted strings in pig-latin/grunt commands -- Key: PIG-1190 URL: https://issues.apache.org/jira/browse/PIG-1190 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: correct-testcase.patch, pig-1190.patch, pig-1190_1.patch There is some inconsistency in the way quoted strings are used/handled in pig-latin . In load/store and define-ship commands, files are specified in quoted strings , and the file name is the content within the quotes. But in case of register, set, and file system commands , if string is specified in quotes, the quotes are also included as part of the string. This is not only inconsistent , it is also unintuitive. This is also inconsistent with the way hdfs commandline (or bash shell) interpret file names. For example, currently with the command - set job.name 'job123' The job name set set to 'job123' (including the quotes) not job123 . This needs to be fixed, and above command should be considered equivalent to - set job.name job123. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1198) [zebra] performance improvements
[ https://issues.apache.org/jira/browse/PIG-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1198. --- [zebra] performance improvements Key: PIG-1198 URL: https://issues.apache.org/jira/browse/PIG-1198 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.7.0 Attachments: PIG-1198.patch, PIG-1198.patch, PIG-1198.patch Current input split generation is row-based split on individual TFiles. This leaves undesired fact that even for TFiles smaller than one block one split is still generated for each. Consequently, there will be many mappers, and many waves, needed to handle the many small TFiles generated by as many mappers/reducers that wrote the data. This issue can be addressed by generating input splits that can include multiple TFiles. For sorted tables, key distribution generation by table, which is used to generated proper input splits, includes key distributions from column groups even they are not in projection. This incurs extra cost to perform unnecessary computations and, more inappropriately, creates unreasonable results on input split generations; For unsorted tables, when row split is generated on a union of tables, the FileSplits are generated for each table and then lumped together to form the final list of splits to Map/Reduce. This has a undesirable fact that number of splits is subject to the number of tables in the table union and not just controlled by the number of splits used by the Map/Reduce framework; The input split's goal size is calculated on all column groups even if some of them are not in projection; For input splits of multiple files in one column group, all files are opened at startup. This is unnecessary and takes unnecessarily resources from start to end. The files should be opened when needed and closed when not; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1203) Temporarily disable failed unit test in load-store-redesign branch which have external dependency
[ https://issues.apache.org/jira/browse/PIG-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1203. --- Temporarily disable failed unit test in load-store-redesign branch which have external dependency - Key: PIG-1203 URL: https://issues.apache.org/jira/browse/PIG-1203 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0 Attachments: PIG-1203-1.patch In load-store-redesign branch, two test suits, TestHBaseStorage and TestCounters always fail. TestHBaseStorage depends on https://issues.apache.org/jira/browse/PIG-1200, TestCounters depends on future version of hadoop. We disable these two test suits temporarily, and will enable them once the dependent issues are solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1209) Port POJoinPackage to proactively spill
[ https://issues.apache.org/jira/browse/PIG-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1209. --- Port POJoinPackage to proactively spill --- Key: PIG-1209 URL: https://issues.apache.org/jira/browse/PIG-1209 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1209.patch POPackage proactively spills the bag whereas POJoinPackage still uses the SpillableMemoryManager. We should port this to use InternalCacheBag which proactively spills. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1212) LogicalPlan.replaceAndAddSucessors produce wrong result when successors are null
[ https://issues.apache.org/jira/browse/PIG-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1212. --- LogicalPlan.replaceAndAddSucessors produce wrong result when successors are null Key: PIG-1212 URL: https://issues.apache.org/jira/browse/PIG-1212 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0 Attachments: PIG-1212-1.patch, PIG-1212-2.patch The following script throw a NPE: a = load '1.txt' as (a0:chararray); b = load '2.txt' as (b0:chararray); c = join a by a0, b by b0; d = filter c by a0 == 'a'; explain d; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1204) Pig hangs when joining two streaming relations in local mode
[ https://issues.apache.org/jira/browse/PIG-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1204. --- Pig hangs when joining two streaming relations in local mode Key: PIG-1204 URL: https://issues.apache.org/jira/browse/PIG-1204 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1204.patch The following script hangs running in local mode when inpuf files contains many lines (e.g. 10K). The same script works when runing in MR mode. {code} A = load 'input1' as (a0, a1, a2); B = stream A through `head -1` as (a0, a1, a2); C = load 'input2' as (a0, a1, a2); D = stream C through `head -1` as (a0, a1, a2); E = join B by a0, D by a0; dump E {code} Here is one stack trace: Thread-13 prio=10 tid=0x09938400 nid=0x1232 in Object.wait() [0x8fffe000..0x8030] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x9b8e0a40 (a org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStream) at java.lang.Object.wait(Object.java:485) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStream.getNextHelper(POStream.java:291) - locked 0x9b8e0a40 (a org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStream) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStream.getNext(POStream.java:214) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:272) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:232) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:227) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:52) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1207) [zebra] Data sanity check should be performed at the end of writing instead of later at query time
[ https://issues.apache.org/jira/browse/PIG-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1207. --- [zebra] Data sanity check should be performed at the end of writing instead of later at query time --- Key: PIG-1207 URL: https://issues.apache.org/jira/browse/PIG-1207 Project: Pig Issue Type: Improvement Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.7.0 Attachments: PIG-1207.patch, PIG-1207.patch Currently the equity check of number of rows across different column groups are performed by the query. And the error info is sketchy and only emits a Column groups are not evenly distributed, or worse, throws an IndexOufOfBound exception from CGScanner.getCGValue since BasicTable.atEnd and BasicTable.getKey, which are called just before BasicTable.getValue, only checks the first column group in projection and any discrepancy of the number of rows per file cross multiple column groups in projection could have BasicTable.atEnd return false and BasicTable.getKey return a key normally but another column group already exaust its current file and the call to its CGScanner.getCGValue throw the exception. This check should also be performed at the end of writing and the error info should be more informational. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1215) Make Hadoop jobId more prominent in the client log
[ https://issues.apache.org/jira/browse/PIG-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1215. --- Make Hadoop jobId more prominent in the client log -- Key: PIG-1215 URL: https://issues.apache.org/jira/browse/PIG-1215 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1215.patch, pig-1215.patch, pig-1215_1.patch, pig-1215_3.patch, pig-1215_4.patch This is a request from applications that want to be able to programmatically parse client logs to find hadoop Ids. The woould like to see each job id on a separate line in the following format: hadoopJobId: job_123456789 They would also like to see the jobs in the order they are executed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front
[ https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1216. --- New load store design does not allow Pig to validate inputs and outputs up front Key: PIG-1216 URL: https://issues.apache.org/jira/browse/PIG-1216 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Alan Gates Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1216.patch, pig-1216_1.patch In Pig 0.6 and before, Pig attempts to verify existence of inputs and non-existence of outputs during parsing to avoid run time failures when inputs don't exist or outputs can't be overwritten. The downside to this was that Pig assumed all inputs and outputs were HDFS files, which made implementation harder for non-HDFS based load and store functions. In the load store redesign (PIG-966) this was delegated to InputFormats and OutputFormats to avoid this problem and to make use of the checks already being done in those implementations. Unfortunately, for Pig Latin scripts that run more then one MR job, this does not work well. MR does not do input/output verification on all the jobs at once. It does them one at a time. So if a Pig Latin script results in 10 MR jobs and the file to store to at the end already exists, the first 9 jobs will be run before the 10th job discovers that the whole thing was doomed from the beginning. To avoid this a validate call needs to be added to the new LoadFunc and StoreFunc interfaces. Pig needs to pass this method enough information that the load function implementer can delegate to InputFormat.getSplits() and the store function implementer to OutputFormat.checkOutputSpecs() if s/he decides to. Since 90% of all load and store functions use HDFS and PigStorage will also need to, the Pig team should implement a default file existence check on HDFS and make it available as a static method to other Load/Store function implementers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1217) [piggybank] evaluation.util.Top is broken
[ https://issues.apache.org/jira/browse/PIG-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1217. --- [piggybank] evaluation.util.Top is broken - Key: PIG-1217 URL: https://issues.apache.org/jira/browse/PIG-1217 Project: Pig Issue Type: Bug Affects Versions: 0.3.0, 0.4.0, site, 0.5.0, 0.6.0, 0.7.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.7.0 Attachments: fix_top_udf.diff, fix_top_udf.diff, fix_top_udf.diff The Top udf has been broken for a while, due to an incorrect implementation of getArgToFuncMapping. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1230) Streaming input in POJoinPackage should use nonspillable bag to collect tuples
[ https://issues.apache.org/jira/browse/PIG-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1230. --- Streaming input in POJoinPackage should use nonspillable bag to collect tuples -- Key: PIG-1230 URL: https://issues.apache.org/jira/browse/PIG-1230 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1230.patch, pig-1230_1.patch, pig-1230_2.patch Last table of join statement is streamed through instead of collecting all its tuple in a bag. As a further optimization of that, tuples of that relation are collected in chunks in a bag. Since we don't want to spill the tuples from this bag, NonSpillableBag should be used to hold tuples for this relation. Initially, DefaultDataBag was used, which was later changed to InternalCachedBag as a part of PIG-1209. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1218) Use distributed cache to store samples
[ https://issues.apache.org/jira/browse/PIG-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1218. --- Use distributed cache to store samples -- Key: PIG-1218 URL: https://issues.apache.org/jira/browse/PIG-1218 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1218.patch, PIG-1218_2.patch, PIG-1218_3.patch Currently, in the case of skew join and order by we use sample that is just written to the dfs (not distributed cache) and, as the result, get opened and copied around more than necessary. This impacts query performance and also places unnecesary load on the name node -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1220) Document unknown keywords as missing or to do in future
[ https://issues.apache.org/jira/browse/PIG-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1220. --- Document unknown keywords as missing or to do in future --- Key: PIG-1220 URL: https://issues.apache.org/jira/browse/PIG-1220 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.7.0 To get help at the grunt shell I do the following: grunttouchz 010-02-04 00:59:28,714 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered IDENTIFIER touchz at line 1, column 1. Was expecting one of: EOF cat ... fs ... cd ... cp ... copyFromLocal ... copyToLocal ... dump ... describe ... aliases ... explain ... help ... kill ... ls ... mv ... mkdir ... pwd ... quit ... register ... rm ... rmf ... set ... illustrate ... run ... exec ... scriptDone ... ... EOL ... ; ... I looked at the code and found that we do nothing at: scriptDone: Is there some future value of that command. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1241) Accumulator is turned on when a map is used with a non-accumulative UDF
[ https://issues.apache.org/jira/browse/PIG-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1241. --- Accumulator is turned on when a map is used with a non-accumulative UDF --- Key: PIG-1241 URL: https://issues.apache.org/jira/browse/PIG-1241 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ying He Assignee: Ying He Fix For: 0.7.0 Attachments: accum.patch Exception is thrown for a script like the following: register /homes/yinghe/owl/string.jar; a = load 'a.txt' as (id, url); b = group a by (id, url); c = foreach b generate COUNT(a), (CHARARRAY) string.URLPARSE(group.url)#'url'; dump c; In this query, URLPARSE() is not accumulative, and it returns a map. The accumulator optimizer failed to check UDF in this case, and tries to run the job in accumulative mode. ClassCastException is thrown when trying to cast UDF into Accumulator interface. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1224) Collected group should change to use new (internal) bag
[ https://issues.apache.org/jira/browse/PIG-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1224. --- Collected group should change to use new (internal) bag --- Key: PIG-1224 URL: https://issues.apache.org/jira/browse/PIG-1224 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1224.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1240) [Zebra] suggestion to have zebra manifest file contain version and svn-revision etc.
[ https://issues.apache.org/jira/browse/PIG-1240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1240. --- [Zebra] suggestion to have zebra manifest file contain version and svn-revision etc. - Key: PIG-1240 URL: https://issues.apache.org/jira/browse/PIG-1240 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Gaurav Jain Assignee: Gaurav Jain Priority: Minor Fix For: 0.7.0 Attachments: PIG-1240.patch Zebra jars' manifest file sld have zebra manifest file contain version and svn-revision etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1226) Need to be able to register jars on the command line
[ https://issues.apache.org/jira/browse/PIG-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1226. --- Need to be able to register jars on the command line Key: PIG-1226 URL: https://issues.apache.org/jira/browse/PIG-1226 Project: Pig Issue Type: Bug Reporter: Alan Gates Assignee: Thejas M Nair Fix For: 0.7.0 Attachments: PIG-1126.patch Currently 'register' can only be done inside a Pig Latin script. Users often run their scripts in different environments, so jar locations or versions may change. But they don't want to edit their script to fit each environment. Instead they could register on the command line, something like: pig -Dpig.additional.jars=my.jar:your.jar script.pig These would not override registers in the Pig Latin script itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1248) [piggybank] useful String functions
[ https://issues.apache.org/jira/browse/PIG-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1248. --- [piggybank] useful String functions --- Key: PIG-1248 URL: https://issues.apache.org/jira/browse/PIG-1248 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: PIG_1248.diff, PIG_1248.diff, PIG_1248.diff Pig ships with very few evalFuncs for working with strings. This jira is for adding a few more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1233) NullPointerException in AVG
[ https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1233. --- NullPointerException in AVG Key: PIG-1233 URL: https://issues.apache.org/jira/browse/PIG-1233 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Ankur Fix For: 0.7.0 Attachments: jira-1233.patch The overridden method - getValue() in AVG throws null pointer exception in case accumulate() is not called leaving variable 'intermediateCount' initialized to null. This causes java to throw exception when it tries to 'unbox' the value for numeric comparison. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1250) Make StoreFunc an abstract class and create a mirror interface called StoreFuncInterface
[ https://issues.apache.org/jira/browse/PIG-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1250. --- Make StoreFunc an abstract class and create a mirror interface called StoreFuncInterface Key: PIG-1250 URL: https://issues.apache.org/jira/browse/PIG-1250 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: PIG-1250-2.patch, PIG-1250.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1238) Dump does not respect the schema
[ https://issues.apache.org/jira/browse/PIG-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1238. --- Dump does not respect the schema Key: PIG-1238 URL: https://issues.apache.org/jira/browse/PIG-1238 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1238.patch For complex data type and certain sequence of operations dump produces results with non-existent field in the relation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1243) Passing Complex map types to and from streaming causes a problem
[ https://issues.apache.org/jira/browse/PIG-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1243. --- Passing Complex map types to and from streaming causes a problem Key: PIG-1243 URL: https://issues.apache.org/jira/browse/PIG-1243 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Richard Ding Fix For: 0.7.0 I have a program which generates different types of Maps fields and stores it into PigStorage. {code} A = load '/user/viraj/three.txt' using PigStorage(); B = foreach A generate ['a'#'12'] as b:map[], ['b'#['c'#'12']] as c, ['c'#{(['d'#'15']),(['e'#'16'])}] as d; store B into '/user/viraj/pigtest' using PigStorage(); {code} Now I test the previous output in the below script to make sure I have the right results. I also pass this data to a Perl script and I observe that the complex Map types I have generated, are lost when I get the result back. {code} DEFINE CMD `simple.pl` SHIP('simple.pl'); A = load '/user/viraj/pigtest' using PigStorage() as (simpleFields, mapFields, mapListFields); B = foreach A generate $0, $1, $2; dump B; C = foreach A generate (chararray)simpleFields#'a' as value, $0,$1,$2; D = stream C through CMD as (a0:map[], a1:map[], a2:map[]); dump D; {code} dumping B results in: ([a#12],[b#[c#12]],[c#{([d#15]),([e#16])}]) ([a#12],[b#[c#12]],[c#{([d#15]),([e#16])}]) ([a#12],[b#[c#12]],[c#{([d#15]),([e#16])}]) dumping D results in: ([a#12],,) ([a#12],,) ([a#12],,) The Perl script used here is: {code} #!/usr/local/bin/perl use warnings; use strict; while() { my($bc,$s,$m,$l)=split/\t/; print($s\t$m\t$l); } {code} Is there an issue with handling of complex Map fields within streaming? How can I fix this to obtain the right result? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1234) Unable to create input slice for har:// files
[ https://issues.apache.org/jira/browse/PIG-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai closed PIG-1234. --- Unable to create input slice for har:// files - Key: PIG-1234 URL: https://issues.apache.org/jira/browse/PIG-1234 Project: Pig Issue Type: Bug Reporter: Tsz Wo (Nicholas), SZE Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: PIG-1234.patch Tried to load har:// files {noformat} grunt a = LOAD 'har://hdfs-namenode/user/tsz/t20.har/t20' USING PigStorage('\n') AS (line); grunt dump {noformat} but pig says {noformat} 2010-02-10 18:42:20,750 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2118: Unable to create input slice for: har://hdfs-namenode/user/tsz/t20.har/t20 {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.