[jira] [Commented] (PIG-1429) Add Boolean Data Type to Pig
[ https://issues.apache.org/jira/browse/PIG-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066365#comment-13066365 ] Daniel Dai commented on PIG-1429: - I would vote for string "true"/"false"(regardless case), otherwise null, for Utf8StorageConverter. > Add Boolean Data Type to Pig > > > Key: PIG-1429 > URL: https://issues.apache.org/jira/browse/PIG-1429 > Project: Pig > Issue Type: New Feature > Components: data >Affects Versions: 0.7.0 >Reporter: Russell Jurney >Assignee: Zhijie Shen > Labels: boolean, gsoc2011, pig, type > Attachments: working_boolean.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > Pig needs a Boolean data type. Pig-1097 is dependent on doing this. > I volunteer. Is there anything beyond the work in src/org/apache/pig/data/ > plus unit tests to make this work? > This is a candidate project for Google summer of code 2011. More information > about the program can be found at http://wiki.apache.org/pig/GSoc2011 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1866) Dereference a bag within a tuple does not work
[ https://issues.apache.org/jira/browse/PIG-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated PIG-1866: --- Attachment: PIG-1866-4-cdh3-0.8.0.patch Here is a version of patch #4 that applies cleanly to CDH3u0 (pig 0.8.0) > Dereference a bag within a tuple does not work > -- > > Key: PIG-1866 > URL: https://issues.apache.org/jira/browse/PIG-1866 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.9.0 > > Attachments: PIG-1866-1.patch, PIG-1866-2.patch, PIG-1866-3.patch, > PIG-1866-4-cdh3-0.8.0.patch, PIG-1866-4.patch > > > The following script does not work (both in new and old logical plan): > {code} > a = load '1.txt' as (t : tuple(i: int, b1: bag { b_tuple : tuple ( b_str: > chararray) })); > b = foreach a generate t.b1; > dump b; > {code} > 1.txt: > (1,{(one),(two)}) > Error from old logical plan: > java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot be > cast to org.apache.pig.data.DataBag > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:482) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:480) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:339) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:237) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:232) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > Error from new logical plan: > java.lang.NullPointerException > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.consumeInputBag(POProject.java:246) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:200) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:339) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:237) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:232) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > If we change "b = foreach a generate t.b1;" to "b = foreach a generate t.i;", > it works fine, only refer to a bag does not work. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1429) Add Boolean Data Type to Pig
[ https://issues.apache.org/jira/browse/PIG-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066358#comment-13066358 ] Zhijie Shen commented on PIG-1429: -- Does anyone have the opinion of casting DataByteArray to Boolean? 1. DataByteArray can represent a numeric value, such that non-zero value should be converted to True. 2. DataByteArray can also represent a string, such that the "true" string should be converted to True. However, these two cases conflicts to some extent. a raw DataByteArray can be simultaneously translated into a non-zero numeric or a non-"true" string. Then, it is hard to say whether it should be converted to True or False. > Add Boolean Data Type to Pig > > > Key: PIG-1429 > URL: https://issues.apache.org/jira/browse/PIG-1429 > Project: Pig > Issue Type: New Feature > Components: data >Affects Versions: 0.7.0 >Reporter: Russell Jurney >Assignee: Zhijie Shen > Labels: boolean, gsoc2011, pig, type > Attachments: working_boolean.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > Pig needs a Boolean data type. Pig-1097 is dependent on doing this. > I volunteer. Is there anything beyond the work in src/org/apache/pig/data/ > plus unit tests to make this work? > This is a candidate project for Google summer of code 2011. More information > about the program can be found at http://wiki.apache.org/pig/GSoc2011 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1904) Default split destination
[ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066287#comment-13066287 ] Thejas M Nair commented on PIG-1904: The approach you are proposing for @NonDeterministic udf sounds good. PIG-1904.1.patch looks good. Some comments - I think it is better to retain the restriction that a split needs at least two output aliases. This will prevent split being used instead of filter, and from pig becoming perl ;). Maybe, something like - split_clause : SPLIT rel INTO split_branch (COMMA split_branch)* ( COMMA split_branch ) |( COMMA split_otherwise )) In LogicalPlanBuilder.java, I think it is better to change the assertion to a if(root == null){throw exception;}, as assertions are not enabled by default. > Default split destination > - > > Key: PIG-1904 > URL: https://issues.apache.org/jira/browse/PIG-1904 > Project: Pig > Issue Type: New Feature >Reporter: Daniel Dai > Labels: gsoc2011 > Fix For: 0.10 > > Attachments: PIG-1904.1.patch > > > "split" statement is better to have a default destination, eg: > {code} > SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- > OTHERS has all tuples with f1>=7 && f2!=5 && f3==6 > {code} > This is a candidate project for Google summer of code 2011. More information > about the program can be found at http://wiki.apache.org/pig/GSoc2011 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1904) Default split destination
[ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1904: --- Release Note: This feature introduces a new keyword - OTHERWISE, and that is not backward compatible - it can break scripts that use it as an alias. Adding a note in release notes, about how this feature affects backward compatibility. > Default split destination > - > > Key: PIG-1904 > URL: https://issues.apache.org/jira/browse/PIG-1904 > Project: Pig > Issue Type: New Feature >Reporter: Daniel Dai > Labels: gsoc2011 > Fix For: 0.10 > > Attachments: PIG-1904.1.patch > > > "split" statement is better to have a default destination, eg: > {code} > SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- > OTHERS has all tuples with f1>=7 && f2!=5 && f3==6 > {code} > This is a candidate project for Google summer of code 2011. More information > about the program can be found at http://wiki.apache.org/pig/GSoc2011 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2143) Improvements for PigStorage
[ https://issues.apache.org/jira/browse/PIG-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066275#comment-13066275 ] Thejas M Nair commented on PIG-2143: Thanks for adding the comprehensive documentation and fixing the incorrect old one! Review of PIG-2143.4.patch - - In PigStorage.getSchema(..) , it should check for (!dontLoadSchema) for deciding if the schema file should be read. (instead of (storeschema) ). - A test case where pig loads schema with the default constructor will be useful. One of the new test cases in the patch can be modified for this. I think we need one for the -noschema as well. - In javadoc for constructor PigStorage(String delimiter, String options), the line about "-Dprop=value" can be removed as its not used right now. - A nitpick - In the PigStorage class javadoc, I think 'An optional second constructor' is a bit misleading. There are 3 constructors including default one, and all 3 constructors are 'optional' :) . Maybe calling it 'Another constructor' is better. > Improvements for PigStorage > --- > > Key: PIG-2143 > URL: https://issues.apache.org/jira/browse/PIG-2143 > Project: Pig > Issue Type: Improvement >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy > Fix For: 0.10 > > Attachments: PIG-2143.2.diff, PIG-2143.3.patch, PIG-2143.4.patch, > PIG-2143.diff > > > I'd like to propose that we allow for a greater degree of customization in > PigStorage. > An incomplete list features that we might want to add: > - flag to tell it to overwrite existing output if it exists > - flag to tell it to compress output using gzip|bzip|lzo (currently this can > be achieved by setting the directory name to end in .gz or .bz2, which is a > bit awkward) > - flag to tell it to store the schema and header (perhaps by merging in > PigStorageSchema work?) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1942) script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects
[ https://issues.apache.org/jira/browse/PIG-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1942: Status: Patch Available (was: Open) marking as SubmitPatch so this can be reviewed and committed. > script UDF (jython) should utilize the intended output schema to more > directly convert Py objects to Pig objects > > > Key: PIG-1942 > URL: https://issues.apache.org/jira/browse/PIG-1942 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.8.0, 0.9.0 >Reporter: Woody Anderson >Assignee: Woody Anderson >Priority: Minor > Labels: python, schema, udf > Fix For: 0.10 > > Attachments: 1942.patch, 1942_with_junit.patch > > > from https://issues.apache.org/jira/browse/PIG-1824 > {code} > import re > @outputSchema("y:bag{t:tuple(word:chararray)}") > def strsplittobag(content,regex): > return re.compile(regex).split(content) > {code} > does not work because split returns a list of strings. However, the output > schema is known, and it would be quite simple to implicitly promote the > string element to a tupled element. > also, a list/array/tuple/set etc. are all equally convertable to bag, and > list/array/tuple are equally convertable to Tuple, this conversion can be > done in a much less rigid way with the use of the schema. > this allows much more facile re-use of existing python code and less memory > overhead to create intermediate re-converting of object types. > I have written the code to do this a while back as part of my version of the > jython script framework, i'll isolate that and attach. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1973) UDFContext.getUDFContext has a thread race condition around it's ThreadLocal
[ https://issues.apache.org/jira/browse/PIG-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1973: Status: Patch Available (was: Open) Marking as submitpatch so we can get this reviewed and committed. > UDFContext.getUDFContext has a thread race condition around it's ThreadLocal > > > Key: PIG-1973 > URL: https://issues.apache.org/jira/browse/PIG-1973 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.8.0, 0.9.0 >Reporter: Woody Anderson >Assignee: Woody Anderson >Priority: Minor > Attachments: 1973.patch > > > this is probably isn't manifesting anywhere, but it's an incorrect use of the > ThreadLocal pattern. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-1991) Leading Underscore (_) not allowed in schema names
[ https://issues.apache.org/jira/browse/PIG-1991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates resolved PIG-1991. - Resolution: Won't Fix The definition of variable names for Pig is: [a-zA-Z][a-zA-Z0-9]* I don't see any compelling reason to change that. > Leading Underscore (_) not allowed in schema names > -- > > Key: PIG-1991 > URL: https://issues.apache.org/jira/browse/PIG-1991 > Project: Pig > Issue Type: Wish > Components: grunt >Affects Versions: 0.9.0 >Reporter: Viraj Bhat > > I have a Pig script which uses underscore in its schema name (_a) > {code} > a = load 'test.txt' as (_a:long, b:chararray); > dump a; > {code} > This causes an error in Pig: > {quote} > Unexpected character '_' > 2011-04-12 11:58:59,624 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1200: Unexpected character '_' > {quote} > Stack trace: > Pig Stack Trace > --- > ERROR 1200: Unexpected character '_' > Failed to parse: Unexpected character '_' > at > org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:83) > at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1555) > at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1527) > at org.apache.pig.PigServer.registerQuery(PigServer.java:582) > at > org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:917) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:176) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:152) > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76) > at org.apache.pig.Main.run(Main.java:489) > at org.apache.pig.Main.main(Main.java:108) > > Schema names should be allowed to have underscores. > Viraj -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2010) Bundle registered jars via distributed cache
[ https://issues.apache.org/jira/browse/PIG-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-2010: Status: Patch Available (was: Open) Marking this as submitpatch so we can review it. > Bundle registered jars via distributed cache > > > Key: PIG-2010 > URL: https://issues.apache.org/jira/browse/PIG-2010 > Project: Pig > Issue Type: Improvement >Reporter: Dmitriy V. Ryaboy > Attachments: pig-2010.patch > > > Currently registered jars get collapsed into a single job megajar that gets > submitted to Hadoop. > A better pattern would be to take advantage of the distributed cache. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2027) NPE if Pig don't have permission for log file
[ https://issues.apache.org/jira/browse/PIG-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-2027: Status: Patch Available (was: Open) Marking as submitpatch so it can get reviewed and committed. > NPE if Pig don't have permission for log file > - > > Key: PIG-2027 > URL: https://issues.apache.org/jira/browse/PIG-2027 > Project: Pig > Issue Type: Bug >Reporter: Daniel Dai >Assignee: Daniel Dai >Priority: Trivial > Fix For: 0.10 > > Attachments: PIG-2027-1.patch > > > If specify a log file to Pig, but Pig don't have write permission, if any > failure in Pig script, we will get a NPE in addition to Pig script failure: > 2011-05-02 13:18:36,493 [main] ERROR org.apache.pig.tools.grunt.Grunt - > java.lang.NullPointerException > at org.apache.pig.impl.util.LogUtils.writeLog(LogUtils.java:172) > at org.apache.pig.impl.util.LogUtils.writeLog(LogUtils.java:79) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:131) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:180) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:152) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90) > at org.apache.pig.Main.run(Main.java:554) > at org.apache.pig.Main.main(Main.java:109) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2050) Pig shows auto-generated schema name for TOTUPLE in describe
[ https://issues.apache.org/jira/browse/PIG-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066245#comment-13066245 ] Alan Gates commented on PIG-2050: - The issue here is not that the user cannot access the tuple by this name, but that the name shows up in describe. The semantics of Pig Latin are that if the user does not name an expression in foreach and it is not a simple column expression, then it has no name. But describe should not show this internal name that Pig is using, as that is confusing. > Pig shows auto-generated schema name for TOTUPLE in describe > > > Key: PIG-2050 > URL: https://issues.apache.org/jira/browse/PIG-2050 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0, 0.9.0 >Reporter: Richard Ding >Priority: Minor > > Here is the use case: > {code} > grunt> A = load 'data' as (a0, a1, a2); > grunt> B = foreach A generate TOTUPLE(a0, a2); > grunt> describe B > B: {org.apache.pig.builtin.totuple_a0_3: (a0: bytearray,a2: bytearray)} > grunt> C = foreach B generate org.apache.pig.builtin.totuple_a0_3; > 2011-05-06 14:38:14,635 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1000: Error during parsing. Invalid alias: org in > {org.apache.pig.builtin.totuple_a0_1: (a0: bytearray,a2: bytearray)} > {code} > The workaround is to specify a use-defined schema name: > {code} > grunt> A = load 'data' as (a0, a1, a2); > > grunt> B = foreach A generate TOTUPLE(a0, a2) as aa; > grunt> describe B > B: {aa: (a0: bytearray,a2: bytearray)} > grunt> C = foreach B generate aa; > grunt> > {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2050) Pig shows auto-generated schema name for TOTUPLE in describe
[ https://issues.apache.org/jira/browse/PIG-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-2050: Summary: Pig shows auto-generated schema name for TOTUPLE in describe (was: Pig can't reference auto-generated schema name for TOTUPLE) > Pig shows auto-generated schema name for TOTUPLE in describe > > > Key: PIG-2050 > URL: https://issues.apache.org/jira/browse/PIG-2050 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0, 0.9.0 >Reporter: Richard Ding >Priority: Minor > > Here is the use case: > {code} > grunt> A = load 'data' as (a0, a1, a2); > grunt> B = foreach A generate TOTUPLE(a0, a2); > grunt> describe B > B: {org.apache.pig.builtin.totuple_a0_3: (a0: bytearray,a2: bytearray)} > grunt> C = foreach B generate org.apache.pig.builtin.totuple_a0_3; > 2011-05-06 14:38:14,635 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1000: Error during parsing. Invalid alias: org in > {org.apache.pig.builtin.totuple_a0_1: (a0: bytearray,a2: bytearray)} > {code} > The workaround is to specify a use-defined schema name: > {code} > grunt> A = load 'data' as (a0, a1, a2); > > grunt> B = foreach A generate TOTUPLE(a0, a2) as aa; > grunt> describe B > B: {aa: (a0: bytearray,a2: bytearray)} > grunt> C = foreach B generate aa; > grunt> > {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2051) new LogicalSchema column prune code does not preserve type information for map subfields
[ https://issues.apache.org/jira/browse/PIG-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-2051: Status: Patch Available (was: Open) Marking as Submitpatch so we can get this reviewed and committed. > new LogicalSchema column prune code does not preserve type information for > map subfields > > > Key: PIG-2051 > URL: https://issues.apache.org/jira/browse/PIG-2051 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.10 >Reporter: Woody Anderson >Assignee: Woody Anderson > Fix For: 0.10 > > Attachments: 2051.patch > > > current impl of ColumnPruneVisitor.visit ignores field type info and passes > type BYTEARRAY for all map fields. > the corrected type is pretty easy to fill in, especially since map field info > is only attempted 1 level deep. > i came across this b/c i utilize the type information in the pushProjection > call, and this was previously of the 'correct' type information, the change > over to LogicalSchema caused a regression. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2053) PigInputFormat uses class.isAssignableFrom() where instanceof is more appropriate
[ https://issues.apache.org/jira/browse/PIG-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-2053: Status: Patch Available (was: Open) Submitting patch so this can get reviewed and committed. > PigInputFormat uses class.isAssignableFrom() where instanceof is more > appropriate > - > > Key: PIG-2053 > URL: https://issues.apache.org/jira/browse/PIG-2053 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.10 >Reporter: Woody Anderson >Priority: Minor > Labels: newbie > Fix For: 0.10 > > Attachments: 2053.patch > > > This is a code style/quality improvement. > isAssignableFrom is appropriate when the class is not known at compile type, > but assignment needs to be checked. > e.g. foo.getClass().isAssignableFrom(bar.getClass()) > but, if the class of foo is known (e.g. X.class), then instanceof is more > appropriate and readable. > i also made use of de morgan's to simply the "is combininable" boolean > statement, which is hard to grok as written. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2077) Project UDF output inside a non-foreach statement fail on 0.8
[ https://issues.apache.org/jira/browse/PIG-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-2077: Status: Patch Available (was: Open) Marking SubmitPatch so we can get this reviewed and committed. > Project UDF output inside a non-foreach statement fail on 0.8 > - > > Key: PIG-2077 > URL: https://issues.apache.org/jira/browse/PIG-2077 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.1 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.1 > > Attachments: PIG-2077-1.patch > > > The following script fail on 0.8: > {code} > A = load '1.txt' as (tracking_id, day:chararray); > B = load '2.txt' as (tracking_id, timestamp:chararray); > C = JOIN A by (tracking_id, day) LEFT OUTER, B by (tracking_id, > STRSPLIT(timestamp, ' ').$0); > explain C; > {code} > Error stack: > Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 > at java.util.ArrayList.get(ArrayList.java:324) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.findReferent(ProjectExpression.java:207) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.getFieldSchema(ProjectExpression.java:121) > at > org.apache.pig.newplan.logical.optimizer.FieldSchemaResetter.execute(SchemaResetter.java:193) > at > org.apache.pig.newplan.logical.expression.AllSameExpressionVisitor.visit(AllSameExpressionVisitor.java:53) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:75) > at > org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:70) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) > at > org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:83) > at > org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:149) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:262) > This is not a problem on 0.9, trunk, since LogicalExpPlanMigrationVistor is > dropped in 0.9. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2124) Script never ending when joining from the same source
[ https://issues.apache.org/jira/browse/PIG-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-2124: Status: Patch Available (was: Open) Marking as ready for review. > Script never ending when joining from the same source > - > > Key: PIG-2124 > URL: https://issues.apache.org/jira/browse/PIG-2124 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.1 >Reporter: Tristan Croiset >Assignee: Daniel Dai > Fix For: 0.10 > > Attachments: PIG-2124-1.patch > > > Considering the following script, it works perfectly fine or the script never > ends depending on the fields used at output. > input ("scores" file) contains: > -- > test1;0.1 > test2;0.9 > test1;0.3 > -- > -- > score_list = LOAD 'scores' USING PigStorage(';') > AS (word: chararray, score: double); > score_list_ = FOREACH score_list GENERATE > word, > score, > 0 AS joinField; > group_score = GROUP score_list ALL; > sum_score = FOREACH group_score GENERATE > 0 AS joinField, > SUM(score_list.score) as scoreTotal; > score_with_sum = JOIN score_list_ BY joinField, sum_score BY joinField; > out = FOREACH score_with_sum GENERATE word, (score / scoreTotal); > DUMP out; > -- > This works fine > But if I change "out" to : out = FOREACH score_with_sum GENERATE word; > Then the script never ends and the output keeps repeating lines likes: > 2011-06-15 15:00:22,536 [SpillThread] INFO org.apache.hadoop.mapred.MapTask > - Finished spill 24 > 2011-06-15 15:00:22,889 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > Spilling map output: record full = true > 2011-06-15 15:00:22,889 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > bufstart = 65535810; bufend = 68157240; bufvoid = 99614720 > 2011-06-15 15:00:22,889 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > kvstart = 327661; kvend = 262124; length = 327680 > 2011-06-15 15:00:22,994 [SpillThread] INFO org.apache.hadoop.mapred.MapTask > - Finished spill 25 > 2011-06-15 15:00:23,345 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > Spilling map output: record full = true > 2011-06-15 15:00:23,345 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > bufstart = 68157240; bufend = 70778670; bufvoid = 99614720 > 2011-06-15 15:00:23,345 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > kvstart = 262124; kvend = 196587; length = 327680 > 2011-06-15 15:00:23,447 [SpillThread] INFO org.apache.hadoop.mapred.MapTask > - Finished spill 26 > 2011-06-15 15:00:23,794 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > Spilling map output: record full = true > 2011-06-15 15:00:23,794 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > bufstart = 70778670; bufend = 73400100; bufvoid = 99614720 > 2011-06-15 15:00:23,794 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > kvstart = 196587; kvend = 131050; length = 327680 > 2011-06-15 15:00:23,896 [SpillThread] INFO org.apache.hadoop.mapred.MapTask > - Finished spill 27 > 2011-06-15 15:00:24,243 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > Spilling map output: record full = true > 2011-06-15 15:00:24,243 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > bufstart = 73400100; bufend = 76021530; bufvoid = 99614720 > 2011-06-15 15:00:24,243 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > kvstart = 131050; kvend = 65513; length = 327680 > 2011-06-15 15:00:24,346 [SpillThread] INFO org.apache.hadoop.mapred.MapTask > - Finished spill 28 > 2011-06-15 15:00:24,692 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > Spilling map output: record full = true > 2011-06-15 15:00:24,692 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > bufstart = 76021530; bufend = 78642970; bufvoid = 99614720 > 2011-06-15 15:00:24,693 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > kvstart = 65513; kvend = 327657; length = 327680 > 2011-06-15 15:00:24,793 [SpillThread] INFO org.apache.hadoop.mapred.MapTask > - Finished spill 29 > 2011-06-15 15:00:25,144 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > Spilling map output: record full = true > 2011-06-15 15:00:25,144 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > bufstart = 78642970; bufend = 81264400; bufvoid = 99614720 > 2011-06-15 15:00:25,144 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - > kvstart = 327657; kvend = 262120; length = 327680 > P.S. I know it's possible to refactor the script using casting to scalar ;) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2149) ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to recreate exception from backed error: Error: Java heap space
[ https://issues.apache.org/jira/browse/PIG-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066222#comment-13066222 ] Alan Gates commented on PIG-2149: - It means Pig ran out of memory. Can you attach your script (or some script that replicates the problem)? Also an idea of how much data you are reading in the script would be helpful. > ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to recreate > exception from backed error: Error: Java heap space > - > > Key: PIG-2149 > URL: https://issues.apache.org/jira/browse/PIG-2149 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.8.0 > Environment: hadoop 0.20.2 > Linux 2.6.18-194.8.1.el5PAE #1 SMP Thu Jul 1 19:46:23 EDT 2010 i686 i686 > i386 GNU/Linux >Reporter: Kim Sang hyun > > Backend error message > - > Error: Java heap space > Pig Stack Trace > --- > ERROR 2997: Unable to recreate exception from backed error: Error: Java heap > space > org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to > recreate exception from backed error: Error: Java heap space > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:337) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:378) > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1198) > at org.apache.pig.PigServer.execute(PigServer.java:1190) > at org.apache.pig.PigServer.access$100(PigServer.java:128) > at org.apache.pig.PigServer$Graph.execute(PigServer.java:1517) > at org.apache.pig.PigServer.executeBatchEx(PigServer.java:362) > at org.apache.pig.PigServer.executeBatch(PigServer.java:329) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:169) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90) > at org.apache.pig.Main.run(Main.java:510) > at org.apache.pig.Main.main(Main.java:107) > > Why this error occur? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Pig 0.9 release take 2
Hi guys, We have fixed several release blockers since the last attempt. I think it is time to get this release out! I am planning to start the release process around 2:30. Please, let me know before that if you have concerns about it. Please, don't make any changes to 0.9 branch while I roll the release candidate. Thanks, Olga
[jira] [Created] (PIG-2170) NPE thrown during illustrate
NPE thrown during illustrate Key: PIG-2170 URL: https://issues.apache.org/jira/browse/PIG-2170 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.10 Reporter: Mat Kelcey working with version https://svn.apache.org/repos/asf/pig/trunk@1146777 fetched from git git://git.apache.org/pig.git a7e1228a0fdfe76c3cff0e749e252dba8d387052 using file /tmp/data.tsv id1 123 id1 234 id2 345 id2 456 this is the most cutdown/simplest script i can make that illustrates (no pun intended) the problem grunt> data = load '/tmp/data.tsv' as (id:chararray, value:long); grunt> cogrouped = cogroup data by id; grunt> exists = foreach cogrouped generate (IsEmpty(data.value) ? 0 : 1) as exists; grunt> dump exists is ok but grunt> illustrate exists throws java.lang.NullPointerException at org.apache.pig.pen.IllustratorAttacher.visitBinCond(IllustratorAttacher.java:360) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.visit(POBinCond.java:145) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.visit(POBinCond.java:36) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71) at org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:52) at org.apache.pig.pen.IllustratorAttacher.innerPlanAttach(IllustratorAttacher.java:417) at org.apache.pig.pen.IllustratorAttacher.visitPOForEach(IllustratorAttacher.java:229) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.visit(POForEach.java:117) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.visit(POForEach.java:47) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71) at org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:52) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:246) at org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:238) at org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:103) at org.apache.pig.pen.LineageTrimmingVisitor.(LineageTrimmingVisitor.java:98) at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:166) at org.apache.pig.PigServer.getExamples(PigServer.java:1201) at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:698) at org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:591) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:306) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:67) at org.apache.pig.Main.run(Main.java:487) at org.apache.pig.Main.main(Main.java:108) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) with pig.log containing java.io.IOException: Exception : null at org.apache.pig.PigServer.getExamples(PigServer.java:1207) at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:698) at org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:591) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:306) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
[jira] [Commented] (PIG-2165) Need a way to deal with params and param_file in embedded pig in python
[ https://issues.apache.org/jira/browse/PIG-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066084#comment-13066084 ] Julien Le Dem commented on PIG-2165: I agree, on second thoughts my second option is a bad idea as the script can be launched either through the command line or programmatically through the java API. In that case the only parameters that are available are the ones passed through -p. There's no good way to reuse sys.argv here, it is not really a good fit as it is a list of strings. Suggestions: - a separate dictionary: Pig.getParameters() ? - placed in global variables > Need a way to deal with params and param_file in embedded pig in python > --- > > Key: PIG-2165 > URL: https://issues.apache.org/jira/browse/PIG-2165 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Supreeth > Fix For: 0.10 > > > I am using embedded pig in python and cannot pass param key value pairs to > the python script. The only way to pass params seem to be by passing it in > the bind command. > Is there a plan to have command line parameters to a pig embedded python > script? Similar needs for param_file and using the environment variables. > Thanks > Supreeth -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2159) New logical plan uses incorrect class for SUM causing for ClassCastException
[ https://issues.apache.org/jira/browse/PIG-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066082#comment-13066082 ] Alan Gates commented on PIG-2159: - Dmitry, I don't see this as a blocker for 0.9. It does not produce wrong results and users can rewrite their scripts to work around it. I agree it should go on the 0.9 branch and be part of the anticipated 0.9.1 release. > New logical plan uses incorrect class for SUM causing for ClassCastException > - > > Key: PIG-2159 > URL: https://issues.apache.org/jira/browse/PIG-2159 > Project: Pig > Issue Type: Bug >Affects Versions: 0.9.0 >Reporter: Vivek Padmanabhan >Priority: Blocker > Fix For: 0.9.0 > > Attachments: PIG-2159-1.patch, PIG-2159-2.patch > > > The below is my script; > {code} > A = load 'input1' using PigStorage(',') as > (f1:int,f2:int,f3:int,f4:long,f5:double); > B = load 'input2' using PigStorage(',') as > (f1:int,f2:int,f3:int,f4:long,f5:double); > C = load 'input_Main' using PigStorage(',') as (f1:int,f2:int,f3:int); > U = UNION ONSCHEMA A,B; > J = join C by (f1,f2,f3) LEFT OUTER, U by (f1,f2,f3); > Porj = foreach J generate C::f1 as f1 ,C::f2 as f2,C::f3 as f3,U::f4 as > f4,U::f5 as f5; > G = GROUP Porj by (f1,f2,f3,f5); > Final = foreach G generate SUM(Porj.f4) as total; > dump Final; > {code} > The script fails at while computing the sum with class cast exception. > Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to > java.lang.Double > at org.apache.pig.builtin.DoubleSum$Initial.exec(DoubleSum.java:82) > ... 19 more > This is clearly a bug in the logical plan created in 0.9. The sum operation > should have processed using org.apache.pig.builtin.LongSum, but instead 0.9 > logical plan have used org.apache.pig.builtin.DoubleSum which is meant for > sum of doubles. And hence the ClassCastException. > The same script works fine with Pig 0.8. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1946) HBaseStorage constructor syntax is error prone
[ https://issues.apache.org/jira/browse/PIG-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Graham updated PIG-1946: - Attachment: PIG-1946_3.patch All good suggestions, here's patch #3. The {{--no-patch}} option doesn't seem to be valid in git merge. I tried {{--no-prefix}} but the format still looks gitish. Any other suggestions for an SVN-friendly git patch? I've made manual changes in the past when applying git patches to SVN, but I'd love to find a way to not require that. > HBaseStorage constructor syntax is error prone > -- > > Key: PIG-1946 > URL: https://issues.apache.org/jira/browse/PIG-1946 > Project: Pig > Issue Type: Improvement >Reporter: Bill Graham >Assignee: Bill Graham > Fix For: 0.10 > > Attachments: PIG-1946_1.patch, PIG-1946_2.patch, PIG-1946_3.patch > > > Using {{HBaseStorage}} like so seems like a reasonable thing to do, but it > will yield unexpected results: > {code} > STORE result INTO 'hbase://foo' USING > org.apache.pig.backend.hadoop.hbase.HBaseStorage( > 'info:first_name, info:last_name'); > {code} > The problem us that a column named {{info:first_name,}} will be created, with > the trailing comma included. I've had numerous developers get tripped up on > this issue since everywhere else in Pig variables are separated by commas, so > I propose we fix it. > I propose we trim leading/trailing commas from column names, but I'm open to > other ideas. > Also should we accept column names that are comman-delimited without spaces? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2143) Improvements for PigStorage
[ https://issues.apache.org/jira/browse/PIG-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066061#comment-13066061 ] Raghu Angadi commented on PIG-2143: --- Thanks for the detailed javadoc. noticed the compression changes in the prev patch. PigStorageSchema class does not set "-schema" option. Is that correct? didn't know .pig_schema didn't store the delimiter. > Improvements for PigStorage > --- > > Key: PIG-2143 > URL: https://issues.apache.org/jira/browse/PIG-2143 > Project: Pig > Issue Type: Improvement >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy > Fix For: 0.10 > > Attachments: PIG-2143.2.diff, PIG-2143.3.patch, PIG-2143.4.patch, > PIG-2143.diff > > > I'd like to propose that we allow for a greater degree of customization in > PigStorage. > An incomplete list features that we might want to add: > - flag to tell it to overwrite existing output if it exists > - flag to tell it to compress output using gzip|bzip|lzo (currently this can > be achieved by setting the directory name to end in .gz or .bz2, which is a > bit awkward) > - flag to tell it to store the schema and header (perhaps by merging in > PigStorageSchema work?) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2165) Need a way to deal with params and param_file in embedded pig in python
[ https://issues.apache.org/jira/browse/PIG-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066048#comment-13066048 ] Alan Gates commented on PIG-2165: - I like the idea of only passing the parameters from either the command line or the parameters file. I don't see why the Python script should care about other parameters that will be Pig specific. Whether those parameters are placed in global variables, in sys.argv, or in a separate dictionary (pig_argv?) I don't have an opinion on. > Need a way to deal with params and param_file in embedded pig in python > --- > > Key: PIG-2165 > URL: https://issues.apache.org/jira/browse/PIG-2165 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Supreeth > Fix For: 0.10 > > > I am using embedded pig in python and cannot pass param key value pairs to > the python script. The only way to pass params seem to be by passing it in > the bind command. > Is there a plan to have command line parameters to a pig embedded python > script? Similar needs for param_file and using the environment variables. > Thanks > Supreeth -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1904) Default split destination
[ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066046#comment-13066046 ] Gianmarco De Francisci Morales commented on PIG-1904: - Created PIG-2169 for this. Anyway given the benefit/cost ratio I wouldn't try to fix it. A Nondeterministic UDF in a Split is probably better expressed as a Sample. Anyway I think this simple workaround should work: {code} a = LOAD 'a.txt' AS (f1,f2,f3); b = FOREACH a GENERATE f1, f2, f3, NonDetUDF(f1,f2,f3) AS f4; SPLIT b INTO c IF f4 < 0.5, D OTHERWISE; {code} > Default split destination > - > > Key: PIG-1904 > URL: https://issues.apache.org/jira/browse/PIG-1904 > Project: Pig > Issue Type: New Feature >Reporter: Daniel Dai > Labels: gsoc2011 > Fix For: 0.10 > > Attachments: PIG-1904.1.patch > > > "split" statement is better to have a default destination, eg: > {code} > SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- > OTHERS has all tuples with f1>=7 && f2!=5 && f3==6 > {code} > This is a candidate project for Google summer of code 2011. More information > about the program can be found at http://wiki.apache.org/pig/GSoc2011 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2169) Allow Nondeterministic UDFs in Split-Otherwise
Allow Nondeterministic UDFs in Split-Otherwise -- Key: PIG-2169 URL: https://issues.apache.org/jira/browse/PIG-2169 Project: Pig Issue Type: Wish Reporter: Gianmarco De Francisci Morales Priority: Trivial PIG-1904 allows an Otherwise option in Split. Because of how it is implemented, Nondeterministic UDFs are not allowed. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2060) Fix errors in pig grammars reported by ANTLRWorks
[ https://issues.apache.org/jira/browse/PIG-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-2060: --- Resolution: Fixed Status: Resolved (was: Patch Available) Patch committed to trunk. Thanks Gianmarco! > Fix errors in pig grammars reported by ANTLRWorks > - > > Key: PIG-2060 > URL: https://issues.apache.org/jira/browse/PIG-2060 > Project: Pig > Issue Type: Bug >Reporter: Gianmarco De Francisci Morales >Assignee: Gianmarco De Francisci Morales >Priority: Minor > Attachments: PIG-2060.1.patch, PIG-2060.patch > > > There are various errors in pig's grammar files highlighted by ANTLRWorks. > In particular, on token MATCHES, ANY and EVAL. > The first one should be removed, as there is already STR_OP_MATCHES, > the second one is an imaginary tokens that should be defined in the > appropriate section. > On the third one I am not sure. > I have been told it is from the old parsers but it is not used anywhere. Is > it correct? > Is it reserved for future uses? Has it anything to do with FUNC_EVAL? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2167) CUBE operation in Pig
[ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066023#comment-13066023 ] Dmitriy V. Ryaboy commented on PIG-2167: I believe there is value to providing the naive solution and improving on it later, rather than trying to build the optimal plan from the get-go. Initial (naive) implementation plan: Add an optional "WITH CUBE" clause to the group operator. In LogicalPlanBuilder, if "WITH CUBE" is present, insert operators equivalent to the following above the group operator: {code} relation = foreach relation generate FLATTEN(CubeDimensions(dim1, dim2, dim3)) as (dim1, dim2, dim3), other_fields; {code} It may be desirable in some cases to group by a superset of dimensions one wants to cube on: group by dim1, dim2, dim3 with cube on (dim1, dim2). If we want to support that use case, we simply need to know to call the UDF on (dim1, dim2) and push dim3 into the other_fields list. Note also that there's a bit of a problem if null values are legitimate values for the dimensions, as we use null to indicate "all". The UDF provided in PIG-2168 allows one to use custom strings instead of null for the "all" marker. We can optionally support this in the grammar, as well. > CUBE operation in Pig > - > > Key: PIG-2167 > URL: https://issues.apache.org/jira/browse/PIG-2167 > Project: Pig > Issue Type: New Feature >Reporter: Dmitriy V. Ryaboy > Fix For: 0.10 > > > Computing aggregates over a cube of several dimensions is a common operation > in data warehousing. > The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- > which in addition to all dim1-2-3, produces aggregations for just dim1, just > dim1 and dim2, etc. NULL is generally used to represent "all". > A presentation by Arnab Nandi describes how one might implement efficient > cubing in Map-Reduce here: http://pdf.cx/44wrk > We can start with the naive solution which only works for algebraic measures, > and work up from there. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2060) Fix errors in pig grammars reported by ANTLRWorks
[ https://issues.apache.org/jira/browse/PIG-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-2060: --- Attachment: PIG-2060.1.patch +1 Regenerated patch for latest svn trunk (PIG-2060.1.patch). > Fix errors in pig grammars reported by ANTLRWorks > - > > Key: PIG-2060 > URL: https://issues.apache.org/jira/browse/PIG-2060 > Project: Pig > Issue Type: Bug >Reporter: Gianmarco De Francisci Morales >Assignee: Gianmarco De Francisci Morales >Priority: Minor > Attachments: PIG-2060.1.patch, PIG-2060.patch > > > There are various errors in pig's grammar files highlighted by ANTLRWorks. > In particular, on token MATCHES, ANY and EVAL. > The first one should be removed, as there is already STR_OP_MATCHES, > the second one is an imaginary tokens that should be defined in the > appropriate section. > On the third one I am not sure. > I have been told it is from the old parsers but it is not used anywhere. Is > it correct? > Is it reserved for future uses? Has it anything to do with FUNC_EVAL? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2161) TOTUPLE should use no-copy tuple creation
[ https://issues.apache.org/jira/browse/PIG-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066016#comment-13066016 ] Thejas M Nair commented on PIG-2161: +1. I don't see a reason why the tuple should be copied. > TOTUPLE should use no-copy tuple creation > - > > Key: PIG-2161 > URL: https://issues.apache.org/jira/browse/PIG-2161 > Project: Pig > Issue Type: Improvement >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy >Priority: Trivial > Attachments: pig_2161.patch > > > TOTUPLE udf gets an input tuple, creates a new list, puts every field from > the tuple into the list, and creates a new tuple by calling > TupleFactory.newTuple(List) method -- which in turn allocates > *another* list and copies everything in there. > Simply returning the input tuple should be sufficient -- Pig already did the > work of putting the arguments into a tuple. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2168) CubeDimensions UDF
[ https://issues.apache.org/jira/browse/PIG-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-2168: --- Attachment: PIG-2168.patch > CubeDimensions UDF > -- > > Key: PIG-2168 > URL: https://issues.apache.org/jira/browse/PIG-2168 > Project: Pig > Issue Type: Sub-task >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy > Fix For: 0.10 > > Attachments: PIG-2168.patch > > > A prerequisite for a naive cubing implementation: > A UDF that, given a set of dimensions (a, b, c) generates all the points on > the cube: > (a, b, c), (a, b, null), (a, null, c), (null, b, c), (null, null, c), (a, > null, null), (null, b, null), (null, null, null). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2168) CubeDimensions UDF
[ https://issues.apache.org/jira/browse/PIG-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-2168: --- Status: Patch Available (was: Open) > CubeDimensions UDF > -- > > Key: PIG-2168 > URL: https://issues.apache.org/jira/browse/PIG-2168 > Project: Pig > Issue Type: Sub-task >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy > Fix For: 0.10 > > Attachments: PIG-2168.patch > > > A prerequisite for a naive cubing implementation: > A UDF that, given a set of dimensions (a, b, c) generates all the points on > the cube: > (a, b, c), (a, b, null), (a, null, c), (null, b, c), (null, null, c), (a, > null, null), (null, b, null), (null, null, null). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2168) CubeDimensions UDF
CubeDimensions UDF -- Key: PIG-2168 URL: https://issues.apache.org/jira/browse/PIG-2168 Project: Pig Issue Type: Sub-task Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy A prerequisite for a naive cubing implementation: A UDF that, given a set of dimensions (a, b, c) generates all the points on the cube: (a, b, c), (a, b, null), (a, null, c), (null, b, c), (null, null, c), (a, null, null), (null, b, null), (null, null, null). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2167) CUBE operation in Pig
CUBE operation in Pig - Key: PIG-2167 URL: https://issues.apache.org/jira/browse/PIG-2167 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Fix For: 0.10 Computing aggregates over a cube of several dimensions is a common operation in data warehousing. The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all". A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk We can start with the naive solution which only works for algebraic measures, and work up from there. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1904) Default split destination
[ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065966#comment-13065966 ] Dmitriy V. Ryaboy commented on PIG-1904: Nice catch about @NonDeterministic. Seems like it doesn't work due to the implementation details, the issue isn't fundamental. I'm cool with the partial solution for now, but please file a jira to fix this later. > Default split destination > - > > Key: PIG-1904 > URL: https://issues.apache.org/jira/browse/PIG-1904 > Project: Pig > Issue Type: New Feature >Reporter: Daniel Dai > Labels: gsoc2011 > Fix For: 0.10 > > Attachments: PIG-1904.1.patch > > > "split" statement is better to have a default destination, eg: > {code} > SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- > OTHERS has all tuples with f1>=7 && f2!=5 && f3==6 > {code} > This is a candidate project for Google summer of code 2011. More information > about the program can be found at http://wiki.apache.org/pig/GSoc2011 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1914) Support load/store JSON data in Pig
[ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065961#comment-13065961 ] Dmitriy V. Ryaboy commented on PIG-1914: Very cool. Some quick code review notes: Tiny typo here: "e = foreach d generate flatten(men#'value') as val;" -- that should read menu#'value' {code} boolean notDone = in.nextKeyValue(); if (!notDone) { return null; } {code} Better: {code} if (!in.nextKeyValue()) { return null; } {code} Parse exceptions: it's better to increment a counter and move on than to break on a bad input string. Throwing an exception kills the whole job. So maybe something like {code} t = null; while (t == null && in.nextKeyValue()) { ... } return t; {code} In flatten_array, if the value is an array, you allocate a new bag, populate it recursively, and add the contents of the new bag to the old bag. Why not skip the object allocation and copy, and simply pass the original bag into the recursive call? Also: are null values for keys just plain unsupported? You skip them. setLocation: not that it really matters, but for consistency, you should use PigTextInputFormat instead of PigFileInputFormat here. schema: probably makes sense to implement getSchema? > Support load/store JSON data in Pig > --- > > Key: PIG-1914 > URL: https://issues.apache.org/jira/browse/PIG-1914 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.8.0, 0.9.0 >Reporter: Chao Tian > Attachments: PIG-1914.patch > > > The JSON is a commonly used data storage format. It is popular for storing > structured data, especially for JavaScript data exchange. > Pig should have the ability to load/store JSON format data. I plan to write > one for the piggy bank. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1904) Default split destination
[ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1904: Attachment: PIG-1904.1.patch PIG-1904.1.patch contains the first working implementation of the feature. The grammar now recognizes statements like: SPLIT a INTO b IF x1 < 0, c OTHERWISE; but also like: SPLIT a INTO b IF x1 < 0; This is a side-effect of making the otherwise branch optional and is a change from past behavior. It shouldn't be a problem as the Split maps to a Filter in any case. Implemented by copying of the other LOSplitOutput plans, and building a negated disjunction (OR) of the expressions. Added unit test for Split-Otherwise TODO: Disable the feature if the expression contains a @NonDeterministic UDF. I plan to do it by spawning a visitor on the expression. The visitor will throw an error and explain the reason in the error message. Is this a reasonable approach? > Default split destination > - > > Key: PIG-1904 > URL: https://issues.apache.org/jira/browse/PIG-1904 > Project: Pig > Issue Type: New Feature >Reporter: Daniel Dai > Labels: gsoc2011 > Fix For: 0.10 > > Attachments: PIG-1904.1.patch > > > "split" statement is better to have a default destination, eg: > {code} > SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- > OTHERS has all tuples with f1>=7 && f2!=5 && f3==6 > {code} > This is a candidate project for Google summer of code 2011. More information > about the program can be found at http://wiki.apache.org/pig/GSoc2011 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-1429) Add Boolean Data Type to Pig
[ https://issues.apache.org/jira/browse/PIG-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen reassigned PIG-1429: Assignee: Zhijie Shen (was: Russell Jurney) > Add Boolean Data Type to Pig > > > Key: PIG-1429 > URL: https://issues.apache.org/jira/browse/PIG-1429 > Project: Pig > Issue Type: New Feature > Components: data >Affects Versions: 0.7.0 >Reporter: Russell Jurney >Assignee: Zhijie Shen > Labels: boolean, gsoc2011, pig, type > Attachments: working_boolean.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > Pig needs a Boolean data type. Pig-1097 is dependent on doing this. > I volunteer. Is there anything beyond the work in src/org/apache/pig/data/ > plus unit tests to make this work? > This is a candidate project for Google summer of code 2011. More information > about the program can be found at http://wiki.apache.org/pig/GSoc2011 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira