[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1038: Attachment: PIG-1038-3.patch Attach a patch to address Pradeep's comments. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1038: Status: Patch Available (was: Open) Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1038: Status: Open (was: Patch Available) Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
package org.apache.hadoop.zebra.parse missing
Hi guys, I checked out pig from trunk, and found package org.apache.hadoop.zebra.parse missing. Do you assure this package has been committed? see this link http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/zebra/src/java/org/apache/hadoop/zebra/ Min -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. My profile: http://www.linkedin.com/in/coderplay My blog: http://coderplay.javaeye.com
Re: package org.apache.hadoop.zebra.parse missing
The parser package is generated as part of the build. Doing invoking ant in the contrib/zebra directory should result in the parser package being created at ./src-gen/org/apache/hadoop/zebra/parser Alan. On Nov 11, 2009, at 12:54 AM, Min Zhou wrote: Hi guys, I checked out pig from trunk, and found package org.apache.hadoop.zebra.parse missing. Do you assure this package has been committed? see this link http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/zebra/src/java/org/apache/hadoop/zebra/ Min -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. My profile: http://www.linkedin.com/in/coderplay My blog: http://coderplay.javaeye.com
[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
[ https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776509#action_12776509 ] Alan Gates commented on PIG-966: Size on disk. It's not quite useless, as it can be used to estimate number of splits, etc. It should also be possible to estimate size in memory based on size in disk by applying an average explosion factor (about 4x at the moment I believe). Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces --- Key: PIG-966 URL: https://issues.apache.org/jira/browse/PIG-966 Project: Pig Issue Type: Improvement Components: impl Reporter: Alan Gates Assignee: Alan Gates I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces significantly. See http://wiki.apache.org/pig/LoadStoreRedesignProposal for full details -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator
[ https://issues.apache.org/jira/browse/PIG-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776520#action_12776520 ] Alan Gates commented on PIG-1064: - Why is cogrouping on * without a schema causing trouble? Because we can't guarantee that inputs have the same number of fields? Why would anyone ever want to cogroup on *? Do we need to spend any effort fixing this? Behvaiour of COGROUP with and without schema when using * operator Key: PIG-1064 URL: https://issues.apache.org/jira/browse/PIG-1064 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have 2 tab separated files, 1.txt and 2.txt $ cat 1.txt 1 2 2 3 $ cat 2.txt 1 2 2 3 I use COGROUP feature of Pig in the following way: $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main {code} grunt A = load '1.txt'; grunt B = load '2.txt' as (b0, b1); grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1012: Each COGroup input has to have the same number of inner plans Details at logfile: pig_1256845224752.log == If I reverse, the order of the schema's {code} grunt A = load '1.txt' as (a0, a1); grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1013: Grouping attributes can either be star (*) or a list of expressions, but not both. Details at logfile: pig_1256845224752.log == Now running without schema?? {code} grunt A = load '1.txt'; grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; grunt dump C; {code} 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: file:/tmp/temp-319926700/tmp-1990275961 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 2 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 154 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! ((1,2),{(1,2)},{(1,2)}) ((2,3),{(2,3)},{(2,3)}) == Is this a bug or a feature? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776526#action_12776526 ] Thejas M Nair commented on PIG-1062: Proposal for sampling in RandomSampleLoader (as well as SampleLoader class)- (used for order-by queries) - Problem: With new interface, we cannot use the old approach of dividing the size of file by number of samples required and skipping that many bytes to get new sample. Proposal: The approach proposed by Dmitriy for sampling is used - bq. In getNext(), we can now allocate a buffer for T elements, populate it with the first T tuples, and continue scanning the partition. For every ith next() call, we generate a random number r s.t. 0=ri, and if rT we insert the new tuple into our buffer at position r. This gives us a nicely random sample of the tuples in the partition. To avoid parsing all tuples RecordReader.nextKeyValue() will be called (instead of loader.getNext()) if the current tuple is to be skipped. bq. It looks like ReduceContext has a getCounter() method. Am I missing a subtlety? Arun C Murthy (mapreduce comitter) has agreed to elaborate on his recommendation on this in the jira. load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface --- Key: PIG-1062 URL: https://issues.apache.org/jira/browse/PIG-1062 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal . PigStorage and BinStorage are now working. SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface. Fixing SampleLoader and RandomSampleLoader will get order-by queries working. PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator
[ https://issues.apache.org/jira/browse/PIG-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776525#action_12776525 ] Pradeep Kamath commented on PIG-1064: - Cogroup needs the same arity for the grouping key from both inputs. If there is a cogroup by *, the '*' needs to be expanded so we know the arity. This is done in ProjectStarTranslator - the current code leaves the '*' as is when there is no schema. This causes problems in the backend - hence the proposed fix to catch this and error out. If we feel that users should not cogroup on '*' we should prevent it in the parser. The proposed fix is easy enough that I don't think we need to restrict the use of '*'. Behvaiour of COGROUP with and without schema when using * operator Key: PIG-1064 URL: https://issues.apache.org/jira/browse/PIG-1064 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have 2 tab separated files, 1.txt and 2.txt $ cat 1.txt 1 2 2 3 $ cat 2.txt 1 2 2 3 I use COGROUP feature of Pig in the following way: $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main {code} grunt A = load '1.txt'; grunt B = load '2.txt' as (b0, b1); grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1012: Each COGroup input has to have the same number of inner plans Details at logfile: pig_1256845224752.log == If I reverse, the order of the schema's {code} grunt A = load '1.txt' as (a0, a1); grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1013: Grouping attributes can either be star (*) or a list of expressions, but not both. Details at logfile: pig_1256845224752.log == Now running without schema?? {code} grunt A = load '1.txt'; grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; grunt dump C; {code} 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: file:/tmp/temp-319926700/tmp-1990275961 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 2 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 154 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! ((1,2),{(1,2)},{(1,2)}) ((2,3),{(2,3)},{(2,3)}) == Is this a bug or a feature? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator
[ https://issues.apache.org/jira/browse/PIG-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776527#action_12776527 ] Pradeep Kamath commented on PIG-1064: - The last paragraph in my previous comment should read: If we feel that users should not cogroup on star we should prevent it in the parser. The proposed fix is easy enough that I don't think we need to restrict the use of star. Behvaiour of COGROUP with and without schema when using * operator Key: PIG-1064 URL: https://issues.apache.org/jira/browse/PIG-1064 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have 2 tab separated files, 1.txt and 2.txt $ cat 1.txt 1 2 2 3 $ cat 2.txt 1 2 2 3 I use COGROUP feature of Pig in the following way: $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main {code} grunt A = load '1.txt'; grunt B = load '2.txt' as (b0, b1); grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1012: Each COGroup input has to have the same number of inner plans Details at logfile: pig_1256845224752.log == If I reverse, the order of the schema's {code} grunt A = load '1.txt' as (a0, a1); grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1013: Grouping attributes can either be star (*) or a list of expressions, but not both. Details at logfile: pig_1256845224752.log == Now running without schema?? {code} grunt A = load '1.txt'; grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; grunt dump C; {code} 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: file:/tmp/temp-319926700/tmp-1990275961 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 2 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 154 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! ((1,2),{(1,2)},{(1,2)}) ((2,3),{(2,3)},{(2,3)}) == Is this a bug or a feature? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-979) Acummulator Interface for UDFs
[ https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ying He updated PIG-979: Attachment: PIG-979.patch patch to address Alan's comments. Acummulator Interface for UDFs -- Key: PIG-979 URL: https://issues.apache.org/jira/browse/PIG-979 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Ying He Attachments: PIG-979.patch, PIG-979.patch Add an accumulator interface for UDFs that would allow them to take a set number of records at a time instead of the entire bag. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1001) Generate more meaningful error message when one input file does not exist
[ https://issues.apache.org/jira/browse/PIG-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776545#action_12776545 ] Nigel Daley commented on PIG-1001: -- Please ensure the issue is Assigned to the patch author when committing. Also, please provide a justification for why there is no unit test, then describe how you DID test it before you uploaded the patch. Generate more meaningful error message when one input file does not exist - Key: PIG-1001 URL: https://issues.apache.org/jira/browse/PIG-1001 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1001-1.patch, PIG-1001-2.patch In the following query, if 1.txt does not exist, a = load '1.txt'; b = group a by $0; c = group b all; dump c; Pig throws error message ERROR 2100: file:/tmp/temp155054664/tmp1144108421 does not exist., Pig should deal with it with the error message Input file 1.txt not exist instead of those confusing messages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
[ https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776547#action_12776547 ] Dmitriy V. Ryaboy commented on PIG-966: --- I agree, useless was a strong word (in fact I've been assuming it means size on disk and using it to estimate number of splits in my cbo code already...). The explosion factor is very iffy when we are dealing with compressed data. But, ok, let's not overthink it -- we can save the memory question for the next iteration. I'll edit the wiki to note the mBytes is size on disk. Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces --- Key: PIG-966 URL: https://issues.apache.org/jira/browse/PIG-966 Project: Pig Issue Type: Improvement Components: impl Reporter: Alan Gates Assignee: Alan Gates I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces significantly. See http://wiki.apache.org/pig/LoadStoreRedesignProposal for full details -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1001) Generate more meaningful error message when one input file does not exist
[ https://issues.apache.org/jira/browse/PIG-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai reassigned PIG-1001: --- Assignee: Daniel Dai Generate more meaningful error message when one input file does not exist - Key: PIG-1001 URL: https://issues.apache.org/jira/browse/PIG-1001 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1001-1.patch, PIG-1001-2.patch In the following query, if 1.txt does not exist, a = load '1.txt'; b = group a by $0; c = group b all; dump c; Pig throws error message ERROR 2100: file:/tmp/temp155054664/tmp1144108421 does not exist., Pig should deal with it with the error message Input file 1.txt not exist instead of those confusing messages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1001) Generate more meaningful error message when one input file does not exist
[ https://issues.apache.org/jira/browse/PIG-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776558#action_12776558 ] Daniel Dai commented on PIG-1001: - Hi, Nigel, Since this patch is all about error message and user experience, so it is hard to write a unit test case for that. I tested it manually with the following situations: 1. All jobs are successful 2. One complete failed job, and we stop launching dependent jobs 3. Two independent job, one fail, with and without -F option All of those give us desired error messages. Generate more meaningful error message when one input file does not exist - Key: PIG-1001 URL: https://issues.apache.org/jira/browse/PIG-1001 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1001-1.patch, PIG-1001-2.patch In the following query, if 1.txt does not exist, a = load '1.txt'; b = group a by $0; c = group b all; dump c; Pig throws error message ERROR 2100: file:/tmp/temp155054664/tmp1144108421 does not exist., Pig should deal with it with the error message Input file 1.txt not exist instead of those confusing messages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1080) PigStorage may miss records when loading a file
[ https://issues.apache.org/jira/browse/PIG-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1080: Resolution: Fixed Status: Resolved (was: Patch Available) patch committed, thanks Richard! PigStorage may miss records when loading a file --- Key: PIG-1080 URL: https://issues.apache.org/jira/browse/PIG-1080 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Richard Ding Assignee: Richard Ding Attachments: PIG-1080.patch, PIG-1080.patch, PIG-1080.patch When a file is assigned to multiple mappers (one block per mapper), the blocks may not end at the exact record boundary. Special care is taken to ensure that all records are loaded by mappers (and exactly once), even for records that cross the block boundary. The PigStorage, however, doesn't correctly handle the case where a block ends at exactly record boundary and results in missing records. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1086) Nexted sort by * throw exception
Nexted sort by * throw exception Key: PIG-1086 URL: https://issues.apache.org/jira/browse/PIG-1086 Project: Pig Issue Type: Bug Affects Versions: 0.5.0 Reporter: Daniel Dai The following script fail: a = load '1.txt' as (a0, a1, a2); b = group a by *; c = foreach b { d = distinct a; generate d;}; explain c; Here is the stack: Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.get(ArrayList.java:324) at org.apache.pig.impl.logicalLayer.schema.Schema.getField(Schema.java:752) at org.apache.pig.impl.logicalLayer.LOSort.getSortInfo(LOSort.java:332) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1365) at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:176) at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:43) at org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:69) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1274) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:234) at org.apache.pig.PigServer.compilePp(PigServer.java:864) at org.apache.pig.PigServer.explain(PigServer.java:583) ... 8 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1086) Nested sort by * throw exception
[ https://issues.apache.org/jira/browse/PIG-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1086: Summary: Nested sort by * throw exception (was: Nexted sort by * throw exception) Nested sort by * throw exception Key: PIG-1086 URL: https://issues.apache.org/jira/browse/PIG-1086 Project: Pig Issue Type: Bug Affects Versions: 0.5.0 Reporter: Daniel Dai The following script fail: a = load '1.txt' as (a0, a1, a2); b = group a by *; c = foreach b { d = distinct a; generate d;}; explain c; Here is the stack: Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.get(ArrayList.java:324) at org.apache.pig.impl.logicalLayer.schema.Schema.getField(Schema.java:752) at org.apache.pig.impl.logicalLayer.LOSort.getSortInfo(LOSort.java:332) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1365) at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:176) at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:43) at org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:69) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1274) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:234) at org.apache.pig.PigServer.compilePp(PigServer.java:864) at org.apache.pig.PigServer.explain(PigServer.java:583) ... 8 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1086) Nested sort by * throw exception
[ https://issues.apache.org/jira/browse/PIG-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1086: Description: The following script fail: A = load '1.txt' as (a0, a1, a2); B = group A by a0; C = foreach B { D = order A by *; generate group, D;}; explain C; Here is the stack: Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.get(ArrayList.java:324) at org.apache.pig.impl.logicalLayer.schema.Schema.getField(Schema.java:752) at org.apache.pig.impl.logicalLayer.LOSort.getSortInfo(LOSort.java:332) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1365) at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:176) at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:43) at org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:69) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1274) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:234) at org.apache.pig.PigServer.compilePp(PigServer.java:864) at org.apache.pig.PigServer.explain(PigServer.java:583) ... 8 more was: The following script fail: a = load '1.txt' as (a0, a1, a2); b = group a by *; c = foreach b { d = distinct a; generate d;}; explain c; Here is the stack: Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.get(ArrayList.java:324) at org.apache.pig.impl.logicalLayer.schema.Schema.getField(Schema.java:752) at org.apache.pig.impl.logicalLayer.LOSort.getSortInfo(LOSort.java:332) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1365) at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:176) at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:43) at org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:69) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1274) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:234) at org.apache.pig.PigServer.compilePp(PigServer.java:864) at org.apache.pig.PigServer.explain(PigServer.java:583) ... 8 more Nested sort by * throw exception Key: PIG-1086 URL: https://issues.apache.org/jira/browse/PIG-1086 Project: Pig Issue Type: Bug Affects Versions: 0.5.0 Reporter: Daniel Dai The following script fail: A = load '1.txt' as (a0, a1, a2); B = group A by a0; C = foreach B { D = order A by *; generate group, D;}; explain C; Here is the stack: Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.get(ArrayList.java:324) at org.apache.pig.impl.logicalLayer.schema.Schema.getField(Schema.java:752) at org.apache.pig.impl.logicalLayer.LOSort.getSortInfo(LOSort.java:332) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1365) at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:176) at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:43) at org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:69) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1274) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at
[jira] Updated: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
[ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1077: --- Release Note: In this jira, we plan to also resolve the dependency issue that Zebra record-based split needs Hadoop TFile split support to work. For this dependency, Zebra has to maintain its own copy of Hadoop jar in svn for it to be able to build. Furthermore, the fact that Zebra currently sits inside Pig in svn and Pig itself maintains its own copy of Hadoop jar in lib directory makes things even messier. Finally, we notice that Zebra is new and making many changes and needs to get new revisions quickly, while Hadoop and Pig are more mature and moving slowly and thus can't make new releases for Zebra all the time. After carefully thinking through all this, we plan to fork the TFile part off the Hadoop and port it into Zebra's own code base. This will greatly simply the building process of Zebra and also enable it to make quick revisions. Last, we would like to point out that this is a short term solution for Zebra and we plan to: 1) port all changes to Zebra TFile back into Hadoop TFile. 2) in the long run have a single unified solution for this. [Zebra] to support record(row)-based file split in Zebra's TableInputFormat --- Key: PIG-1077 URL: https://issues.apache.org/jira/browse/PIG-1077 Project: Pig Issue Type: New Feature Affects Versions: 0.4.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.6.0 TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra. One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file. In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776729#action_12776729 ] Pradeep Kamath commented on PIG-1038: - In JobControlCompiler: == {code} 583 jobConf.set(pig.secondarySortOrder, 584 ObjectSerializer.serialize(mro.getSecondarySortOrder())); 585 } {code} Looks like above code should be set in the case of non order by Mro which uses secondary key {code} 638 valuea = ((Tuple)wa.getValueAsPigType()).get(0); {code} We should put a comment explaining that we extract the first field out since that represents the actual group by key. In SecondaryKeyOptimizer: {code} } else if (mapLeaf instanceof POUnion || mapLeaf instanceof POSplit) { {code} The above should not contain POSplit since POSplit would only occur after multi query optimization which happens later. {code} 94 } else if (plan.getRoots().size() != 1) { 95 // POLocalRearrange plan contains more than 1 root. 96 // Probably there is an Expression operator in the local 97 // rearrangement plan, 98 // skip secondary key optimizing 99 return null; {code} Should we do continue nextPlan instead of return null here since this is similar to udf or constant in local rearrange case {code} 105.columnChainInfo.insert(false, columns, DataType.TUPLE); {code} It would useful to put a comment explaining this is put into the ColumnChainInfo only for comparing that different components of SortKeyInfo are all coming from the same input index. Also should the datatype be BAG? {code} 118log.debug(node + have more than 1 predecessor); {code} predecessor should change to successor. {code} 217 if (currentNode instanceof POPackage 218 || currentNode instanceof POFilter 219 || currentNode instanceof POLimit) { {code} In line 217 we should ensure, we don't optimize when we encounter POJoinPackage using something like {code} if ((currentNode instanceof POPackage !(currentNode instanceof POJoinPackage)) {code} {code} 307. int errorCode = 1000; 327 int errorCode = 1000; 526. int errorCode = 1000; {code} This error code is already in use {code} 336 } else if (mapLeaf instanceof POUnion || mapLeaf instanceof POSplit) { 337 ListPhysicalOperator preds = mr.mapPlan 338 .getPredecessors(mapLeaf); 339 for (PhysicalOperator pred : preds) { 340 POLocalRearrange rearrange = (POLocalRearrange) pred; 341 rearrange.setUseSecondaryKey(true); 342 if (rearrange.getIndex() == indexOfRearrangeToChange) // Try 343 // to 344 // find 345 // the 346 // POLocalRearrange 347 // for 348 // the 349 // secondary 350 // key 351 setSecondaryPlan(mr.mapPlan, rearrange, 352 secondarySortKeyInfo); 353 } 354 } {code} The above should not contain POSplit since POSplit would only occur after multi query optimization which happens later. Also in the if statement on line 342, what if the condition evaluates to false - should we throw an Exception like earlier in the same method? {code} 530 if (r) 531 sawInvalidPhysicalOper = true; .. 557 if (r) // if we saw physical operator other than project in sort 558// plan 559 return; {code} At line 559 should we be setting sawInvalidPhysicalOper? General comments: = A comment on ColumnChainInfo and SortKeyInfo explaining how it tracks to POProjects in the plan would be useful POMultiQueryPackage should not change since SecondaryKeyOptimizer runs before MultiQueryOptimizer. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement
FYI - forking TFile off Hadoop into Zebra
Hi all, In Jira Pig-1077, we Zebra team plan to utilize Hadoop TFile's split by record sequence number support to provide record(row)-based input split support in Zebra. Here we would like to point out that: along the way we plan to also resolve the dependency issue that Zebra record-based split needs Hadoop TFile split support to work. For this dependency, Zebra has to maintain its own copy of Hadoop jar in svn for it to be able to build. Furthermore, the fact that Zebra currently sits inside Pig in svn and Pig itself maintains its own copy of Hadoop jar in lib directory makes things even messier. Finally, we notice that Zebra is new and making many changes and needs to get new revisions quickly, while Hadoop and Pig are more mature and moving slowly and thus can't make new releases for Zebra all the time. After carefully thinking through all this, we plan to fork the TFile part off the Hadoop and port it into Zebra's own code base. This will greatly simply the building process of Zebra and also enable it to make quick revisions. Last, we would like to point out that this is a short term solution for Zebra and we plan to: 1) port all changes to Zebra TFile back into Hadoop TFile. 2) in the long run have a single unified solution for this. For more information, please see https://issues.apache.org/jira/browse/PIG-1077 Welcome your feedback on this. Regards, Chao
[jira] Created: (PIG-1087) Use Pig's version for Zebra's own version.
Use Pig's version for Zebra's own version. -- Key: PIG-1087 URL: https://issues.apache.org/jira/browse/PIG-1087 Project: Pig Issue Type: Task Affects Versions: 0.4.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.6.0 Zebra is a contrib project of Pig now. It should use Pig's version for its own version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator
[ https://issues.apache.org/jira/browse/PIG-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1064: Assignee: Pradeep Kamath Status: Patch Available (was: Open) The patch implements the proposal to catch the situation wherein the user specifies '*' as the cogrouping key and does not have a schema for the corresponding input to the cogroup. In these situations we would issue an error message - Cogroup by * is only allowed if the input has a schema and error out. Behvaiour of COGROUP with and without schema when using * operator Key: PIG-1064 URL: https://issues.apache.org/jira/browse/PIG-1064 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Pradeep Kamath Fix For: 0.6.0 Attachments: PIG-1064.patch I have 2 tab separated files, 1.txt and 2.txt $ cat 1.txt 1 2 2 3 $ cat 2.txt 1 2 2 3 I use COGROUP feature of Pig in the following way: $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main {code} grunt A = load '1.txt'; grunt B = load '2.txt' as (b0, b1); grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1012: Each COGroup input has to have the same number of inner plans Details at logfile: pig_1256845224752.log == If I reverse, the order of the schema's {code} grunt A = load '1.txt' as (a0, a1); grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1013: Grouping attributes can either be star (*) or a list of expressions, but not both. Details at logfile: pig_1256845224752.log == Now running without schema?? {code} grunt A = load '1.txt'; grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; grunt dump C; {code} 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: file:/tmp/temp-319926700/tmp-1990275961 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 2 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 154 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! ((1,2),{(1,2)},{(1,2)}) ((2,3),{(2,3)},{(2,3)}) == Is this a bug or a feature? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1038: Status: Open (was: Patch Available) Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, PIG-1038-4.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1038: Status: Patch Available (was: Open) Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, PIG-1038-4.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1038: Attachment: PIG-1038-4.patch Put the new patch in response to Pradeep's comments. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, PIG-1038-4.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: FYI - forking TFile off Hadoop into Zebra
On Wed, Nov 11, 2009 at 18:26, Chao Wang ch...@yahoo-inc.com wrote: Last, we would like to point out that this is a short term solution for Zebra and we plan to: 1) port all changes to Zebra TFile back into Hadoop TFile. 2) in the long run have a single unified solution for this. Just for clarity, in long run as Zebra stabilizes and Pig adopts hadoop-0.22, Zebra will get rid of this fork? Ashutosh
[jira] Commented: (PIG-979) Acummulator Interface for UDFs
[ https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776760#action_12776760 ] Ying He commented on PIG-979: - performance tests doesn't show noticeable difference between trunk and accumulator patch when calling no-accumulator udfs. the script to test performance is: register /homes/yinghe/pig_test/pigperf.jar; register /homes/yinghe/pig_test/string.jar; register /homes/yinghe/pig_test/piggybank.jar; A = load '/user/pig/tests/data/pigmix_large/page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, org.apache.pig.piggybank.evaluation.string.STRINGCAT(user, ip_addr) as id; C = group B by id parallel 10; D = foreach C { generate group, string.BagCount2(B)*string.ColumnLen2(B, 0); } store D into 'test2'; The input data has 100M rows, output has 57M rows, so the UDFs are called 57M times. The result is with patch: 5min 14sec w/o patch: 5min 17sec Acummulator Interface for UDFs -- Key: PIG-979 URL: https://issues.apache.org/jira/browse/PIG-979 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Ying He Attachments: PIG-979.patch, PIG-979.patch Add an accumulator interface for UDFs that would allow them to take a set number of records at a time instead of the entire bag. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-742) Spaces could be optional in Pig syntax
[ https://issues.apache.org/jira/browse/PIG-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776767#action_12776767 ] Pradeep Kamath commented on PIG-742: I gave a shot at changing introducing a new production in QueryParser.jjt but it didnt work. I am wondering if this issue is really because javacc's tokenizer needs a whitespace to tokenize - anybody with more experience with javacc want to comment? Here's the patch of what I tried: {code} Index: src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt === --- src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt (revision 834628) +++ src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt (working copy) @@ -979,7 +979,8 @@ | #DIGIT : [0-9] | #SPECIALCHAR : [_] | #FSSPECIALCHAR: [-, :, /] -| IDENTIFIER: ( LETTER )+ ( DIGIT | LETTER | SPECIALCHAR | ::)* +| IDENTIFIER: ( LETTER )+ ( DIGIT | LETTER | SPECIALCHAR | :: )* +| IDENTIFIEREQUAL: ( LETTER )+ ( DIGIT | LETTER | SPECIALCHAR | :: )* (=) } // Define Numeric Constants TOKEN : @@ -1010,12 +1011,15 @@ // Pig has special variables starting with $ TOKEN : { DOLLARVAR : $ INTEGER } +TOKEN : { EQUAL: = } + // Parse is the Starting function. LogicalPlan Parse() : { LogicalOperator root = null; Token t1; Token t2; + String alias; LogicalPlan lp = new LogicalPlan(); log.trace(Entering Parse); } @@ -1028,7 +1032,8 @@ throw new ParseException( Currently PIG does not support assigning an existing relation ( + t1.image + ) to another alias ( + t2.image + ));}) | LOOKAHEAD(2) - (t1 = IDENTIFIER = root = Expr(lp) ; { + ( +(t1 = IDENTIFIER EQUAL { alias = t1.image;}| t1 = IDENTIFIEREQUAL { alias = t1.image.replaceAll(=, ); }) root = Expr(lp) ; { root.setAlias(t1.image); addAlias(t1.image, root); pigContext.setLastAlias(t1.image); {code} Spaces could be optional in Pig syntax -- Key: PIG-742 URL: https://issues.apache.org/jira/browse/PIG-742 Project: Pig Issue Type: Wish Components: grunt Affects Versions: 0.3.0 Reporter: Viraj Bhat Priority: Minor The following Pig statements generate an error if there is no space between A and = {code} A=load 'quf.txt' using PigStorage() as (q, u, f:long); B = group A by (q); C = foreach B { F = order A by f desc; generate F; }; describe C; dump C; {code} 2009-03-31 17:14:15,959 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered PATH A=load at line 1, column 1. Was expecting one of: EOF cat ... cd ... cp ... copyFromLocal ... copyToLocal ... dump ... describe ... aliases ... explain ... help ... kill ... ls ... mv ... mkdir ... pwd ... quit ... register ... rm ... rmf ... set ... illustrate ... run ... exec ... scriptDone ... ... EOL ... ; ... It would be nice if the parser would not expect these space requirements between an alias and = -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1039) Pig 0.5 Doc Updates
[ https://issues.apache.org/jira/browse/PIG-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1039: --- Assignee: Corinne Chandel Pig 0.5 Doc Updates --- Key: PIG-1039 URL: https://issues.apache.org/jira/browse/PIG-1039 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.5.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Fix For: 0.5.0 Attachments: branch-0.5.patch, trunk.patch Pig 0.5 doc updates (to be applied to Trunk and branch-0.5) 1. updates to tutorial 2. updates to pig latin reference manual 3. updated doc tab to 0.5.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1033) javac warnings: deprecated hadoop APIs
[ https://issues.apache.org/jira/browse/PIG-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1033: --- Assignee: Daniel Dai javac warnings: deprecated hadoop APIs -- Key: PIG-1033 URL: https://issues.apache.org/jira/browse/PIG-1033 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.4.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1033-1.patch Suppress javac warnings related to deprecated hadoop APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: package org.apache.hadoop.zebra.parse missing
Thanks, Alan. On Thu, Nov 12, 2009 at 12:51 AM, Alan Gates ga...@yahoo-inc.com wrote: The parser package is generated as part of the build. Â Doing invoking ant in the contrib/zebra directory should result in the parser package being created at ./src-gen/org/apache/hadoop/zebra/parser Alan. On Nov 11, 2009, at 12:54 AM, Min Zhou wrote: Hi guys, I checked out pig from trunk, and found package org.apache.hadoop.zebra.parse missing. Do you assure this package has been committed? see this link http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/zebra/src/java/org/apache/hadoop/zebra/ Min -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. My profile: http://www.linkedin.com/in/coderplay My blog: http://coderplay.javaeye.com -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. My profile: http://www.linkedin.com/in/coderplay My blog: http://coderplay.javaeye.com
[jira] Updated: (PIG-1085) Pass JobConf and UDF specific configuration information to UDFs
[ https://issues.apache.org/jira/browse/PIG-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1085: Status: Open (was: Patch Available) Pass JobConf and UDF specific configuration information to UDFs --- Key: PIG-1085 URL: https://issues.apache.org/jira/browse/PIG-1085 Project: Pig Issue Type: New Feature Components: impl Reporter: Alan Gates Assignee: Alan Gates Attachments: udfconf.patch Users have long asked for a way to get the JobConf structure in their UDFs. It would also be nice to have a way to pass properties between the front end and back end so that UDFs can store state during parse time and use it at runtime. This patch does part of what is proposed in PIG-602, but not all of it. It does not provide a way to give user specified configuration files to UDFs. So I will mark 602 as depending on this bug, but it isn't a duplicate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1085) Pass JobConf and UDF specific configuration information to UDFs
[ https://issues.apache.org/jira/browse/PIG-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1085: Status: Patch Available (was: Open) Uploading new patch that addresses javac warnings and release audit issues. If I read the output correctly all of pig's unit tests passed on the previous test. Pass JobConf and UDF specific configuration information to UDFs --- Key: PIG-1085 URL: https://issues.apache.org/jira/browse/PIG-1085 Project: Pig Issue Type: New Feature Components: impl Reporter: Alan Gates Assignee: Alan Gates Attachments: udfconf-2.patch, udfconf.patch Users have long asked for a way to get the JobConf structure in their UDFs. It would also be nice to have a way to pass properties between the front end and back end so that UDFs can store state during parse time and use it at runtime. This patch does part of what is proposed in PIG-602, but not all of it. It does not provide a way to give user specified configuration files to UDFs. So I will mark 602 as depending on this bug, but it isn't a duplicate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1032) FINDBUGS: DM_STRING_CTOR: Method invokes inefficient new String(String) constructor
[ https://issues.apache.org/jira/browse/PIG-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1032: --- Assignee: Olga Natkovich FINDBUGS: DM_STRING_CTOR: Method invokes inefficient new String(String) constructor --- Key: PIG-1032 URL: https://issues.apache.org/jira/browse/PIG-1032 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Olga Natkovich Attachments: PIG-1032.patch DmMethod org.apache.pig.backend.executionengine.PigSlice.init(DataStorage) invokes toString() method on a String Dm org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.copyHadoopConfLocally(String) invokes inefficient new String(String) constructor Dm org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getFirstLineFromMessage(String) invokes inefficient new String(String) constructor Dm org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.BinaryComparisonOperator.initializeRefs() invokes inefficient Boolean constructor; use Boolean.valueOf(...) instead Dm org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.ExpressionOperator.clone() invokes inefficient new String(String) constructor Dm org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(String) invokes inefficient new String(String) constructor Dm org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PONot.getNext(Boolean) invokes inefficient Boolean constructor; use Boolean.valueOf(...) instead Dm org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan.clone() invokes inefficient new String(String) constructor Dmnew org.apache.pig.data.TimestampedTuple(String, String, int, SimpleDateFormat) invokes inefficient new String(String) constructor Dmorg.apache.pig.impl.io.PigNullableWritable.toString() invokes inefficient new String(String) constructor Dmorg.apache.pig.impl.logicalLayer.LOForEach.clone() invokes inefficient Boolean constructor; use Boolean.valueOf(...) instead Dmorg.apache.pig.impl.logicalLayer.LOGenerate.clone() invokes inefficient Boolean constructor; use Boolean.valueOf(...) instead Dmorg.apache.pig.impl.logicalLayer.LogicalPlan.clone() invokes inefficient new String(String) constructor Dmorg.apache.pig.impl.logicalLayer.LOSort.clone() invokes inefficient Boolean constructor; use Boolean.valueOf(...) instead Dm org.apache.pig.impl.logicalLayer.optimizer.ImplicitSplitInserter.transform(List) invokes inefficient Boolean constructor; use Boolean.valueOf(...) instead Dm org.apache.pig.impl.logicalLayer.RemoveRedundantOperators.visit(LOProject) invokes inefficient new String(String) constructor Dmorg.apache.pig.impl.logicalLayer.schema.Schema.getField(String) invokes inefficient new String(String) constructor Dmorg.apache.pig.impl.logicalLayer.schema.Schema.reconcile(Schema) invokes inefficient new String(String) constructor Dm org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertCastForEachInBetweenIfNecessary(LogicalOperator, LogicalOperator, Schema) invokes inefficient Boolean constructor; use Boolean.valueOf(...) instead] Dm org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(Notification, Object) forces garbage collection; extremely dubious except in benchmarking code Dmorg.apache.pig.pen.AugmentBaseDataVisitor.GetLargerValue(Object) invokes inefficient new String(String) constructor Dmorg.apache.pig.pen.AugmentBaseDataVisitor.GetSmallerValue(Object) invokes inefficient new String(String) constructor Dmorg.apache.pig.tools.cmdline.CmdLineParser.getNextOpt() invokes inefficient new String(String) constructor Dmorg.apache.pig.tools.parameters.PreprocessorContext.substitute(String) invokes inefficient new String(String) constructor -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1085) Pass JobConf and UDF specific configuration information to UDFs
[ https://issues.apache.org/jira/browse/PIG-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1085: Attachment: udfconf-2.patch Pass JobConf and UDF specific configuration information to UDFs --- Key: PIG-1085 URL: https://issues.apache.org/jira/browse/PIG-1085 Project: Pig Issue Type: New Feature Components: impl Reporter: Alan Gates Assignee: Alan Gates Attachments: udfconf-2.patch, udfconf.patch Users have long asked for a way to get the JobConf structure in their UDFs. It would also be nice to have a way to pass properties between the front end and back end so that UDFs can store state during parse time and use it at runtime. This patch does part of what is proposed in PIG-602, but not all of it. It does not provide a way to give user specified configuration files to UDFs. So I will mark 602 as depending on this bug, but it isn't a duplicate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1088) change merge join and merge join indexer to work with new LoadFunc interface
change merge join and merge join indexer to work with new LoadFunc interface Key: PIG-1088 URL: https://issues.apache.org/jira/browse/PIG-1088 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1018) FINDBUGS: NM_FIELD_NAMING_CONVENTION: Field names should start with a lower case letter
[ https://issues.apache.org/jira/browse/PIG-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1018: --- Assignee: Olga Natkovich FINDBUGS: NM_FIELD_NAMING_CONVENTION: Field names should start with a lower case letter --- Key: PIG-1018 URL: https://issues.apache.org/jira/browse/PIG-1018 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich Attachments: PIG-1018.patch NmThe field name org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.LogToPhyMap doesn't start with a lower case letter NmThe method name org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.CreateTuple(Object[]) doesn't start with a lower case letter NmThe class name org.apache.pig.backend.hadoop.executionengine.physicalLayer.util.operatorHelper doesn't start with an upper case letter NmClass org.apache.pig.impl.util.WrappedIOException is not derived from an Exception, even though it is named as such NmThe method name org.apache.pig.pen.EquivalenceClasses.GetEquivalenceClasses(LogicalOperator, Map) doesn't start with a lower case letter NmThe field name org.apache.pig.pen.util.DisplayExamples.Result doesn't start with a lower case letter NmThe method name org.apache.pig.pen.util.DisplayExamples.PrintSimple(LogicalOperator, Map) doesn't start with a lower case letter NmThe method name org.apache.pig.pen.util.DisplayExamples.PrintTabular(LogicalPlan, Map) doesn't start with a lower case letter NmThe method name org.apache.pig.tools.parameters.TokenMgrError.LexicalError(boolean, int, int, int, String, char) doesn't start with a lower case letter -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1013) FINDBUGS: DMI_INVOKING_TOSTRING_ON_ARRAY: Invocation of toString on an array
[ https://issues.apache.org/jira/browse/PIG-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1013: --- Assignee: Olga Natkovich FINDBUGS: DMI_INVOKING_TOSTRING_ON_ARRAY: Invocation of toString on an array Key: PIG-1013 URL: https://issues.apache.org/jira/browse/PIG-1013 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich Attachments: PIG-1013.patch DMI Invocation of toString on stackTraceLines in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getExceptionFromStrings(String[], int) DMI Invocation of toString on b in org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(byte[]) DMI Invocation of toString on b in org.apache.pig.builtin.Utf8StorageConverter.bytesToDouble(byte[]) DMI Invocation of toString on b in org.apache.pig.builtin.Utf8StorageConverter.bytesToFloat(byte[]) DMI Invocation of toString on b in org.apache.pig.builtin.Utf8StorageConverter.bytesToInteger(byte[]) DMI Invocation of toString on b in org.apache.pig.builtin.Utf8StorageConverter.bytesToLong(byte[]) DMI Invocation of toString on b in org.apache.pig.builtin.Utf8StorageConverter.bytesToMap(byte[]) DMI Invocation of toString on b in org.apache.pig.builtin.Utf8StorageConverter.bytesToTuple(byte[]) DMI Invocation of toString on args in org.apache.pig.impl.PigContext.instantiateFuncFromSpec(FuncSpec) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1011) FINDBUGS: SE_NO_SERIALVERSIONID: Class is Serializable, but doesn't define serialVersionUID
[ https://issues.apache.org/jira/browse/PIG-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1011: --- Assignee: Olga Natkovich FINDBUGS: SE_NO_SERIALVERSIONID: Class is Serializable, but doesn't define serialVersionUID --- Key: PIG-1011 URL: https://issues.apache.org/jira/browse/PIG-1011 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich Attachments: PIG-1011.patch SnVI org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODistinct is Serializable; consider declaring a SnVI org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PORead is Serializable; consider declaring a serialVersionUID -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1015) [piggybank] DateExtractor should take into account timezones
[ https://issues.apache.org/jira/browse/PIG-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1015: --- Assignee: Dmitriy V. Ryaboy [piggybank] DateExtractor should take into account timezones Key: PIG-1015 URL: https://issues.apache.org/jira/browse/PIG-1015 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.6.0 Attachments: date_extractor.patch The current implementation defaults to the local timezone when parsing strings, thereby providing inconsistent results depending on the settings of the computer the program is executing on (this is causing unit test failures). We should set the timezone to a consistent default, and allow users to override this default. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1089) Pig 0.6.0 Documentation
Pig 0.6.0 Documentation --- Key: PIG-1089 URL: https://issues.apache.org/jira/browse/PIG-1089 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.6.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Fix For: 0.6.0 Pig 0.6.0 documentation: Ability to use Hadoop dfs commands from Pig Replicated left outer join Skewed outer join Map-side group Accumulate Interface for UDFs Improved Memory Mgt Integration with Zebra -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1012) FINDBUGS: SE_BAD_FIELD: Non-transient non-serializable instance field in serializable class
[ https://issues.apache.org/jira/browse/PIG-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1012: --- Assignee: Olga Natkovich FINDBUGS: SE_BAD_FIELD: Non-transient non-serializable instance field in serializable class --- Key: PIG-1012 URL: https://issues.apache.org/jira/browse/PIG-1012 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich Fix For: 0.6.0 Attachments: PIG-1012-2.patch, PIG-1012-3.patch, PIG-1012.patch SeClass org.apache.pig.backend.executionengine.PigSlice defines non-transient non-serializable instance field is SeClass org.apache.pig.backend.executionengine.PigSlice defines non-transient non-serializable instance field loader Sejava.util.zip.GZIPInputStream stored into non-transient field PigSlice.is Seorg.apache.pig.backend.datastorage.SeekableInputStream stored into non-transient field PigSlice.is Seorg.apache.tools.bzip2r.CBZip2InputStream stored into non-transient field PigSlice.is Seorg.apache.pig.builtin.PigStorage stored into non-transient field PigSlice.loader Seorg.apache.pig.backend.hadoop.DoubleWritable$Comparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigBagWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigCharArrayWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigDBAWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigDoubleWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigFloatWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigIntWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigLongWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigTupleWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigWritableComparator implements Comparator but not Serializable SeClass org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper defines non-transient non-serializable instance field nig SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.EqualToExpr defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GreaterThanExpr defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GTOrEqualToExpr defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.LessThanExpr defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.LTOrEqualToExpr defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.NotEqualToExpr defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject defines non-transient non-serializable instance field bagIterator SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserComparisonFunc defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCombinerPackage defines non-transient non-serializable instance field log SeClass
[jira] Assigned: (PIG-1010) FINDBUGS: RV_RETURN_VALUE_IGNORED_BAD_PRACTICE
[ https://issues.apache.org/jira/browse/PIG-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1010: --- Assignee: Olga Natkovich FINDBUGS: RV_RETURN_VALUE_IGNORED_BAD_PRACTICE -- Key: PIG-1010 URL: https://issues.apache.org/jira/browse/PIG-1010 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich RV org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.deleteLocalDir(File) ignores exceptional return value of java.io.File.delete() RVorg.apache.pig.backend.local.datastorage.LocalPath.delete() ignores exceptional return value of java.io.File.delete() RVorg.apache.pig.data.DefaultAbstractBag.clear() ignores exceptional return value of java.io.File.delete() RVorg.apache.pig.data.DefaultAbstractBag.finalize() ignores exceptional return value of java.io.File.delete() RVorg.apache.pig.impl.io.FileLocalizer.create(String, boolean, PigContext) ignores exceptional return value of java.io.File.mkdirs() -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1009) FINDBUGS: OS_OPEN_STREAM: Method may fail to close stream
[ https://issues.apache.org/jira/browse/PIG-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1009: --- Assignee: Olga Natkovich FINDBUGS: OS_OPEN_STREAM: Method may fail to close stream - Key: PIG-1009 URL: https://issues.apache.org/jira/browse/PIG-1009 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich Attachments: PIG-1009.patch OSorg.apache.pig.impl.io.FileLocalizer.parseCygPath(String, int) may fail to close stream OSorg.apache.pig.impl.logicalLayer.parser.QueryParser.which(String) may fail to close stream OS org.apache.pig.impl.util.PropertiesUtil.loadPropertiesFromFile(Properties) may fail to close stream OSorg.apache.pig.Main.configureLog4J(Properties, PigContext) may fail to close stream OS org.apache.pig.tools.parameters.PreprocessorContext.executeShellCommand(String) may fail to close stream OS org.apache.pig.tools.parameters.PreprocessorContext.executeShellCommand(String) may fail to close stream -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1008) FINDBUGS: NP_TOSTRING_COULD_RETURN_NULL
[ https://issues.apache.org/jira/browse/PIG-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1008: --- Assignee: Olga Natkovich FINDBUGS: NP_TOSTRING_COULD_RETURN_NULL --- Key: PIG-1008 URL: https://issues.apache.org/jira/browse/PIG-1008 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich Attachments: PIG-1008.patch NPorg.apache.pig.data.DataByteArray.toString() may return null NP org.apache.pig.impl.streaming.StreamingCommand$HandleSpec.equals(Object) does not check for null argument -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1006) FINDBUGS: EQ_COMPARETO_USE_OBJECT_EQUALS in bags and tuples
[ https://issues.apache.org/jira/browse/PIG-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1006: --- Assignee: Olga Natkovich FINDBUGS: EQ_COMPARETO_USE_OBJECT_EQUALS in bags and tuples --- Key: PIG-1006 URL: https://issues.apache.org/jira/browse/PIG-1006 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich Eqorg.apache.pig.data.DistinctDataBag$DistinctDataBagIterator$TContainer defines compareTo(DistinctDataBag$DistinctDataBagIterator$TContainer) and uses Object.equals() Eqorg.apache.pig.data.SingleTupleBag defines compareTo(Object) and uses Object.equals() Eqorg.apache.pig.data.SortedDataBag$SortedDataBagIterator$PQContainer defines compareTo(SortedDataBag$SortedDataBagIterator$PQContainer) and uses Object.equals() Eqorg.apache.pig.data.TargetedTuple defines compareTo(Object) and uses Object.equals() Eqorg.apache.pig.pen.util.ExampleTuple defines compareTo(Object) and uses Object.equals() -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-989) Allow type merge between numerical type and non-numerical type
[ https://issues.apache.org/jira/browse/PIG-989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-989: -- Assignee: Daniel Dai Allow type merge between numerical type and non-numerical type -- Key: PIG-989 URL: https://issues.apache.org/jira/browse/PIG-989 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.5.0 Reporter: Daniel Dai Assignee: Daniel Dai Attachments: PIG-989-1.patch, PIG-989-2.patch Currently, we do not allow type merge between numerical type and non-numerical type. And the error message is confusing. Eg, if you run: a = load '1.txt' as (a0:chararray, a1:chararray); b = load '2.txt' as (b0:long, b1:chararray); c = join a by a0, b by b0; dump c; And the error message is ERROR 1051: Cannot cast to Unknown We shall: 1. Allow the type merge between numerical type and non-numerical type 2. Or at least, provide more meaningful error message to the user -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-979) Acummulator Interface for UDFs
[ https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-979: --- Status: Open (was: Patch Available) Acummulator Interface for UDFs -- Key: PIG-979 URL: https://issues.apache.org/jira/browse/PIG-979 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Ying He Attachments: PIG-979.patch, PIG-979.patch Add an accumulator interface for UDFs that would allow them to take a set number of records at a time instead of the entire bag. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-979) Acummulator Interface for UDFs
[ https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-979: --- Status: Patch Available (was: Open) Acummulator Interface for UDFs -- Key: PIG-979 URL: https://issues.apache.org/jira/browse/PIG-979 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Ying He Attachments: PIG-979.patch, PIG-979.patch Add an accumulator interface for UDFs that would allow them to take a set number of records at a time instead of the entire bag. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1089) Pig 0.6.0 Documentation
[ https://issues.apache.org/jira/browse/PIG-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corinne Chandel updated PIG-1089: - Status: Patch Available (was: Open) Apply patch to trunk: http://svn.apache.org/repos/asf/hadoop/pig/trunk Note: No new test code required; changes to documentation only. Pig 0.6.0 Documentation --- Key: PIG-1089 URL: https://issues.apache.org/jira/browse/PIG-1089 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.6.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Fix For: 0.6.0 Attachments: Pig-6-Beta.patch Pig 0.6.0 documentation: Ability to use Hadoop dfs commands from Pig Replicated left outer join Skewed outer join Map-side group Accumulate Interface for UDFs Improved Memory Mgt Integration with Zebra -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-968) findContainingJar fails when there's a + in the path
[ https://issues.apache.org/jira/browse/PIG-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-968: -- Assignee: Todd Lipcon findContainingJar fails when there's a + in the path Key: PIG-968 URL: https://issues.apache.org/jira/browse/PIG-968 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.4.0, 0.5.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: pig-968.txt This is the same bug as in MAPREDUCE-714. Please see discussion there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-938) Pig Docs for 0.4.0
[ https://issues.apache.org/jira/browse/PIG-938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-938: -- Assignee: Corinne Chandel Pig Docs for 0.4.0 -- Key: PIG-938 URL: https://issues.apache.org/jira/browse/PIG-938 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.4.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Minor Attachments: PIG-938-2.patch, PIG-938-2b.patch, PIG-938-3.patch, PIG-938-4.patch, PIG-938.patch Pig docs for 0.4.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-935) Skewed join throws an exception when used with map keys
[ https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-935: -- Assignee: Sriranjan Manjunath Skewed join throws an exception when used with map keys --- Key: PIG-935 URL: https://issues.apache.org/jira/browse/PIG-935 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: skmapbug.patch Skewed join throws a runtime exception for the following query: A = load 'map.txt' as (e); B = load 'map.txt' as (f); C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed; explain C; Exception: Caused by: java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast cannot be cast to org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO Project at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894) ... 27 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-958) Splitting output data on key field
[ https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-958: -- Assignee: Ankur Splitting output data on key field -- Key: PIG-958 URL: https://issues.apache.org/jira/browse/PIG-958 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Ankur Assignee: Ankur Fix For: 0.6.0 Attachments: 958.v3.patch, 958.v4.patch Pig users often face the need to split the output records into a bunch of files and directories depending on the type of record. Pig's SPLIT operator is useful when record types are few and known in advance. In cases where type is not directly known but is derived dynamically from values of a key field in the output tuple, a custom store function is a better solution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-960) Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage
[ https://issues.apache.org/jira/browse/PIG-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-960: -- Assignee: Ankit Modi Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage --- Key: PIG-960 URL: https://issues.apache.org/jira/browse/PIG-960 Project: Pig Issue Type: Improvement Components: impl Reporter: Ankit Modi Assignee: Ankit Modi Fix For: 0.6.0 Attachments: pig_rlr.patch PigStorage's reading of Tuples ( lines ) can be optimized using Hadoop's {{LineRecordReader}}. This can help in following areas - Improving performance reading of Tuples (lines) in {{PigStorage}} - Any future improvements in line reading done in Hadoop's {{LineRecordReader}} is automatically carried over to Pig Issues that are handled by this patch - BZip uses internal buffers and positioning for determining the number of bytes read. Hence buffering done by {{LineRecordReader}} has to be turned off - Current implementation of {{LocalSeekableInputStream}} does not implement {{available}} method. This method has to be implemented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-923) Allow setting logfile location in pig.properties
[ https://issues.apache.org/jira/browse/PIG-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-923: -- Assignee: Dmitriy V. Ryaboy Allow setting logfile location in pig.properties Key: PIG-923 URL: https://issues.apache.org/jira/browse/PIG-923 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.4.0 Attachments: pig_923.patch Local log file location can be specified through the -l flag, but it cannot be set in pig.properties. This JIRA proposes a change to Main.java that allows it to read the pig.logfile property from the configuration. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-929) Default value of memusage for skewed join is not correct
[ https://issues.apache.org/jira/browse/PIG-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-929: -- Assignee: Ying He Default value of memusage for skewed join is not correct Key: PIG-929 URL: https://issues.apache.org/jira/browse/PIG-929 Project: Pig Issue Type: Improvement Reporter: Ying He Assignee: Ying He Attachments: memusage.patch default value pig.skewedjoin.reduce.memusage , which is used in skewed join, should be set to 0.3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-924) Make Pig work with multiple versions of Hadoop
[ https://issues.apache.org/jira/browse/PIG-924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-924: -- Assignee: Dmitriy V. Ryaboy Make Pig work with multiple versions of Hadoop -- Key: PIG-924 URL: https://issues.apache.org/jira/browse/PIG-924 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: pig_924.2.patch, pig_924.3.patch, pig_924.patch The current Pig build scripts package hadoop and other dependencies into the pig.jar file. This means that if users upgrade Hadoop, they also need to upgrade Pig. Pig has relatively few dependencies on Hadoop interfaces that changed between 18, 19, and 20. It is possibly to write a dynamic shim that allows Pig to use the correct calls for any of the above versions of Hadoop. Unfortunately, the building process precludes us from the ability to do this at runtime, and forces an unnecessary Pig rebuild even if dynamic shims are created. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-919) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group
[ https://issues.apache.org/jira/browse/PIG-919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-919: -- Assignee: Viraj Bhat Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group -- Key: PIG-919 URL: https://issues.apache.org/jira/browse/PIG-919 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Viraj Bhat Assignee: Viraj Bhat Fix For: 0.3.0 Attachments: GenHashList.java, mapscript.pig, mymapudf.jar I have a Pig script, which takes in a student file and generates a bag of maps. I later want to group on the value of the key name0 which corresponds to the first name of the student. {code} register mymapudf.jar; data = LOAD '/user/viraj/studenttab10k' AS (somename:chararray,age:long,marks:float); genmap = foreach data generate flatten(mymapudf.GenHashList(somename,' ')) as bp:map[], age, marks; getfirstnames = foreach genmap generate bp#'name0' as firstname, age, marks; filternonnullfirstnames = filter getfirstnames by firstname is not null; groupgenmap = group filternonnullfirstnames by firstname; dump groupgenmap; {code} When I execute this code, I get an error in the Map Phase: === java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:242) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) === -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-913) Error in Pig script when grouping on chararray column
[ https://issues.apache.org/jira/browse/PIG-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-913: -- Assignee: Daniel Dai Error in Pig script when grouping on chararray column - Key: PIG-913 URL: https://issues.apache.org/jira/browse/PIG-913 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Viraj Bhat Assignee: Daniel Dai Priority: Critical Fix For: 0.4.0 Attachments: PIG-913-2.patch, PIG-913.patch I have a very simple script which fails at parsetime due to the schema I specified in the loader. {code} data = LOAD '/user/viraj/studenttab10k' AS (s:chararray); dataSmall = limit data 100; bb = GROUP dataSmall by $0; dump bb; {code} = 2009-08-06 18:47:56,297 [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig-svn/trunk/pig_1249609676296.log 09/08/06 18:47:56 INFO pig.Main: Logging error messages to: /homes/viraj/pig-svn/trunk/pig_1249609676296.log 2009-08-06 18:47:56,459 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000 09/08/06 18:47:56 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: hdfs://localhost:9000 2009-08-06 18:47:56,694 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001 09/08/06 18:47:56 INFO executionengine.HExecutionEngine: Connecting to map-reduce job tracker at: localhost:9001 2009-08-06 18:47:57,008 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable to store alias bb 09/08/06 18:47:57 ERROR grunt.Grunt: ERROR 1002: Unable to store alias bb Details at logfile: /homes/viraj/pig-svn/trunk/pig_1249609676296.log = = Pig Stack Trace --- ERROR 1002: Unable to store alias bb org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias bb at org.apache.pig.PigServer.openIterator(PigServer.java:481) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:531) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias bb at org.apache.pig.PigServer.store(PigServer.java:536) at org.apache.pig.PigServer.openIterator(PigServer.java:464) ... 6 more Caused by: java.lang.NullPointerException at org.apache.pig.impl.logicalLayer.LOCogroup.unsetSchema(LOCogroup.java:359) at org.apache.pig.impl.logicalLayer.optimizer.SchemaRemover.visit(SchemaRemover.java:64) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:335) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:46) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildSchemas(LogicalTransformer.java:67) at org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:187) at org.apache.pig.PigServer.compileLp(PigServer.java:854) at org.apache.pig.PigServer.compileLp(PigServer.java:791) at org.apache.pig.PigServer.store(PigServer.java:509) ... 7 more = -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-911) [Piggybank] SequenceFileLoader
[ https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-911: -- Assignee: Dmitriy V. Ryaboy [Piggybank] SequenceFileLoader --- Key: PIG-911 URL: https://issues.apache.org/jira/browse/PIG-911 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.5.0 Attachments: pig_911.2.patch, pig_sequencefile.patch The proposed piggybank contribution adds a SequenceFileLoader to the piggybank. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-905) TOKENIZE throws exception on null data
[ https://issues.apache.org/jira/browse/PIG-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-905: -- Assignee: Daniel Dai TOKENIZE throws exception on null data -- Key: PIG-905 URL: https://issues.apache.org/jira/browse/PIG-905 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.4.0 Attachments: PIG-905-1.patch, PIG-905-2.patch, PIG-905-3.patch it should just return null -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-895) Default parallel for Pig
[ https://issues.apache.org/jira/browse/PIG-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-895: -- Assignee: Daniel Dai Default parallel for Pig Key: PIG-895 URL: https://issues.apache.org/jira/browse/PIG-895 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.3.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.4.0 Attachments: PIG-895-1.patch, PIG-895-2.patch, PIG-895-3.patch For hadoop 20, if user don't specify the number of reducers, hadoop will use 1 reducer as the default value. It is different from previous of hadoop, in which default reducer number is usually good. 1 reducer is not what user want for sure. Although user can use parallel keyword to specify number of reducers for each statement, it is wordy. We need a convenient way for users to express a desired number of reducers. Here is my propose: 1. Add one property default_parallel to Pig. User can set default_parallel in script. Eg: set default_parallel 10; 2. default_parallel is a hint to Pig. Pig is free to optimize the number of reducers (unlike parallel keyword). Currently, since we do not have a mechanism to determine the optimal number of reducers, default_parallel will be always granted, unless it is override by parallel keyword. 3. If user put multiple default_parallel inside script, the last entry will be taken. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-890) Create a sampler interface and improve the skewed join sampler
[ https://issues.apache.org/jira/browse/PIG-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-890: -- Assignee: Sriranjan Manjunath Create a sampler interface and improve the skewed join sampler -- Key: PIG-890 URL: https://issues.apache.org/jira/browse/PIG-890 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Fix For: 0.4.0 Attachments: samplerinterface.patch We need a different sampler for order by and skewed join. We thus need a better sampling interface. The design of the same is described here: http://wiki.apache.org/pig/PigSampler -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-837) docs ant target is broken
[ https://issues.apache.org/jira/browse/PIG-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-837: -- Assignee: Olga Natkovich docs ant target is broken -- Key: PIG-837 URL: https://issues.apache.org/jira/browse/PIG-837 Project: Pig Issue Type: Bug Reporter: Giridharan Kesavan Assignee: Olga Natkovich docs ant target is broken , this would fail the trunk builds.. [exec] Java Result: 1 [exec] [exec] Copying broken links file to site root. [exec] [exec] Copying 1 file to /home/hudson/hudson-slave/workspace/Pig-Patch-minerva.apache.org/trunk/src/docs/build/site [exec] [exec] BUILD FAILED [exec] /home/nigel/tools/forrest/latest/main/targets/site.xml:180: Error building site. [exec] [exec] There appears to be a problem with your site build. [exec] [exec] Read the output above: [exec] * Cocoon will report the status of each document: [exec] - in column 1: *=okay X=brokenLink ^=pageSkipped (see FAQ). [exec] * Even if only one link is broken, you will still get failed. [exec] * Your site would still be generated, but some pages would be broken. [exec] - See /home/hudson/hudson-slave/workspace/Pig-Patch-minerva.apache.org/trunk/src/docs/build/site/broken-links.xml [exec] [exec] Total time: 28 seconds BUILD FAILED /home/hudson/hudson-slave/workspace/Pig-Patch-minerva.apache.org/trunk/build.xml:326: exec returned: 1 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-830) Port Apache Log parsing piggybank contrib to Pig 0.2
[ https://issues.apache.org/jira/browse/PIG-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-830: -- Assignee: Dmitriy V. Ryaboy Port Apache Log parsing piggybank contrib to Pig 0.2 Key: PIG-830 URL: https://issues.apache.org/jira/browse/PIG-830 Project: Pig Issue Type: New Feature Affects Versions: 0.2.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.3.0 Attachments: pig-830-v2.patch, pig-830-v3.patch, pig-830.patch, TEST-org.apache.pig.piggybank.test.storage.TestMyRegExLoader.txt The piggybank contribs (pig-472, pig-473, pig-474, pig-476, pig-486, pig-487, pig-488, pig-503, pig-509) got dropped after the types branch was merged in. They should be updated to work with the current APIs and added back into trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1087) Use Pig's version for Zebra's own version.
[ https://issues.apache.org/jira/browse/PIG-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1087: --- Attachment: patch_Pig1087 Use Pig's version for Zebra's own version. -- Key: PIG-1087 URL: https://issues.apache.org/jira/browse/PIG-1087 Project: Pig Issue Type: Task Affects Versions: 0.4.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.6.0 Attachments: patch_Pig1087 Zebra is a contrib project of Pig now. It should use Pig's version for its own version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-825) PIG_HADOOP_VERSION should be 18
[ https://issues.apache.org/jira/browse/PIG-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-825: -- Assignee: Dmitriy V. Ryaboy PIG_HADOOP_VERSION should be 18 --- Key: PIG-825 URL: https://issues.apache.org/jira/browse/PIG-825 Project: Pig Issue Type: Bug Components: grunt Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.3.0 Attachments: pig-825.patch, pig-825.patch PIG_HADOOP_VERSION should be set to 18, not 17, as Hadoop 0.18 is now considered default. Patch coming. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1087) Use Pig's version for Zebra's own version.
[ https://issues.apache.org/jira/browse/PIG-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776786#action_12776786 ] Yan Zhou commented on PIG-1087: --- +1 Use Pig's version for Zebra's own version. -- Key: PIG-1087 URL: https://issues.apache.org/jira/browse/PIG-1087 Project: Pig Issue Type: Task Affects Versions: 0.4.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.6.0 Attachments: patch_Pig1087 Zebra is a contrib project of Pig now. It should use Pig's version for its own version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1087) Use Pig's version for Zebra's own version.
[ https://issues.apache.org/jira/browse/PIG-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1087: --- Status: Patch Available (was: Open) Use Pig's version for Zebra's own version. -- Key: PIG-1087 URL: https://issues.apache.org/jira/browse/PIG-1087 Project: Pig Issue Type: Task Affects Versions: 0.4.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.6.0 Attachments: patch_Pig1087 Zebra is a contrib project of Pig now. It should use Pig's version for its own version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-796) support conversion from numeric types to chararray
[ https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-796: -- Assignee: Ashutosh Chauhan support conversion from numeric types to chararray --- Key: PIG-796 URL: https://issues.apache.org/jira/browse/PIG-796 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Ashutosh Chauhan Fix For: 0.3.0 Attachments: 796.patch, pig-796.patch, pig-796.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT
[ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-795: -- Assignee: Eric Gaudet Command that selects a random sample of the rows, similar to LIMIT -- Key: PIG-795 URL: https://issues.apache.org/jira/browse/PIG-795 Project: Pig Issue Type: New Feature Components: impl Reporter: Eric Gaudet Assignee: Eric Gaudet Priority: Trivial Attachments: sample2.diff, sample3.diff When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) The command LIMIT N selects the first N rows of the data, but these are not necessarily randomzed. A command SAMPLE X would retain the row only with the probability x%. Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-792) PERFORMANCE: Support skewed join in pig
[ https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-792: -- Assignee: Sriranjan Manjunath PERFORMANCE: Support skewed join in pig --- Key: PIG-792 URL: https://issues.apache.org/jira/browse/PIG-792 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: skewedjoin.patch Fragmented replicated join has a few limitations: - One of the tables needs to be loaded into memory - Join is limited to two tables Skewed join partitions the table and joins the records in the reduce phase. It computes a histogram of the key space to account for skewing in the input records. Further, it adjusts the number of reducers depending on the key distribution. We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-782) javadoc throws warnings - this would break hudson patch test process.
[ https://issues.apache.org/jira/browse/PIG-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-782: -- Assignee: Santhosh Srinivasan javadoc throws warnings - this would break hudson patch test process. - Key: PIG-782 URL: https://issues.apache.org/jira/browse/PIG-782 Project: Pig Issue Type: Bug Environment: javadoc throws warnings - this would break the hudson patch test process. Reporter: Giridharan Kesavan Assignee: Santhosh Srinivasan [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:233: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:205: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:185: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:220: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:158: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:134: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:105: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:92: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:120: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:48: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:77: warning - @return tag has no arguments. [javadoc] /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:92: warning - @param argument names is not a parameter name. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-781) Error reporting for failed MR jobs
[ https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-781: -- Assignee: Gunther Hagleitner Error reporting for failed MR jobs -- Key: PIG-781 URL: https://issues.apache.org/jira/browse/PIG-781 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner Assignee: Gunther Hagleitner Fix For: 0.3.0 Attachments: partial_failure.patch, partial_failure.patch, partial_failure.patch, partial_failure.patch If we have multiple MR jobs to run and some of them fail the behavior of the system is to not stop on the first failure but to keep going. That way jobs that do not depend on the failed job might still succeed. The question is to how best report this scenario to a user. How do we tell which jobs failed and which didn't? One way could be to tie jobs to stores and report which store locations won't have data and which ones do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-745) Please add DataTypes.toString() conversion function
[ https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-745: -- Assignee: David Ciemiewicz Please add DataTypes.toString() conversion function --- Key: PIG-745 URL: https://issues.apache.org/jira/browse/PIG-745 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz Assignee: David Ciemiewicz Fix For: 0.3.0 Attachments: PIG-745.patch I'm doing some work in string manipulation UDFs and I've found that it would be very convenient if I could always convert the argument to a chararray (internally a Java String). For example TOLOWERCASE(arg) shouldn't really care whether arg is a bytearray, chararray, int, long, double, or float, it should be treated as a string and operated on. The simplest and most foolproof method would be if the DataTypes added a static function of DataTypes.toString which did all of the argument type checking and provided consistent translation. I believe that this function might be coded as: public static String toString(Object o) throws ExecException { try { switch (findType(o)) { case BOOLEAN: if (((Boolean)o) == true) return new String('1'); else return new String('0'); case BYTE: return ((Byte)o).toString(); case INTEGER: return ((Integer)o).toString(); case LONG: return ((Long)o).toString(); case FLOAT: return ((Float)o).toString(); case DOUBLE: return ((Double)o).toString(); case BYTEARRAY: return ((DataByteArray)o).toString(); case CHARARRAY: return (String)o; case NULL: return null; case MAP: case TUPLE: case BAG: case UNKNOWN: default: int errCode = 1071; String msg = Cannot convert a + findTypeName(o) + to an String; throw new ExecException(msg, errCode, PigException.INPUT); } } catch (ExecException ee) { throw ee; } catch (Exception e) { int errCode = 2054; String msg = Internal error. Could not convert + o + to String.; throw new ExecException(msg, errCode, PigException.BUG); } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-753) Provide support for UDFs without parameters
[ https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-753: -- Assignee: Jeff Zhang Provide support for UDFs without parameters --- Key: PIG-753 URL: https://issues.apache.org/jira/browse/PIG-753 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.4.0 Attachments: Pig_753_Patch.txt Pig do not support UDF without parameters, it force me provide a parameter. like the following statement: B = FOREACH A GENERATE bagGenerator(); this will generate error. I have to provide a parameter like following B = FOREACH A GENERATE bagGenerator($0); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-833: -- Assignee: Raghu Angadi Storage access layer Key: PIG-833 URL: https://issues.apache.org/jira/browse/PIG-833 Project: Pig Issue Type: New Feature Reporter: Jay Tang Assignee: Raghu Angadi Fix For: 0.4.0 Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code. This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata. Eventually it could also support predicate pushdown for further performance improvement. Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1089) Pig 0.6.0 Documentation
[ https://issues.apache.org/jira/browse/PIG-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776789#action_12776789 ] Dmitriy V. Ryaboy commented on PIG-1089: I am not sure where this goes, but can we add a bit about using the pig.logfile property in the pig.properties file to control where Pig logs get written? It takes a directory or a filename on a local system, and defaults to current working directory. Pig 0.6.0 Documentation --- Key: PIG-1089 URL: https://issues.apache.org/jira/browse/PIG-1089 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.6.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Fix For: 0.6.0 Attachments: Pig-6-Beta.patch Pig 0.6.0 documentation: Ability to use Hadoop dfs commands from Pig Replicated left outer join Skewed outer join Map-side group Accumulate Interface for UDFs Improved Memory Mgt Integration with Zebra -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-732) Utility UDFs
[ https://issues.apache.org/jira/browse/PIG-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-732: -- Assignee: Ankur Utility UDFs - Key: PIG-732 URL: https://issues.apache.org/jira/browse/PIG-732 Project: Pig Issue Type: New Feature Reporter: Ankur Assignee: Ankur Priority: Minor Attachments: udf.v1.patch, udf.v2.patch, udf.v3.patch, udf.v4.patch, udf.v5.patch Two utility UDFs and their respective test cases. 1. TopN - Accepts number of tuples (N) to retain in output, field number (type long) to use for comparison, and an sorted/unsorted bag of tuples. It outputs a bag containing top N tuples. 2. SearchQuery - Accepts an encoded URL from any of the 4 search engines (Yahoo, Google, AOL, Live) and extracts and normalizes the search query present in it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-715) Remove 2 doc files: hello.pdf and overview.html
[ https://issues.apache.org/jira/browse/PIG-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-715: -- Assignee: Corinne Chandel Remove 2 doc files: hello.pdf and overview.html --- Key: PIG-715 URL: https://issues.apache.org/jira/browse/PIG-715 Project: Pig Issue Type: Bug Components: documentation Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Minor Please remove these 2 doc files. They don't belong with the Pig 2.0 documnetation and will cause confusion. (1) hello.pdf ... located in: trunk/src/docs/src/documentaiton/content/xdocs (2) overview.html ... located in: trunk/docs -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-713) Autocompletion doesn't complete aliases
[ https://issues.apache.org/jira/browse/PIG-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-713: -- Assignee: Eric Gaudet Autocompletion doesn't complete aliases --- Key: PIG-713 URL: https://issues.apache.org/jira/browse/PIG-713 Project: Pig Issue Type: New Feature Components: grunt Reporter: Eric Gaudet Assignee: Eric Gaudet Priority: Minor Fix For: 0.3.0 Attachments: alias_completion.patch Autocompletion only knows about keywords, but in different contexts, it would be nice if it completed aliases where an alias is expected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-712) Need utilities to create schemas for bags and tuples
[ https://issues.apache.org/jira/browse/PIG-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-712: -- Assignee: Jeff Zhang Need utilities to create schemas for bags and tuples Key: PIG-712 URL: https://issues.apache.org/jira/browse/PIG-712 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Santhosh Srinivasan Assignee: Jeff Zhang Priority: Minor Fix For: 0.3.0 Attachments: Pig_712_Patch.txt Pig should provide utilities to create bag and tuple schemas. Currently, users return schemas in outputSchema method and end up with very verbose boiler plate code. It will be very nice if Pig encapsulates the boiler plate code in utility methods. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-704) Interactive mode doesn't list defined aliases
[ https://issues.apache.org/jira/browse/PIG-704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-704: -- Assignee: Eric Gaudet Interactive mode doesn't list defined aliases - Key: PIG-704 URL: https://issues.apache.org/jira/browse/PIG-704 Project: Pig Issue Type: Improvement Components: grunt Reporter: Eric Gaudet Assignee: Eric Gaudet Priority: Trivial Fix For: 0.2.0 Attachments: aliases_last.patch, aliases_last2.patch I'm using the interactive mode to test my scripts, and I'm struggling to keep track of 2 things: 1) the aliases. A typical test script has 10 aliases or more. As the test goes on, different versions are created, or aliases are created with typos. There's no command in grunt to get the list of defined aliases. Proposed solution: add a new command aliases that prints the list of aliases. 2) I prefer to give meaningful (long) names to my aliases. But as I try different things, I find it hard to predict what the schema will look like, so I use DESCRIBE a lot. It's a pain to type these long names all the time, especially since most of the time I only want to describe the last alias created. A shortcut to the describe command describing the last alias will be very useful. Proposed solution: use the special name _ as a shortcut to the last created alias: DECRIBE _. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-703) Pig trunk/src/docs folders and files for forrest xml doc builds
[ https://issues.apache.org/jira/browse/PIG-703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-703: -- Assignee: Corinne Chandel Pig trunk/src/docs folders and files for forrest xml doc builds Key: PIG-703 URL: https://issues.apache.org/jira/browse/PIG-703 Project: Pig Issue Type: Task Components: documentation Affects Versions: site Reporter: Corinne Chandel Assignee: Corinne Chandel Fix For: site Attachments: logos-1.zip, trunk-1.patch Add src/docs directory folders and files to trunk branch. Patch includes: src/docs ... forrrest.properties src/docs/src/documentation ... skinconf.xml src/docs/src/documentation/content/xdocs ... doc files src/docs/src/documentation/content/xdocs/images ... image files Please review. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-692) when running script file, automatically set up job name based on the file name
[ https://issues.apache.org/jira/browse/PIG-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-692: -- Assignee: Vadim Zaliva when running script file, automatically set up job name based on the file name -- Key: PIG-692 URL: https://issues.apache.org/jira/browse/PIG-692 Project: Pig Issue Type: Improvement Components: tools Affects Versions: 0.2.0 Reporter: Vadim Zaliva Assignee: Vadim Zaliva Priority: Trivial Fix For: 0.2.0 Attachments: pig-job-name.patch When running pig script from command like like this: pig scriptfile right now default job name is used. it is convenient to have it automatically set up based on the script name. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-623) Fix spelling errors in output messages
[ https://issues.apache.org/jira/browse/PIG-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-623: -- Assignee: Tom White Fix spelling errors in output messages -- Key: PIG-623 URL: https://issues.apache.org/jira/browse/PIG-623 Project: Pig Issue Type: Improvement Reporter: Tom White Assignee: Tom White Priority: Trivial Attachments: pig-623.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-620) find Max Tuple by 1st field UDF (for piggybank)
[ https://issues.apache.org/jira/browse/PIG-620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-620: -- Assignee: Vadim Zaliva find Max Tuple by 1st field UDF (for piggybank) --- Key: PIG-620 URL: https://issues.apache.org/jira/browse/PIG-620 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.2.0 Reporter: Vadim Zaliva Assignee: Vadim Zaliva Fix For: 0.2.0 Attachments: MaxTupleBy1stField.java This is simple UDF which takes bag of tuples and returns one with max. 1st column. It is fairly trivial but I have seen people asking for it. Detailed usage comments are in Javadoc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1087) Use Pig's version for Zebra's own version.
[ https://issues.apache.org/jira/browse/PIG-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1087: Resolution: Fixed Status: Resolved (was: Patch Available) patch committed. thanks Chao! Use Pig's version for Zebra's own version. -- Key: PIG-1087 URL: https://issues.apache.org/jira/browse/PIG-1087 Project: Pig Issue Type: Task Affects Versions: 0.4.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.6.0 Attachments: patch_Pig1087 Zebra is a contrib project of Pig now. It should use Pig's version for its own version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-622) Include pig executable in distribution
[ https://issues.apache.org/jira/browse/PIG-622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-622: -- Assignee: Tom White Include pig executable in distribution -- Key: PIG-622 URL: https://issues.apache.org/jira/browse/PIG-622 Project: Pig Issue Type: Bug Reporter: Tom White Assignee: Tom White Attachments: pig-622.patch Running ant tar does not generate the bin directory with the pig executable in it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-592) schema inferred incorrectly
[ https://issues.apache.org/jira/browse/PIG-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-592: -- Assignee: Daniel Dai schema inferred incorrectly --- Key: PIG-592 URL: https://issues.apache.org/jira/browse/PIG-592 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Christopher Olston Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-592-1.patch, PIG-592-2.patch, PIG-592-3.patch A simple pig script, that never introduces any schema information: A = load 'foo'; B = foreach (group A by $8) generate group, COUNT($1); C = load 'bar'; // ('bar' has two columns) D = join B by $0, C by $0; E = foreach D generate $0, $1, $3; Fails, complaining that $3 does not exist: java.io.IOException: Out of bound access. Trying to access non-existent column: 3. Schema {B::group: bytearray,long,bytearray} has 3 column(s). Apparently Pig gets confused, and thinks it knows the schema for C (a single bytearray column). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-595) Use of Combiner causes java.lang.ClassCastException in ForEach
[ https://issues.apache.org/jira/browse/PIG-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-595: -- Assignee: Viraj Bhat Use of Combiner causes java.lang.ClassCastException in ForEach -- Key: PIG-595 URL: https://issues.apache.org/jira/browse/PIG-595 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Assignee: Viraj Bhat Attachments: querypairs.txt The following Pig script causes a ClassCastException when QueryPairs is used in the ForEach statement. This is due to the use of the combiner. {code} QueryPairs = load 'querypairs.txt' using PigStorage() as ( q1: chararray, q2: chararray ); describe QueryPairs; QueryPairsGrouped = group QueryPairs by ( q1 ); describe QueryPairsGrouped; QueryGroups = foreach QueryPairsGrouped generate group as q1, COUNT(QueryPairs) as paircount, QueryPairsas QueryPairs; describe QueryGroups; dump QueryGroups; {code} = 2008-12-31 15:01:48,713 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error message from task (map) task_200812151518_4922_m_00java.lang.ClassCastException: org.apache.pig.data.DefaultDataBag cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCombinerPackage.getNext(POCombinerPackage.java:122) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:152) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:143) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:57) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:904) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:785) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) = -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-546) FilterFunc calls empty constructor when it should be calling parameterized constructor
[ https://issues.apache.org/jira/browse/PIG-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-546: -- Assignee: Santhosh Srinivasan FilterFunc calls empty constructor when it should be calling parameterized constructor -- Key: PIG-546 URL: https://issues.apache.org/jira/browse/PIG-546 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Assignee: Santhosh Srinivasan Fix For: 0.2.0 Attachments: FILTERFROMFILE.java, insetfilterfile, mydata.txt, PIG-546.patch The following piece of Pig Script uses a custom UDF known as FILTERFROMFILE which extends the FilterFunc. It contains two constructors, an empty constructor which is mandatory and the parameterized constructor. The parameterized constructor passes the HDFS filename, which the exec function uses to construct a HashMap. The HashMap is later used for filtering records based on the match criteria in the HDFS file. {code} register util.jar; --util.jar contains the FILTERFROMFILE class define FILTER_CRITERION util.FILTERFROMFILE('/user/viraj/insetfilterfile'); RAW_LOGS = load 'mydata.txt' as (url:chararray, numvisits:int); FILTERED_LOGS = filter RAW_LOGS by FILTER_CRITERION(numvisits); dump FILTERED_LOGS; {code} When you execute the above script, it results in a single Map only job with 1 Map. It seems that the empty constructor is called 5 times, and ultimately results in failure of the job. === parameterized constructor: /user/viraj/insetfilterfile parameterized constructor: /user/viraj/insetfilterfile empty constructor empty constructor empty constructor empty constructor empty constructor === Error in the Hadoop backend === java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:82) at org.apache.hadoop.fs.Path.(Path.java:90) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:199) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:130) at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:164) at util.FILTERFROMFILE.init(FILTERFROMFILE.java:70) at util.FILTERFROMFILE.exec(FILTERFROMFILE.java:89) at util.FILTERFROMFILE.exec(FILTERFROMFILE.java:52) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:179) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:217) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:148) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:170) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:158) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) === Attaching the sample data and the filter function UDF. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-570) Large BZip files Seem to loose data in Pig
[ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-570: -- Assignee: Benjamin Reed Large BZip files Seem to loose data in Pig --- Key: PIG-570 URL: https://issues.apache.org/jira/browse/PIG-570 Project: Pig Issue Type: Bug Affects Versions: 0.0.0, 0.1.0, 0.2.0, site Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2 Reporter: Alex Newman Assignee: Benjamin Reed Fix For: 0.2.0 Attachments: bzipTest.bz2, PIG-570.patch So I don't believe bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms: - Maps seem to be completing in a unbelievably fast rate With uncompressed data Status: Succeeded Started at: Wed Dec 17 21:31:10 EST 2008 Finished at: Wed Dec 17 22:42:09 EST 2008 Finished in: 1hrs, 10mins, 59sec map 100.00% 4670 0 0 46700 0 / 21 reduce57.72% 130 0 13 0 0 / 4 With bzip compressed data Started at: Wed Dec 17 21:17:28 EST 2008 Failed at: Wed Dec 17 21:17:52 EST 2008 Failed in: 24sec Black-listed TaskTrackers: 2 Kind % Complete Num Tasks Pending Running CompleteKilled Failed/Killed Task Attempts map 100.00% 183 0 0 15 168 54 / 22 reduce100.00% 130 0 0 13 0 / 0 The errors we get: ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, 0HAW, CHIX, ) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Last 4KB attempt_200812161759_0045_m_07_0 task_200812161759_0045_m_07 tsdhb06.factset.com FAILED java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-572) A PigServer.registerScript() method, which lets a client programmatically register a Pig Script.
[ https://issues.apache.org/jira/browse/PIG-572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-572: -- Assignee: Shubham Chopra A PigServer.registerScript() method, which lets a client programmatically register a Pig Script. Key: PIG-572 URL: https://issues.apache.org/jira/browse/PIG-572 Project: Pig Issue Type: New Feature Affects Versions: 0.2.0 Reporter: Shubham Chopra Assignee: Shubham Chopra Priority: Minor Fix For: 0.2.0 Attachments: registerScript.patch A PigServer.registerScript() method, which lets a client programmatically register a Pig Script. For example, say theres a script my_script.pig with the following content: a = load '/data/my_data.txt'; b = filter a by $0 '0'; The function lets you use something like the following: pigServer.registerScript(my_script.pig); pigServer.registerQuery(c = foreach b generate $2, $3;); pigServer.store(c); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-574) run command for grunt
[ https://issues.apache.org/jira/browse/PIG-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-574: -- Assignee: Olga Natkovich run command for grunt - Key: PIG-574 URL: https://issues.apache.org/jira/browse/PIG-574 Project: Pig Issue Type: New Feature Components: grunt Reporter: David Ciemiewicz Assignee: Olga Natkovich Priority: Minor Attachments: PIG-574.patch, run_command.patch, run_command_params.patch, run_command_params_021109.patch This is a request for a run file command in grunt which will read a script from the local file system and execute the script interactively while in the grunt shell. One of the things that slows down iterative development of large, complicated Pig scripts that must operate on hadoop fs data is that the edit, run, debug cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) cluster for each iteration. I would prefer not to preallocate a cluster of nodes (though I could). Instead, I'd like to have one window open and edit my Pig script using vim or emacs, write it, and then type run myscript.pig at the grunt shell until I get things right. I'm used to doing similar things with Oracle, MySQL, and R. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-627: -- Assignee: Gunther Hagleitner PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Gunther Hagleitner Fix For: 0.3.0 Attachments: doc-fix.patch, error_handling_0415.patch, error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, merge-041409.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery-phase3_0423.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1088) change merge join and merge join indexer to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776792#action_12776792 ] Thejas M Nair commented on PIG-1088: *Problem* : With old load/store interface, the index created by MergeJoinIndexer consisted of tuples with join key(s), filename, offset. With the new load/store interface, the split index is available (RecordReader.getSplitIndex) instead of filename and offset . But there is no guarantee that split indexes are in sorted order of the file. If more than one split has tuples with same join key in it, it is necessary to know which split needs to be read first. *Proposal*: (thanks to Alan Gates) We should add an interface to the list of load interfaces: public interface LoadOrderedInput { WritableComparable getPosition(); } If the load function implements this interface it can then be used in a merge join. This getPosition call could then be called in the map phase of the sampling MR job and the tuples in the index will have the sort(/join) key(s) followed by the resulting value. In sorting the index in the reduce phase of the sampling MR job, this value will then be used. For LoadFuncs that use FileInputFormat, getPosition can return the following class: public class TextInputOrder implements WritableComparable { private String basename; // basename of the file private long offset; // offset at which this split starts int compareTo(TextInputOrder other) { int rc = basename.compareTo(other.basename) if (rc == 0) rc = offset.compareTo(other.offset); return rc; } } This means that we would take the filenames sorted lexigraphically (which will work for things like part-0, map-0, bucket001 (warehouse data), etc.) and then offsets into those files after that. To make it easier for authors of new LoadFuncs to implement this interface, implementation of this interface for load functions that use FileInputFormat will be provided through an abstract base class. change merge join and merge join indexer to work with new LoadFunc interface Key: PIG-1088 URL: https://issues.apache.org/jira/browse/PIG-1088 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.