[jira] Created: (PIG-1301) Problem pruning columns with UDF
Problem pruning columns with UDF Key: PIG-1301 URL: https://issues.apache.org/jira/browse/PIG-1301 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Andrew Groh I just upgraded to pig 0.6.0. I have a pig file like raw = load 'foo.csv' using PigStorage() as (field1:chararray, field2:chararray); define contains com.mycompany.pig.Contains(); rawactions = foreach raw generate contains(field1, field2) as junk, field1; reqcnt = foreach rawactions generate field1; dump reqcnt When I try to run this code, I get an error: Problem with input: (Name: Project 1-40 Projections: [1] Overloaded: false Operator Key: 1-40) of User-defined function: (Name: UserFunc 1-39 function: com.mycompany.pig.Contains Operator Key: 1-39) Thrown from line 98 of LOUserFunction.java This was caused by another FrontEndException Attempt to access field: 1 from schema: {field1: chararray} from Schema.java I also investigated changing the pig code if you change rawactions = foreach raw generate contains(field1, field2) as junk, field1; to either rawactions = foreach raw generate contains(field2, field2) as junk, field1; or rawactions = foreach raw generate contains(field2, field2) as junk, field1; or if you change reqcnt = foreach rawactions generate field1; to reqcnt = foreach rawactions generate field1, junk; It all works correctly. The problem appears to be that it prunes out field2, but then gets confused and does not prune out the plan associated with the UDF contains, since field1 is not pruned. So if the UDF only references field2 it will get removed, if it only references field1 the field will have not been pruned and it can run. I eventually tracked this down to the code around 947 of LOForEach.java for (LOProject loProject : projectFinder.getProjectSet()) { PairInteger, Integer pair = new PairInteger, Integer(0, loProject.getCol()); if (!columns.contains(pair)) { allPruned = false; break; } } if (allPruned) { planToRemove.add(i); } In the example pig, allPruned is false for the plan associated the UDF. This is because field1 is both a column for the UDF and for the ForEach in general. Since field1 is not pruned, the plan is not removed and bad things happen later. I don't really understand the pruning code all that well, so I don't have a fix for it. I hope that it will be clear to someone who understands this code better. I can provide a better test case for this if necessary. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1292) Interface Refinements
[ https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1292: -- Resolution: Fixed Status: Resolved (was: Patch Available) Patch checked-in with changes suggested in previous comment. Core test failure reported by hudson was transient. It passed on my machine. Interface Refinements - Key: PIG-1292 URL: https://issues.apache.org/jira/browse/PIG-1292 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1292.patch, pig-interfaces.patch A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both are abstract classes instead of being interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1257) PigStorage per the new load-store redesign should support splitting of bzip files
[ https://issues.apache.org/jira/browse/PIG-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846027#action_12846027 ] Benjamin Reed commented on PIG-1257: excellent work pradeep. just one minor thing: you always append a \n before inputData in your test case, so you never test the case when you end with just \r PigStorage per the new load-store redesign should support splitting of bzip files - Key: PIG-1257 URL: https://issues.apache.org/jira/browse/PIG-1257 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: blockEndingInCR.txt.bz2, blockHeaderEndsAt136500.txt.bz2, PIG-1257-2.patch, PIG-1257-3.patch, PIG-1257.patch, recordLossblockHeaderEndsAt136500.txt.bz2 PigStorage implemented per new load-store-redesign (PIG-966) is based on TextInputFormat for reading data. TextInputFormat has support for reading bzip data but without support for splitting bzip files. In pig 0.6, splitting was enabled for bzip files - we should attempt to enable that feature. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1258) [zebra] Number of sorted input splits is unusually high
[ https://issues.apache.org/jira/browse/PIG-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1258: -- Attachment: PIG-1258.patch [zebra] Number of sorted input splits is unusually high --- Key: PIG-1258 URL: https://issues.apache.org/jira/browse/PIG-1258 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Yan Zhou Attachments: PIG-1258.patch Number of sorted input splits is unusually high if the projections are on multiple column groups, or a union of tables, or column group(s) that hold many small tfiles. In one test, the number is about 100 times bigger that from unsorted input splits on the same input tables. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1257) PigStorage per the new load-store redesign should support splitting of bzip files
[ https://issues.apache.org/jira/browse/PIG-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846038#action_12846038 ] Pradeep Kamath commented on PIG-1257: - In the following case in inputData the record will end with \r won't it? (notice the \r in the middle after 2) {code} 1\t2\r3\t4, // '\r' case - this will be split into two tuples {code} PigStorage per the new load-store redesign should support splitting of bzip files - Key: PIG-1257 URL: https://issues.apache.org/jira/browse/PIG-1257 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: blockEndingInCR.txt.bz2, blockHeaderEndsAt136500.txt.bz2, PIG-1257-2.patch, PIG-1257-3.patch, PIG-1257.patch, recordLossblockHeaderEndsAt136500.txt.bz2 PigStorage implemented per new load-store-redesign (PIG-966) is based on TextInputFormat for reading data. TextInputFormat has support for reading bzip data but without support for splitting bzip files. In pig 0.6, splitting was enabled for bzip files - we should attempt to enable that feature. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Branching for Pig 0.7.0 release
Hi, It has been a few weeks since we merged the Load-Store redesign changes into the trunk. We have been doing a lot of testing and fixing bugs. I think it is time to branch the code in preparation for Pig 0.7.0 release. Unless I here objections, I will do this next Monday, 3/22. Olga
[jira] Commented: (PIG-1257) PigStorage per the new load-store redesign should support splitting of bzip files
[ https://issues.apache.org/jira/browse/PIG-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846080#action_12846080 ] Pradeep Kamath commented on PIG-1257: - I ran all unit tests on my local machines and also the test-patch ant target: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 12 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] PigStorage per the new load-store redesign should support splitting of bzip files - Key: PIG-1257 URL: https://issues.apache.org/jira/browse/PIG-1257 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: blockEndingInCR.txt.bz2, blockHeaderEndsAt136500.txt.bz2, PIG-1257-2.patch, PIG-1257-3.patch, PIG-1257.patch, recordLossblockHeaderEndsAt136500.txt.bz2 PigStorage implemented per new load-store-redesign (PIG-966) is based on TextInputFormat for reading data. TextInputFormat has support for reading bzip data but without support for splitting bzip files. In pig 0.6, splitting was enabled for bzip files - we should attempt to enable that feature. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1257) PigStorage per the new load-store redesign should support splitting of bzip files
[ https://issues.apache.org/jira/browse/PIG-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846083#action_12846083 ] Benjamin Reed commented on PIG-1257: +1 you are right. thanx pradeep. i think it is ready to commit. PigStorage per the new load-store redesign should support splitting of bzip files - Key: PIG-1257 URL: https://issues.apache.org/jira/browse/PIG-1257 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: blockEndingInCR.txt.bz2, blockHeaderEndsAt136500.txt.bz2, PIG-1257-2.patch, PIG-1257-3.patch, PIG-1257.patch, recordLossblockHeaderEndsAt136500.txt.bz2 PigStorage implemented per new load-store-redesign (PIG-966) is based on TextInputFormat for reading data. TextInputFormat has support for reading bzip data but without support for splitting bzip files. In pig 0.6, splitting was enabled for bzip files - we should attempt to enable that feature. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1301) Problem pruning columns with UDF
[ https://issues.apache.org/jira/browse/PIG-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1301. - Resolution: Fixed Fix Version/s: 0.7.0 Problem pruning columns with UDF Key: PIG-1301 URL: https://issues.apache.org/jira/browse/PIG-1301 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Andrew Groh Fix For: 0.7.0 I just upgraded to pig 0.6.0. I have a pig file like raw = load 'foo.csv' using PigStorage() as (field1:chararray, field2:chararray); define contains com.mycompany.pig.Contains(); rawactions = foreach raw generate contains(field1, field2) as junk, field1; reqcnt = foreach rawactions generate field1; dump reqcnt When I try to run this code, I get an error: Problem with input: (Name: Project 1-40 Projections: [1] Overloaded: false Operator Key: 1-40) of User-defined function: (Name: UserFunc 1-39 function: com.mycompany.pig.Contains Operator Key: 1-39) Thrown from line 98 of LOUserFunction.java This was caused by another FrontEndException Attempt to access field: 1 from schema: {field1: chararray} from Schema.java I also investigated changing the pig code if you change rawactions = foreach raw generate contains(field1, field2) as junk, field1; to either rawactions = foreach raw generate contains(field2, field2) as junk, field1; or rawactions = foreach raw generate contains(field2, field2) as junk, field1; or if you change reqcnt = foreach rawactions generate field1; to reqcnt = foreach rawactions generate field1, junk; It all works correctly. The problem appears to be that it prunes out field2, but then gets confused and does not prune out the plan associated with the UDF contains, since field1 is not pruned. So if the UDF only references field2 it will get removed, if it only references field1 the field will have not been pruned and it can run. I eventually tracked this down to the code around 947 of LOForEach.java for (LOProject loProject : projectFinder.getProjectSet()) { PairInteger, Integer pair = new PairInteger, Integer(0, loProject.getCol()); if (!columns.contains(pair)) { allPruned = false; break; } } if (allPruned) { planToRemove.add(i); } In the example pig, allPruned is false for the plan associated the UDF. This is because field1 is both a column for the UDF and for the ForEach in general. Since field1 is not pruned, the plan is not removed and bad things happen later. I don't really understand the pruning code all that well, so I don't have a fix for it. I hope that it will be clear to someone who understands this code better. I can provide a better test case for this if necessary. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1301) Problem pruning columns with UDF
[ https://issues.apache.org/jira/browse/PIG-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846087#action_12846087 ] Daniel Dai commented on PIG-1301: - Thanks for reporting. I tried the script on trunk. Seems on trunk we have fixed that. The code you mentioned do have the problem you mentioned. But in the trunk, we already changed the code to: {code} boolean anyPruned = false; for (LOProject loProject : projectFinder.getProjectSet()) { PairInteger, Integer pair = new PairInteger, Integer(0, loProject.getCol()); if (columns.contains(pair)) { anyPruned = true; break; } } {code} The fix will come with next Pig release. Problem pruning columns with UDF Key: PIG-1301 URL: https://issues.apache.org/jira/browse/PIG-1301 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Andrew Groh Fix For: 0.7.0 I just upgraded to pig 0.6.0. I have a pig file like raw = load 'foo.csv' using PigStorage() as (field1:chararray, field2:chararray); define contains com.mycompany.pig.Contains(); rawactions = foreach raw generate contains(field1, field2) as junk, field1; reqcnt = foreach rawactions generate field1; dump reqcnt When I try to run this code, I get an error: Problem with input: (Name: Project 1-40 Projections: [1] Overloaded: false Operator Key: 1-40) of User-defined function: (Name: UserFunc 1-39 function: com.mycompany.pig.Contains Operator Key: 1-39) Thrown from line 98 of LOUserFunction.java This was caused by another FrontEndException Attempt to access field: 1 from schema: {field1: chararray} from Schema.java I also investigated changing the pig code if you change rawactions = foreach raw generate contains(field1, field2) as junk, field1; to either rawactions = foreach raw generate contains(field2, field2) as junk, field1; or rawactions = foreach raw generate contains(field2, field2) as junk, field1; or if you change reqcnt = foreach rawactions generate field1; to reqcnt = foreach rawactions generate field1, junk; It all works correctly. The problem appears to be that it prunes out field2, but then gets confused and does not prune out the plan associated with the UDF contains, since field1 is not pruned. So if the UDF only references field2 it will get removed, if it only references field1 the field will have not been pruned and it can run. I eventually tracked this down to the code around 947 of LOForEach.java for (LOProject loProject : projectFinder.getProjectSet()) { PairInteger, Integer pair = new PairInteger, Integer(0, loProject.getCol()); if (!columns.contains(pair)) { allPruned = false; break; } } if (allPruned) { planToRemove.add(i); } In the example pig, allPruned is false for the plan associated the UDF. This is because field1 is both a column for the UDF and for the ForEach in general. Since field1 is not pruned, the plan is not removed and bad things happen later. I don't really understand the pruning code all that well, so I don't have a fix for it. I hope that it will be clear to someone who understands this code better. I can provide a better test case for this if necessary. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1289) PIG Join fails while doing a filter on joined data
[ https://issues.apache.org/jira/browse/PIG-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1289: Fix Version/s: 0.7.0 PIG Join fails while doing a filter on joined data -- Key: PIG-1289 URL: https://issues.apache.org/jira/browse/PIG-1289 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Karim Saadah Assignee: Daniel Dai Priority: Minor Fix For: 0.7.0 Attachments: PIG-1289-1.patch PIG Join fails while doing a filter on joined data Here are the steps to reproduce it: -bash-3.1$ pig -latest -x local grunt a = load 'first.dat' using PigStorage('\u0001') as (f1:int, f2:chararray); grunt DUMP a; (1,A) (2,B) (3,C) (4,D) grunt b = load 'second.dat' using PigStorage() as (f3:chararray); grunt DUMP b; (A) (D) (E) grunt c = join a by f2 LEFT OUTER, b by f3; grunt DUMP c; (1,A,A) (2,B,) (3,C,) (4,D,D) grunt describe c; c: {a::f1: int,a::f2: chararray,b::f3: chararray} grunt d = filter c by (f3 is null or f3 ==''); grunt dump d; 2010-03-03 15:00:37,129 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for b 2010-03-03 15:00:37,129 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned for b 2010-03-03 15:00:37,129 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for a 2010-03-03 15:00:37,130 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned for a 2010-03-03 15:00:37,130 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable to store alias d This one is failing too: grunt d = filter c by (b::f3 is null or b::f3 ==''); or this one not returning results as expected: grunt d = foreach c generate f1 as f1, f2 as f2, f3 as f3; grunt e = filter d by (f3 is null or f3 ==''); grunt DUMP e; (1,A,) (2,B,) (3,C,) (4,D,) while the expected result is (2,B,) (3,C,) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1284) pig UDF is lacking XMLLoader. Plan to add the XMLLoader
[ https://issues.apache.org/jira/browse/PIG-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846111#action_12846111 ] Alok Singh commented on PIG-1284: - Hi As mentioned earlier, I have run the test locally and is passing. The timeout issue is not related to this. Can a moderater review my patch and commit it Thanks Alok pig UDF is lacking XMLLoader. Plan to add the XMLLoader --- Key: PIG-1284 URL: https://issues.apache.org/jira/browse/PIG-1284 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Alok Singh Fix For: 0.7.0 Attachments: pigudf_xmlLoader.patch, pigudf_xmlLoader.patch Original Estimate: 168h Remaining Estimate: 168h Hi All, We are planning to add the XMLLoader UDF in the piggybank repository. Here is the proposal with the user docs :- The load function to load the XML file This will implements the LoadFunc interface which is used to parse records from a dataset. This takes a xmlTag as the arg which it will use to split the inputdataset into multiple records. For example if the input xml (input.xml) is like this configuration property name foobar /name value barfoo /value /property ignoreProperty name foo /name /ignoreProperty property name justname /name /property /configuration And your pig script is like this --load the jar files register loader.jar; -- load the dataset using XMLLoader -- A is the bag containing the tuple which contains one atom i.e doc see output A = load '/user/aloks/pig/input.xml using loader.XMLLoader('property') as (doc:chararray); --dump the result dump A; Then you will get the output (property name foobar /name value barfoo /value /property) (property name justname /name /property) Where each () indicate one record -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1302) Include zebra's
Include zebra's Key: PIG-1302 URL: https://issues.apache.org/jira/browse/PIG-1302 Project: Pig Issue Type: Improvement Reporter: Pradeep Kamath -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1302) Include zebra's pigtest ant target as a part of pig's ant test target
[ https://issues.apache.org/jira/browse/PIG-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1302: Description: There are changes made in Pig interfaces which break zebra loaders/storers. It would be good to run the pig tests in the zebra unit tests as part of running pig's core-test for each patch submission. So essentially in the test ant target in pig, we would need to invoke zebra's pigtest target. Affects Version/s: 0.7.0 Fix Version/s: 0.7.0 Summary: Include zebra's pigtest ant target as a part of pig's ant test target (was: Include zebra's ) Include zebra's pigtest ant target as a part of pig's ant test target --- Key: PIG-1302 URL: https://issues.apache.org/jira/browse/PIG-1302 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Pradeep Kamath Fix For: 0.7.0 There are changes made in Pig interfaces which break zebra loaders/storers. It would be good to run the pig tests in the zebra unit tests as part of running pig's core-test for each patch submission. So essentially in the test ant target in pig, we would need to invoke zebra's pigtest target. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1284) pig UDF is lacking XMLLoader. Plan to add the XMLLoader
[ https://issues.apache.org/jira/browse/PIG-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846130#action_12846130 ] Alan Gates commented on PIG-1284: - I'll take a look at the patch. pig UDF is lacking XMLLoader. Plan to add the XMLLoader --- Key: PIG-1284 URL: https://issues.apache.org/jira/browse/PIG-1284 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Alok Singh Fix For: 0.7.0 Attachments: pigudf_xmlLoader.patch, pigudf_xmlLoader.patch Original Estimate: 168h Remaining Estimate: 168h Hi All, We are planning to add the XMLLoader UDF in the piggybank repository. Here is the proposal with the user docs :- The load function to load the XML file This will implements the LoadFunc interface which is used to parse records from a dataset. This takes a xmlTag as the arg which it will use to split the inputdataset into multiple records. For example if the input xml (input.xml) is like this configuration property name foobar /name value barfoo /value /property ignoreProperty name foo /name /ignoreProperty property name justname /name /property /configuration And your pig script is like this --load the jar files register loader.jar; -- load the dataset using XMLLoader -- A is the bag containing the tuple which contains one atom i.e doc see output A = load '/user/aloks/pig/input.xml using loader.XMLLoader('property') as (doc:chararray); --dump the result dump A; Then you will get the output (property name foobar /name value barfoo /value /property) (property name justname /name /property) Where each () indicate one record -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1302) Include zebra's pigtest ant target as a part of pig's ant test target
[ https://issues.apache.org/jira/browse/PIG-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846179#action_12846179 ] Alan Gates commented on PIG-1302: - -1. Pig must build independent of its contrib projects. I'm fine with changing the hudson process to run some of Zebra's tests as well. But ant test at the Pig level should not invoke Zebra. Include zebra's pigtest ant target as a part of pig's ant test target --- Key: PIG-1302 URL: https://issues.apache.org/jira/browse/PIG-1302 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Pradeep Kamath Fix For: 0.7.0 There are changes made in Pig interfaces which break zebra loaders/storers. It would be good to run the pig tests in the zebra unit tests as part of running pig's core-test for each patch submission. So essentially in the test ant target in pig, we would need to invoke zebra's pigtest target. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1302) Include zebra's pigtest ant target as a part of pig's ant test target
[ https://issues.apache.org/jira/browse/PIG-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846184#action_12846184 ] Olga Natkovich commented on PIG-1302: - That's the approach we initially favored but according to Giri, this is not the way hadoop is doing this and we wanted to be consistent with them. Include zebra's pigtest ant target as a part of pig's ant test target --- Key: PIG-1302 URL: https://issues.apache.org/jira/browse/PIG-1302 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Pradeep Kamath Fix For: 0.7.0 There are changes made in Pig interfaces which break zebra loaders/storers. It would be good to run the pig tests in the zebra unit tests as part of running pig's core-test for each patch submission. So essentially in the test ant target in pig, we would need to invoke zebra's pigtest target. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1257) PigStorage per the new load-store redesign should support splitting of bzip files
[ https://issues.apache.org/jira/browse/PIG-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1257: Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch committed PigStorage per the new load-store redesign should support splitting of bzip files - Key: PIG-1257 URL: https://issues.apache.org/jira/browse/PIG-1257 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: blockEndingInCR.txt.bz2, blockHeaderEndsAt136500.txt.bz2, PIG-1257-2.patch, PIG-1257-3.patch, PIG-1257.patch, recordLossblockHeaderEndsAt136500.txt.bz2 PigStorage implemented per new load-store-redesign (PIG-966) is based on TextInputFormat for reading data. TextInputFormat has support for reading bzip data but without support for splitting bzip files. In pig 0.6, splitting was enabled for bzip files - we should attempt to enable that feature. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1302) Include zebra's pigtest ant target as a part of pig's ant test target
[ https://issues.apache.org/jira/browse/PIG-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846194#action_12846194 ] Alan Gates commented on PIG-1302: - I still maintain my -1. It just seems wrong for main projects to depend on their contribs. 99% of Pig users (counting by organization, not by individual users) don't care about Zebra. Making them test Zebra in addition to Pig is not helpful for them. Perhaps we could add a test-stack target or something that tests Pig plus its contrib projects and have hudson call that. Include zebra's pigtest ant target as a part of pig's ant test target --- Key: PIG-1302 URL: https://issues.apache.org/jira/browse/PIG-1302 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Pradeep Kamath Fix For: 0.7.0 There are changes made in Pig interfaces which break zebra loaders/storers. It would be good to run the pig tests in the zebra unit tests as part of running pig's core-test for each patch submission. So essentially in the test ant target in pig, we would need to invoke zebra's pigtest target. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1302) Include zebra's pigtest ant target as a part of pig's ant test target
[ https://issues.apache.org/jira/browse/PIG-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846197#action_12846197 ] Olga Natkovich commented on PIG-1302: - The idea is that we don't want to commit thing that break contrib projects and that's why integrating it in test and not another target makes sense. I am fine re-visiting this issue with Giri and just adding it to test-patch process though it seems that the end result is exactly the same - you can't commit patches that break contrib projects. Include zebra's pigtest ant target as a part of pig's ant test target --- Key: PIG-1302 URL: https://issues.apache.org/jira/browse/PIG-1302 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Pradeep Kamath Fix For: 0.7.0 There are changes made in Pig interfaces which break zebra loaders/storers. It would be good to run the pig tests in the zebra unit tests as part of running pig's core-test for each patch submission. So essentially in the test ant target in pig, we would need to invoke zebra's pigtest target. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1287) Use hadoop-0.20.2 with pig 0.7.0 release
[ https://issues.apache.org/jira/browse/PIG-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1287: Attachment: PIG-1287-2.patch The new patch also fixes warning aggregation in PigHadoopLogger to use the counter support now available in hadoop 0.20.2 Use hadoop-0.20.2 with pig 0.7.0 release Key: PIG-1287 URL: https://issues.apache.org/jira/browse/PIG-1287 Project: Pig Issue Type: Task Affects Versions: 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: hadoop20.jar, PIG-1287-2.patch, PIG-1287.patch Use hadoop-0.20.2 with pig 0.7.0 release -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1287) Use hadoop-0.20.2 with pig 0.7.0 release
[ https://issues.apache.org/jira/browse/PIG-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1287: Status: Patch Available (was: Open) Use hadoop-0.20.2 with pig 0.7.0 release Key: PIG-1287 URL: https://issues.apache.org/jira/browse/PIG-1287 Project: Pig Issue Type: Task Affects Versions: 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: hadoop20.jar, PIG-1287-2.patch, PIG-1287.patch Use hadoop-0.20.2 with pig 0.7.0 release -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1289) PIG Join fails while doing a filter on joined data
[ https://issues.apache.org/jira/browse/PIG-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846222#action_12846222 ] Daniel Dai commented on PIG-1289: - Yes, it is safe not push filter up a branch that will be producing nulls. I might be wrong but what I did is try to be a little bit more aggressive. Since the only extra value outer join will produce is null, so if filter is not testing null, we can still push it up even if it is on the inner branch. Eg: A = load 'foo' as (q, r, s); B = load 'bar ' as (t, u, v); C = join A on q outer, B on t; D = filter C by t 0; The production C consists of two parts: A + B A + null If we do a filter after join, it is a union on this two parts: filter(A + B) union filter(A + null) If we are not testing nullability (eg, t 0), then filter(A + null) will not have any production, so filter(A + B) union filter(A + null) = filter(A + B) In this case, outer join is equivalent as a regular join (since all generated null B records are filtered away), so we can still push the filter up. PIG Join fails while doing a filter on joined data -- Key: PIG-1289 URL: https://issues.apache.org/jira/browse/PIG-1289 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Karim Saadah Assignee: Daniel Dai Priority: Minor Fix For: 0.7.0 Attachments: PIG-1289-1.patch PIG Join fails while doing a filter on joined data Here are the steps to reproduce it: -bash-3.1$ pig -latest -x local grunt a = load 'first.dat' using PigStorage('\u0001') as (f1:int, f2:chararray); grunt DUMP a; (1,A) (2,B) (3,C) (4,D) grunt b = load 'second.dat' using PigStorage() as (f3:chararray); grunt DUMP b; (A) (D) (E) grunt c = join a by f2 LEFT OUTER, b by f3; grunt DUMP c; (1,A,A) (2,B,) (3,C,) (4,D,D) grunt describe c; c: {a::f1: int,a::f2: chararray,b::f3: chararray} grunt d = filter c by (f3 is null or f3 ==''); grunt dump d; 2010-03-03 15:00:37,129 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for b 2010-03-03 15:00:37,129 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned for b 2010-03-03 15:00:37,129 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for a 2010-03-03 15:00:37,130 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned for a 2010-03-03 15:00:37,130 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable to store alias d This one is failing too: grunt d = filter c by (b::f3 is null or b::f3 ==''); or this one not returning results as expected: grunt d = foreach c generate f1 as f1, f2 as f2, f3 as f3; grunt e = filter d by (f3 is null or f3 ==''); grunt DUMP e; (1,A,) (2,B,) (3,C,) (4,D,) while the expected result is (2,B,) (3,C,) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846224#action_12846224 ] Pradeep Kamath commented on PIG-1205: - Jeff, if the only issue blocking the commit is javac warning - unless the warning is due to use of deprecated hadoop API, we should fix it - if it is due to deprecated hadoop API then its ok to ignore. Very soon trunk will be branched for Pig 0.7.0 - so if this feature is useful to feature in Pig 0.7.0, we should get this committed soon. Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.7.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1293) pig wrapper script tends to fail if pig is in the path and PIG_HOME isn't set
[ https://issues.apache.org/jira/browse/PIG-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846226#action_12846226 ] Alan Gates commented on PIG-1293: - Allen, I'm having trouble reproducing this issue, so I'm not sure how to test your fix. If I take top of trunk and install it, then do: {code} gates echo $PIG_HOME gates PATH=/usr/bin:/usr/local/bin:/bin:./bin which pig /home/gates/tmp/pig-0.7.0-dev/bin/pig gates PATH=/usr/bin:/usr/local/bin:/bin:./bin pig -x local ~/pig/scripts/Checkin_2.local.pig 10/03/16 17:09:24 INFO pig.Main: Logging error messages to: /home/gates/tmp/pig-0.7.0-dev/pig_1268784564902.log 2010-03-16 17:09:25,205 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2010-03-16 17:09:26,047 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_INT 2 time(s). ... {code} What am I doing wrong here? pig wrapper script tends to fail if pig is in the path and PIG_HOME isn't set - Key: PIG-1293 URL: https://issues.apache.org/jira/browse/PIG-1293 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Allen Wittenauer Attachments: PIG-1293.txt If PIG_HOME isn't set and pig is in the path, the pig wrapper script can't find its home. Setting PIG_HOME makes it hard to support multiple versions of pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1289) PIG Join fails while doing a filter on joined data
[ https://issues.apache.org/jira/browse/PIG-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846228#action_12846228 ] Alan Gates commented on PIG-1289: - In the case of D = filter C by t 0 the filter will evaluate to null when t is null. By definition filters return only records that evaluate true. So t 0 will have the affect of filtering out all outer records of A because t will be null for every one of them. That is, it turns the join into an inner join. However, if the filter is pushed above the join, it will remain an outer join, since it will only filter the records from B where t 0 and not the outer records from A. Thus this transformation is not output neutral. PIG Join fails while doing a filter on joined data -- Key: PIG-1289 URL: https://issues.apache.org/jira/browse/PIG-1289 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Karim Saadah Assignee: Daniel Dai Priority: Minor Fix For: 0.7.0 Attachments: PIG-1289-1.patch PIG Join fails while doing a filter on joined data Here are the steps to reproduce it: -bash-3.1$ pig -latest -x local grunt a = load 'first.dat' using PigStorage('\u0001') as (f1:int, f2:chararray); grunt DUMP a; (1,A) (2,B) (3,C) (4,D) grunt b = load 'second.dat' using PigStorage() as (f3:chararray); grunt DUMP b; (A) (D) (E) grunt c = join a by f2 LEFT OUTER, b by f3; grunt DUMP c; (1,A,A) (2,B,) (3,C,) (4,D,D) grunt describe c; c: {a::f1: int,a::f2: chararray,b::f3: chararray} grunt d = filter c by (f3 is null or f3 ==''); grunt dump d; 2010-03-03 15:00:37,129 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for b 2010-03-03 15:00:37,129 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned for b 2010-03-03 15:00:37,129 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for a 2010-03-03 15:00:37,130 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned for a 2010-03-03 15:00:37,130 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable to store alias d This one is failing too: grunt d = filter c by (b::f3 is null or b::f3 ==''); or this one not returning results as expected: grunt d = foreach c generate f1 as f1, f2 as f2, f3 as f3; grunt e = filter d by (f3 is null or f3 ==''); grunt DUMP e; (1,A,) (2,B,) (3,C,) (4,D,) while the expected result is (2,B,) (3,C,) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846229#action_12846229 ] Olga Natkovich commented on PIG-1150: - We would like to cut release branch next Monday. This mean that the code needs to be committed by the end of the week. Is this likely to happen? If not, I would like to unlink this from 0.7.0 release and leave for inclusion in one of the future releases when this patch is ready. VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Fix For: 0.7.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1284) pig UDF is lacking XMLLoader. Plan to add the XMLLoader
[ https://issues.apache.org/jira/browse/PIG-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1284: Since we are planning to branch for release next Monday, 3/22, it needs to be ready to be committed by the end of the week. Otherwise, we should schedule it for the next release. Please, update the target version accordingly. pig UDF is lacking XMLLoader. Plan to add the XMLLoader --- Key: PIG-1284 URL: https://issues.apache.org/jira/browse/PIG-1284 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Alok Singh Fix For: 0.7.0 Attachments: pigudf_xmlLoader.patch, pigudf_xmlLoader.patch Original Estimate: 168h Remaining Estimate: 168h Hi All, We are planning to add the XMLLoader UDF in the piggybank repository. Here is the proposal with the user docs :- The load function to load the XML file This will implements the LoadFunc interface which is used to parse records from a dataset. This takes a xmlTag as the arg which it will use to split the inputdataset into multiple records. For example if the input xml (input.xml) is like this configuration property name foobar /name value barfoo /value /property ignoreProperty name foo /name /ignoreProperty property name justname /name /property /configuration And your pig script is like this --load the jar files register loader.jar; -- load the dataset using XMLLoader -- A is the bag containing the tuple which contains one atom i.e doc see output A = load '/user/aloks/pig/input.xml using loader.XMLLoader('property') as (doc:chararray); --dump the result dump A; Then you will get the output (property name foobar /name value barfoo /value /property) (property name justname /name /property) Where each () indicate one record -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1293) pig wrapper script tends to fail if pig is in the path and PIG_HOME isn't set
[ https://issues.apache.org/jira/browse/PIG-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846236#action_12846236 ] Allen Wittenauer commented on PIG-1293: --- You likely have PIG_HOME configured. Unset it, then try running bash -x pig and the message about being unable to find pig-env.sh won't be hidden by bash. BTW, the hadoop equiv jira is HADOOP-6630, as it suffers from the same problem. pig wrapper script tends to fail if pig is in the path and PIG_HOME isn't set - Key: PIG-1293 URL: https://issues.apache.org/jira/browse/PIG-1293 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Allen Wittenauer Attachments: PIG-1293.txt If PIG_HOME isn't set and pig is in the path, the pig wrapper script can't find its home. Setting PIG_HOME makes it hard to support multiple versions of pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Branching for Pig 0.7.0
Hi, If you have an issue assigned to you for Pig 0.7.0 release, please, make sure that it can be committed by the end of the week since we are aiming to branch for the release by next Monday, 3/22. If you don't think the issue can be addressed by then but feel strongly that it needs to be in Pig 0.7.0, please, respond with your reasoning. Otherwise, please, unlink from the release any issues that will not meet the deadline. Thanks, Olga
[jira] Commented: (PIG-1293) pig wrapper script tends to fail if pig is in the path and PIG_HOME isn't set
[ https://issues.apache.org/jira/browse/PIG-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846243#action_12846243 ] Allen Wittenauer commented on PIG-1293: --- Err, not PIG_HOME, PIG_CONF_DIR. pig wrapper script tends to fail if pig is in the path and PIG_HOME isn't set - Key: PIG-1293 URL: https://issues.apache.org/jira/browse/PIG-1293 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Allen Wittenauer Attachments: PIG-1293.txt If PIG_HOME isn't set and pig is in the path, the pig wrapper script can't find its home. Setting PIG_HOME makes it hard to support multiple versions of pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846273#action_12846273 ] Dmitriy V. Ryaboy commented on PIG-1205: You can suppress the unchecked warning with @SuppresWarnings(unchecked), and comment why it's ok to suppress the warning I've been playing with using HBase through pig using the 0.6 loader, and I must say, it's very far from being ready from prime-time. I don't know whether we need to exert too much effort to get this under the wire when it won't really be usable anyway until much further love is applied. -D Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.7.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1287) Use hadoop-0.20.2 with pig 0.7.0 release
[ https://issues.apache.org/jira/browse/PIG-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846287#action_12846287 ] Hadoop QA commented on PIG-1287: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438984/PIG-1287-2.patch against trunk revision 924034. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/240/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/240/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/240/console This message is automatically generated. Use hadoop-0.20.2 with pig 0.7.0 release Key: PIG-1287 URL: https://issues.apache.org/jira/browse/PIG-1287 Project: Pig Issue Type: Task Affects Versions: 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: hadoop20.jar, PIG-1287-2.patch, PIG-1287.patch Use hadoop-0.20.2 with pig 0.7.0 release -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.