[jira] Commented: (PIG-1231) Default DataBagIterator.hasNext() should be idempotent in all cases
[ https://issues.apache.org/jira/browse/PIG-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831400#action_12831400 ] Hadoop QA commented on PIG-1231: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435230/PIG-1231-1.patch against trunk revision 907760. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/206/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/206/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/206/console This message is automatically generated. Default DataBagIterator.hasNext() should be idempotent in all cases --- Key: PIG-1231 URL: https://issues.apache.org/jira/browse/PIG-1231 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1231-1.patch DefaultDataBagIterator.hasNext() is not repeatable when the below conditions met: 1. There is no more tuple in the last spill file 2. There is no tuples in memory (all contents are spilled to files) This is not acceptable cuz the name hasNext() implies that it is idempotent. In BagFormat, we do misuse DataBagIterator.hasNext() because of the assumption that hasNext() is always idempotent, which leads to some mysterious errors. Condition 2 seems to be very restrictive, but when the databag is really big, the memory can hold less than a couple of tuples, the chance to hit 2. is high enough. Here is one error we saw: Caused by: java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:145) at java.io.BufferedInputStream.fill(BufferedInputStream.java:189) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readByte(DataInputStream.java:248) at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:278) at org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.readFromFile(DefaultDataBag.java:237) ... 20 more This happens because: we call hasNext(), which reach EOF and we close the file. Then we call hasNext() again in the assumption that it is idempotent. However, the stream is closed so we get this error message. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-834) incorrect plan when algebraic functions are nested
[ https://issues.apache.org/jira/browse/PIG-834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831421#action_12831421 ] Hadoop QA commented on PIG-834: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435027/pig-834_2.patch against trunk revision 907760. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/195/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/195/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/195/console This message is automatically generated. incorrect plan when algebraic functions are nested -- Key: PIG-834 URL: https://issues.apache.org/jira/browse/PIG-834 Project: Pig Issue Type: Bug Components: impl Reporter: Thejas M Nair Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-834.patch, pig-834_2.patch a = load 'students.txt' as (c1,c2,c3,c4); c = group a by c2; f = foreach c generate COUNT(org.apache.pig.builtin.Distinct($1.$2)); Notice that Distinct udf is missing in Combiner and reduce stage. As a result distinct does not function, and incorrect results are produced. Distinct should have been evaluated in the 3 stages and output of Distinct should be given to COUNT in reduce stage. {code} # Map Reduce Plan #-- MapReduce node 1-122 Map Plan Local Rearrange[tuple]{bytearray}(false) - 1-139 | | | Project[bytearray][1] - 1-140 | |---New For Each(false,false)[bag] - 1-127 | | | POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] - 1-125 | | | |---POUserFunc(org.apache.pig.builtin.Distinct)[bag] - 1-126 | | | |---Project[bag][2] - 1-123 | | | |---Project[bag][1] - 1-124 | | | Project[bytearray][0] - 1-133 | |---Pre Combiner Local Rearrange[tuple]{Unknown} - 1-141 | |---Load(hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/tejas/students.txt:org.apache.pig.builtin.PigStorage) - 1-111 Combine Plan Local Rearrange[tuple]{bytearray}(false) - 1-143 | | | Project[bytearray][1] - 1-144 | |---New For Each(false,false)[bag] - 1-132 | | | POUserFunc(org.apache.pig.builtin.COUNT$Intermediate)[tuple] - 1-130 | | | |---Project[bag][0] - 1-135 | | | Project[bytearray][1] - 1-134 | |---POCombinerPackage[tuple]{bytearray} - 1-137 Reduce Plan Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-121 | |---New For Each(false)[bag] - 1-120 | | | POUserFunc(org.apache.pig.builtin.COUNT$Final)[long] - 1-119 | | | |---Project[bag][0] - 1-136 | |---POCombinerPackage[tuple]{bytearray} - 1-145 Global sort: false {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-259) allow store to overwrite existing directroy
[ https://issues.apache.org/jira/browse/PIG-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831481#action_12831481 ] Jeff Zhang commented on PIG-259: Response to Alan, I agree that it makes more sense to do the overwrite in StoreFunc, and I notice that there's a JIAR PIG-1216 which is related with this. allow store to overwrite existing directroy --- Key: PIG-259 URL: https://issues.apache.org/jira/browse/PIG-259 Project: Pig Issue Type: Sub-task Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Jeff Zhang Fix For: 0.8.0 Attachments: Pig_259.patch, Pig_259_2.patch we have users who are asking for a flag to overwrite existing directory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-259) allow store to overwrite existing directroy
[ https://issues.apache.org/jira/browse/PIG-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831484#action_12831484 ] Jeff Zhang commented on PIG-259: Response to Dmitriy, Thanks for your suggestion of implementing overwrite on the StoreFunc level rather than on language level. I can bug in this. AndI think another advantage of putting it in StoreFunc is that it's more flexible than putting it in language. We have more control on StoreFunc than pig latin. allow store to overwrite existing directroy --- Key: PIG-259 URL: https://issues.apache.org/jira/browse/PIG-259 Project: Pig Issue Type: Sub-task Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Jeff Zhang Fix For: 0.8.0 Attachments: Pig_259.patch, Pig_259_2.patch we have users who are asking for a flag to overwrite existing directory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-259) allow store to overwrite existing directroy
[ https://issues.apache.org/jira/browse/PIG-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831487#action_12831487 ] Jeff Zhang commented on PIG-259: Sorry, I mean I can buy in your suggestion. allow store to overwrite existing directroy --- Key: PIG-259 URL: https://issues.apache.org/jira/browse/PIG-259 Project: Pig Issue Type: Sub-task Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Jeff Zhang Fix For: 0.8.0 Attachments: Pig_259.patch, Pig_259_2.patch we have users who are asking for a flag to overwrite existing directory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-834) incorrect plan when algebraic functions are nested
[ https://issues.apache.org/jira/browse/PIG-834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831535#action_12831535 ] Ashutosh Chauhan commented on PIG-834: -- Another hudson quirk : ( Failed test passes successfully on local machine. Patch is ready for review. incorrect plan when algebraic functions are nested -- Key: PIG-834 URL: https://issues.apache.org/jira/browse/PIG-834 Project: Pig Issue Type: Bug Components: impl Reporter: Thejas M Nair Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-834.patch, pig-834_2.patch a = load 'students.txt' as (c1,c2,c3,c4); c = group a by c2; f = foreach c generate COUNT(org.apache.pig.builtin.Distinct($1.$2)); Notice that Distinct udf is missing in Combiner and reduce stage. As a result distinct does not function, and incorrect results are produced. Distinct should have been evaluated in the 3 stages and output of Distinct should be given to COUNT in reduce stage. {code} # Map Reduce Plan #-- MapReduce node 1-122 Map Plan Local Rearrange[tuple]{bytearray}(false) - 1-139 | | | Project[bytearray][1] - 1-140 | |---New For Each(false,false)[bag] - 1-127 | | | POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] - 1-125 | | | |---POUserFunc(org.apache.pig.builtin.Distinct)[bag] - 1-126 | | | |---Project[bag][2] - 1-123 | | | |---Project[bag][1] - 1-124 | | | Project[bytearray][0] - 1-133 | |---Pre Combiner Local Rearrange[tuple]{Unknown} - 1-141 | |---Load(hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/tejas/students.txt:org.apache.pig.builtin.PigStorage) - 1-111 Combine Plan Local Rearrange[tuple]{bytearray}(false) - 1-143 | | | Project[bytearray][1] - 1-144 | |---New For Each(false,false)[bag] - 1-132 | | | POUserFunc(org.apache.pig.builtin.COUNT$Intermediate)[tuple] - 1-130 | | | |---Project[bag][0] - 1-135 | | | Project[bytearray][1] - 1-134 | |---POCombinerPackage[tuple]{bytearray} - 1-137 Reduce Plan Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-121 | |---New For Each(false)[bag] - 1-120 | | | POUserFunc(org.apache.pig.builtin.COUNT$Final)[long] - 1-119 | | | |---Project[bag][0] - 1-136 | |---POCombinerPackage[tuple]{bytearray} - 1-145 Global sort: false {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1231) Default DataBagIterator.hasNext() should be idempotent in all cases
[ https://issues.apache.org/jira/browse/PIG-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831558#action_12831558 ] Daniel Dai commented on PIG-1231: - testCompressed1: java.lang.IllegalArgumentException: port out of range:-1. Not a real problem. Manual test passes. Default DataBagIterator.hasNext() should be idempotent in all cases --- Key: PIG-1231 URL: https://issues.apache.org/jira/browse/PIG-1231 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1231-1.patch DefaultDataBagIterator.hasNext() is not repeatable when the below conditions met: 1. There is no more tuple in the last spill file 2. There is no tuples in memory (all contents are spilled to files) This is not acceptable cuz the name hasNext() implies that it is idempotent. In BagFormat, we do misuse DataBagIterator.hasNext() because of the assumption that hasNext() is always idempotent, which leads to some mysterious errors. Condition 2 seems to be very restrictive, but when the databag is really big, the memory can hold less than a couple of tuples, the chance to hit 2. is high enough. Here is one error we saw: Caused by: java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:145) at java.io.BufferedInputStream.fill(BufferedInputStream.java:189) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readByte(DataInputStream.java:248) at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:278) at org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.readFromFile(DefaultDataBag.java:237) ... 20 more This happens because: we call hasNext(), which reach EOF and we close the file. Then we call hasNext() again in the assumption that it is idempotent. However, the stream is closed so we get this error message. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1231) Default DataBagIterator.hasNext() should be idempotent in all cases
[ https://issues.apache.org/jira/browse/PIG-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831572#action_12831572 ] Alan Gates commented on PIG-1231: - +1 Changes look good. Default DataBagIterator.hasNext() should be idempotent in all cases --- Key: PIG-1231 URL: https://issues.apache.org/jira/browse/PIG-1231 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1231-1.patch DefaultDataBagIterator.hasNext() is not repeatable when the below conditions met: 1. There is no more tuple in the last spill file 2. There is no tuples in memory (all contents are spilled to files) This is not acceptable cuz the name hasNext() implies that it is idempotent. In BagFormat, we do misuse DataBagIterator.hasNext() because of the assumption that hasNext() is always idempotent, which leads to some mysterious errors. Condition 2 seems to be very restrictive, but when the databag is really big, the memory can hold less than a couple of tuples, the chance to hit 2. is high enough. Here is one error we saw: Caused by: java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:145) at java.io.BufferedInputStream.fill(BufferedInputStream.java:189) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readByte(DataInputStream.java:248) at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:278) at org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.readFromFile(DefaultDataBag.java:237) ... 20 more This happens because: we call hasNext(), which reach EOF and we close the file. Then we call hasNext() again in the assumption that it is idempotent. However, the stream is closed so we get this error message. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1224) Collected group should change to use new (internal) bag
[ https://issues.apache.org/jira/browse/PIG-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831575#action_12831575 ] Olga Natkovich commented on PIG-1224: - This patch is already covered by existing tests. It only changes the internal of the implementation Collected group should change to use new (internal) bag --- Key: PIG-1224 URL: https://issues.apache.org/jira/browse/PIG-1224 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1224.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1209) Port POJoinPackage to proactively spill
[ https://issues.apache.org/jira/browse/PIG-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831578#action_12831578 ] Olga Natkovich commented on PIG-1209: - The current unit tests adequately cover the testing of this internal change. Additionally, Ashutosh ran several e2e tests and also verified that this change fixed user problem. User script no longer ran out of memory Port POJoinPackage to proactively spill --- Key: PIG-1209 URL: https://issues.apache.org/jira/browse/PIG-1209 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1209.patch POPackage proactively spills the bag whereas POJoinPackage still uses the SpillableMemoryManager. We should port this to use InternalCacheBag which proactively spills. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1230) Streaming input in POJoinPackage should use nonspillable bag to collect tuples
[ https://issues.apache.org/jira/browse/PIG-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831585#action_12831585 ] Ashutosh Chauhan commented on PIG-1230: --- This patch switches POJoinPackage to use NonSpillableDataBag for last bag instead of currently used InternalCachedBag. Both of these bag implementations are already covered by existing unit tests and thus this patch needs no new tests. Streaming input in POJoinPackage should use nonspillable bag to collect tuples -- Key: PIG-1230 URL: https://issues.apache.org/jira/browse/PIG-1230 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1230.patch, pig-1230_1.patch Last table of join statement is streamed through instead of collecting all its tuple in a bag. As a further optimization of that, tuples of that relation are collected in chunks in a bag. Since we don't want to spill the tuples from this bag, NonSpillableBag should be used to hold tuples for this relation. Initially, DefaultDataBag was used, which was later changed to InternalCachedBag as a part of PIG-1209. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1232) [zebra] Column Group schema file versioning
[zebra] Column Group schema file versioning --- Key: PIG-1232 URL: https://issues.apache.org/jira/browse/PIG-1232 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Yan Zhou Priority: Minor This missing versioning in column group schema makes it difficult to evolve the index. For instance, prior to fix of PIG-1201, the index is empty for unsorted tables. However the index is useful even for unsorted tables to save a listStatus call to name node, which has been found expensive for directories of many disk entries inside it. As part of that fix, an index is built.. Without versioning but with the demand to support backward compatibility, another non-classical approach has to be figured out to build index when necessary. As part of this fix, we may also want to address that issue too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1230) Streaming input in POJoinPackage should use nonspillable bag to collect tuples
[ https://issues.apache.org/jira/browse/PIG-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831591#action_12831591 ] Olga Natkovich commented on PIG-1230: - The patch looks good. One comment: when iterating through bags, we should say numInputs -1 rather than lastBagIndex (which happens to have the right value.) to make the code more readable and intent more clear. After the change is made, the patch can be committed Streaming input in POJoinPackage should use nonspillable bag to collect tuples -- Key: PIG-1230 URL: https://issues.apache.org/jira/browse/PIG-1230 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1230.patch, pig-1230_1.patch Last table of join statement is streamed through instead of collecting all its tuple in a bag. As a further optimization of that, tuples of that relation are collected in chunks in a bag. Since we don't want to spill the tuples from this bag, NonSpillableBag should be used to hold tuples for this relation. Initially, DefaultDataBag was used, which was later changed to InternalCachedBag as a part of PIG-1209. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1231) Default DataBagIterator.hasNext() should be idempotent in all cases
[ https://issues.apache.org/jira/browse/PIG-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1231: Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch committed to both trunk and 0.6 branch. Default DataBagIterator.hasNext() should be idempotent in all cases --- Key: PIG-1231 URL: https://issues.apache.org/jira/browse/PIG-1231 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1231-1.patch DefaultDataBagIterator.hasNext() is not repeatable when the below conditions met: 1. There is no more tuple in the last spill file 2. There is no tuples in memory (all contents are spilled to files) This is not acceptable cuz the name hasNext() implies that it is idempotent. In BagFormat, we do misuse DataBagIterator.hasNext() because of the assumption that hasNext() is always idempotent, which leads to some mysterious errors. Condition 2 seems to be very restrictive, but when the databag is really big, the memory can hold less than a couple of tuples, the chance to hit 2. is high enough. Here is one error we saw: Caused by: java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:145) at java.io.BufferedInputStream.fill(BufferedInputStream.java:189) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readByte(DataInputStream.java:248) at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:278) at org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.readFromFile(DefaultDataBag.java:237) ... 20 more This happens because: we call hasNext(), which reach EOF and we close the file. Then we call hasNext() again in the assumption that it is idempotent. However, the stream is closed so we get this error message. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-259) allow store to overwrite existing directroy
[ https://issues.apache.org/jira/browse/PIG-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831628#action_12831628 ] Olga Natkovich commented on PIG-259: +1 on passing the information in the constructor. Since we need the store function to to the validation, we don't have control over the semantics and it is better not to have constructs in the language whose semantics are not well defined. One thing we need to provide to the store function writer is guidence on when the information they get in the constructor can be acted on. allow store to overwrite existing directroy --- Key: PIG-259 URL: https://issues.apache.org/jira/browse/PIG-259 Project: Pig Issue Type: Sub-task Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Jeff Zhang Fix For: 0.8.0 Attachments: Pig_259.patch, Pig_259_2.patch we have users who are asking for a flag to overwrite existing directory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
merging load-store-redesign branch back into truck
Pig Developers, As most of you know, we have spent the last couple of month mostly working on LSR branch. We believe that in about a week the code in the branch will be stable enough to merge it back into the trunk. If you are using trunk or making any modifications to it, you will be impacted. Please see the following documents for details: http://wiki.apache.org/pig/Pig070IncompatibleChanges http://wiki.apache.org/pig/LoadStoreRedesignProposal Please, let us know if you have any questions or concerns. Thanks, Olga
[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema
[ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831749#action_12831749 ] Ashutosh Chauhan commented on PIG-1188: --- I have a different take on this. Referring to original description of Jira, I would expect Pig's behavior should be one given in Current result and not as given in Desired result. Pig should not try to do anything behind the scenes with data which Desired result is proposing to do. In cases where columns are not consistent, there are two scenarios with or without schema. If user did supply the schema, then I would consider that user is telling to Pig that data is consistent with the schema he is providing and if thats not the case, its perfectly fine to throw exception at runtime. Tricky case is when schema is not provided and user tries to access a non-existent field. I think even in such cases its valid to throw exception at runtime, instead of returning null. First, if user is trying to access a non-existent field thats an error condition in any case. Second, it can't be assumed that user wants those non-existent field to be treated as null. If he wants it that way, he should implement LoadFunc interface which treats them that way. Third, doing further operations on these columns down the pipeline may result in non-predictable results in other operators. Fourth, returning null will obscure the bugs in Pig where Pig (instead of user himself) tries to access non-existent fields to construct new tuples at run time to do e.g. joins (see PIG-1131). In short, I am suggesting that Pig should continue to have a behavior it has today. That is it can load variable number of columns in a tuple. But, if user access a non-existent field throw the exception and let user deal with such scenario himself by implementing his own LoadFunc interface. Thoughts ? Padding nulls to the input tuple according to input schema -- Key: PIG-1188 URL: https://issues.apache.org/jira/browse/PIG-1188 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Richard Ding Fix For: 0.7.0 Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example: Pig script: {code} a = load '1.txt' as (a0, a1); dump a; {code} Input file: {code} 1 2 1 2 3 1 {code} Current result: {code} (1,2) (1,2,3) (1) {code} Desired result: {code} (1,2) (1,2) (1, null) {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1090) Update sources to reflect recent changes in load-store interfaces
[ https://issues.apache.org/jira/browse/PIG-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath resolved PIG-1090. - Resolution: Fixed +1 for PIG-1090-22.patch, patch committed. Closing this jira as resolved since all changes to accommodate the new load-store interfaces have now been checked in. Update sources to reflect recent changes in load-store interfaces - Key: PIG-1090 URL: https://issues.apache.org/jira/browse/PIG-1090 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: PIG-1090-10.patch, PIG-1090-11.patch, PIG-1090-12.patch, PIG-1090-13.patch, PIG-1090-14.patch, PIG-1090-15.patch, PIG-1090-16.patch, PIG-1090-17.patch, PIG-1090-18.patch, PIG-1090-19.patch, PIG-1090-2.patch, PIG-1090-20.patch, PIG-1090-21.patch, PIG-1090-22.patch, PIG-1090-3.patch, PIG-1090-4.patch, PIG-1090-6.patch, PIG-1090-7.patch, PIG-1090-8.patch, PIG-1090-9.patch, PIG-1090.patch, PIG-1190-5.patch There have been some changes (as recorded in the Changes Section, Nov 2 2009 sub section of http://wiki.apache.org/pig/LoadStoreRedesignProposal) in the load/store interfaces - this jira is to track the task of making those changes under src. Changes under test will be addresses in a different jira. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan reopened PIG-1131: --- Reopening since related to 1188 not a duplicate of it. Pig simple join does not work when it contains empty lines -- Key: PIG-1131 URL: https://issues.apache.org/jira/browse/PIG-1131 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: junk1.txt, junk2.txt, simplejoinscript.pig I have a simple script, which does a JOIN. {code} input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); describe input1; input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); describe input2; joineddata = JOIN input1 by $0, input2 by $0; describe joineddata; store joineddata into 'result'; {code} The input data contains empty lines. The join fails in the Map phase with the following error in the PRLocalRearrange.java java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am surprised that the test cases did not detect this error. Could we add this data which contains empty lines to the testcases? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema
[ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831776#action_12831776 ] Alan Gates commented on PIG-1188: - A few thoughts: In a job that is going to process a billion rows and run for 3 hours 1 bad row should not cause the whole job to fail. This invalid access should certainly cause a warning. Users can look at the warnings at the end of the query and decide they do not want to keep the output because of the warnings. But failure should not be the default case (see previous point). Perhaps we should have a warnings = error option like compilers do so users who are very worried about the warnings can make sure they fail. But that's a different proposal for a different JIRA. bq. Third, doing further operations on these columns down the pipeline may result in non-predictable results in other operators. I don't follow. Nulls in the pipeline shouldn't cause a problem. UDFs and operators need to be able to handle null values whether they come from processing or from the data itself. bq. Second, it can't be assumed that user wants those non-existent field to be treated as null. If he wants it that way, he should implement LoadFunc interface which treats them that way. One could argue that it can't be assumed the user wants his query to fail when a field is missing. We have to assume one way or another. Null is a better assumption than failure, since it is possible for a user who doesn't want that behavior to detect it and deal with it. As it is now, the user has to modify his data or write a new load function to deal with padding his data. I agree with you that in the schema case, it would be ideal if not having a field was an error. However, given the architecture this is difficult. And stipulating that load functions test every record to assure it matches the schema is too much of a performance penalty. But for the non-schema case I don't agree. Pig's philsophy of Pigs eat anything doesn't mean much if Pig gags as soon as it gets a record that doesn't match it's expectation. Padding nulls to the input tuple according to input schema -- Key: PIG-1188 URL: https://issues.apache.org/jira/browse/PIG-1188 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Richard Ding Fix For: 0.7.0 Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example: Pig script: {code} a = load '1.txt' as (a0, a1); dump a; {code} Input file: {code} 1 2 1 2 3 1 {code} Current result: {code} (1,2) (1,2,3) (1) {code} Desired result: {code} (1,2) (1,2) (1, null) {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1207) [zebra] Data sanity check should be performed at the end of writing instead of later at query time
[ https://issues.apache.org/jira/browse/PIG-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1207: -- Attachment: PIG-1207.patch [zebra] Data sanity check should be performed at the end of writing instead of later at query time --- Key: PIG-1207 URL: https://issues.apache.org/jira/browse/PIG-1207 Project: Pig Issue Type: Improvement Reporter: Yan Zhou Assignee: Yan Zhou Attachments: PIG-1207.patch Currently the equity check of number of rows across different column groups are performed by the query. And the error info is sketchy and only emits a Column groups are not evenly distributed, or worse, throws an IndexOufOfBound exception from CGScanner.getCGValue since BasicTable.atEnd and BasicTable.getKey, which are called just before BasicTable.getValue, only checks the first column group in projection and any discrepancy of the number of rows per file cross multiple column groups in projection could have BasicTable.atEnd return false and BasicTable.getKey return a key normally but another column group already exaust its current file and the call to its CGScanner.getCGValue throw the exception. This check should also be performed at the end of writing and the error info should be more informational. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1140) [zebra] Use of Hadoop 2.0 APIs
[ https://issues.apache.org/jira/browse/PIG-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Tang reassigned PIG-1140: - Assignee: Xuefu Zhang [zebra] Use of Hadoop 2.0 APIs Key: PIG-1140 URL: https://issues.apache.org/jira/browse/PIG-1140 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: zebra.0209 Currently, Zebra is still using already deprecated Hadoop 1.8 APIs. Need to upgrade to its 2.0 APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1137) [zebra] get* methods of Zebra Map/Reduce APIs need improvements
[ https://issues.apache.org/jira/browse/PIG-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Tang reassigned PIG-1137: - Assignee: Yan Zhou [zebra] get* methods of Zebra Map/Reduce APIs need improvements --- Key: PIG-1137 URL: https://issues.apache.org/jira/browse/PIG-1137 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.7.0 Currently the set* methods takes external Zebra objects, namely objects of ZebraStorageHint, ZebraSchema, ZebraSortInfo or ZebraProjection. Correspondingly, the get* methods should return such objects instead of String or Zebra internal objects like Schema. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1139) [zebra] Encapsulation of check of ZebraSortInfo by a Zebra reader; the check by a writer could be better encapsulated
[ https://issues.apache.org/jira/browse/PIG-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Tang updated PIG-1139: -- Fix Version/s: (was: 0.7.0) 0.8.0 [zebra] Encapsulation of check of ZebraSortInfo by a Zebra reader; the check by a writer could be better encapsulated - Key: PIG-1139 URL: https://issues.apache.org/jira/browse/PIG-1139 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Yan Zhou Priority: Minor Fix For: 0.8.0 Currently the user's ZebraSortInfo by Map/Reduce's writer, namely, the BasicTableOutputFormat.setStorageInfo, is sanity checked by the SortInfo.parse(), although the sanity check could be all performed in that method taking a ZebraSortInfo object. But the sanity check at the reader side is totally by the caller of TableInputFormat.requireSortedTable method, which should be better encapsulated into a new SortInfo's method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1131: -- Attachment: pig-1131.patch In POLocalRearrange number of elements in tuple not present in key (and thus put in value) is computed first time and then cached as an optimization. This patch removes this caching because of the problem illustrated in the bug. Test case included which reproduces the bug. Pig simple join does not work when it contains empty lines -- Key: PIG-1131 URL: https://issues.apache.org/jira/browse/PIG-1131 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: junk1.txt, junk2.txt, pig-1131.patch, simplejoinscript.pig I have a simple script, which does a JOIN. {code} input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); describe input1; input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); describe input2; joineddata = JOIN input1 by $0, input2 by $0; describe joineddata; store joineddata into 'result'; {code} The input data contains empty lines. The join fails in the Map phase with the following error in the PRLocalRearrange.java java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am surprised that the test cases did not detect this error. Could we add this data which contains empty lines to the testcases? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1131: -- Status: Patch Available (was: Reopened) Pig simple join does not work when it contains empty lines -- Key: PIG-1131 URL: https://issues.apache.org/jira/browse/PIG-1131 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: junk1.txt, junk2.txt, pig-1131.patch, simplejoinscript.pig I have a simple script, which does a JOIN. {code} input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); describe input1; input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); describe input2; joineddata = JOIN input1 by $0, input2 by $0; describe joineddata; store joineddata into 'result'; {code} The input data contains empty lines. The join fails in the Map phase with the following error in the PRLocalRearrange.java java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am surprised that the test cases did not detect this error. Could we add this data which contains empty lines to the testcases? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1215) Make Hadoop jobId more prominent in the client log
[ https://issues.apache.org/jira/browse/PIG-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831812#action_12831812 ] Olga Natkovich commented on PIG-1215: - I would like to request an additional change to make sure that we can write HadoopId information to the client side log file not just stdout. This would happen only if special property is used. So the additional ask is to implement handling of this new property and when it is present to make sure that all messages at the level of INFO are written to the log file. This can be accomplished by changing the log listener for the log file so it picks up INFO level log events. We don't want to do this by default because it would drastically increase the number of log files created by Pig since now we only create the file when there is a real problem executing it. Make Hadoop jobId more prominent in the client log -- Key: PIG-1215 URL: https://issues.apache.org/jira/browse/PIG-1215 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1215.patch This is a request from applications that want to be able to programmatically parse client logs to find hadoop Ids. The woould like to see each job id on a separate line in the following format: hadoopJobId: job_123456789 They would also like to see the jobs in the order they are executed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1215) Make Hadoop jobId more prominent in the client log
[ https://issues.apache.org/jira/browse/PIG-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831817#action_12831817 ] Olga Natkovich commented on PIG-1215: - can we also make the value NOT_AVAILABLE rather than NOT AVAILABLE to make it easier for tools to parse Make Hadoop jobId more prominent in the client log -- Key: PIG-1215 URL: https://issues.apache.org/jira/browse/PIG-1215 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1215.patch This is a request from applications that want to be able to programmatically parse client logs to find hadoop Ids. The woould like to see each job id on a separate line in the following format: hadoopJobId: job_123456789 They would also like to see the jobs in the order they are executed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831818#action_12831818 ] Hadoop QA commented on PIG-1131: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435394/pig-1131.patch against trunk revision 908177. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/196/console This message is automatically generated. Pig simple join does not work when it contains empty lines -- Key: PIG-1131 URL: https://issues.apache.org/jira/browse/PIG-1131 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: junk1.txt, junk2.txt, pig-1131.patch, simplejoinscript.pig I have a simple script, which does a JOIN. {code} input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); describe input1; input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); describe input2; joineddata = JOIN input1 by $0, input2 by $0; describe joineddata; store joineddata into 'result'; {code} The input data contains empty lines. The join fails in the Map phase with the following error in the PRLocalRearrange.java java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am surprised that the test cases did not detect this error. Could we add this data which contains empty lines to the testcases? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1230) Streaming input in POJoinPackage should use nonspillable bag to collect tuples
[ https://issues.apache.org/jira/browse/PIG-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1230: -- Attachment: pig-1230_2.patch As per comment changed lastBagIndex to numInputs - 1, no other changes. Streaming input in POJoinPackage should use nonspillable bag to collect tuples -- Key: PIG-1230 URL: https://issues.apache.org/jira/browse/PIG-1230 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1230.patch, pig-1230_1.patch, pig-1230_2.patch Last table of join statement is streamed through instead of collecting all its tuple in a bag. As a further optimization of that, tuples of that relation are collected in chunks in a bag. Since we don't want to spill the tuples from this bag, NonSpillableBag should be used to hold tuples for this relation. Initially, DefaultDataBag was used, which was later changed to InternalCachedBag as a part of PIG-1209. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1230) Streaming input in POJoinPackage should use nonspillable bag to collect tuples
[ https://issues.apache.org/jira/browse/PIG-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1230: -- Resolution: Fixed Status: Resolved (was: Patch Available) Patch checked-in. Streaming input in POJoinPackage should use nonspillable bag to collect tuples -- Key: PIG-1230 URL: https://issues.apache.org/jira/browse/PIG-1230 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1230.patch, pig-1230_1.patch, pig-1230_2.patch Last table of join statement is streamed through instead of collecting all its tuple in a bag. As a further optimization of that, tuples of that relation are collected in chunks in a bag. Since we don't want to spill the tuples from this bag, NonSpillableBag should be used to hold tuples for this relation. Initially, DefaultDataBag was used, which was later changed to InternalCachedBag as a part of PIG-1209. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1131: -- Attachment: pig-1131.patch Previous patch was stale. Merged with trunk and regenerated the patch. Pig simple join does not work when it contains empty lines -- Key: PIG-1131 URL: https://issues.apache.org/jira/browse/PIG-1131 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: junk1.txt, junk2.txt, pig-1131.patch, pig-1131.patch, simplejoinscript.pig I have a simple script, which does a JOIN. {code} input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); describe input1; input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); describe input2; joineddata = JOIN input1 by $0, input2 by $0; describe joineddata; store joineddata into 'result'; {code} The input data contains empty lines. The join fails in the Map phase with the following error in the PRLocalRearrange.java java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am surprised that the test cases did not detect this error. Could we add this data which contains empty lines to the testcases? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1131: -- Status: Open (was: Patch Available) Pig simple join does not work when it contains empty lines -- Key: PIG-1131 URL: https://issues.apache.org/jira/browse/PIG-1131 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: junk1.txt, junk2.txt, pig-1131.patch, pig-1131.patch, simplejoinscript.pig I have a simple script, which does a JOIN. {code} input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); describe input1; input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); describe input2; joineddata = JOIN input1 by $0, input2 by $0; describe joineddata; store joineddata into 'result'; {code} The input data contains empty lines. The join fails in the Map phase with the following error in the PRLocalRearrange.java java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am surprised that the test cases did not detect this error. Could we add this data which contains empty lines to the testcases? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1131: -- Status: Patch Available (was: Open) Pig simple join does not work when it contains empty lines -- Key: PIG-1131 URL: https://issues.apache.org/jira/browse/PIG-1131 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: junk1.txt, junk2.txt, pig-1131.patch, pig-1131.patch, simplejoinscript.pig I have a simple script, which does a JOIN. {code} input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); describe input1; input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); describe input2; joineddata = JOIN input1 by $0, input2 by $0; describe joineddata; store joineddata into 'result'; {code} The input data contains empty lines. The join fails in the Map phase with the following error in the PRLocalRearrange.java java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am surprised that the test cases did not detect this error. Could we add this data which contains empty lines to the testcases? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831886#action_12831886 ] Ashutosh Chauhan commented on PIG-1178: --- Was wondering about different optimizations that we do on a complied MR plan. Not sure if its already been discussed or is in some doc. But essentially those optimizations are also done through visitors and would benefit greatly if there is a framework for them just as there is one for front-end. Is there any plan to also subsume those visitors (possibly by rewriting them as rule-transform pairs) in this new optimizer or will they be dealt with separately later on? LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Ying He Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1217) [piggybank] evaluation.util.Top is broken
[ https://issues.apache.org/jira/browse/PIG-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1217: --- Status: Open (was: Patch Available) [piggybank] evaluation.util.Top is broken - Key: PIG-1217 URL: https://issues.apache.org/jira/browse/PIG-1217 Project: Pig Issue Type: Bug Affects Versions: 0.3.0, 0.3.1, 0.4.0, site, 0.5.0, 0.6.0, 0.7.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.7.0 Attachments: fix_top_udf.diff, fix_top_udf.diff The Top udf has been broken for a while, due to an incorrect implementation of getArgToFuncMapping. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1217) [piggybank] evaluation.util.Top is broken
[ https://issues.apache.org/jira/browse/PIG-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1217: --- Attachment: fix_top_udf.diff Simplified Initial per Alan's comments (just returning the tuple doesn't work, btw). Also made it a bit safer around nulls. [piggybank] evaluation.util.Top is broken - Key: PIG-1217 URL: https://issues.apache.org/jira/browse/PIG-1217 Project: Pig Issue Type: Bug Affects Versions: 0.3.0, 0.3.1, 0.4.0, site, 0.5.0, 0.6.0, 0.7.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.7.0 Attachments: fix_top_udf.diff, fix_top_udf.diff, fix_top_udf.diff The Top udf has been broken for a while, due to an incorrect implementation of getArgToFuncMapping. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1217) [piggybank] evaluation.util.Top is broken
[ https://issues.apache.org/jira/browse/PIG-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1217: --- Status: Patch Available (was: Open) [piggybank] evaluation.util.Top is broken - Key: PIG-1217 URL: https://issues.apache.org/jira/browse/PIG-1217 Project: Pig Issue Type: Bug Affects Versions: 0.3.0, 0.3.1, 0.4.0, site, 0.5.0, 0.6.0, 0.7.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.7.0 Attachments: fix_top_udf.diff, fix_top_udf.diff, fix_top_udf.diff The Top udf has been broken for a while, due to an incorrect implementation of getArgToFuncMapping. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1233) NullPointerException in AVG
NullPointerException in AVG Key: PIG-1233 URL: https://issues.apache.org/jira/browse/PIG-1233 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Fix For: 0.6.0 The overridden method - getValue() in AVG throws null pointer exception in case accumulate() is not called leaving variable 'intermediateCount' initialized to null. This causes java to throw exception when it tries to 'unbox' the value for numeric comparison. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1233) NullPointerException in AVG
[ https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1233: --- Attachment: jira-1233.patch Attached is a very simple patch that adds the required null checks. This is a very simple code change so I don't think any new test cases are needed. NullPointerException in AVG Key: PIG-1233 URL: https://issues.apache.org/jira/browse/PIG-1233 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Fix For: 0.6.0 Attachments: jira-1233.patch The overridden method - getValue() in AVG throws null pointer exception in case accumulate() is not called leaving variable 'intermediateCount' initialized to null. This causes java to throw exception when it tries to 'unbox' the value for numeric comparison. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1233) NullPointerException in AVG
[ https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-1233: --- Status: Patch Available (was: Open) NullPointerException in AVG Key: PIG-1233 URL: https://issues.apache.org/jira/browse/PIG-1233 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Ankur Fix For: 0.6.0 Attachments: jira-1233.patch The overridden method - getValue() in AVG throws null pointer exception in case accumulate() is not called leaving variable 'intermediateCount' initialized to null. This causes java to throw exception when it tries to 'unbox' the value for numeric comparison. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831898#action_12831898 ] Hadoop QA commented on PIG-1131: +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435402/pig-1131.patch against trunk revision 908324. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/197/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/197/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/197/console This message is automatically generated. Pig simple join does not work when it contains empty lines -- Key: PIG-1131 URL: https://issues.apache.org/jira/browse/PIG-1131 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: junk1.txt, junk2.txt, pig-1131.patch, pig-1131.patch, simplejoinscript.pig I have a simple script, which does a JOIN. {code} input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); describe input1; input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); describe input2; joineddata = JOIN input1 by $0, input2 by $0; describe joineddata; store joineddata into 'result'; {code} The input data contains empty lines. The join fails in the Map phase with the following error in the PRLocalRearrange.java java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am surprised that the test cases did not detect this error. Could we add this data which contains empty lines to the testcases? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.