[jira] Commented: (PIG-1231) Default DataBagIterator.hasNext() should be idempotent in all cases

2010-02-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831400#action_12831400
 ] 

Hadoop QA commented on PIG-1231:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12435230/PIG-1231-1.patch
  against trunk revision 907760.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/206/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/206/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/206/console

This message is automatically generated.

 Default DataBagIterator.hasNext() should be idempotent in all cases
 ---

 Key: PIG-1231
 URL: https://issues.apache.org/jira/browse/PIG-1231
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1231-1.patch


 DefaultDataBagIterator.hasNext() is not repeatable when the below conditions 
 met:
 1. There is no more tuple in the last spill file
 2. There is no tuples in memory (all contents are spilled to files)
 This is not acceptable cuz the name hasNext() implies that it is idempotent. 
 In BagFormat, we do misuse DataBagIterator.hasNext() because of the 
 assumption that hasNext() is always idempotent, which leads to some 
 mysterious errors. 
 Condition 2 seems to be very restrictive, but when the databag is really big, 
 the memory can hold less than a couple of tuples, the chance to hit 2. is 
 high enough.
 Here is one error we saw:
 Caused by: java.io.IOException: Stream closed
 at 
 java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:145)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:189)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
 at java.io.DataInputStream.readByte(DataInputStream.java:248)
 at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:278)
 at 
 org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.readFromFile(DefaultDataBag.java:237)
 ... 20 more
 This happens because: we call hasNext(), which reach EOF and we close the 
 file. Then we call hasNext() again in the assumption that it is idempotent. 
 However, the stream is closed so we get this error message.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-834) incorrect plan when algebraic functions are nested

2010-02-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831421#action_12831421
 ] 

Hadoop QA commented on PIG-834:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12435027/pig-834_2.patch
  against trunk revision 907760.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/195/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/195/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/195/console

This message is automatically generated.

 incorrect plan when algebraic functions are nested
 --

 Key: PIG-834
 URL: https://issues.apache.org/jira/browse/PIG-834
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Thejas M Nair
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-834.patch, pig-834_2.patch


 a = load 'students.txt' as (c1,c2,c3,c4); 
 c = group a by c2;  
 f = foreach c generate COUNT(org.apache.pig.builtin.Distinct($1.$2));
 Notice that Distinct udf is missing in Combiner and reduce stage. As a result 
 distinct does not function, and incorrect results are produced.
 Distinct should have been evaluated in the 3 stages and output of Distinct 
 should be given to COUNT in reduce stage.
 {code}
 # Map Reduce Plan  
 #--
 MapReduce node 1-122
 Map Plan
 Local Rearrange[tuple]{bytearray}(false) - 1-139
 |   |
 |   Project[bytearray][1] - 1-140
 |
 |---New For Each(false,false)[bag] - 1-127
 |   |
 |   POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] - 1-125
 |   |
 |   |---POUserFunc(org.apache.pig.builtin.Distinct)[bag] - 1-126
 |   |
 |   |---Project[bag][2] - 1-123
 |   |
 |   |---Project[bag][1] - 1-124
 |   |
 |   Project[bytearray][0] - 1-133
 |
 |---Pre Combiner Local Rearrange[tuple]{Unknown} - 1-141
 |
 
 |---Load(hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/tejas/students.txt:org.apache.pig.builtin.PigStorage)
  - 1-111
 Combine Plan
 Local Rearrange[tuple]{bytearray}(false) - 1-143
 |   |
 |   Project[bytearray][1] - 1-144
 |
 |---New For Each(false,false)[bag] - 1-132
 |   |
 |   POUserFunc(org.apache.pig.builtin.COUNT$Intermediate)[tuple] - 1-130
 |   |
 |   |---Project[bag][0] - 1-135
 |   |
 |   Project[bytearray][1] - 1-134
 |
 |---POCombinerPackage[tuple]{bytearray} - 1-137
 Reduce Plan
 Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-121
 |
 |---New For Each(false)[bag] - 1-120
 |   |
 |   POUserFunc(org.apache.pig.builtin.COUNT$Final)[long] - 1-119
 |   |
 |   |---Project[bag][0] - 1-136
 |
 |---POCombinerPackage[tuple]{bytearray} - 1-145
 Global sort: false
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-259) allow store to overwrite existing directroy

2010-02-09 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831481#action_12831481
 ] 

Jeff Zhang commented on PIG-259:


Response to Alan,

I agree that it makes more sense to do the overwrite in StoreFunc, and I notice 
that there's a JIAR PIG-1216 which is related with this.



 allow store to overwrite existing directroy
 ---

 Key: PIG-259
 URL: https://issues.apache.org/jira/browse/PIG-259
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: Pig_259.patch, Pig_259_2.patch


 we have users who are asking for a flag to overwrite existing directory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-259) allow store to overwrite existing directroy

2010-02-09 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831484#action_12831484
 ] 

Jeff Zhang commented on PIG-259:


Response to Dmitriy,

Thanks for your suggestion of implementing overwrite on the StoreFunc level 
rather than on language level. I can bug in this. AndI think another advantage 
of putting it in StoreFunc is that it's more flexible than putting it in 
language. We have more control on StoreFunc than pig latin. 



 allow store to overwrite existing directroy
 ---

 Key: PIG-259
 URL: https://issues.apache.org/jira/browse/PIG-259
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: Pig_259.patch, Pig_259_2.patch


 we have users who are asking for a flag to overwrite existing directory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-259) allow store to overwrite existing directroy

2010-02-09 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831487#action_12831487
 ] 

Jeff Zhang commented on PIG-259:


Sorry, I mean I can buy in your suggestion.

 allow store to overwrite existing directroy
 ---

 Key: PIG-259
 URL: https://issues.apache.org/jira/browse/PIG-259
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: Pig_259.patch, Pig_259_2.patch


 we have users who are asking for a flag to overwrite existing directory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-834) incorrect plan when algebraic functions are nested

2010-02-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831535#action_12831535
 ] 

Ashutosh Chauhan commented on PIG-834:
--

Another hudson quirk : ( Failed test passes successfully on local machine. 
Patch is ready for review.

 incorrect plan when algebraic functions are nested
 --

 Key: PIG-834
 URL: https://issues.apache.org/jira/browse/PIG-834
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Thejas M Nair
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-834.patch, pig-834_2.patch


 a = load 'students.txt' as (c1,c2,c3,c4); 
 c = group a by c2;  
 f = foreach c generate COUNT(org.apache.pig.builtin.Distinct($1.$2));
 Notice that Distinct udf is missing in Combiner and reduce stage. As a result 
 distinct does not function, and incorrect results are produced.
 Distinct should have been evaluated in the 3 stages and output of Distinct 
 should be given to COUNT in reduce stage.
 {code}
 # Map Reduce Plan  
 #--
 MapReduce node 1-122
 Map Plan
 Local Rearrange[tuple]{bytearray}(false) - 1-139
 |   |
 |   Project[bytearray][1] - 1-140
 |
 |---New For Each(false,false)[bag] - 1-127
 |   |
 |   POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] - 1-125
 |   |
 |   |---POUserFunc(org.apache.pig.builtin.Distinct)[bag] - 1-126
 |   |
 |   |---Project[bag][2] - 1-123
 |   |
 |   |---Project[bag][1] - 1-124
 |   |
 |   Project[bytearray][0] - 1-133
 |
 |---Pre Combiner Local Rearrange[tuple]{Unknown} - 1-141
 |
 
 |---Load(hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/tejas/students.txt:org.apache.pig.builtin.PigStorage)
  - 1-111
 Combine Plan
 Local Rearrange[tuple]{bytearray}(false) - 1-143
 |   |
 |   Project[bytearray][1] - 1-144
 |
 |---New For Each(false,false)[bag] - 1-132
 |   |
 |   POUserFunc(org.apache.pig.builtin.COUNT$Intermediate)[tuple] - 1-130
 |   |
 |   |---Project[bag][0] - 1-135
 |   |
 |   Project[bytearray][1] - 1-134
 |
 |---POCombinerPackage[tuple]{bytearray} - 1-137
 Reduce Plan
 Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-121
 |
 |---New For Each(false)[bag] - 1-120
 |   |
 |   POUserFunc(org.apache.pig.builtin.COUNT$Final)[long] - 1-119
 |   |
 |   |---Project[bag][0] - 1-136
 |
 |---POCombinerPackage[tuple]{bytearray} - 1-145
 Global sort: false
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1231) Default DataBagIterator.hasNext() should be idempotent in all cases

2010-02-09 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831558#action_12831558
 ] 

Daniel Dai commented on PIG-1231:
-

testCompressed1: java.lang.IllegalArgumentException: port out of range:-1. Not 
a real problem. Manual test passes.

 Default DataBagIterator.hasNext() should be idempotent in all cases
 ---

 Key: PIG-1231
 URL: https://issues.apache.org/jira/browse/PIG-1231
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1231-1.patch


 DefaultDataBagIterator.hasNext() is not repeatable when the below conditions 
 met:
 1. There is no more tuple in the last spill file
 2. There is no tuples in memory (all contents are spilled to files)
 This is not acceptable cuz the name hasNext() implies that it is idempotent. 
 In BagFormat, we do misuse DataBagIterator.hasNext() because of the 
 assumption that hasNext() is always idempotent, which leads to some 
 mysterious errors. 
 Condition 2 seems to be very restrictive, but when the databag is really big, 
 the memory can hold less than a couple of tuples, the chance to hit 2. is 
 high enough.
 Here is one error we saw:
 Caused by: java.io.IOException: Stream closed
 at 
 java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:145)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:189)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
 at java.io.DataInputStream.readByte(DataInputStream.java:248)
 at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:278)
 at 
 org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.readFromFile(DefaultDataBag.java:237)
 ... 20 more
 This happens because: we call hasNext(), which reach EOF and we close the 
 file. Then we call hasNext() again in the assumption that it is idempotent. 
 However, the stream is closed so we get this error message.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1231) Default DataBagIterator.hasNext() should be idempotent in all cases

2010-02-09 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831572#action_12831572
 ] 

Alan Gates commented on PIG-1231:
-

+1 Changes look good.  

 Default DataBagIterator.hasNext() should be idempotent in all cases
 ---

 Key: PIG-1231
 URL: https://issues.apache.org/jira/browse/PIG-1231
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1231-1.patch


 DefaultDataBagIterator.hasNext() is not repeatable when the below conditions 
 met:
 1. There is no more tuple in the last spill file
 2. There is no tuples in memory (all contents are spilled to files)
 This is not acceptable cuz the name hasNext() implies that it is idempotent. 
 In BagFormat, we do misuse DataBagIterator.hasNext() because of the 
 assumption that hasNext() is always idempotent, which leads to some 
 mysterious errors. 
 Condition 2 seems to be very restrictive, but when the databag is really big, 
 the memory can hold less than a couple of tuples, the chance to hit 2. is 
 high enough.
 Here is one error we saw:
 Caused by: java.io.IOException: Stream closed
 at 
 java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:145)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:189)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
 at java.io.DataInputStream.readByte(DataInputStream.java:248)
 at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:278)
 at 
 org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.readFromFile(DefaultDataBag.java:237)
 ... 20 more
 This happens because: we call hasNext(), which reach EOF and we close the 
 file. Then we call hasNext() again in the assumption that it is idempotent. 
 However, the stream is closed so we get this error message.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1224) Collected group should change to use new (internal) bag

2010-02-09 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831575#action_12831575
 ] 

Olga Natkovich commented on PIG-1224:
-

This patch is already covered by existing tests. It only changes the internal 
of the implementation

 Collected group should change to use new (internal) bag
 ---

 Key: PIG-1224
 URL: https://issues.apache.org/jira/browse/PIG-1224
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-1224.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1209) Port POJoinPackage to proactively spill

2010-02-09 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831578#action_12831578
 ] 

Olga Natkovich commented on PIG-1209:
-

The current unit tests adequately cover the testing of this internal change. 
Additionally, Ashutosh ran several e2e tests and also verified that this change 
fixed user problem. User script no longer ran out of memory

 Port POJoinPackage to proactively spill
 ---

 Key: PIG-1209
 URL: https://issues.apache.org/jira/browse/PIG-1209
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-1209.patch


 POPackage proactively spills the bag whereas POJoinPackage still uses the 
 SpillableMemoryManager. We should port this to use InternalCacheBag which 
 proactively spills.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1230) Streaming input in POJoinPackage should use nonspillable bag to collect tuples

2010-02-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831585#action_12831585
 ] 

Ashutosh Chauhan commented on PIG-1230:
---

This patch switches POJoinPackage to use NonSpillableDataBag for last bag 
instead of currently used InternalCachedBag. Both of these bag implementations 
are already covered by existing unit tests and thus this patch needs no new 
tests. 

 Streaming input in POJoinPackage should use nonspillable bag to collect tuples
 --

 Key: PIG-1230
 URL: https://issues.apache.org/jira/browse/PIG-1230
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-1230.patch, pig-1230_1.patch


 Last table of join statement is streamed through instead of collecting all 
 its tuple in a bag. As a further optimization of that, tuples of that 
 relation are collected in chunks in a bag. Since we don't want to spill the 
 tuples from this bag, NonSpillableBag should be used to hold tuples for this 
 relation. Initially, DefaultDataBag was used, which was later changed to 
 InternalCachedBag as a part of PIG-1209.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1232) [zebra] Column Group schema file versioning

2010-02-09 Thread Yan Zhou (JIRA)
[zebra] Column Group schema file versioning
---

 Key: PIG-1232
 URL: https://issues.apache.org/jira/browse/PIG-1232
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Yan Zhou
Priority: Minor


This missing versioning in column group schema makes it difficult to evolve the 
index. For instance, prior to fix of PIG-1201, the index is empty for unsorted 
tables. However the index is useful even for unsorted tables to save a 
listStatus call to name node, which has been found expensive for directories of 
many disk entries inside it. As part of that fix, an index is built.. Without 
versioning but with the demand to support backward compatibility, another 
non-classical approach has to be figured out to build index when necessary. As 
part of this fix, we may also want to address that issue too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1230) Streaming input in POJoinPackage should use nonspillable bag to collect tuples

2010-02-09 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831591#action_12831591
 ] 

Olga Natkovich commented on PIG-1230:
-

The patch looks good. One comment: when iterating through bags,  we should say 
numInputs -1 rather than lastBagIndex (which happens to have the right value.) 
to make the code more readable and intent more clear. After the change is made, 
the patch can be committed

 Streaming input in POJoinPackage should use nonspillable bag to collect tuples
 --

 Key: PIG-1230
 URL: https://issues.apache.org/jira/browse/PIG-1230
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-1230.patch, pig-1230_1.patch


 Last table of join statement is streamed through instead of collecting all 
 its tuple in a bag. As a further optimization of that, tuples of that 
 relation are collected in chunks in a bag. Since we don't want to spill the 
 tuples from this bag, NonSpillableBag should be used to hold tuples for this 
 relation. Initially, DefaultDataBag was used, which was later changed to 
 InternalCachedBag as a part of PIG-1209.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1231) Default DataBagIterator.hasNext() should be idempotent in all cases

2010-02-09 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1231:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Patch committed to both trunk and 0.6 branch.

 Default DataBagIterator.hasNext() should be idempotent in all cases
 ---

 Key: PIG-1231
 URL: https://issues.apache.org/jira/browse/PIG-1231
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1231-1.patch


 DefaultDataBagIterator.hasNext() is not repeatable when the below conditions 
 met:
 1. There is no more tuple in the last spill file
 2. There is no tuples in memory (all contents are spilled to files)
 This is not acceptable cuz the name hasNext() implies that it is idempotent. 
 In BagFormat, we do misuse DataBagIterator.hasNext() because of the 
 assumption that hasNext() is always idempotent, which leads to some 
 mysterious errors. 
 Condition 2 seems to be very restrictive, but when the databag is really big, 
 the memory can hold less than a couple of tuples, the chance to hit 2. is 
 high enough.
 Here is one error we saw:
 Caused by: java.io.IOException: Stream closed
 at 
 java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:145)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:189)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
 at java.io.DataInputStream.readByte(DataInputStream.java:248)
 at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:278)
 at 
 org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator.readFromFile(DefaultDataBag.java:237)
 ... 20 more
 This happens because: we call hasNext(), which reach EOF and we close the 
 file. Then we call hasNext() again in the assumption that it is idempotent. 
 However, the stream is closed so we get this error message.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-259) allow store to overwrite existing directroy

2010-02-09 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831628#action_12831628
 ] 

Olga Natkovich commented on PIG-259:


+1 on passing the information in the constructor. Since we need the store 
function to to the validation, we don't have control over the semantics and it 
is better not to have constructs in the language whose semantics are not well 
defined.

One thing we need to provide to the store function writer is guidence on when 
the information they get in the constructor can be acted on. 

 allow store to overwrite existing directroy
 ---

 Key: PIG-259
 URL: https://issues.apache.org/jira/browse/PIG-259
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: Pig_259.patch, Pig_259_2.patch


 we have users who are asking for a flag to overwrite existing directory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



merging load-store-redesign branch back into truck

2010-02-09 Thread Olga Natkovich
Pig Developers,

 

As most of you know, we have spent the last couple of month mostly
working on LSR branch. We believe that in about a week the code in the
branch will be stable enough to merge it back into the trunk. 

 

If you are using trunk or making any modifications to it, you will be
impacted. Please see the following documents for details:

 

http://wiki.apache.org/pig/Pig070IncompatibleChanges

http://wiki.apache.org/pig/LoadStoreRedesignProposal

 

Please, let us know if you have any questions or concerns.

 

Thanks,

 

Olga



[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

2010-02-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831749#action_12831749
 ] 

Ashutosh Chauhan commented on PIG-1188:
---

I have a different take on this. Referring to original description of Jira, I 
would expect Pig's behavior should be one given in Current result and not as 
given in Desired result. Pig should not try to do anything behind the scenes 
with data which Desired result is proposing to do. In cases where columns are 
not consistent, there are two scenarios with or without schema. If user did 
supply the schema, then I would consider that user is telling to Pig that data 
is consistent with the schema he is providing and if thats not the case, its 
perfectly fine to throw exception at runtime. Tricky case is when schema is not 
provided and user tries to access a non-existent field. I think even in such 
cases its valid to throw exception at runtime, instead of returning null. 
First, if user is trying to access a non-existent field thats an error 
condition in any case. Second, it can't be assumed that user wants those 
non-existent field to be treated as null. If he wants it that way, he should 
implement LoadFunc interface which treats them that way. Third, doing further 
operations on these columns down the pipeline may result in non-predictable 
results in other operators. Fourth, returning null will obscure the bugs in Pig 
where Pig (instead of user himself) tries to access non-existent fields to 
construct new tuples at run time to do e.g. joins (see PIG-1131). 

In short, I am suggesting that Pig should continue to have a behavior it has 
today. That is it can load variable number of columns in a tuple. But, if user 
access a non-existent field throw the exception and let user deal with  such 
scenario himself by implementing his own LoadFunc interface. 

Thoughts ?

 Padding nulls to the input tuple according to input schema
 --

 Key: PIG-1188
 URL: https://issues.apache.org/jira/browse/PIG-1188
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Richard Ding
 Fix For: 0.7.0


 Currently, the number of fields in the input tuple is determined by the data. 
 When we have schema, we should generate input data according to the schema, 
 and padding nulls if necessary. Here is one example:
 Pig script:
 {code}
 a = load '1.txt' as (a0, a1);
 dump a;
 {code}
 Input file:
 {code}
 1   2
 1   2   3
 1
 {code}
 Current result:
 {code}
 (1,2)
 (1,2,3)
 (1)
 {code}
 Desired result:
 {code}
 (1,2)
 (1,2)
 (1, null)
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1090) Update sources to reflect recent changes in load-store interfaces

2010-02-09 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath resolved PIG-1090.
-

Resolution: Fixed

+1 for PIG-1090-22.patch, patch committed.
Closing this jira as resolved since all changes to accommodate the new 
load-store interfaces have now been checked in. 

 Update sources to reflect recent changes in load-store interfaces
 -

 Key: PIG-1090
 URL: https://issues.apache.org/jira/browse/PIG-1090
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.7.0

 Attachments: PIG-1090-10.patch, PIG-1090-11.patch, PIG-1090-12.patch, 
 PIG-1090-13.patch, PIG-1090-14.patch, PIG-1090-15.patch, PIG-1090-16.patch, 
 PIG-1090-17.patch, PIG-1090-18.patch, PIG-1090-19.patch, PIG-1090-2.patch, 
 PIG-1090-20.patch, PIG-1090-21.patch, PIG-1090-22.patch, PIG-1090-3.patch, 
 PIG-1090-4.patch, PIG-1090-6.patch, PIG-1090-7.patch, PIG-1090-8.patch, 
 PIG-1090-9.patch, PIG-1090.patch, PIG-1190-5.patch


 There have been some changes (as recorded in the Changes Section, Nov 2 2009 
 sub section of http://wiki.apache.org/pig/LoadStoreRedesignProposal) in the 
 load/store interfaces - this jira is to track the task of making those 
 changes under src. Changes under test will be addresses in a different jira.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (PIG-1131) Pig simple join does not work when it contains empty lines

2010-02-09 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reopened PIG-1131:
---


Reopening since related to 1188 not a duplicate of it.

 Pig simple join does not work when it contains empty lines
 --

 Key: PIG-1131
 URL: https://issues.apache.org/jira/browse/PIG-1131
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: junk1.txt, junk2.txt, simplejoinscript.pig


 I have a simple script, which does a JOIN.
 {code}
 input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
 describe input1;
 input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
 describe input2;
 joineddata = JOIN input1 by $0, input2 by $0;
 describe joineddata;
 store joineddata into 'result';
 {code}
 The input data contains empty lines.  
 The join fails in the Map phase with the following error in the 
 PRLocalRearrange.java
 java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
   at java.util.ArrayList.get(ArrayList.java:322)
   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:159)
 I am surprised that the test cases did not detect this error. Could we add 
 this data which contains empty lines to the testcases?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

2010-02-09 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831776#action_12831776
 ] 

Alan Gates commented on PIG-1188:
-

A few thoughts:

In a job that is going to process a billion rows and run for 3 hours 1 bad row 
should not cause the whole job to fail.

This invalid access should certainly cause a warning.  Users can look at the 
warnings at the end of the query and decide they do not want to keep the output 
because of the warnings.  But failure should not be the default case (see 
previous point).  Perhaps we should have a warnings = error option like 
compilers do so users who are very worried about the warnings can make sure 
they fail.  But that's a different proposal for a different JIRA.

bq. Third, doing further operations on these columns down the pipeline may 
result in non-predictable results in other operators.

I don't follow.  Nulls in the pipeline shouldn't cause a problem.  UDFs and 
operators need to be able to handle null values whether they come from 
processing or from the data itself.

bq. Second, it can't be assumed that user wants those non-existent field to be 
treated as null. If he wants it that way, he should implement LoadFunc 
interface which treats them that way.

One could argue that it can't be assumed the user wants his query to fail when 
a field is missing.  We have to assume one way or another.  Null is a better 
assumption than failure, since it is possible for a user who doesn't want that 
behavior to detect it and deal with it.  As it is now, the user has to modify 
his data or write a new load function to deal with padding his data.

I agree with you that in the schema case, it would be ideal if not having a 
field was an error.  However, given the architecture this is difficult.  And 
stipulating that load functions test every record to assure it matches the 
schema is too much of a performance penalty.  But for the non-schema case I 
don't agree.  Pig's philsophy of Pigs eat anything doesn't mean much if Pig 
gags as soon as it gets a record that doesn't match it's expectation.




 Padding nulls to the input tuple according to input schema
 --

 Key: PIG-1188
 URL: https://issues.apache.org/jira/browse/PIG-1188
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Richard Ding
 Fix For: 0.7.0


 Currently, the number of fields in the input tuple is determined by the data. 
 When we have schema, we should generate input data according to the schema, 
 and padding nulls if necessary. Here is one example:
 Pig script:
 {code}
 a = load '1.txt' as (a0, a1);
 dump a;
 {code}
 Input file:
 {code}
 1   2
 1   2   3
 1
 {code}
 Current result:
 {code}
 (1,2)
 (1,2,3)
 (1)
 {code}
 Desired result:
 {code}
 (1,2)
 (1,2)
 (1, null)
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1207) [zebra] Data sanity check should be performed at the end of writing instead of later at query time

2010-02-09 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1207:
--

Attachment: PIG-1207.patch

 [zebra] Data sanity check should be performed at the end  of writing instead 
 of later at query time
 ---

 Key: PIG-1207
 URL: https://issues.apache.org/jira/browse/PIG-1207
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou
Assignee: Yan Zhou
 Attachments: PIG-1207.patch


 Currently the equity check of number of rows across different column groups 
 are performed by the query. And the error info is sketchy and only emits a 
 Column groups are not evenly distributed, or worse,  throws an 
 IndexOufOfBound exception from CGScanner.getCGValue since BasicTable.atEnd 
 and BasicTable.getKey, which are called just before BasicTable.getValue, only 
 checks the first column group in projection and any discrepancy of the number 
 of rows per file cross multiple column groups in projection could have  
 BasicTable.atEnd  return false and BasicTable.getKey return a key normally 
 but another column group already exaust its current file and the call to its 
 CGScanner.getCGValue throw the exception. 
 This check should also be performed at the end of writing and the error info 
 should be more informational.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1140) [zebra] Use of Hadoop 2.0 APIs

2010-02-09 Thread Jay Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Tang reassigned PIG-1140:
-

Assignee: Xuefu Zhang

 [zebra] Use of Hadoop 2.0 APIs  
 

 Key: PIG-1140
 URL: https://issues.apache.org/jira/browse/PIG-1140
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Xuefu Zhang
 Fix For: 0.7.0

 Attachments: zebra.0209


 Currently, Zebra is still using already deprecated Hadoop 1.8 APIs. Need to 
 upgrade to its 2.0 APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1137) [zebra] get* methods of Zebra Map/Reduce APIs need improvements

2010-02-09 Thread Jay Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Tang reassigned PIG-1137:
-

Assignee: Yan Zhou

 [zebra] get* methods of Zebra Map/Reduce APIs need improvements
 ---

 Key: PIG-1137
 URL: https://issues.apache.org/jira/browse/PIG-1137
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0


 Currently the set* methods takes external Zebra objects, namely objects of  
 ZebraStorageHint, ZebraSchema, ZebraSortInfo or ZebraProjection. 
 Correspondingly, the get* methods should return such objects instead of 
 String or Zebra internal objects like Schema.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1139) [zebra] Encapsulation of check of ZebraSortInfo by a Zebra reader; the check by a writer could be better encapsulated

2010-02-09 Thread Jay Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Tang updated PIG-1139:
--

Fix Version/s: (was: 0.7.0)
   0.8.0

 [zebra] Encapsulation of check of ZebraSortInfo by a Zebra reader; the check 
 by a writer could be better encapsulated
 -

 Key: PIG-1139
 URL: https://issues.apache.org/jira/browse/PIG-1139
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Yan Zhou
Priority: Minor
 Fix For: 0.8.0


 Currently the user's ZebraSortInfo by Map/Reduce's writer, namely, the 
 BasicTableOutputFormat.setStorageInfo, is sanity checked by the 
 SortInfo.parse(), although the sanity check could be all performed in that 
 method taking a ZebraSortInfo object.
 But the sanity check at the reader side is totally by the caller of 
 TableInputFormat.requireSortedTable method, which should be better 
 encapsulated into a new SortInfo's method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1131) Pig simple join does not work when it contains empty lines

2010-02-09 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1131:
--

Attachment: pig-1131.patch

In POLocalRearrange number of elements in tuple not present in key (and thus 
put in value) is computed first time and then cached as an optimization. This 
patch removes this caching because of the problem illustrated in the bug. Test 
case included which reproduces the bug. 

 Pig simple join does not work when it contains empty lines
 --

 Key: PIG-1131
 URL: https://issues.apache.org/jira/browse/PIG-1131
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: junk1.txt, junk2.txt, pig-1131.patch, 
 simplejoinscript.pig


 I have a simple script, which does a JOIN.
 {code}
 input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
 describe input1;
 input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
 describe input2;
 joineddata = JOIN input1 by $0, input2 by $0;
 describe joineddata;
 store joineddata into 'result';
 {code}
 The input data contains empty lines.  
 The join fails in the Map phase with the following error in the 
 PRLocalRearrange.java
 java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
   at java.util.ArrayList.get(ArrayList.java:322)
   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:159)
 I am surprised that the test cases did not detect this error. Could we add 
 this data which contains empty lines to the testcases?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1131) Pig simple join does not work when it contains empty lines

2010-02-09 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1131:
--

Status: Patch Available  (was: Reopened)

 Pig simple join does not work when it contains empty lines
 --

 Key: PIG-1131
 URL: https://issues.apache.org/jira/browse/PIG-1131
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: junk1.txt, junk2.txt, pig-1131.patch, 
 simplejoinscript.pig


 I have a simple script, which does a JOIN.
 {code}
 input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
 describe input1;
 input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
 describe input2;
 joineddata = JOIN input1 by $0, input2 by $0;
 describe joineddata;
 store joineddata into 'result';
 {code}
 The input data contains empty lines.  
 The join fails in the Map phase with the following error in the 
 PRLocalRearrange.java
 java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
   at java.util.ArrayList.get(ArrayList.java:322)
   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:159)
 I am surprised that the test cases did not detect this error. Could we add 
 this data which contains empty lines to the testcases?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1215) Make Hadoop jobId more prominent in the client log

2010-02-09 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831812#action_12831812
 ] 

Olga Natkovich commented on PIG-1215:
-

I would like to request an additional change to make sure that we can write 
HadoopId information to the client side log file not just stdout. This would 
happen only if special property is used.

So the additional ask is to implement handling of this new property and when it 
is present to make sure that all messages at the level of INFO are written to 
the log file. This can be accomplished by changing the log listener for the log 
file so it picks up INFO level log events.

We don't want to do this by default because it would drastically increase the 
number of log files created by Pig since now we only create the file when there 
is a real problem executing it.


 Make Hadoop jobId more prominent in the client log
 --

 Key: PIG-1215
 URL: https://issues.apache.org/jira/browse/PIG-1215
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-1215.patch


 This is a request from applications that want to be able to programmatically 
 parse client logs to find hadoop Ids.
 The woould like to see each job id on a separate line in the following format:
 hadoopJobId: job_123456789
 They would also like to see the jobs in the order they are executed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1215) Make Hadoop jobId more prominent in the client log

2010-02-09 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831817#action_12831817
 ] 

Olga Natkovich commented on PIG-1215:
-

can we also make the value NOT_AVAILABLE rather than NOT AVAILABLE to make it 
easier for tools to parse

 Make Hadoop jobId more prominent in the client log
 --

 Key: PIG-1215
 URL: https://issues.apache.org/jira/browse/PIG-1215
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-1215.patch


 This is a request from applications that want to be able to programmatically 
 parse client logs to find hadoop Ids.
 The woould like to see each job id on a separate line in the following format:
 hadoopJobId: job_123456789
 They would also like to see the jobs in the order they are executed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1131) Pig simple join does not work when it contains empty lines

2010-02-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831818#action_12831818
 ] 

Hadoop QA commented on PIG-1131:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12435394/pig-1131.patch
  against trunk revision 908177.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/196/console

This message is automatically generated.

 Pig simple join does not work when it contains empty lines
 --

 Key: PIG-1131
 URL: https://issues.apache.org/jira/browse/PIG-1131
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: junk1.txt, junk2.txt, pig-1131.patch, 
 simplejoinscript.pig


 I have a simple script, which does a JOIN.
 {code}
 input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
 describe input1;
 input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
 describe input2;
 joineddata = JOIN input1 by $0, input2 by $0;
 describe joineddata;
 store joineddata into 'result';
 {code}
 The input data contains empty lines.  
 The join fails in the Map phase with the following error in the 
 PRLocalRearrange.java
 java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
   at java.util.ArrayList.get(ArrayList.java:322)
   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:159)
 I am surprised that the test cases did not detect this error. Could we add 
 this data which contains empty lines to the testcases?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1230) Streaming input in POJoinPackage should use nonspillable bag to collect tuples

2010-02-09 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1230:
--

Attachment: pig-1230_2.patch

As per comment changed lastBagIndex to numInputs - 1, no other changes.

 Streaming input in POJoinPackage should use nonspillable bag to collect tuples
 --

 Key: PIG-1230
 URL: https://issues.apache.org/jira/browse/PIG-1230
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-1230.patch, pig-1230_1.patch, pig-1230_2.patch


 Last table of join statement is streamed through instead of collecting all 
 its tuple in a bag. As a further optimization of that, tuples of that 
 relation are collected in chunks in a bag. Since we don't want to spill the 
 tuples from this bag, NonSpillableBag should be used to hold tuples for this 
 relation. Initially, DefaultDataBag was used, which was later changed to 
 InternalCachedBag as a part of PIG-1209.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1230) Streaming input in POJoinPackage should use nonspillable bag to collect tuples

2010-02-09 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1230:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch checked-in.

 Streaming input in POJoinPackage should use nonspillable bag to collect tuples
 --

 Key: PIG-1230
 URL: https://issues.apache.org/jira/browse/PIG-1230
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-1230.patch, pig-1230_1.patch, pig-1230_2.patch


 Last table of join statement is streamed through instead of collecting all 
 its tuple in a bag. As a further optimization of that, tuples of that 
 relation are collected in chunks in a bag. Since we don't want to spill the 
 tuples from this bag, NonSpillableBag should be used to hold tuples for this 
 relation. Initially, DefaultDataBag was used, which was later changed to 
 InternalCachedBag as a part of PIG-1209.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1131) Pig simple join does not work when it contains empty lines

2010-02-09 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1131:
--

Attachment: pig-1131.patch

Previous patch was stale. Merged with trunk and regenerated the patch.

 Pig simple join does not work when it contains empty lines
 --

 Key: PIG-1131
 URL: https://issues.apache.org/jira/browse/PIG-1131
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: junk1.txt, junk2.txt, pig-1131.patch, pig-1131.patch, 
 simplejoinscript.pig


 I have a simple script, which does a JOIN.
 {code}
 input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
 describe input1;
 input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
 describe input2;
 joineddata = JOIN input1 by $0, input2 by $0;
 describe joineddata;
 store joineddata into 'result';
 {code}
 The input data contains empty lines.  
 The join fails in the Map phase with the following error in the 
 PRLocalRearrange.java
 java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
   at java.util.ArrayList.get(ArrayList.java:322)
   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:159)
 I am surprised that the test cases did not detect this error. Could we add 
 this data which contains empty lines to the testcases?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1131) Pig simple join does not work when it contains empty lines

2010-02-09 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1131:
--

Status: Open  (was: Patch Available)

 Pig simple join does not work when it contains empty lines
 --

 Key: PIG-1131
 URL: https://issues.apache.org/jira/browse/PIG-1131
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: junk1.txt, junk2.txt, pig-1131.patch, pig-1131.patch, 
 simplejoinscript.pig


 I have a simple script, which does a JOIN.
 {code}
 input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
 describe input1;
 input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
 describe input2;
 joineddata = JOIN input1 by $0, input2 by $0;
 describe joineddata;
 store joineddata into 'result';
 {code}
 The input data contains empty lines.  
 The join fails in the Map phase with the following error in the 
 PRLocalRearrange.java
 java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
   at java.util.ArrayList.get(ArrayList.java:322)
   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:159)
 I am surprised that the test cases did not detect this error. Could we add 
 this data which contains empty lines to the testcases?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1131) Pig simple join does not work when it contains empty lines

2010-02-09 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1131:
--

Status: Patch Available  (was: Open)

 Pig simple join does not work when it contains empty lines
 --

 Key: PIG-1131
 URL: https://issues.apache.org/jira/browse/PIG-1131
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: junk1.txt, junk2.txt, pig-1131.patch, pig-1131.patch, 
 simplejoinscript.pig


 I have a simple script, which does a JOIN.
 {code}
 input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
 describe input1;
 input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
 describe input2;
 joineddata = JOIN input1 by $0, input2 by $0;
 describe joineddata;
 store joineddata into 'result';
 {code}
 The input data contains empty lines.  
 The join fails in the Map phase with the following error in the 
 PRLocalRearrange.java
 java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
   at java.util.ArrayList.get(ArrayList.java:322)
   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:159)
 I am surprised that the test cases did not detect this error. Could we add 
 this data which contains empty lines to the testcases?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-02-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831886#action_12831886
 ] 

Ashutosh Chauhan commented on PIG-1178:
---

 Was wondering  about different optimizations that we do on a complied MR plan. 
Not sure if its already been discussed or is in some doc. But essentially those 
optimizations are also done through visitors and would benefit greatly if there 
is a framework for them just as there is one for front-end. Is there any plan 
to also subsume those visitors (possibly by rewriting them as rule-transform 
pairs) in this new optimizer or will they be dealt with separately later on?  

 LogicalPlan and Optimizer are too complex and hard to work with
 ---

 Key: PIG-1178
 URL: https://issues.apache.org/jira/browse/PIG-1178
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates
Assignee: Ying He
 Attachments: expressions-2.patch, expressions.patch, lp.patch, 
 lp.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch


 The current implementation of the logical plan and the logical optimizer in 
 Pig has proven to not be easily extensible. Developer feedback has indicated 
 that adding new rules to the optimizer is quite burdensome. In addition, the 
 logical plan has been an area of numerous bugs, many of which have been 
 difficult to fix. Developers also feel that the logical plan is difficult to 
 understand and maintain. The root cause for these issues is that a number of 
 design decisions that were made as part of the 0.2 rewrite of the front end 
 have now proven to be sub-optimal. The heart of this proposal is to revisit a 
 number of those proposals and rebuild the logical plan with a simpler design 
 that will make it much easier to maintain the logical plan as well as extend 
 the logical optimizer. 
 See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
 details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1217) [piggybank] evaluation.util.Top is broken

2010-02-09 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1217:
---

Status: Open  (was: Patch Available)

 [piggybank] evaluation.util.Top is broken
 -

 Key: PIG-1217
 URL: https://issues.apache.org/jira/browse/PIG-1217
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0, 0.3.1, 0.4.0, site, 0.5.0, 0.6.0, 0.7.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0

 Attachments: fix_top_udf.diff, fix_top_udf.diff


 The Top udf has been broken for a while, due to an incorrect implementation 
 of getArgToFuncMapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1217) [piggybank] evaluation.util.Top is broken

2010-02-09 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1217:
---

Attachment: fix_top_udf.diff

Simplified Initial per Alan's comments (just returning the tuple doesn't work, 
btw).
Also made it a bit safer around nulls.

 [piggybank] evaluation.util.Top is broken
 -

 Key: PIG-1217
 URL: https://issues.apache.org/jira/browse/PIG-1217
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0, 0.3.1, 0.4.0, site, 0.5.0, 0.6.0, 0.7.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0

 Attachments: fix_top_udf.diff, fix_top_udf.diff, fix_top_udf.diff


 The Top udf has been broken for a while, due to an incorrect implementation 
 of getArgToFuncMapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1217) [piggybank] evaluation.util.Top is broken

2010-02-09 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1217:
---

Status: Patch Available  (was: Open)

 [piggybank] evaluation.util.Top is broken
 -

 Key: PIG-1217
 URL: https://issues.apache.org/jira/browse/PIG-1217
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0, 0.3.1, 0.4.0, site, 0.5.0, 0.6.0, 0.7.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0

 Attachments: fix_top_udf.diff, fix_top_udf.diff, fix_top_udf.diff


 The Top udf has been broken for a while, due to an incorrect implementation 
 of getArgToFuncMapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1233) NullPointerException in AVG

2010-02-09 Thread Ankur (JIRA)
NullPointerException in AVG 


 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
 Fix For: 0.6.0


The overridden method - getValue() in AVG throws null pointer exception in case 
accumulate() is not called leaving variable 'intermediateCount'  initialized to 
null. This causes java to throw exception when it tries to 'unbox' the value 
for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1233) NullPointerException in AVG

2010-02-09 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1233:
---

Attachment: jira-1233.patch

Attached is a very simple patch that adds the required null checks. This is a 
very simple code change so I don't think any new test cases are needed. 

 NullPointerException in AVG 
 

 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
 Fix For: 0.6.0

 Attachments: jira-1233.patch


 The overridden method - getValue() in AVG throws null pointer exception in 
 case accumulate() is not called leaving variable 'intermediateCount'  
 initialized to null. This causes java to throw exception when it tries to 
 'unbox' the value for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1233) NullPointerException in AVG

2010-02-09 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1233:
---

Status: Patch Available  (was: Open)

 NullPointerException in AVG 
 

 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Ankur
 Fix For: 0.6.0

 Attachments: jira-1233.patch


 The overridden method - getValue() in AVG throws null pointer exception in 
 case accumulate() is not called leaving variable 'intermediateCount'  
 initialized to null. This causes java to throw exception when it tries to 
 'unbox' the value for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1131) Pig simple join does not work when it contains empty lines

2010-02-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831898#action_12831898
 ] 

Hadoop QA commented on PIG-1131:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12435402/pig-1131.patch
  against trunk revision 908324.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/197/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/197/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/197/console

This message is automatically generated.

 Pig simple join does not work when it contains empty lines
 --

 Key: PIG-1131
 URL: https://issues.apache.org/jira/browse/PIG-1131
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: junk1.txt, junk2.txt, pig-1131.patch, pig-1131.patch, 
 simplejoinscript.pig


 I have a simple script, which does a JOIN.
 {code}
 input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
 describe input1;
 input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
 describe input2;
 joineddata = JOIN input1 by $0, input2 by $0;
 describe joineddata;
 store joineddata into 'result';
 {code}
 The input data contains empty lines.  
 The join fails in the Map phase with the following error in the 
 PRLocalRearrange.java
 java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
   at java.util.ArrayList.get(ArrayList.java:322)
   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:159)
 I am surprised that the test cases did not detect this error. Could we add 
 this data which contains empty lines to the testcases?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.