[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1038:


Attachment: PIG-1038-3.patch

Attach a patch to address Pradeep's comments.

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1038:


Status: Patch Available  (was: Open)

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1038:


Status: Open  (was: Patch Available)

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



package org.apache.hadoop.zebra.parse missing

2009-11-11 Thread Min Zhou
Hi guys,

I checked out pig from trunk, and found package
org.apache.hadoop.zebra.parse missing. Do you assure this package has
been committed?
see this link
http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/zebra/src/java/org/apache/hadoop/zebra/


Min

-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com


Re: package org.apache.hadoop.zebra.parse missing

2009-11-11 Thread Alan Gates
The parser package is generated as part of the build.  Doing invoking  
ant in the contrib/zebra directory should result in the parser package  
being created at ./src-gen/org/apache/hadoop/zebra/parser


Alan.

On Nov 11, 2009, at 12:54 AM, Min Zhou wrote:


Hi guys,

I checked out pig from trunk, and found package
org.apache.hadoop.zebra.parse missing. Do you assure this package has
been committed?
see this link
http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/zebra/src/java/org/apache/hadoop/zebra/


Min

--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com




[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces

2009-11-11 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776509#action_12776509
 ] 

Alan Gates commented on PIG-966:


Size on disk.  It's not quite useless, as it can be used to estimate number of 
splits, etc.  It should also be possible to estimate size in memory based on 
size in disk by applying an average explosion factor (about 4x at the moment I 
believe).

 Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
 ---

 Key: PIG-966
 URL: https://issues.apache.org/jira/browse/PIG-966
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces 
 significantly.  See http://wiki.apache.org/pig/LoadStoreRedesignProposal for 
 full details

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator

2009-11-11 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776520#action_12776520
 ] 

Alan Gates commented on PIG-1064:
-

Why is cogrouping on * without a schema causing trouble?  Because we can't 
guarantee that inputs have the same number of fields?

Why would anyone ever want to cogroup on *?  Do we need to spend any effort 
fixing this?

 Behvaiour of COGROUP with and without schema when using * operator
 

 Key: PIG-1064
 URL: https://issues.apache.org/jira/browse/PIG-1064
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


 I have 2 tab separated files, 1.txt and 2.txt
 $ cat 1.txt 
 
 1   2
 2   3
 
 $ cat 2.txt 
 1   2
 2   3
 I use COGROUP feature of Pig in the following way:
 $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main
 {code}
 grunt A = load '1.txt';
 grunt B = load '2.txt' as (b0, b1);
 grunt C = cogroup A by *, B by *;  
 {code}
 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1012: Each COGroup input has to have the same number of inner plans
 Details at logfile: pig_1256845224752.log
 ==
 If I reverse, the order of the schema's
 {code}
 grunt A = load '1.txt' as (a0, a1);
 grunt B = load '2.txt';
 grunt C = cogroup A by *, B by *;  
 {code}
 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1013: Grouping attributes can either be star (*) or a list of expressions, 
 but not both.
 Details at logfile: pig_1256845224752.log
 ==
 Now running without schema??
 {code}
 grunt A = load '1.txt';
 grunt B = load '2.txt';
 grunt C = cogroup A by *, B by *;
 grunt dump C; 
 {code}
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully 
 stored result in: file:/tmp/temp-319926700/tmp-1990275961
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records 
 written : 2
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written 
 : 154
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
 ((1,2),{(1,2)},{(1,2)})
 ((2,3),{(2,3)},{(2,3)})
 ==
 Is this a bug or a feature?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

2009-11-11 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776526#action_12776526
 ] 

Thejas M Nair commented on PIG-1062:


Proposal for sampling in RandomSampleLoader (as well as SampleLoader class)- 
(used for order-by queries) -
Problem: With new interface, we cannot use the old approach of dividing the 
size of file by number of samples required and skipping that many bytes to get 
new sample.
Proposal: The approach proposed by Dmitriy for sampling is used -
bq. In getNext(), we can now allocate a buffer for T elements, populate it with 
the first T tuples, and continue scanning the partition. For every ith next() 
call, we generate a random number r s.t. 0=ri, and if rT we insert the new 
tuple into our buffer at position r. This gives us a nicely random sample of 
the tuples in the partition.
To avoid parsing all tuples RecordReader.nextKeyValue() will be called (instead 
of loader.getNext()) if the current tuple is to be skipped.

bq. It looks like ReduceContext has a getCounter() method. Am I missing a 
subtlety?
Arun C Murthy (mapreduce comitter) has agreed to elaborate on his 
recommendation on this in the jira.


 load-store-redesign branch: change SampleLoader and subclasses to work with 
 new LoadFunc interface 
 ---

 Key: PIG-1062
 URL: https://issues.apache.org/jira/browse/PIG-1062
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair
Assignee: Thejas M Nair

 This is part of the effort to implement new load store interfaces as laid out 
 in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
 PigStorage and BinStorage are now working.
 SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
 be changed to work with new LoadFunc interface.  
 Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
 PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator

2009-11-11 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776525#action_12776525
 ] 

Pradeep Kamath commented on PIG-1064:
-

Cogroup needs the same arity for the grouping key from both inputs. If there is 
a cogroup by *, the '*' needs to be expanded so we know the arity. This is done 
in ProjectStarTranslator - the current code leaves the '*' as is when there is 
no schema. This causes problems in the backend - hence the proposed fix to 
catch this and error out.

If we feel that users should not cogroup on '*' we should prevent it in the 
parser. The proposed fix is easy enough that I don't think we need to restrict 
the use of '*'.

 Behvaiour of COGROUP with and without schema when using * operator
 

 Key: PIG-1064
 URL: https://issues.apache.org/jira/browse/PIG-1064
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


 I have 2 tab separated files, 1.txt and 2.txt
 $ cat 1.txt 
 
 1   2
 2   3
 
 $ cat 2.txt 
 1   2
 2   3
 I use COGROUP feature of Pig in the following way:
 $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main
 {code}
 grunt A = load '1.txt';
 grunt B = load '2.txt' as (b0, b1);
 grunt C = cogroup A by *, B by *;  
 {code}
 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1012: Each COGroup input has to have the same number of inner plans
 Details at logfile: pig_1256845224752.log
 ==
 If I reverse, the order of the schema's
 {code}
 grunt A = load '1.txt' as (a0, a1);
 grunt B = load '2.txt';
 grunt C = cogroup A by *, B by *;  
 {code}
 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1013: Grouping attributes can either be star (*) or a list of expressions, 
 but not both.
 Details at logfile: pig_1256845224752.log
 ==
 Now running without schema??
 {code}
 grunt A = load '1.txt';
 grunt B = load '2.txt';
 grunt C = cogroup A by *, B by *;
 grunt dump C; 
 {code}
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully 
 stored result in: file:/tmp/temp-319926700/tmp-1990275961
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records 
 written : 2
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written 
 : 154
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
 ((1,2),{(1,2)},{(1,2)})
 ((2,3),{(2,3)},{(2,3)})
 ==
 Is this a bug or a feature?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator

2009-11-11 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776527#action_12776527
 ] 

Pradeep Kamath commented on PIG-1064:
-

The last paragraph in my previous comment should read:
If we feel that users should not cogroup on star we should prevent it in the 
parser. The proposed fix is easy enough that I don't think we need to restrict 
the use of star.

 Behvaiour of COGROUP with and without schema when using * operator
 

 Key: PIG-1064
 URL: https://issues.apache.org/jira/browse/PIG-1064
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


 I have 2 tab separated files, 1.txt and 2.txt
 $ cat 1.txt 
 
 1   2
 2   3
 
 $ cat 2.txt 
 1   2
 2   3
 I use COGROUP feature of Pig in the following way:
 $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main
 {code}
 grunt A = load '1.txt';
 grunt B = load '2.txt' as (b0, b1);
 grunt C = cogroup A by *, B by *;  
 {code}
 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1012: Each COGroup input has to have the same number of inner plans
 Details at logfile: pig_1256845224752.log
 ==
 If I reverse, the order of the schema's
 {code}
 grunt A = load '1.txt' as (a0, a1);
 grunt B = load '2.txt';
 grunt C = cogroup A by *, B by *;  
 {code}
 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1013: Grouping attributes can either be star (*) or a list of expressions, 
 but not both.
 Details at logfile: pig_1256845224752.log
 ==
 Now running without schema??
 {code}
 grunt A = load '1.txt';
 grunt B = load '2.txt';
 grunt C = cogroup A by *, B by *;
 grunt dump C; 
 {code}
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully 
 stored result in: file:/tmp/temp-319926700/tmp-1990275961
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records 
 written : 2
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written 
 : 154
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
 ((1,2),{(1,2)},{(1,2)})
 ((2,3),{(2,3)},{(2,3)})
 ==
 Is this a bug or a feature?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-979) Acummulator Interface for UDFs

2009-11-11 Thread Ying He (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-979:


Attachment: PIG-979.patch

patch to address Alan's comments. 

 Acummulator Interface for UDFs
 --

 Key: PIG-979
 URL: https://issues.apache.org/jira/browse/PIG-979
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Ying He
 Attachments: PIG-979.patch, PIG-979.patch


 Add an accumulator interface for UDFs that would allow them to take a set 
 number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1001) Generate more meaningful error message when one input file does not exist

2009-11-11 Thread Nigel Daley (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776545#action_12776545
 ] 

Nigel Daley commented on PIG-1001:
--

Please ensure the issue is Assigned to the patch author when committing.

Also, please provide a justification for why there is no unit test, then 
describe how you DID test it before you uploaded the patch.

 Generate more meaningful error message when one input file does not exist
 -

 Key: PIG-1001
 URL: https://issues.apache.org/jira/browse/PIG-1001
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1001-1.patch, PIG-1001-2.patch


 In the following query, if 1.txt does not exist, 
 a = load '1.txt';
 b = group a by $0;
 c = group b all;
 dump c;
 Pig throws error message ERROR 2100: file:/tmp/temp155054664/tmp1144108421 
 does not exist., Pig should deal with it with the error message Input file 
 1.txt not exist instead of those confusing messages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces

2009-11-11 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776547#action_12776547
 ] 

Dmitriy V. Ryaboy commented on PIG-966:
---

I agree, useless was a strong word (in fact I've been assuming it means size 
on disk and using it to estimate number of splits in my cbo code already...).  
The explosion factor is very iffy when we are dealing with compressed data.  
But, ok, let's not overthink it -- we can save the memory question for the next 
iteration.  I'll edit the wiki to note the mBytes is size on disk.

 Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
 ---

 Key: PIG-966
 URL: https://issues.apache.org/jira/browse/PIG-966
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces 
 significantly.  See http://wiki.apache.org/pig/LoadStoreRedesignProposal for 
 full details

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1001) Generate more meaningful error message when one input file does not exist

2009-11-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai reassigned PIG-1001:
---

Assignee: Daniel Dai

 Generate more meaningful error message when one input file does not exist
 -

 Key: PIG-1001
 URL: https://issues.apache.org/jira/browse/PIG-1001
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1001-1.patch, PIG-1001-2.patch


 In the following query, if 1.txt does not exist, 
 a = load '1.txt';
 b = group a by $0;
 c = group b all;
 dump c;
 Pig throws error message ERROR 2100: file:/tmp/temp155054664/tmp1144108421 
 does not exist., Pig should deal with it with the error message Input file 
 1.txt not exist instead of those confusing messages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1001) Generate more meaningful error message when one input file does not exist

2009-11-11 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776558#action_12776558
 ] 

Daniel Dai commented on PIG-1001:
-

Hi, Nigel,
Since this patch is all about error message and user experience, so it is hard 
to write a unit test case for that. I tested it manually with the following 
situations:
1. All jobs are successful
2. One complete failed job, and we stop launching dependent jobs
3. Two independent job, one fail, with and without -F option

All of those give us desired error messages.

 Generate more meaningful error message when one input file does not exist
 -

 Key: PIG-1001
 URL: https://issues.apache.org/jira/browse/PIG-1001
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1001-1.patch, PIG-1001-2.patch


 In the following query, if 1.txt does not exist, 
 a = load '1.txt';
 b = group a by $0;
 c = group b all;
 dump c;
 Pig throws error message ERROR 2100: file:/tmp/temp155054664/tmp1144108421 
 does not exist., Pig should deal with it with the error message Input file 
 1.txt not exist instead of those confusing messages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1080) PigStorage may miss records when loading a file

2009-11-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1080:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed, thanks Richard!

 PigStorage may miss records when loading a file
 ---

 Key: PIG-1080
 URL: https://issues.apache.org/jira/browse/PIG-1080
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Richard Ding
Assignee: Richard Ding
 Attachments: PIG-1080.patch, PIG-1080.patch, PIG-1080.patch


 When a file is assigned to multiple mappers (one block per mapper), the 
 blocks may not end at the exact record boundary. Special care is taken to 
 ensure that all records are loaded by mappers (and exactly once), even for 
 records that cross the block boundary. 
 The PigStorage, however, doesn't correctly handle the case where a block ends 
 at exactly record boundary and results in missing records.
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1086) Nexted sort by * throw exception

2009-11-11 Thread Daniel Dai (JIRA)
Nexted sort by * throw exception


 Key: PIG-1086
 URL: https://issues.apache.org/jira/browse/PIG-1086
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Daniel Dai


The following script fail:
a = load '1.txt' as (a0, a1, a2);
b = group a by *;
c = foreach b { d = distinct a; generate d;};
explain c;

Here is the stack:
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.get(ArrayList.java:324)
at 
org.apache.pig.impl.logicalLayer.schema.Schema.getField(Schema.java:752)
at org.apache.pig.impl.logicalLayer.LOSort.getSortInfo(LOSort.java:332)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1365)
at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:176)
at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:43)
at 
org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:69)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1274)
at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130)
at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45)
at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:234)
at org.apache.pig.PigServer.compilePp(PigServer.java:864)
at org.apache.pig.PigServer.explain(PigServer.java:583)
... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1086) Nested sort by * throw exception

2009-11-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1086:


Summary: Nested sort by * throw exception  (was: Nexted sort by * throw 
exception)

 Nested sort by * throw exception
 

 Key: PIG-1086
 URL: https://issues.apache.org/jira/browse/PIG-1086
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Daniel Dai

 The following script fail:
 a = load '1.txt' as (a0, a1, a2);
 b = group a by *;
 c = foreach b { d = distinct a; generate d;};
 explain c;
 Here is the stack:
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.ArrayList.get(ArrayList.java:324)
 at 
 org.apache.pig.impl.logicalLayer.schema.Schema.getField(Schema.java:752)
 at 
 org.apache.pig.impl.logicalLayer.LOSort.getSortInfo(LOSort.java:332)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1365)
 at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:176)
 at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:43)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:69)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1274)
 at 
 org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130)
 at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:234)
 at org.apache.pig.PigServer.compilePp(PigServer.java:864)
 at org.apache.pig.PigServer.explain(PigServer.java:583)
 ... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1086) Nested sort by * throw exception

2009-11-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1086:


Description: 
The following script fail:
A = load '1.txt' as (a0, a1, a2);
B = group A by a0;
C = foreach B { D = order A by *; generate group, D;};
explain C;

Here is the stack:
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.get(ArrayList.java:324)
at 
org.apache.pig.impl.logicalLayer.schema.Schema.getField(Schema.java:752)
at org.apache.pig.impl.logicalLayer.LOSort.getSortInfo(LOSort.java:332)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1365)
at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:176)
at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:43)
at 
org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:69)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1274)
at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130)
at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45)
at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:234)
at org.apache.pig.PigServer.compilePp(PigServer.java:864)
at org.apache.pig.PigServer.explain(PigServer.java:583)
... 8 more

  was:
The following script fail:
a = load '1.txt' as (a0, a1, a2);
b = group a by *;
c = foreach b { d = distinct a; generate d;};
explain c;

Here is the stack:
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.get(ArrayList.java:324)
at 
org.apache.pig.impl.logicalLayer.schema.Schema.getField(Schema.java:752)
at org.apache.pig.impl.logicalLayer.LOSort.getSortInfo(LOSort.java:332)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1365)
at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:176)
at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:43)
at 
org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:69)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1274)
at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130)
at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45)
at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:234)
at org.apache.pig.PigServer.compilePp(PigServer.java:864)
at org.apache.pig.PigServer.explain(PigServer.java:583)
... 8 more


 Nested sort by * throw exception
 

 Key: PIG-1086
 URL: https://issues.apache.org/jira/browse/PIG-1086
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Daniel Dai

 The following script fail:
 A = load '1.txt' as (a0, a1, a2);
 B = group A by a0;
 C = foreach B { D = order A by *; generate group, D;};
 explain C;
 Here is the stack:
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.ArrayList.get(ArrayList.java:324)
 at 
 org.apache.pig.impl.logicalLayer.schema.Schema.getField(Schema.java:752)
 at 
 org.apache.pig.impl.logicalLayer.LOSort.getSortInfo(LOSort.java:332)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1365)
 at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:176)
 at org.apache.pig.impl.logicalLayer.LOSort.visit(LOSort.java:43)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:69)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:1274)
 at 
 org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:130)
 at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:45)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 

[jira] Updated: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

2009-11-11 Thread Chao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Wang updated PIG-1077:
---

Release Note: 
In this jira, we plan to also resolve the dependency issue that Zebra 
record-based split needs Hadoop TFile split support to work. For this 
dependency, Zebra has to maintain its own copy of Hadoop jar in svn for it to 
be able to build. Furthermore, the fact that Zebra currently sits inside Pig in 
svn and Pig itself maintains its own copy of Hadoop jar in lib directory makes 
things even messier. Finally, we notice that Zebra is new and making many 
changes and needs to get new revisions quickly, while Hadoop and Pig are more 
mature and moving slowly and thus can't make new releases for Zebra all the 
time. 

After carefully thinking through all this, we plan to fork the TFile part off 
the Hadoop and port it into Zebra's own code base. This will greatly simply the 
building process of Zebra and also enable it to make quick revisions.

Last, we would like to point out that this is a short term solution for Zebra 
and we plan to: 
1) port all changes to Zebra TFile back into Hadoop TFile. 
2) in the long run have a single unified solution for this.


 [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
 ---

 Key: PIG-1077
 URL: https://issues.apache.org/jira/browse/PIG-1077
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.4.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.6.0


 TFile currently supports split by record sequence number (see Jira 
 HADOOP-6218). We want to utilize this to provide record(row)-based input 
 split support in Zebra.
 One prominent benefit is that: in cases where we have very large data files, 
 we can create much more fine-grained input splits than before where we can 
 only create one big split for one big file.
 In more detail, the new row-based getSplits() works by default (user does not 
 specify no. of splits to be generated) as follows: 
 1) Select the biggest column group in terms of data size, split all of its 
 TFiles according to hdfs block size (64 MB or 128 MB) and get a list of 
 physical byte offsets as the output per TFile. For example, let us assume for 
 the 1st TFile we get offset1, offset2, ..., offset10; 
 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a 
 key-value pair near a byte offset. For the example above, say we get 
 recordNum1, recordNum2, ..., recordNum10; 
 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, 
 recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, 
 respectively to form 11 record-based input splits for the 1st TFile. 
 4) For each input split, we need to create a TFile scanner through: 
 TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
 Note: conversion from byte offset to record number will be done by each 
 mapper, rather than being done at the job initialization phase. This is due 
 to performance concern since the conversion incurs some TFile reading 
 overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-11 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776729#action_12776729
 ] 

Pradeep Kamath commented on PIG-1038:
-

In JobControlCompiler:
==
{code}
583 jobConf.set(pig.secondarySortOrder,
584 
ObjectSerializer.serialize(mro.getSecondarySortOrder()));
585 }
{code}
Looks like above code should be set in the case of non order by Mro which uses
secondary key

{code}
638 valuea = ((Tuple)wa.getValueAsPigType()).get(0);
{code}
We should put a comment explaining that we extract the first field out since 
that represents
the actual group by key.

In SecondaryKeyOptimizer:

{code}
} else if (mapLeaf instanceof POUnion || mapLeaf instanceof 
POSplit) {
{code}

The above should not contain POSplit since POSplit would only occur after multi 
query optimization
which happens later.

{code}
94 } else if (plan.getRoots().size() != 1) {
 95 // POLocalRearrange plan contains more than 1 root.
 96 // Probably there is an Expression operator in the local
 97 // rearrangement plan,
 98 // skip secondary key optimizing
 99 return null;
{code}
Should we do continue nextPlan instead of return null here since this is 
similar to udf or constant in 
local rearrange case

{code}
105.columnChainInfo.insert(false, columns, DataType.TUPLE);
{code}
It would useful to put a comment explaining this is put into the 
ColumnChainInfo only for comparing that
different components of SortKeyInfo are all coming from the same input index. 
Also should the datatype be
BAG?

{code}
118log.debug(node +  have more than 1 predecessor);
{code}
predecessor should change to successor.

{code}
217 if (currentNode instanceof POPackage
218 || currentNode instanceof POFilter
219 || currentNode instanceof POLimit) {
{code}
In line 217 we should ensure, we don't optimize when we encounter POJoinPackage 
using something like
{code}
if ((currentNode instanceof POPackage  !(currentNode instanceof 
POJoinPackage))
{code}

{code}
307. int errorCode = 1000;
327 int errorCode = 1000;
526. int errorCode = 1000;
{code}
This error code is already in use 

{code}
336 } else if (mapLeaf instanceof POUnion || mapLeaf instanceof 
POSplit) {
337 ListPhysicalOperator preds = mr.mapPlan
338 .getPredecessors(mapLeaf);
339 for (PhysicalOperator pred : preds) {
340 POLocalRearrange rearrange = (POLocalRearrange) pred;
341 rearrange.setUseSecondaryKey(true);
342 if (rearrange.getIndex() == indexOfRearrangeToChange) 
// Try
343   
// to
344   
// find
345   
// the
346   
// POLocalRearrange
347   
// for
348   
// the
349   
// secondary
350   
// key
351 setSecondaryPlan(mr.mapPlan, rearrange,
352 secondarySortKeyInfo);
353 }   
354 }
{code}
The above should not contain POSplit since POSplit would only occur after multi 
query optimization
which happens later.

Also in the if statement on line 342, what if the condition evaluates to false 
- should we throw an Exception like earlier in the same
method?

{code}
530 if (r)
531 sawInvalidPhysicalOper = true;
..
557 if (r) // if we saw physical operator other than project in 
sort
558// plan
559 return;
{code}
At line 559 should we be setting sawInvalidPhysicalOper?

General comments:
=
A comment on ColumnChainInfo and SortKeyInfo explaining how it tracks to 
POProjects in the plan would be useful

POMultiQueryPackage should not change since SecondaryKeyOptimizer runs before
MultiQueryOptimizer.




 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  

FYI - forking TFile off Hadoop into Zebra

2009-11-11 Thread Chao Wang
Hi all,

 

In Jira Pig-1077, we Zebra team plan to utilize Hadoop TFile's split by
record sequence number support to provide record(row)-based input split
support in Zebra.

 

Here we would like to point out that: along the way we plan to also
resolve the dependency issue that Zebra record-based split needs Hadoop
TFile split support to work. For this dependency, Zebra has to maintain
its own copy of Hadoop jar in svn for it to be able to build.
Furthermore, the fact that Zebra currently sits inside Pig in svn and
Pig itself maintains its own copy of Hadoop jar in lib directory makes
things even messier. Finally, we notice that Zebra is new and making
many changes and needs to get new revisions quickly, while Hadoop and
Pig are more mature and moving slowly and thus can't make new releases
for Zebra all the time. 

After carefully thinking through all this, we plan to fork the TFile
part off the Hadoop and port it into Zebra's own code base. This will
greatly simply the building process of Zebra and also enable it to make
quick revisions. 

Last, we would like to point out that this is a short term solution for
Zebra and we plan to: 
1) port all changes to Zebra TFile back into Hadoop TFile. 
2) in the long run have a single unified solution for this. 

 

 

For more information, please see
https://issues.apache.org/jira/browse/PIG-1077

 

 

 

Welcome your feedback on this.

 

 

 

Regards,

 

Chao

 

 



[jira] Created: (PIG-1087) Use Pig's version for Zebra's own version.

2009-11-11 Thread Chao Wang (JIRA)
Use Pig's version for Zebra's own version.
--

 Key: PIG-1087
 URL: https://issues.apache.org/jira/browse/PIG-1087
 Project: Pig
  Issue Type: Task
Affects Versions: 0.4.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.6.0


Zebra is a contrib project of Pig now. It should use Pig's version for its own 
version. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator

2009-11-11 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1064:


Assignee: Pradeep Kamath
  Status: Patch Available  (was: Open)

The patch implements the proposal to catch the situation wherein the user 
specifies '*' as the cogrouping key and does not have a schema for the 
corresponding input to the cogroup. In these situations we would issue an error 
message - Cogroup by * is only allowed if the input has a schema and error 
out.

 Behvaiour of COGROUP with and without schema when using * operator
 

 Key: PIG-1064
 URL: https://issues.apache.org/jira/browse/PIG-1064
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Pradeep Kamath
 Fix For: 0.6.0

 Attachments: PIG-1064.patch


 I have 2 tab separated files, 1.txt and 2.txt
 $ cat 1.txt 
 
 1   2
 2   3
 
 $ cat 2.txt 
 1   2
 2   3
 I use COGROUP feature of Pig in the following way:
 $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main
 {code}
 grunt A = load '1.txt';
 grunt B = load '2.txt' as (b0, b1);
 grunt C = cogroup A by *, B by *;  
 {code}
 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1012: Each COGroup input has to have the same number of inner plans
 Details at logfile: pig_1256845224752.log
 ==
 If I reverse, the order of the schema's
 {code}
 grunt A = load '1.txt' as (a0, a1);
 grunt B = load '2.txt';
 grunt C = cogroup A by *, B by *;  
 {code}
 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1013: Grouping attributes can either be star (*) or a list of expressions, 
 but not both.
 Details at logfile: pig_1256845224752.log
 ==
 Now running without schema??
 {code}
 grunt A = load '1.txt';
 grunt B = load '2.txt';
 grunt C = cogroup A by *, B by *;
 grunt dump C; 
 {code}
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully 
 stored result in: file:/tmp/temp-319926700/tmp-1990275961
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records 
 written : 2
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written 
 : 154
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
 ((1,2),{(1,2)},{(1,2)})
 ((2,3),{(2,3)},{(2,3)})
 ==
 Is this a bug or a feature?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1038:


Status: Open  (was: Patch Available)

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, 
 PIG-1038-4.patch


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1038:


Status: Patch Available  (was: Open)

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, 
 PIG-1038-4.patch


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1038:


Attachment: PIG-1038-4.patch

Put the new patch in response to Pradeep's comments.

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, 
 PIG-1038-4.patch


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: FYI - forking TFile off Hadoop into Zebra

2009-11-11 Thread Ashutosh Chauhan
On Wed, Nov 11, 2009 at 18:26, Chao Wang ch...@yahoo-inc.com wrote:


 Last, we would like to point out that this is a short term solution for
 Zebra and we plan to:
 1) port all changes to Zebra TFile back into Hadoop TFile.
 2) in the long run have a single unified solution for this.

 Just for clarity, in long run as Zebra stabilizes and Pig adopts
hadoop-0.22, Zebra will get rid of this fork?

Ashutosh


[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-11-11 Thread Ying He (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776760#action_12776760
 ] 

Ying He commented on PIG-979:
-

performance tests doesn't show noticeable difference between trunk and 
accumulator patch when calling no-accumulator udfs.

the script to test performance is:

register /homes/yinghe/pig_test/pigperf.jar;
register /homes/yinghe/pig_test/string.jar;
register /homes/yinghe/pig_test/piggybank.jar;

A = load '/user/pig/tests/data/pigmix_large/page_views' using 
org.apache.pig.test.utils.datagen.PigPerformanceLoader() as (user, action, 
timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
page_links);

B = foreach A generate user, 
org.apache.pig.piggybank.evaluation.string.STRINGCAT(user, ip_addr) as id;

C = group B by id parallel 10;

D = foreach C {
generate group, string.BagCount2(B)*string.ColumnLen2(B, 0);
}

store D into 'test2';

The input data has 100M rows, output has 57M rows, so the UDFs are called 57M 
times.
The result is

 with patch:  5min 14sec
 w/o patch:   5min 17sec

 Acummulator Interface for UDFs
 --

 Key: PIG-979
 URL: https://issues.apache.org/jira/browse/PIG-979
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Ying He
 Attachments: PIG-979.patch, PIG-979.patch


 Add an accumulator interface for UDFs that would allow them to take a set 
 number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-742) Spaces could be optional in Pig syntax

2009-11-11 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776767#action_12776767
 ] 

Pradeep Kamath commented on PIG-742:


I gave a shot at changing introducing a new production in QueryParser.jjt but 
it didnt work. I am wondering if this issue is really because javacc's 
tokenizer needs a whitespace to tokenize - anybody with more experience with 
javacc want to comment?

Here's the patch of what I tried:
{code}
Index: src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt
===
--- src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt (revision 
834628)
+++ src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt (working copy)
@@ -979,7 +979,8 @@
 |  #DIGIT : [0-9] 
 |   #SPECIALCHAR : [_] 
 |   #FSSPECIALCHAR: [-, :, /]
-|  IDENTIFIER: ( LETTER )+ ( DIGIT | LETTER | SPECIALCHAR | 
::)* 
+|  IDENTIFIER: ( LETTER )+ ( DIGIT | LETTER | SPECIALCHAR | :: 
)* 
+|  IDENTIFIEREQUAL: ( LETTER )+ ( DIGIT | LETTER | SPECIALCHAR | 
:: )* (=) 
 }
 // Define Numeric Constants
 TOKEN :
@@ -1010,12 +1011,15 @@
 // Pig has special variables starting with $
 TOKEN : { DOLLARVAR : $ INTEGER  }
 
+TOKEN : { EQUAL: = }
+
 // Parse is the Starting function.
 LogicalPlan Parse() : 
 {
LogicalOperator root = null; 
Token t1;
Token t2; 
+   String alias;
LogicalPlan lp = new LogicalPlan();
log.trace(Entering Parse);
 }
@@ -1028,7 +1032,8 @@
throw new ParseException(
Currently PIG does not support assigning an existing 
relation ( + t1.image + ) to another alias ( + t2.image + ));})
 |  LOOKAHEAD(2) 
-   (t1 = IDENTIFIER = root = Expr(lp) ; {
+   (
+(t1 = IDENTIFIER EQUAL { alias = t1.image;}|  t1 = 
IDENTIFIEREQUAL { alias = t1.image.replaceAll(=, ); }) root = Expr(lp) 
; {
  root.setAlias(t1.image);
  addAlias(t1.image, root);
  pigContext.setLastAlias(t1.image);

{code}


 Spaces could be optional in Pig syntax
 --

 Key: PIG-742
 URL: https://issues.apache.org/jira/browse/PIG-742
 Project: Pig
  Issue Type: Wish
  Components: grunt
Affects Versions: 0.3.0
Reporter: Viraj Bhat
Priority: Minor

 The following Pig statements generate an error if there is no space between A 
  and =
 {code}
 A=load 'quf.txt' using PigStorage() as (q, u, f:long);
 B = group A by (q);
 C = foreach B {
 F = order A by f desc;
 generate F;
 };
 describe C;
 dump C;
 {code}
 2009-03-31 17:14:15,959 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Encountered
  PATH A=load  at line 1, column 1.
 Was expecting one of:
 EOF 
 cat ...
 cd ...
 cp ...
 copyFromLocal ...
 copyToLocal ...
 dump ...
 describe ...
 aliases ...
 explain ...
 help ...
 kill ...
 ls ...
 mv ...
 mkdir ...
 pwd ...
 quit ...
 register ...
 rm ...
 rmf ...
 set ...
 illustrate ...
 run ...
 exec ...
 scriptDone ...
  ...
 EOL ...
 ; ...
 It would be nice if the parser would not expect these space requirements 
 between an alias and =

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1039) Pig 0.5 Doc Updates

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1039:
---

Assignee: Corinne Chandel

 Pig 0.5 Doc Updates
 ---

 Key: PIG-1039
 URL: https://issues.apache.org/jira/browse/PIG-1039
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.5.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.5.0

 Attachments: branch-0.5.patch, trunk.patch


 Pig 0.5 doc updates (to be applied to Trunk and branch-0.5)
 1. updates to tutorial
 2. updates to pig latin reference manual
 3. updated doc tab to 0.5.0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1033) javac warnings: deprecated hadoop APIs

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1033:
---

Assignee: Daniel Dai

 javac warnings: deprecated hadoop APIs
 --

 Key: PIG-1033
 URL: https://issues.apache.org/jira/browse/PIG-1033
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.4.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1033-1.patch


 Suppress javac warnings related to deprecated hadoop APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: package org.apache.hadoop.zebra.parse missing

2009-11-11 Thread Min Zhou
Thanks, Alan.


On Thu, Nov 12, 2009 at 12:51 AM, Alan Gates ga...@yahoo-inc.com wrote:
 The parser package is generated as part of the build.  Doing invoking ant in
 the contrib/zebra directory should result in the parser package being
 created at ./src-gen/org/apache/hadoop/zebra/parser

 Alan.

 On Nov 11, 2009, at 12:54 AM, Min Zhou wrote:

 Hi guys,

 I checked out pig from trunk, and found package
 org.apache.hadoop.zebra.parse missing. Do you assure this package has
 been committed?
 see this link

 http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/zebra/src/java/org/apache/hadoop/zebra/


 Min

 --
 My research interests are distributed systems, parallel computing and
 bytecode based virtual machine.

 My profile:
 http://www.linkedin.com/in/coderplay
 My blog:
 http://coderplay.javaeye.com





-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com


[jira] Updated: (PIG-1085) Pass JobConf and UDF specific configuration information to UDFs

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1085:


Status: Open  (was: Patch Available)

 Pass JobConf and UDF specific configuration information to UDFs
 ---

 Key: PIG-1085
 URL: https://issues.apache.org/jira/browse/PIG-1085
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates
 Attachments: udfconf.patch


 Users have long asked for a way to get the JobConf structure in their UDFs.  
 It would also be nice to have a way to pass properties between the front end 
 and back end so that UDFs can store state during parse time and use it at 
 runtime.
 This patch does part of what is proposed in PIG-602, but not all of it.  It 
 does not provide a way to give user specified configuration files to UDFs.  
 So I will mark 602 as depending on this bug, but it isn't a duplicate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1085) Pass JobConf and UDF specific configuration information to UDFs

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1085:


Status: Patch Available  (was: Open)

Uploading new patch that addresses javac warnings and release audit issues.  If 
I read the output correctly all of pig's unit tests passed on the previous test.

 Pass JobConf and UDF specific configuration information to UDFs
 ---

 Key: PIG-1085
 URL: https://issues.apache.org/jira/browse/PIG-1085
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates
 Attachments: udfconf-2.patch, udfconf.patch


 Users have long asked for a way to get the JobConf structure in their UDFs.  
 It would also be nice to have a way to pass properties between the front end 
 and back end so that UDFs can store state during parse time and use it at 
 runtime.
 This patch does part of what is proposed in PIG-602, but not all of it.  It 
 does not provide a way to give user specified configuration files to UDFs.  
 So I will mark 602 as depending on this bug, but it isn't a duplicate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1032) FINDBUGS: DM_STRING_CTOR: Method invokes inefficient new String(String) constructor

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1032:
---

Assignee: Olga Natkovich

 FINDBUGS: DM_STRING_CTOR: Method invokes inefficient new String(String) 
 constructor
 ---

 Key: PIG-1032
 URL: https://issues.apache.org/jira/browse/PIG-1032
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Attachments: PIG-1032.patch


 DmMethod 
 org.apache.pig.backend.executionengine.PigSlice.init(DataStorage) invokes 
 toString() method on a String
 Dm
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.copyHadoopConfLocally(String)
  invokes inefficient new String(String) constructor
 Dm
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getFirstLineFromMessage(String)
  invokes inefficient new String(String) constructor
 Dm
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.BinaryComparisonOperator.initializeRefs()
  invokes inefficient Boolean constructor; use Boolean.valueOf(...) instead
 Dm
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.ExpressionOperator.clone()
  invokes inefficient new String(String) constructor
 Dm
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(String)
  invokes inefficient new String(String) constructor
 Dm
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PONot.getNext(Boolean)
  invokes inefficient Boolean constructor; use Boolean.valueOf(...) instead
 Dm
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan.clone()
  invokes inefficient new String(String) constructor
 Dmnew org.apache.pig.data.TimestampedTuple(String, String, int, 
 SimpleDateFormat) invokes inefficient new String(String) constructor
 Dmorg.apache.pig.impl.io.PigNullableWritable.toString() invokes 
 inefficient new String(String) constructor
 Dmorg.apache.pig.impl.logicalLayer.LOForEach.clone() invokes inefficient 
 Boolean constructor; use Boolean.valueOf(...) instead
 Dmorg.apache.pig.impl.logicalLayer.LOGenerate.clone() invokes inefficient 
 Boolean constructor; use Boolean.valueOf(...) instead
 Dmorg.apache.pig.impl.logicalLayer.LogicalPlan.clone() invokes 
 inefficient new String(String) constructor
 Dmorg.apache.pig.impl.logicalLayer.LOSort.clone() invokes inefficient 
 Boolean constructor; use Boolean.valueOf(...) instead
 Dm
 org.apache.pig.impl.logicalLayer.optimizer.ImplicitSplitInserter.transform(List)
  invokes inefficient Boolean constructor; use Boolean.valueOf(...) instead
 Dm
 org.apache.pig.impl.logicalLayer.RemoveRedundantOperators.visit(LOProject) 
 invokes inefficient new String(String) constructor
 Dmorg.apache.pig.impl.logicalLayer.schema.Schema.getField(String) invokes 
 inefficient new String(String) constructor
 Dmorg.apache.pig.impl.logicalLayer.schema.Schema.reconcile(Schema) 
 invokes inefficient new String(String) constructor
 Dm
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertCastForEachInBetweenIfNecessary(LogicalOperator,
  LogicalOperator, Schema) invokes inefficient Boolean constructor; use 
 Boolean.valueOf(...) instead]
 Dm
 org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(Notification,
  Object) forces garbage collection; extremely dubious except in benchmarking 
 code
 Dmorg.apache.pig.pen.AugmentBaseDataVisitor.GetLargerValue(Object) 
 invokes inefficient new String(String) constructor
 Dmorg.apache.pig.pen.AugmentBaseDataVisitor.GetSmallerValue(Object) 
 invokes inefficient new String(String) constructor
 Dmorg.apache.pig.tools.cmdline.CmdLineParser.getNextOpt() invokes 
 inefficient new String(String) constructor
 Dmorg.apache.pig.tools.parameters.PreprocessorContext.substitute(String) 
 invokes inefficient new String(String) constructor

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1085) Pass JobConf and UDF specific configuration information to UDFs

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1085:


Attachment: udfconf-2.patch

 Pass JobConf and UDF specific configuration information to UDFs
 ---

 Key: PIG-1085
 URL: https://issues.apache.org/jira/browse/PIG-1085
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates
 Attachments: udfconf-2.patch, udfconf.patch


 Users have long asked for a way to get the JobConf structure in their UDFs.  
 It would also be nice to have a way to pass properties between the front end 
 and back end so that UDFs can store state during parse time and use it at 
 runtime.
 This patch does part of what is proposed in PIG-602, but not all of it.  It 
 does not provide a way to give user specified configuration files to UDFs.  
 So I will mark 602 as depending on this bug, but it isn't a duplicate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1088) change merge join and merge join indexer to work with new LoadFunc interface

2009-11-11 Thread Thejas M Nair (JIRA)
change merge join and merge join indexer to work with new LoadFunc interface


 Key: PIG-1088
 URL: https://issues.apache.org/jira/browse/PIG-1088
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair
Assignee: Thejas M Nair




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1018) FINDBUGS: NM_FIELD_NAMING_CONVENTION: Field names should start with a lower case letter

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1018:
---

Assignee: Olga Natkovich

 FINDBUGS: NM_FIELD_NAMING_CONVENTION: Field names should start with a lower 
 case letter
 ---

 Key: PIG-1018
 URL: https://issues.apache.org/jira/browse/PIG-1018
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Attachments: PIG-1018.patch


 NmThe field name 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.LogToPhyMap
  doesn't start with a lower case letter
 NmThe method name 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.CreateTuple(Object[])
  doesn't start with a lower case letter
 NmThe class name 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.util.operatorHelper
  doesn't start with an upper case letter
 NmClass org.apache.pig.impl.util.WrappedIOException is not derived from 
 an Exception, even though it is named as such
 NmThe method name 
 org.apache.pig.pen.EquivalenceClasses.GetEquivalenceClasses(LogicalOperator, 
 Map) doesn't start with a lower case letter
 NmThe field name org.apache.pig.pen.util.DisplayExamples.Result doesn't 
 start with a lower case letter
 NmThe method name 
 org.apache.pig.pen.util.DisplayExamples.PrintSimple(LogicalOperator, Map) 
 doesn't start with a lower case letter
 NmThe method name 
 org.apache.pig.pen.util.DisplayExamples.PrintTabular(LogicalPlan, Map) 
 doesn't start with a lower case letter
 NmThe method name 
 org.apache.pig.tools.parameters.TokenMgrError.LexicalError(boolean, int, int, 
 int, String, char) doesn't start with a lower case letter

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1013) FINDBUGS: DMI_INVOKING_TOSTRING_ON_ARRAY: Invocation of toString on an array

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1013:
---

Assignee: Olga Natkovich

 FINDBUGS: DMI_INVOKING_TOSTRING_ON_ARRAY: Invocation of toString on an array
 

 Key: PIG-1013
 URL: https://issues.apache.org/jira/browse/PIG-1013
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Attachments: PIG-1013.patch


 DMI   Invocation of toString on stackTraceLines in 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getExceptionFromStrings(String[],
  int)
 DMI   Invocation of toString on b in 
 org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(byte[])
 DMI   Invocation of toString on b in 
 org.apache.pig.builtin.Utf8StorageConverter.bytesToDouble(byte[])
 DMI   Invocation of toString on b in 
 org.apache.pig.builtin.Utf8StorageConverter.bytesToFloat(byte[])
 DMI   Invocation of toString on b in 
 org.apache.pig.builtin.Utf8StorageConverter.bytesToInteger(byte[])
 DMI   Invocation of toString on b in 
 org.apache.pig.builtin.Utf8StorageConverter.bytesToLong(byte[])
 DMI   Invocation of toString on b in 
 org.apache.pig.builtin.Utf8StorageConverter.bytesToMap(byte[])
 DMI   Invocation of toString on b in 
 org.apache.pig.builtin.Utf8StorageConverter.bytesToTuple(byte[])
 DMI   Invocation of toString on args in 
 org.apache.pig.impl.PigContext.instantiateFuncFromSpec(FuncSpec)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1011) FINDBUGS: SE_NO_SERIALVERSIONID: Class is Serializable, but doesn't define serialVersionUID

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1011:
---

Assignee: Olga Natkovich

 FINDBUGS: SE_NO_SERIALVERSIONID: Class is Serializable, but doesn't define 
 serialVersionUID
 ---

 Key: PIG-1011
 URL: https://issues.apache.org/jira/browse/PIG-1011
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Attachments: PIG-1011.patch


 SnVI  
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODistinct
  is Serializable; consider declaring a SnVI   
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PORead
  is Serializable; consider declaring a serialVersionUID

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1015) [piggybank] DateExtractor should take into account timezones

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1015:
---

Assignee: Dmitriy V. Ryaboy

 [piggybank] DateExtractor should take into account timezones
 

 Key: PIG-1015
 URL: https://issues.apache.org/jira/browse/PIG-1015
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.6.0

 Attachments: date_extractor.patch


 The current implementation defaults to the local timezone when parsing 
 strings, thereby providing inconsistent results depending on the settings of 
 the computer the program is executing on (this is causing unit test 
 failures). We should set the timezone to a consistent default, and allow 
 users to override this default.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1089) Pig 0.6.0 Documentation

2009-11-11 Thread Corinne Chandel (JIRA)
Pig 0.6.0 Documentation
---

 Key: PIG-1089
 URL: https://issues.apache.org/jira/browse/PIG-1089
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0


Pig 0.6.0 documentation:
 Ability to use Hadoop dfs commands from Pig
 Replicated left outer join
 Skewed outer join
 Map-side group
 Accumulate Interface for UDFs
 Improved Memory Mgt
 Integration with Zebra

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1012) FINDBUGS: SE_BAD_FIELD: Non-transient non-serializable instance field in serializable class

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1012:
---

Assignee: Olga Natkovich

 FINDBUGS: SE_BAD_FIELD: Non-transient non-serializable instance field in 
 serializable class
 ---

 Key: PIG-1012
 URL: https://issues.apache.org/jira/browse/PIG-1012
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.6.0

 Attachments: PIG-1012-2.patch, PIG-1012-3.patch, PIG-1012.patch


 SeClass org.apache.pig.backend.executionengine.PigSlice defines 
 non-transient non-serializable instance field is
 SeClass org.apache.pig.backend.executionengine.PigSlice defines 
 non-transient non-serializable instance field loader
 Sejava.util.zip.GZIPInputStream stored into non-transient field 
 PigSlice.is
 Seorg.apache.pig.backend.datastorage.SeekableInputStream stored into 
 non-transient field PigSlice.is
 Seorg.apache.tools.bzip2r.CBZip2InputStream stored into non-transient 
 field PigSlice.is
 Seorg.apache.pig.builtin.PigStorage stored into non-transient field 
 PigSlice.loader
 Seorg.apache.pig.backend.hadoop.DoubleWritable$Comparator implements 
 Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigBagWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigCharArrayWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigDBAWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigDoubleWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigFloatWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigIntWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigLongWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigTupleWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigWritableComparator
  implements Comparator but not Serializable
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper 
 defines non-transient non-serializable instance field nig
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.EqualToExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GreaterThanExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GTOrEqualToExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.LessThanExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.LTOrEqualToExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.NotEqualToExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject
  defines non-transient non-serializable instance field bagIterator
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserComparisonFunc
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCombinerPackage
  defines non-transient non-serializable instance field log
 SeClass 
 

[jira] Assigned: (PIG-1010) FINDBUGS: RV_RETURN_VALUE_IGNORED_BAD_PRACTICE

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1010:
---

Assignee: Olga Natkovich

 FINDBUGS: RV_RETURN_VALUE_IGNORED_BAD_PRACTICE
 --

 Key: PIG-1010
 URL: https://issues.apache.org/jira/browse/PIG-1010
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich

 RV
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.deleteLocalDir(File)
  ignores exceptional return value of java.io.File.delete()
 RVorg.apache.pig.backend.local.datastorage.LocalPath.delete() ignores 
 exceptional return value of java.io.File.delete()
 RVorg.apache.pig.data.DefaultAbstractBag.clear() ignores exceptional 
 return value of java.io.File.delete()
 RVorg.apache.pig.data.DefaultAbstractBag.finalize() ignores exceptional 
 return value of java.io.File.delete()
 RVorg.apache.pig.impl.io.FileLocalizer.create(String, boolean, 
 PigContext) ignores exceptional return value of java.io.File.mkdirs()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1009) FINDBUGS: OS_OPEN_STREAM: Method may fail to close stream

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1009:
---

Assignee: Olga Natkovich

 FINDBUGS: OS_OPEN_STREAM: Method may fail to close stream
 -

 Key: PIG-1009
 URL: https://issues.apache.org/jira/browse/PIG-1009
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Attachments: PIG-1009.patch


 OSorg.apache.pig.impl.io.FileLocalizer.parseCygPath(String, int) may fail 
 to close stream
 OSorg.apache.pig.impl.logicalLayer.parser.QueryParser.which(String) may 
 fail to close stream
 OS
 org.apache.pig.impl.util.PropertiesUtil.loadPropertiesFromFile(Properties) 
 may fail to close stream
 OSorg.apache.pig.Main.configureLog4J(Properties, PigContext) may fail to 
 close stream
 OS
 org.apache.pig.tools.parameters.PreprocessorContext.executeShellCommand(String)
  may fail to close stream
 OS
 org.apache.pig.tools.parameters.PreprocessorContext.executeShellCommand(String)
  may fail to close stream

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1008) FINDBUGS: NP_TOSTRING_COULD_RETURN_NULL

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1008:
---

Assignee: Olga Natkovich

 FINDBUGS: NP_TOSTRING_COULD_RETURN_NULL
 ---

 Key: PIG-1008
 URL: https://issues.apache.org/jira/browse/PIG-1008
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Attachments: PIG-1008.patch


 NPorg.apache.pig.data.DataByteArray.toString() may return null
 NP
 org.apache.pig.impl.streaming.StreamingCommand$HandleSpec.equals(Object) does 
 not check for null argument

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1006) FINDBUGS: EQ_COMPARETO_USE_OBJECT_EQUALS in bags and tuples

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1006:
---

Assignee: Olga Natkovich

 FINDBUGS: EQ_COMPARETO_USE_OBJECT_EQUALS in bags and tuples
 ---

 Key: PIG-1006
 URL: https://issues.apache.org/jira/browse/PIG-1006
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich

 Eqorg.apache.pig.data.DistinctDataBag$DistinctDataBagIterator$TContainer 
 defines compareTo(DistinctDataBag$DistinctDataBagIterator$TContainer) and 
 uses Object.equals()
 Eqorg.apache.pig.data.SingleTupleBag defines compareTo(Object) and uses 
 Object.equals()
 Eqorg.apache.pig.data.SortedDataBag$SortedDataBagIterator$PQContainer 
 defines compareTo(SortedDataBag$SortedDataBagIterator$PQContainer) and uses 
 Object.equals()
 Eqorg.apache.pig.data.TargetedTuple defines compareTo(Object) and uses 
 Object.equals()
 Eqorg.apache.pig.pen.util.ExampleTuple defines compareTo(Object) and uses 
 Object.equals()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-989) Allow type merge between numerical type and non-numerical type

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-989:
--

Assignee: Daniel Dai

 Allow type merge between numerical type and non-numerical type
 --

 Key: PIG-989
 URL: https://issues.apache.org/jira/browse/PIG-989
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.5.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-989-1.patch, PIG-989-2.patch


 Currently, we do not allow type merge between numerical type and 
 non-numerical type. And the error message is confusing. 
 Eg, if you run:
 a = load '1.txt' as (a0:chararray, a1:chararray);
 b = load '2.txt' as (b0:long, b1:chararray);
 c = join a by a0, b by b0;
 dump c;
 And the error message is ERROR 1051: Cannot cast to Unknown
 We shall:
 1. Allow the type merge between numerical type and non-numerical type
 2. Or at least, provide more meaningful error message to the user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-979) Acummulator Interface for UDFs

2009-11-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-979:
---

Status: Open  (was: Patch Available)

 Acummulator Interface for UDFs
 --

 Key: PIG-979
 URL: https://issues.apache.org/jira/browse/PIG-979
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Ying He
 Attachments: PIG-979.patch, PIG-979.patch


 Add an accumulator interface for UDFs that would allow them to take a set 
 number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-979) Acummulator Interface for UDFs

2009-11-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-979:
---

Status: Patch Available  (was: Open)

 Acummulator Interface for UDFs
 --

 Key: PIG-979
 URL: https://issues.apache.org/jira/browse/PIG-979
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Ying He
 Attachments: PIG-979.patch, PIG-979.patch


 Add an accumulator interface for UDFs that would allow them to take a set 
 number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1089) Pig 0.6.0 Documentation

2009-11-11 Thread Corinne Chandel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corinne Chandel updated PIG-1089:
-

Status: Patch Available  (was: Open)

Apply patch to trunk: http://svn.apache.org/repos/asf/hadoop/pig/trunk

Note: No new test code required; changes to documentation only.

 Pig 0.6.0 Documentation
 ---

 Key: PIG-1089
 URL: https://issues.apache.org/jira/browse/PIG-1089
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0

 Attachments: Pig-6-Beta.patch


 Pig 0.6.0 documentation:
  Ability to use Hadoop dfs commands from Pig
  Replicated left outer join
  Skewed outer join
  Map-side group
  Accumulate Interface for UDFs
  Improved Memory Mgt
  Integration with Zebra

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-968) findContainingJar fails when there's a + in the path

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-968:
--

Assignee: Todd Lipcon

 findContainingJar fails when there's a + in the path
 

 Key: PIG-968
 URL: https://issues.apache.org/jira/browse/PIG-968
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.4.0, 0.5.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: pig-968.txt


 This is the same bug as in MAPREDUCE-714. Please see discussion there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-938) Pig Docs for 0.4.0

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-938:
--

Assignee: Corinne Chandel

 Pig Docs for 0.4.0
 --

 Key: PIG-938
 URL: https://issues.apache.org/jira/browse/PIG-938
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.4.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Minor
 Attachments: PIG-938-2.patch, PIG-938-2b.patch, PIG-938-3.patch, 
 PIG-938-4.patch, PIG-938.patch


 Pig docs for 0.4.0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-935) Skewed join throws an exception when used with map keys

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-935:
--

Assignee: Sriranjan Manjunath

 Skewed join throws an exception when used with map keys
 ---

 Key: PIG-935
 URL: https://issues.apache.org/jira/browse/PIG-935
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: skmapbug.patch


 Skewed join throws a runtime exception for the following query:
 A = load 'map.txt' as (e);
 B = load 'map.txt' as (f);
 C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed;
 explain C;
 Exception:
 Caused by: java.lang.ClassCastException: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  cannot be cast to 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO
 Project
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894)
 ... 27 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-958) Splitting output data on key field

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-958:
--

Assignee: Ankur

 Splitting output data on key field
 --

 Key: PIG-958
 URL: https://issues.apache.org/jira/browse/PIG-958
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Ankur
Assignee: Ankur
 Fix For: 0.6.0

 Attachments: 958.v3.patch, 958.v4.patch


 Pig users often face the need to split the output records into a bunch of 
 files and directories depending on the type of record. Pig's SPLIT operator 
 is useful when record types are few and known in advance. In cases where type 
 is not directly known but is derived dynamically from values of a key field 
 in the output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-960) Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-960:
--

Assignee: Ankit Modi

 Using Hadoop's optimized LineRecordReader for reading Tuples in PigStorage 
 ---

 Key: PIG-960
 URL: https://issues.apache.org/jira/browse/PIG-960
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ankit Modi
Assignee: Ankit Modi
 Fix For: 0.6.0

 Attachments: pig_rlr.patch


 PigStorage's reading of Tuples ( lines ) can be optimized using Hadoop's 
 {{LineRecordReader}}.
 This can help in following areas
 - Improving performance reading of Tuples (lines) in {{PigStorage}}
 - Any future improvements in line reading done in Hadoop's 
 {{LineRecordReader}} is automatically carried over to Pig
 Issues that are handled by this patch
 - BZip uses internal buffers and positioning for determining the number of 
 bytes read. Hence buffering done by {{LineRecordReader}} has to be turned off
 - Current implementation of {{LocalSeekableInputStream}} does not implement 
 {{available}} method. This method has to be implemented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-923) Allow setting logfile location in pig.properties

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-923:
--

Assignee: Dmitriy V. Ryaboy

 Allow setting logfile location in pig.properties
 

 Key: PIG-923
 URL: https://issues.apache.org/jira/browse/PIG-923
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.4.0

 Attachments: pig_923.patch


 Local log file location can be specified through the -l flag, but it cannot 
 be set in pig.properties.
 This JIRA proposes a change to Main.java that allows it to read the 
 pig.logfile property from the configuration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-929) Default value of memusage for skewed join is not correct

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-929:
--

Assignee: Ying He

 Default value of memusage for skewed join is not correct
 

 Key: PIG-929
 URL: https://issues.apache.org/jira/browse/PIG-929
 Project: Pig
  Issue Type: Improvement
Reporter: Ying He
Assignee: Ying He
 Attachments: memusage.patch


 default value pig.skewedjoin.reduce.memusage , which is used in skewed join, 
 should be set to 0.3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-924) Make Pig work with multiple versions of Hadoop

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-924:
--

Assignee: Dmitriy V. Ryaboy

 Make Pig work with multiple versions of Hadoop
 --

 Key: PIG-924
 URL: https://issues.apache.org/jira/browse/PIG-924
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Attachments: pig_924.2.patch, pig_924.3.patch, pig_924.patch


 The current Pig build scripts package hadoop and other dependencies into the 
 pig.jar file.
 This means that if users upgrade Hadoop, they also need to upgrade Pig.
 Pig has relatively few dependencies on Hadoop interfaces that changed between 
 18, 19, and 20.  It is possibly to write a dynamic shim that allows Pig to 
 use the correct calls for any of the above versions of Hadoop. Unfortunately, 
 the building process precludes us from the ability to do this at runtime, and 
 forces an unnecessary Pig rebuild even if dynamic shims are created.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-919) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-919:
--

Assignee: Viraj Bhat

 Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableBytesWritable, recieved 
 org.apache.pig.impl.io.NullableText when doing simple group
 --

 Key: PIG-919
 URL: https://issues.apache.org/jira/browse/PIG-919
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Viraj Bhat
Assignee: Viraj Bhat
 Fix For: 0.3.0

 Attachments: GenHashList.java, mapscript.pig, mymapudf.jar


 I have a Pig script, which takes in a student file and generates a bag of 
 maps.  I later want to group on the value of the key name0 which 
 corresponds to the first name of the student.
 {code}
 register mymapudf.jar;
 data = LOAD '/user/viraj/studenttab10k' AS 
 (somename:chararray,age:long,marks:float);
 genmap = foreach data generate flatten(mymapudf.GenHashList(somename,' ')) as 
 bp:map[], age, marks;
 getfirstnames = foreach genmap generate bp#'name0' as firstname, age, marks;
 filternonnullfirstnames = filter getfirstnames by firstname is not null;
 groupgenmap = group filternonnullfirstnames by firstname;
 dump groupgenmap;
 {code}
 When I execute this code, I get an error in the Map Phase:
 ===
 java.io.IOException: Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableBytesWritable, recieved 
 org.apache.pig.impl.io.NullableText
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:242)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
 ===

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-913) Error in Pig script when grouping on chararray column

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-913:
--

Assignee: Daniel Dai

 Error in Pig script when grouping on chararray column
 -

 Key: PIG-913
 URL: https://issues.apache.org/jira/browse/PIG-913
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Viraj Bhat
Assignee: Daniel Dai
Priority: Critical
 Fix For: 0.4.0

 Attachments: PIG-913-2.patch, PIG-913.patch


 I have a very simple script which fails at parsetime due to the schema I 
 specified in the loader.
 {code}
 data = LOAD '/user/viraj/studenttab10k' AS (s:chararray);
 dataSmall = limit data 100;
 bb = GROUP dataSmall by $0;
 dump bb;
 {code}
 =
 2009-08-06 18:47:56,297 [main] INFO  org.apache.pig.Main - Logging error 
 messages to: /homes/viraj/pig-svn/trunk/pig_1249609676296.log
 09/08/06 18:47:56 INFO pig.Main: Logging error messages to: 
 /homes/viraj/pig-svn/trunk/pig_1249609676296.log
 2009-08-06 18:47:56,459 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: hdfs://localhost:9000
 09/08/06 18:47:56 INFO executionengine.HExecutionEngine: Connecting to hadoop 
 file system at: hdfs://localhost:9000
 2009-08-06 18:47:56,694 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to map-reduce job tracker at: localhost:9001
 09/08/06 18:47:56 INFO executionengine.HExecutionEngine: Connecting to 
 map-reduce job tracker at: localhost:9001
 2009-08-06 18:47:57,008 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1002: Unable to store alias bb
 09/08/06 18:47:57 ERROR grunt.Grunt: ERROR 1002: Unable to store alias bb
 Details at logfile: /homes/viraj/pig-svn/trunk/pig_1249609676296.log
 =
 =
 Pig Stack Trace
 ---
 ERROR 1002: Unable to store alias bb
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias bb
 at org.apache.pig.PigServer.openIterator(PigServer.java:481)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:531)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: 
 Unable to store alias bb
 at org.apache.pig.PigServer.store(PigServer.java:536)
 at org.apache.pig.PigServer.openIterator(PigServer.java:464)
 ... 6 more
 Caused by: java.lang.NullPointerException
 at 
 org.apache.pig.impl.logicalLayer.LOCogroup.unsetSchema(LOCogroup.java:359)
 at 
 org.apache.pig.impl.logicalLayer.optimizer.SchemaRemover.visit(SchemaRemover.java:64)
 at 
 org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:335)
 at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:46)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildSchemas(LogicalTransformer.java:67)
 at 
 org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:187)
 at org.apache.pig.PigServer.compileLp(PigServer.java:854)
 at org.apache.pig.PigServer.compileLp(PigServer.java:791)
 at org.apache.pig.PigServer.store(PigServer.java:509)
 ... 7 more
 =

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-911) [Piggybank] SequenceFileLoader

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-911:
--

Assignee: Dmitriy V. Ryaboy

 [Piggybank] SequenceFileLoader 
 ---

 Key: PIG-911
 URL: https://issues.apache.org/jira/browse/PIG-911
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.5.0

 Attachments: pig_911.2.patch, pig_sequencefile.patch


 The proposed piggybank contribution adds a SequenceFileLoader to the 
 piggybank.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-905) TOKENIZE throws exception on null data

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-905:
--

Assignee: Daniel Dai

 TOKENIZE throws exception on null data
 --

 Key: PIG-905
 URL: https://issues.apache.org/jira/browse/PIG-905
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-905-1.patch, PIG-905-2.patch, PIG-905-3.patch


 it should just return null

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-895) Default parallel for Pig

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-895:
--

Assignee: Daniel Dai

 Default parallel for Pig
 

 Key: PIG-895
 URL: https://issues.apache.org/jira/browse/PIG-895
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-895-1.patch, PIG-895-2.patch, PIG-895-3.patch


 For hadoop 20, if user don't specify the number of reducers, hadoop will use 
 1 reducer as the default value. It is different from previous of hadoop, in 
 which default reducer number is usually good. 1 reducer is not what user want 
 for sure. Although user can use parallel keyword to specify number of 
 reducers for each statement, it is wordy. We need a convenient way for users 
 to express a desired number of reducers. Here is my propose:
 1. Add one property default_parallel to Pig. User can set default_parallel 
 in script. Eg:
set default_parallel 10;
 2. default_parallel is a hint to Pig. Pig is free to optimize the number of 
 reducers (unlike parallel keyword). Currently, since we do not have a 
 mechanism to determine the optimal number of reducers, default_parallel will 
 be always granted, unless it is override by parallel keyword.
 3. If user put multiple default_parallel inside script, the last entry will 
 be taken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-890) Create a sampler interface and improve the skewed join sampler

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-890:
--

Assignee: Sriranjan Manjunath

 Create a sampler interface and improve the skewed join sampler
 --

 Key: PIG-890
 URL: https://issues.apache.org/jira/browse/PIG-890
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Fix For: 0.4.0

 Attachments: samplerinterface.patch


 We need a different sampler for order by and skewed join. We thus need a 
 better sampling interface. The design of the same is described here: 
 http://wiki.apache.org/pig/PigSampler

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-837) docs ant target is broken

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-837:
--

Assignee: Olga Natkovich

 docs ant target is broken 
 --

 Key: PIG-837
 URL: https://issues.apache.org/jira/browse/PIG-837
 Project: Pig
  Issue Type: Bug
Reporter: Giridharan Kesavan
Assignee: Olga Natkovich

 docs ant target is broken , this would fail the trunk builds..
  [exec] Java Result: 1
  [exec] 
  [exec]   Copying broken links file to site root.
  [exec]   
  [exec] Copying 1 file to 
 /home/hudson/hudson-slave/workspace/Pig-Patch-minerva.apache.org/trunk/src/docs/build/site
  [exec] 
  [exec] BUILD FAILED
  [exec] /home/nigel/tools/forrest/latest/main/targets/site.xml:180: Error 
 building site.
  [exec] 
  [exec] There appears to be a problem with your site build.
  [exec] 
  [exec] Read the output above:
  [exec] * Cocoon will report the status of each document:
  [exec] - in column 1: *=okay X=brokenLink ^=pageSkipped (see FAQ).
  [exec] * Even if only one link is broken, you will still get failed.
  [exec] * Your site would still be generated, but some pages would be 
 broken.
  [exec]   - See 
 /home/hudson/hudson-slave/workspace/Pig-Patch-minerva.apache.org/trunk/src/docs/build/site/broken-links.xml
  [exec] 
  [exec] Total time: 28 seconds
 BUILD FAILED
 /home/hudson/hudson-slave/workspace/Pig-Patch-minerva.apache.org/trunk/build.xml:326:
  exec returned: 1

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-830) Port Apache Log parsing piggybank contrib to Pig 0.2

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-830:
--

Assignee: Dmitriy V. Ryaboy

 Port Apache Log parsing piggybank contrib to Pig 0.2
 

 Key: PIG-830
 URL: https://issues.apache.org/jira/browse/PIG-830
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.2.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.3.0

 Attachments: pig-830-v2.patch, pig-830-v3.patch, pig-830.patch, 
 TEST-org.apache.pig.piggybank.test.storage.TestMyRegExLoader.txt


 The piggybank contribs (pig-472, pig-473,  pig-474, pig-476, pig-486, 
 pig-487, pig-488, pig-503, pig-509) got dropped after the types branch was 
 merged in.
 They should be updated to work with the current APIs and added back into 
 trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1087) Use Pig's version for Zebra's own version.

2009-11-11 Thread Chao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Wang updated PIG-1087:
---

Attachment: patch_Pig1087

 Use Pig's version for Zebra's own version.
 --

 Key: PIG-1087
 URL: https://issues.apache.org/jira/browse/PIG-1087
 Project: Pig
  Issue Type: Task
Affects Versions: 0.4.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.6.0

 Attachments: patch_Pig1087


 Zebra is a contrib project of Pig now. It should use Pig's version for its 
 own version. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-825) PIG_HADOOP_VERSION should be 18

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-825:
--

Assignee: Dmitriy V. Ryaboy

 PIG_HADOOP_VERSION should be 18
 ---

 Key: PIG-825
 URL: https://issues.apache.org/jira/browse/PIG-825
 Project: Pig
  Issue Type: Bug
  Components: grunt
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.3.0

 Attachments: pig-825.patch, pig-825.patch


 PIG_HADOOP_VERSION should be set to 18, not 17, as Hadoop 0.18 is now 
 considered default.
 Patch coming.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1087) Use Pig's version for Zebra's own version.

2009-11-11 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776786#action_12776786
 ] 

Yan Zhou commented on PIG-1087:
---

+1

 Use Pig's version for Zebra's own version.
 --

 Key: PIG-1087
 URL: https://issues.apache.org/jira/browse/PIG-1087
 Project: Pig
  Issue Type: Task
Affects Versions: 0.4.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.6.0

 Attachments: patch_Pig1087


 Zebra is a contrib project of Pig now. It should use Pig's version for its 
 own version. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1087) Use Pig's version for Zebra's own version.

2009-11-11 Thread Chao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Wang updated PIG-1087:
---

Status: Patch Available  (was: Open)

 Use Pig's version for Zebra's own version.
 --

 Key: PIG-1087
 URL: https://issues.apache.org/jira/browse/PIG-1087
 Project: Pig
  Issue Type: Task
Affects Versions: 0.4.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.6.0

 Attachments: patch_Pig1087


 Zebra is a contrib project of Pig now. It should use Pig's version for its 
 own version. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-796) support conversion from numeric types to chararray

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-796:
--

Assignee: Ashutosh Chauhan

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Ashutosh Chauhan
 Fix For: 0.3.0

 Attachments: 796.patch, pig-796.patch, pig-796.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-795:
--

Assignee: Eric Gaudet

 Command that selects a random sample of the rows, similar to LIMIT
 --

 Key: PIG-795
 URL: https://issues.apache.org/jira/browse/PIG-795
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Eric Gaudet
Assignee: Eric Gaudet
Priority: Trivial
 Attachments: sample2.diff, sample3.diff


 When working with very large data sets (imagine that!), running a pig script 
 can take time. It may be useful to run on a small subset of the data in some 
 situations (eg: debugging / testing, or to get fast results even if less 
 accurate.) 
 The command LIMIT N selects the first N rows of the data, but these are not 
 necessarily randomzed. A command SAMPLE X would retain the row only with 
 the probability x%.
 Note: it is possible to implement this feature with FILTER BY and an UDF, but 
 so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-792) PERFORMANCE: Support skewed join in pig

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-792:
--

Assignee: Sriranjan Manjunath

 PERFORMANCE: Support skewed join in pig
 ---

 Key: PIG-792
 URL: https://issues.apache.org/jira/browse/PIG-792
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: skewedjoin.patch


 Fragmented replicated join has a few limitations:
  - One of the tables needs to be loaded into memory
  - Join is limited to two tables
 Skewed join partitions the table and joins the records in the reduce phase. 
 It computes a histogram of the key space to account for skewing in the input 
 records. Further, it adjusts the number of reducers depending on the key 
 distribution.
 We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-782) javadoc throws warnings - this would break hudson patch test process.

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-782:
--

Assignee: Santhosh Srinivasan

 javadoc throws warnings - this would break hudson patch test process.
 -

 Key: PIG-782
 URL: https://issues.apache.org/jira/browse/PIG-782
 Project: Pig
  Issue Type: Bug
 Environment: javadoc throws warnings - this would break the hudson 
 patch test process.
Reporter: Giridharan Kesavan
Assignee: Santhosh Srinivasan

   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:233:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:205:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:185:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:220:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:158:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:134:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:105:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:92:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:120:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:48:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:77:
  warning - @return tag has no arguments.
   [javadoc] 
 /home/gkesavan/pig-trunk/src/org/apache/pig/impl/logicalLayer/schema/SchemaUtil.java:92:
  warning - @param argument names is not a parameter name.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-781) Error reporting for failed MR jobs

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-781:
--

Assignee: Gunther Hagleitner

 Error reporting for failed MR jobs
 --

 Key: PIG-781
 URL: https://issues.apache.org/jira/browse/PIG-781
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Fix For: 0.3.0

 Attachments: partial_failure.patch, partial_failure.patch, 
 partial_failure.patch, partial_failure.patch


 If we have multiple MR jobs to run and some of them fail the behavior of the 
 system is to not stop on the first failure but to keep going. That way jobs 
 that do not depend on the failed job might still succeed.
 The question is to how best report this scenario to a user. How do we tell 
 which jobs failed and which didn't?
 One way could be to tie jobs to stores and report which store locations won't 
 have data and which ones do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-745) Please add DataTypes.toString() conversion function

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-745:
--

Assignee: David Ciemiewicz

 Please add DataTypes.toString() conversion function
 ---

 Key: PIG-745
 URL: https://issues.apache.org/jira/browse/PIG-745
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz
Assignee: David Ciemiewicz
 Fix For: 0.3.0

 Attachments: PIG-745.patch


 I'm doing some work in string manipulation UDFs and I've found that it would 
 be very convenient if I could always convert the argument to a chararray 
 (internally a Java String).
 For example TOLOWERCASE(arg) shouldn't really care whether arg is a 
 bytearray, chararray, int, long, double, or float, it should be treated as a 
 string and operated on.
 The simplest and most foolproof method would be if the DataTypes added a 
 static function of  DataTypes.toString which did all of the argument type 
 checking and provided consistent translation.
 I believe that this function might be coded as:
 public static String toString(Object o) throws ExecException {
 try {
   switch (findType(o)) {
   case BOOLEAN:
   if (((Boolean)o) == true) return new String('1');
   else return new String('0');
   case BYTE:
   return ((Byte)o).toString();
   case INTEGER:
   return ((Integer)o).toString();
   case LONG:
   return ((Long)o).toString();
   case FLOAT:
   return ((Float)o).toString();
   case DOUBLE:
   return ((Double)o).toString();
   case BYTEARRAY:
   return ((DataByteArray)o).toString();
   case CHARARRAY:
   return (String)o;
   case NULL:
   return null;
   case MAP:
   case TUPLE:
   case BAG:
   case UNKNOWN:
   default:
   int errCode = 1071;
   String msg = Cannot convert a  + findTypeName(o) +
to an String;
   throw new ExecException(msg, errCode, 
 PigException.INPUT);
   }
   } catch (ExecException ee) {
   throw ee;
   } catch (Exception e) {
   int errCode = 2054;
   String msg = Internal error. Could not convert  + o + 
  to String.;
   throw new ExecException(msg, errCode, PigException.BUG);
   }
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-753) Provide support for UDFs without parameters

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-753:
--

Assignee: Jeff Zhang

 Provide support for UDFs without parameters
 ---

 Key: PIG-753
 URL: https://issues.apache.org/jira/browse/PIG-753
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.4.0

 Attachments: Pig_753_Patch.txt


 Pig do not support UDF without parameters, it force me provide a parameter.
 like the following statement:
  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to 
 provide a parameter like following
  B = FOREACH A GENERATE bagGenerator($0);
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-833) Storage access layer

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-833:
--

Assignee: Raghu Angadi

 Storage access layer
 

 Key: PIG-833
 URL: https://issues.apache.org/jira/browse/PIG-833
 Project: Pig
  Issue Type: New Feature
Reporter: Jay Tang
Assignee: Raghu Angadi
 Fix For: 0.4.0

 Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, 
 PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, 
 TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz


 A layer is needed to provide a high level data access abstraction and a 
 tabular view of data in Hadoop, and could free Pig users from implementing 
 their own data storage/retrieval code.  This layer should also include a 
 columnar storage format in order to provide fast data projection, 
 CPU/space-efficient data serialization, and a schema language to manage 
 physical storage metadata.  Eventually it could also support predicate 
 pushdown for further performance improvement.  Initially, this layer could be 
 a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1089) Pig 0.6.0 Documentation

2009-11-11 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776789#action_12776789
 ] 

Dmitriy V. Ryaboy commented on PIG-1089:


I am not sure where this goes, but can we add a bit about using the 
pig.logfile property in the pig.properties file to control where Pig logs get 
written? It takes a directory or a filename on a local system, and defaults to 
current working directory.

 Pig 0.6.0 Documentation
 ---

 Key: PIG-1089
 URL: https://issues.apache.org/jira/browse/PIG-1089
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0

 Attachments: Pig-6-Beta.patch


 Pig 0.6.0 documentation:
  Ability to use Hadoop dfs commands from Pig
  Replicated left outer join
  Skewed outer join
  Map-side group
  Accumulate Interface for UDFs
  Improved Memory Mgt
  Integration with Zebra

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-732) Utility UDFs

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-732:
--

Assignee: Ankur

 Utility UDFs 
 -

 Key: PIG-732
 URL: https://issues.apache.org/jira/browse/PIG-732
 Project: Pig
  Issue Type: New Feature
Reporter: Ankur
Assignee: Ankur
Priority: Minor
 Attachments: udf.v1.patch, udf.v2.patch, udf.v3.patch, udf.v4.patch, 
 udf.v5.patch


 Two utility UDFs and their respective test cases.
 1. TopN - Accepts number of tuples (N) to retain in output, field number 
 (type long) to use for comparison, and an sorted/unsorted bag of tuples. It 
 outputs a bag containing top N tuples.
 2. SearchQuery - Accepts an encoded URL from any of the 4 search engines 
 (Yahoo, Google, AOL, Live) and extracts and normalizes the search query 
 present in it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-715) Remove 2 doc files: hello.pdf and overview.html

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-715:
--

Assignee: Corinne Chandel

 Remove 2 doc files: hello.pdf and overview.html
 ---

 Key: PIG-715
 URL: https://issues.apache.org/jira/browse/PIG-715
 Project: Pig
  Issue Type: Bug
  Components: documentation
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Minor

 Please remove these 2 doc files. They don't belong with the Pig 2.0 
 documnetation and will cause confusion.
 (1) hello.pdf ... located in:
 trunk/src/docs/src/documentaiton/content/xdocs
 (2) overview.html ... located in:
 trunk/docs

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-713) Autocompletion doesn't complete aliases

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-713:
--

Assignee: Eric Gaudet

 Autocompletion doesn't complete aliases
 ---

 Key: PIG-713
 URL: https://issues.apache.org/jira/browse/PIG-713
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Reporter: Eric Gaudet
Assignee: Eric Gaudet
Priority: Minor
 Fix For: 0.3.0

 Attachments: alias_completion.patch


 Autocompletion only knows about keywords, but in different contexts, it would 
 be nice if it completed aliases where an alias is expected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-712) Need utilities to create schemas for bags and tuples

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-712:
--

Assignee: Jeff Zhang

 Need utilities to create schemas for bags and tuples
 

 Key: PIG-712
 URL: https://issues.apache.org/jira/browse/PIG-712
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Santhosh Srinivasan
Assignee: Jeff Zhang
Priority: Minor
 Fix For: 0.3.0

 Attachments: Pig_712_Patch.txt


 Pig should provide utilities to create bag and tuple schemas. Currently, 
 users return schemas in outputSchema method and end up with very verbose 
 boiler plate code. It will be very nice if Pig encapsulates the boiler plate 
 code in utility methods.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-704) Interactive mode doesn't list defined aliases

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-704:
--

Assignee: Eric Gaudet

 Interactive mode doesn't list defined aliases
 -

 Key: PIG-704
 URL: https://issues.apache.org/jira/browse/PIG-704
 Project: Pig
  Issue Type: Improvement
  Components: grunt
Reporter: Eric Gaudet
Assignee: Eric Gaudet
Priority: Trivial
 Fix For: 0.2.0

 Attachments: aliases_last.patch, aliases_last2.patch


 I'm using the interactive mode to test my scripts, and I'm struggling to keep 
 track of 2 things:
 1) the aliases. A typical test script has 10 aliases or more. As the test 
 goes on, different versions are created, or aliases are created with typos. 
 There's no command in grunt to get the list of defined aliases. Proposed 
 solution: add a new command aliases that prints the list of aliases.
 2) I prefer to give meaningful (long) names to my aliases. But as I try 
 different things, I find it hard to predict what the schema will look like, 
 so I use DESCRIBE a lot. It's a pain to type these long names all the time, 
 especially since most of the time I only want to describe the last alias 
 created. A shortcut to the describe command describing the last alias will be 
 very useful. Proposed solution: use the special name _ as a shortcut to the 
 last created alias: DECRIBE _.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-703) Pig trunk/src/docs folders and files for forrest xml doc builds

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-703:
--

Assignee: Corinne Chandel

 Pig  trunk/src/docs folders and files for forrest xml doc builds
 

 Key: PIG-703
 URL: https://issues.apache.org/jira/browse/PIG-703
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: site
Reporter: Corinne Chandel
Assignee: Corinne Chandel
 Fix For: site

 Attachments: logos-1.zip, trunk-1.patch


 Add src/docs directory folders and files to trunk branch. 
 Patch includes:
 src/docs ... forrrest.properties
 src/docs/src/documentation ...  skinconf.xml
 src/docs/src/documentation/content/xdocs ... doc files
 src/docs/src/documentation/content/xdocs/images ... image files 
 Please review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-692) when running script file, automatically set up job name based on the file name

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-692:
--

Assignee: Vadim Zaliva

 when running script file, automatically set up job name based on the file name
 --

 Key: PIG-692
 URL: https://issues.apache.org/jira/browse/PIG-692
 Project: Pig
  Issue Type: Improvement
  Components: tools
Affects Versions: 0.2.0
Reporter: Vadim Zaliva
Assignee: Vadim Zaliva
Priority: Trivial
 Fix For: 0.2.0

 Attachments: pig-job-name.patch


 When running pig script from command like like this:
 pig scriptfile
 right now default job name is used. it is convenient to have it automatically 
 set up based on the script name.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-623) Fix spelling errors in output messages

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-623:
--

Assignee: Tom White

 Fix spelling errors in output messages
 --

 Key: PIG-623
 URL: https://issues.apache.org/jira/browse/PIG-623
 Project: Pig
  Issue Type: Improvement
Reporter: Tom White
Assignee: Tom White
Priority: Trivial
 Attachments: pig-623.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-620) find Max Tuple by 1st field UDF (for piggybank)

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-620:
--

Assignee: Vadim Zaliva

 find Max Tuple by 1st field UDF (for piggybank)
 ---

 Key: PIG-620
 URL: https://issues.apache.org/jira/browse/PIG-620
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.2.0
Reporter: Vadim Zaliva
Assignee: Vadim Zaliva
 Fix For: 0.2.0

 Attachments: MaxTupleBy1stField.java


 This is simple UDF which takes bag of tuples and returns one with max. 1st 
 column.
 It is fairly trivial but I have seen people asking for it. Detailed usage 
 comments are in Javadoc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1087) Use Pig's version for Zebra's own version.

2009-11-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1087:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed. thanks Chao!

 Use Pig's version for Zebra's own version.
 --

 Key: PIG-1087
 URL: https://issues.apache.org/jira/browse/PIG-1087
 Project: Pig
  Issue Type: Task
Affects Versions: 0.4.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.6.0

 Attachments: patch_Pig1087


 Zebra is a contrib project of Pig now. It should use Pig's version for its 
 own version. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-622) Include pig executable in distribution

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-622:
--

Assignee: Tom White

 Include pig executable in distribution
 --

 Key: PIG-622
 URL: https://issues.apache.org/jira/browse/PIG-622
 Project: Pig
  Issue Type: Bug
Reporter: Tom White
Assignee: Tom White
 Attachments: pig-622.patch


 Running ant tar does not generate the bin directory with the pig executable 
 in it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-592) schema inferred incorrectly

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-592:
--

Assignee: Daniel Dai

 schema inferred incorrectly
 ---

 Key: PIG-592
 URL: https://issues.apache.org/jira/browse/PIG-592
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Christopher Olston
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-592-1.patch, PIG-592-2.patch, PIG-592-3.patch


 A simple pig script, that never introduces any schema information:
 A = load 'foo';
 B = foreach (group A by $8) generate group, COUNT($1);
 C = load 'bar';   // ('bar' has two columns)
 D = join B by $0, C by $0;
 E = foreach D generate $0, $1, $3;
 Fails, complaining that $3 does not exist:
 java.io.IOException: Out of bound access. Trying to access non-existent 
 column: 3. Schema {B::group: bytearray,long,bytearray} has 3 column(s).
 Apparently Pig gets confused, and thinks it knows the schema for C (a single 
 bytearray column).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-595) Use of Combiner causes java.lang.ClassCastException in ForEach

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-595:
--

Assignee: Viraj Bhat

 Use of Combiner causes java.lang.ClassCastException in ForEach
 --

 Key: PIG-595
 URL: https://issues.apache.org/jira/browse/PIG-595
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
Assignee: Viraj Bhat
 Attachments: querypairs.txt


 The following Pig script causes a ClassCastException when QueryPairs is used 
 in the ForEach statement. This is due to the use of the combiner.
 {code}
 QueryPairs = load 'querypairs.txt' using PigStorage()  as ( q1: chararray, 
 q2: chararray );
 describe QueryPairs;
 QueryPairsGrouped = group QueryPairs by ( q1 );
 describe QueryPairsGrouped;
 QueryGroups = foreach QueryPairsGrouped generate
 group as q1,
 COUNT(QueryPairs)   as paircount,
 QueryPairsas QueryPairs;
 describe QueryGroups;
 dump QueryGroups;
 {code}
 =
 2008-12-31 15:01:48,713 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error 
 message from task (map) 
 task_200812151518_4922_m_00java.lang.ClassCastException: 
 org.apache.pig.data.DefaultDataBag cannot be cast to org.apache.pig.data.Tuple
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCombinerPackage.getNext(POCombinerPackage.java:122)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:152)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:143)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:57)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:904)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:785)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228)
 at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 =

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-546) FilterFunc calls empty constructor when it should be calling parameterized constructor

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-546:
--

Assignee: Santhosh Srinivasan

 FilterFunc calls empty constructor when it should be calling parameterized 
 constructor
 --

 Key: PIG-546
 URL: https://issues.apache.org/jira/browse/PIG-546
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
Assignee: Santhosh Srinivasan
 Fix For: 0.2.0

 Attachments: FILTERFROMFILE.java, insetfilterfile, mydata.txt, 
 PIG-546.patch


 The following piece of Pig Script uses a custom UDF known as FILTERFROMFILE 
 which extends the FilterFunc. It contains two constructors, an empty 
 constructor which is mandatory and the parameterized constructor. The 
 parameterized constructor  passes the HDFS filename, which the exec function 
 uses to construct a HashMap. The HashMap is later used for filtering records 
 based on the match criteria in the HDFS file.
 {code}
 register util.jar;
 --util.jar contains the FILTERFROMFILE class
 define FILTER_CRITERION util.FILTERFROMFILE('/user/viraj/insetfilterfile');
 RAW_LOGS = load 'mydata.txt' as (url:chararray, numvisits:int);
 FILTERED_LOGS = filter RAW_LOGS by FILTER_CRITERION(numvisits);
 dump FILTERED_LOGS;
 {code}
 When you execute the above script,  it results in a single Map only job with 
 1 Map. It seems that the empty constructor is called 5 times, and ultimately 
 results in failure of the job.
 ===
 parameterized constructor: /user/viraj/insetfilterfile
 parameterized constructor: /user/viraj/insetfilterfile
 empty constructor
 empty constructor
 empty constructor
 empty constructor
 empty constructor
 ===
 Error in the Hadoop backend
 ===
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:82)
   at org.apache.hadoop.fs.Path.(Path.java:90)
   at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:199)
   at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:130)
   at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:164)
   at util.FILTERFROMFILE.init(FILTERFROMFILE.java:70)
   at util.FILTERFROMFILE.exec(FILTERFROMFILE.java:89)
   at util.FILTERFROMFILE.exec(FILTERFROMFILE.java:52)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:179)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:217)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:148)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:170)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:158)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
 ===
 Attaching the sample data and the filter function UDF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-570) Large BZip files Seem to loose data in Pig

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-570:
--

Assignee: Benjamin Reed

 Large BZip files  Seem to loose data in Pig
 ---

 Key: PIG-570
 URL: https://issues.apache.org/jira/browse/PIG-570
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.0.0, 0.1.0, 0.2.0, site
 Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
Reporter: Alex Newman
Assignee: Benjamin Reed
 Fix For: 0.2.0

 Attachments: bzipTest.bz2, PIG-570.patch


 So I don't believe  bzip2 input to pig is working, at least not with large 
 files. It seems as though map files are getting cut off. The maps complete 
 way too quickly and the actual row of data that pig tries to process often 
 randomly gets cut, and becomes incomplete. Here are my symptoms:
 - Maps seem to be completing in a unbelievably fast rate
 With uncompressed data
 Status: Succeeded
 Started at: Wed Dec 17 21:31:10 EST 2008
 Finished at: Wed Dec 17 22:42:09 EST 2008
 Finished in: 1hrs, 10mins, 59sec
 map   100.00%
 4670  0   0   46700   0 / 21
 reduce57.72%
 130   0   13  0   0 / 4
 With bzip compressed data
 Started at: Wed Dec 17 21:17:28 EST 2008
 Failed at: Wed Dec 17 21:17:52 EST 2008
 Failed in: 24sec
 Black-listed TaskTrackers: 2
 Kind  % Complete  Num Tasks   Pending Running CompleteKilled  
 Failed/Killed
 Task Attempts
 map   100.00%
 183   0   0   15  168 54 / 22
 reduce100.00%
 130   0   0   13  0 / 0
 The errors we get:
 ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec
 A, 0HAW, CHIX, )
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 Last 4KB
 attempt_200812161759_0045_m_07_0  task_200812161759_0045_m_07 
 tsdhb06.factset.com FAILED  
 java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec   
 A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-572) A PigServer.registerScript() method, which lets a client programmatically register a Pig Script.

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-572:
--

Assignee: Shubham Chopra

 A PigServer.registerScript() method, which lets a client programmatically 
 register a Pig Script.
 

 Key: PIG-572
 URL: https://issues.apache.org/jira/browse/PIG-572
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.2.0
Reporter: Shubham Chopra
Assignee: Shubham Chopra
Priority: Minor
 Fix For: 0.2.0

 Attachments: registerScript.patch


 A PigServer.registerScript() method, which lets a client programmatically 
 register a Pig Script.
 For example, say theres a script my_script.pig with the following content:
 a = load '/data/my_data.txt';
 b = filter a by $0  '0';
 The function lets you use something like the following:
 pigServer.registerScript(my_script.pig);
 pigServer.registerQuery(c = foreach b generate $2, $3;);
 pigServer.store(c);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-574) run command for grunt

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-574:
--

Assignee: Olga Natkovich

 run command for grunt
 -

 Key: PIG-574
 URL: https://issues.apache.org/jira/browse/PIG-574
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Reporter: David Ciemiewicz
Assignee: Olga Natkovich
Priority: Minor
 Attachments: PIG-574.patch, run_command.patch, 
 run_command_params.patch, run_command_params_021109.patch


 This is a request for a run file command in grunt which will read a script 
 from the local file system and execute the script interactively while in the 
 grunt shell.
 One of the things that slows down iterative development of large, complicated 
 Pig scripts that must operate on hadoop fs data is that the edit, run, debug 
 cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) 
 cluster for each iteration.  I would prefer not to preallocate a cluster of 
 nodes (though I could).
 Instead, I'd like to have one window open and edit my Pig script using vim or 
 emacs, write it, and then type run myscript.pig at the grunt shell until I 
 get things right.
 I'm used to doing similar things with Oracle, MySQL, and R. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-627) PERFORMANCE: multi-query optimization

2009-11-11 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-627:
--

Assignee: Gunther Hagleitner

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Gunther Hagleitner
 Fix For: 0.3.0

 Attachments: doc-fix.patch, error_handling_0415.patch, 
 error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, 
 merge-041409.patch, merge_741727_HEAD__0324.patch, 
 merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery-phase3_0423.patch, 
 multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, 
 multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1088) change merge join and merge join indexer to work with new LoadFunc interface

2009-11-11 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776792#action_12776792
 ] 

Thejas M Nair commented on PIG-1088:


*Problem* : With old load/store interface, the index created by 
MergeJoinIndexer consisted of tuples with join key(s), filename, offset. With 
the new load/store interface, the split index is available 
(RecordReader.getSplitIndex) instead of filename and offset . But there is no 
guarantee that split indexes are in sorted order of the file. If more than one 
split has tuples with same join key in it, it is necessary to know which split 
needs to be read first. 

*Proposal*: (thanks to Alan Gates)
We should add an interface to the list of  load interfaces:

public interface LoadOrderedInput {

 WritableComparable getPosition();
}

If the load function implements this interface it can then be used in  a merge 
join.  This getPosition call could then be called in the map phase of the 
sampling MR job and the tuples in the index will have the sort(/join) key(s) 
followed by the resulting value. 
In sorting the index in the reduce phase of the sampling MR job, this value 
will then be used.

For LoadFuncs that use FileInputFormat,  getPosition can return the following 
class:

public class TextInputOrder implements WritableComparable {

private String basename;  // basename of the file
private long offset;  // offset at which this split starts

 int compareTo(TextInputOrder other) {
int rc = basename.compareTo(other.basename)
if (rc == 0) rc = offset.compareTo(other.offset);
return rc;
}
}

This means that we would take the filenames sorted lexigraphically  (which will 
work for things like part-0, map-0, bucket001 (warehouse data), etc.) 
and then offsets into those files after that.   
To make it easier for authors of new LoadFuncs to implement this interface, 
implementation of this interface for load functions that use FileInputFormat  
will be provided through an abstract base class. 


 change merge join and merge join indexer to work with new LoadFunc interface
 

 Key: PIG-1088
 URL: https://issues.apache.org/jira/browse/PIG-1088
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair
Assignee: Thejas M Nair



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   >