[jira] Updated: (PIG-794) Use Avro serialization in Pig

2010-09-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-794:
---

Attachment: AvroStorage_4.patch

Attach the patch according Doug's suggestion, extend GenericDatumReader and 
GenericDatumWriter. 
But it can not handle InternalMap.
Doug, could you help try to look at what's the problem ?

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1543) IsEmpty returns the wrong value after using LIMIT

2010-09-02 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905587#action_12905587
 ] 

Daniel Dai commented on PIG-1543:
-

test-patch result:

 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

All tests pass

 IsEmpty returns the wrong value after using LIMIT
 -

 Key: PIG-1543
 URL: https://issues.apache.org/jira/browse/PIG-1543
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Justin Hu
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1543-1.patch


 1. Two input files:
 1a: limit_empty.input_a
 1
 1
 1
 1b: limit_empty.input_b
 2
 2
 2.
 The pig script: limit_empty.pig
 -- A contains only 1's  B contains only 2's
 A = load 'limit_empty.input_a' as (a1:int);
 B = load 'limit_empty.input_a' as (b1:int);
 C =COGROUP A by a1, B by b1;
 D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), 
 COUNT(B);
 store D into 'limit_empty.output/d';
 -- After the script done, we see the right results:
 -- {(1),(1),(1)}   {}  1   0   3   0
 -- {} {(2),(2)}  0   1   0   2
 C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; }
 D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 
 0:1), COUNT(Alim), COUNT(Blim);
 store D1 into 'limit_empty.output/d1';
 -- After the script done, we see the unexpected results:
 -- {(1)}   {}1   1   1   0
 -- {}  {(2)} 1   1   0   1
 dump D;
 dump D1;
 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues:
 The major one:
 IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while 
 IsEmpty() returns correctly in limit_empty.output/d/*.
 The difference is that one has been applied with LIMIT before using 
 IsEmpty().
 The minor one:
 The redirected output only contains the first dump:
 ({(1),(1),(1)},{},1,0,3L,0L)
 ({},{(2),(2)},0,1,0L,2L)
 We expect two more lines like:
 ({(1)},{},1,1,1L,0L)
 ({},{(2)},1,1,0L,1L)
 Besides, there is error says:
 [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - 
 java.lang.ClassCastException: java.lang.Integer cannot be cast to 
 org.apache.pig.data.Tuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1550) better error handling in casting relations to scalars

2010-09-02 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1550:
---

Status: Patch Available  (was: Open)

 better error handling in casting relations to scalars
 -

 Key: PIG-1550
 URL: https://issues.apache.org/jira/browse/PIG-1550
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1550.1.patch


 I ran the following script:
 Input data:
 joe 100
 sam 20
 bob 134
 Script:
 A = load 'user_clicks' as (user: chararray, clicks: int);
 B = group A by user;
 C = foreach A generate group, SUM(A.clicks);
 D = foreach A generate clicks/(double)C.$1;
 dump C;
 Since C contains more than 1 tuple, I expected to get an error which I did. 
 However, the error was not very clear. When the job failed, I did see a valid 
 error (however it lacked the error code): 210630 [main] ERROR 
 org.apache.pig.tools.pigstats.PigStats  - ERROR 0: Scalar has more than one 
 row in the output
  However at the end of processing, I saw a misleading error:
 210709 [main] ERROR org.apache.pig.tools.grunt.Grunt  - ERROR 2088: Unable to 
 get results for: 
 hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage
 10/08/19 17:16:22 ERROR grunt.Grunt: ERROR 2088: Unable to get results for: 
 hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1550) better error handling in casting relations to scalars

2010-09-02 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1550:
---

Attachment: PIG-1550.1.patch

PIG-1550.1.patch
test-patch has succeeded . unit tests are still running.
 [exec] +1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.


 better error handling in casting relations to scalars
 -

 Key: PIG-1550
 URL: https://issues.apache.org/jira/browse/PIG-1550
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1550.1.patch


 I ran the following script:
 Input data:
 joe 100
 sam 20
 bob 134
 Script:
 A = load 'user_clicks' as (user: chararray, clicks: int);
 B = group A by user;
 C = foreach A generate group, SUM(A.clicks);
 D = foreach A generate clicks/(double)C.$1;
 dump C;
 Since C contains more than 1 tuple, I expected to get an error which I did. 
 However, the error was not very clear. When the job failed, I did see a valid 
 error (however it lacked the error code): 210630 [main] ERROR 
 org.apache.pig.tools.pigstats.PigStats  - ERROR 0: Scalar has more than one 
 row in the output
  However at the end of processing, I saw a misleading error:
 210709 [main] ERROR org.apache.pig.tools.grunt.Grunt  - ERROR 2088: Unable to 
 get results for: 
 hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage
 10/08/19 17:16:22 ERROR grunt.Grunt: ERROR 2088: Unable to get results for: 
 hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1548) Optimize scalar to consolidate the part file

2010-09-02 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair reassigned PIG-1548:
--

Assignee: Richard Ding  (was: Thejas M Nair)

 Optimize scalar to consolidate the part file
 

 Key: PIG-1548
 URL: https://issues.apache.org/jira/browse/PIG-1548
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Daniel Dai
Assignee: Richard Ding
 Fix For: 0.8.0


 Current scalar implementation will write a scalar file onto dfs. When Pig 
 need the scalar, it will open the dfs file directly. Each scalar file 
 contains more than one part file though it contains only one record. This 
 puts a huge load to namenode. We should consolidate part file before open it. 
 Another optional step is put the consolicated file into distributed cache. 
 This further bring down the load of namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-09-02 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905612#action_12905612
 ] 

Dmitriy V. Ryaboy commented on PIG-794:
---

Doug and Scott will know better of course, but afaik, Avro doesn't support 
Object keys.

You can cheat and turn Object keys into strings by Base64-encoding their 
serialized representations.. you'd have to know to reverse the process when 
deserializing, though.

Or we can try to get rid of InternalMap.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1544) proactive-spill bags should share the memory alloted for it

2010-09-02 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905628#action_12905628
 ] 

Olga Natkovich commented on PIG-1544:
-

I am going to take my previous comment back and say that we should make this 
work for UDFs as well. The main reason for this is that we don't have another 
way to make sure that UDFs do not run out of memory. One approach that Alan 
proposed was to make bags when they are created to ask for memory and have a 
central broker in charge of the memory pool. The details of this or whether 
there is a better approach need to be still thought through.

 proactive-spill bags should share the memory alloted for it
 ---

 Key: PIG-1544
 URL: https://issues.apache.org/jira/browse/PIG-1544
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair

 Initially proactive spill bags were designed for use in (co)group 
 (InternalCacheBag) and they knew the total number of proactive bags that were 
 present, and shared the memory limit specified using the property 
 pig.cachedbag.memusage .
 But the two proactive bag implementations were added later - 
 InternalDistinctBag and InternalSortedBag are not aware of actual number of 
 bags being used - their users always assume total-numbags = 3. 
 This needs to be fixed and all proactive-spill bags should share the 
 memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1544) proactive-spill bags should share the memory alloted for it

2010-09-02 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1544:


 Assignee: Thejas M Nair
Fix Version/s: 0.9.0

 proactive-spill bags should share the memory alloted for it
 ---

 Key: PIG-1544
 URL: https://issues.apache.org/jira/browse/PIG-1544
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.9.0


 Initially proactive spill bags were designed for use in (co)group 
 (InternalCacheBag) and they knew the total number of proactive bags that were 
 present, and shared the memory limit specified using the property 
 pig.cachedbag.memusage .
 But the two proactive bag implementations were added later - 
 InternalDistinctBag and InternalSortedBag are not aware of actual number of 
 bags being used - their users always assume total-numbags = 3. 
 This needs to be fixed and all proactive-spill bags should share the 
 memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-09-02 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905663#action_12905663
 ] 

Doug Cutting commented on PIG-794:
--

Some quick comments on the new patch:
  - you might define a java enum type for the union elements, using 
Enum#ordinal() for the union indexes
  - instead of name.equals(union), s.getType()==Type.UNION would be faster, 
but better yet would be to simply call read() recursively, since it already 
handles unions, no?
 - peekArray() can simply return null, and that might be faster



 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Sort Merge Cogroup

2010-09-02 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1309:


Summary: Sort Merge Cogroup  (was: Map-side Cogroup)

 Sort Merge Cogroup
 --

 Key: PIG-1309
 URL: https://issues.apache.org/jira/browse/PIG-1309
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0, 0.8.0

 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, 
 PIG_1309_7.patch


 In never ending quest to make Pig go faster, we want to parallelize as many 
 relational operations as possible. Its already possible to do Group-by( 
 PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
 is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1550) better error handling in casting relations to scalars

2010-09-02 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905715#action_12905715
 ] 

Thejas M Nair commented on PIG-1550:


Unit tests have succeeded. Patch is ready for review.


 better error handling in casting relations to scalars
 -

 Key: PIG-1550
 URL: https://issues.apache.org/jira/browse/PIG-1550
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1550.1.patch


 I ran the following script:
 Input data:
 joe 100
 sam 20
 bob 134
 Script:
 A = load 'user_clicks' as (user: chararray, clicks: int);
 B = group A by user;
 C = foreach A generate group, SUM(A.clicks);
 D = foreach A generate clicks/(double)C.$1;
 dump C;
 Since C contains more than 1 tuple, I expected to get an error which I did. 
 However, the error was not very clear. When the job failed, I did see a valid 
 error (however it lacked the error code): 210630 [main] ERROR 
 org.apache.pig.tools.pigstats.PigStats  - ERROR 0: Scalar has more than one 
 row in the output
  However at the end of processing, I saw a misleading error:
 210709 [main] ERROR org.apache.pig.tools.grunt.Grunt  - ERROR 2088: Unable to 
 get results for: 
 hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage
 10/08/19 17:16:22 ERROR grunt.Grunt: ERROR 2088: Unable to get results for: 
 hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1550) better error handling in casting relations to scalars

2010-09-02 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905717#action_12905717
 ] 

Olga Natkovich commented on PIG-1550:
-

I will review the patch


 better error handling in casting relations to scalars
 -

 Key: PIG-1550
 URL: https://issues.apache.org/jira/browse/PIG-1550
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1550.1.patch


 I ran the following script:
 Input data:
 joe 100
 sam 20
 bob 134
 Script:
 A = load 'user_clicks' as (user: chararray, clicks: int);
 B = group A by user;
 C = foreach A generate group, SUM(A.clicks);
 D = foreach A generate clicks/(double)C.$1;
 dump C;
 Since C contains more than 1 tuple, I expected to get an error which I did. 
 However, the error was not very clear. When the job failed, I did see a valid 
 error (however it lacked the error code): 210630 [main] ERROR 
 org.apache.pig.tools.pigstats.PigStats  - ERROR 0: Scalar has more than one 
 row in the output
  However at the end of processing, I saw a misleading error:
 210709 [main] ERROR org.apache.pig.tools.grunt.Grunt  - ERROR 2088: Unable to 
 get results for: 
 hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage
 10/08/19 17:16:22 ERROR grunt.Grunt: ERROR 2088: Unable to get results for: 
 hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1550) better error handling in casting relations to scalars

2010-09-02 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905731#action_12905731
 ] 

Olga Natkovich commented on PIG-1550:
-

+1, looks good

 better error handling in casting relations to scalars
 -

 Key: PIG-1550
 URL: https://issues.apache.org/jira/browse/PIG-1550
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1550.1.patch


 I ran the following script:
 Input data:
 joe 100
 sam 20
 bob 134
 Script:
 A = load 'user_clicks' as (user: chararray, clicks: int);
 B = group A by user;
 C = foreach A generate group, SUM(A.clicks);
 D = foreach A generate clicks/(double)C.$1;
 dump C;
 Since C contains more than 1 tuple, I expected to get an error which I did. 
 However, the error was not very clear. When the job failed, I did see a valid 
 error (however it lacked the error code): 210630 [main] ERROR 
 org.apache.pig.tools.pigstats.PigStats  - ERROR 0: Scalar has more than one 
 row in the output
  However at the end of processing, I saw a misleading error:
 210709 [main] ERROR org.apache.pig.tools.grunt.Grunt  - ERROR 2088: Unable to 
 get results for: 
 hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage
 10/08/19 17:16:22 ERROR grunt.Grunt: ERROR 2088: Unable to get results for: 
 hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1550) better error handling in casting relations to scalars

2010-09-02 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1550:
---

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch committed to trunk and 0.8 branch.


 better error handling in casting relations to scalars
 -

 Key: PIG-1550
 URL: https://issues.apache.org/jira/browse/PIG-1550
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1550.1.patch


 I ran the following script:
 Input data:
 joe 100
 sam 20
 bob 134
 Script:
 A = load 'user_clicks' as (user: chararray, clicks: int);
 B = group A by user;
 C = foreach A generate group, SUM(A.clicks);
 D = foreach A generate clicks/(double)C.$1;
 dump C;
 Since C contains more than 1 tuple, I expected to get an error which I did. 
 However, the error was not very clear. When the job failed, I did see a valid 
 error (however it lacked the error code): 210630 [main] ERROR 
 org.apache.pig.tools.pigstats.PigStats  - ERROR 0: Scalar has more than one 
 row in the output
  However at the end of processing, I saw a misleading error:
 210709 [main] ERROR org.apache.pig.tools.grunt.Grunt  - ERROR 2088: Unable to 
 get results for: 
 hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage
 10/08/19 17:16:22 ERROR grunt.Grunt: ERROR 2088: Unable to get results for: 
 hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1334) Make pig artifacts available through maven

2010-09-02 Thread Scott Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905736#action_12905736
 ] 

Scott Carey commented on PIG-1334:
--

This ticket is incomplete.   
*It did not properly package javadoc.
* JUnit, is not marked as a test-time dependency, but as a runtime dependency.
* I suspect HBase is not a runtime dependency, but an 'optional' 
(non-transitive) or 'provided' dependency.

Should this be re-opened or make a new ticket?


There is a -sources.jar that has java source and additionally other 
documentation, but no javadoc that I can find, and if it is in there it doesn't 
have the right folder structure.

A properly packaged Maven javadoc jar has a file structure like this:
https://repository.apache.org/content/repositories/public/org/apache/avro/avro/1.4.0-SNAPSHOT/avro-1.4.0-20100825.231911-4-javadoc.jar

When packaged properly, third party tools (IDE's like Eclipse) will 
automatically import the javadoc and java sources for the dependency, making 
them automatically available in the IDE when coding or debugging.   

 Make pig artifacts available through maven
 --

 Key: PIG-1334
 URL: https://issues.apache.org/jira/browse/PIG-1334
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, 
 mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1334) Make pig artifacts available through maven

2010-09-02 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905744#action_12905744
 ] 

Richard Ding commented on PIG-1334:
---

Scott,

Please create a new Jira for this. Another follow-up jira (PIG-1562) has 
already been opened. 

-Richard

 Make pig artifacts available through maven
 --

 Key: PIG-1334
 URL: https://issues.apache.org/jira/browse/PIG-1334
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, 
 mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file

2010-09-02 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1548:
--

Attachment: PIG-1458.patch


Results of test-patch:

{code}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

{code}

 Optimize scalar to consolidate the part file
 

 Key: PIG-1548
 URL: https://issues.apache.org/jira/browse/PIG-1548
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Daniel Dai
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1458.patch


 Current scalar implementation will write a scalar file onto dfs. When Pig 
 need the scalar, it will open the dfs file directly. Each scalar file 
 contains more than one part file though it contains only one record. This 
 puts a huge load to namenode. We should consolidate part file before open it. 
 Another optional step is put the consolicated file into distributed cache. 
 This further bring down the load of namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1575) Complete the migration of optimization rule PushUpFilter including missing test cases

2010-09-02 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1575:
-

Status: Patch Available  (was: Open)

 Complete the migration of optimization rule PushUpFilter including missing 
 test cases
 -

 Key: PIG-1575
 URL: https://issues.apache.org/jira/browse/PIG-1575
 Project: Pig
  Issue Type: Bug
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1575-1.patch


 The Optimization rule under the new logical plan, PushUpFilter, only does a 
 subset of optimization scenarios compared to the same rule under the old 
 logical plan. For instance, it only considers filter after join, but the old 
 optimization also considers other operators such as CoGroup, Union, Cross, 
 etc. The migration of the rule should be complete.
 Also, the test cases created for testing the old PushUpFilter wasn't migrated 
 to the new logical plan code base. It should be also migrated. (A few has 
 been migrated in JIRA-1574.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1598) Pig gobbles up error messages - Part 2

2010-09-02 Thread Ashutosh Chauhan (JIRA)
Pig gobbles up error messages - Part 2
--

 Key: PIG-1598
 URL: https://issues.apache.org/jira/browse/PIG-1598
 Project: Pig
  Issue Type: Improvement
Reporter: Ashutosh Chauhan


Another case of PIG-1531 .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-09-02 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905775#action_12905775
 ] 

Dmitriy V. Ryaboy commented on PIG-794:
---

Jeff, that's what I am saying -- since they are writables, we can turn them 
into strings and not need InternalMap at all.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.