[jira] Commented: (PIG-957) Tutorial is broken with 0.4 branch and trunk

2009-09-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754478#action_12754478
 ] 

Hadoop QA commented on PIG-957:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12419363/PIG-957.patch
  against trunk revision 814075.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/6/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/6/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/6/console

This message is automatically generated.

> Tutorial is broken with 0.4 branch and trunk
> 
>
> Key: PIG-957
> URL: https://issues.apache.org/jira/browse/PIG-957
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Pradeep Kamath
> Fix For: 0.4.0
>
> Attachments: PIG-957.patch
>
>
> As I was testing the Pig Tutorial in preparation for the release, I found 
> that we broke the second script both in local mode and in MR mode. The issue 
> has to do with schema and naming fields.  
> Here is what I see:
>  
> java -cp pig.jar org.apache.pig.Main -x local script2-local.pig
> 2009-09-11 12:52:46,961 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1000: Error during parsing. Invalid alias: hour00::group::ngram in 
> {group::ngram: chararray,group::hour: chararray,hour00::count: long,ngram: 
> chararray,hour: chararray,hour12::count: long}
> 09/09/11 12:52:46 ERROR grunt.Grunt: ERROR 1000: Error during parsing. 
> Invalid alias: hour00::group::ngram in {group::ngram: chararray,group::hour: 
> chararray,hour00::count: long,ngram: chararray,hour: chararray,hour12::count: 
> long}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-957) Tutorial is broken with 0.4 branch and trunk

2009-09-11 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754468#action_12754468
 ] 

Daniel Dai commented on PIG-957:


+1

> Tutorial is broken with 0.4 branch and trunk
> 
>
> Key: PIG-957
> URL: https://issues.apache.org/jira/browse/PIG-957
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Pradeep Kamath
> Fix For: 0.4.0
>
> Attachments: PIG-957.patch
>
>
> As I was testing the Pig Tutorial in preparation for the release, I found 
> that we broke the second script both in local mode and in MR mode. The issue 
> has to do with schema and naming fields.  
> Here is what I see:
>  
> java -cp pig.jar org.apache.pig.Main -x local script2-local.pig
> 2009-09-11 12:52:46,961 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1000: Error during parsing. Invalid alias: hour00::group::ngram in 
> {group::ngram: chararray,group::hour: chararray,hour00::count: long,ngram: 
> chararray,hour: chararray,hour12::count: long}
> 09/09/11 12:52:46 ERROR grunt.Grunt: ERROR 1000: Error during parsing. 
> Invalid alias: hour00::group::ngram in {group::ngram: chararray,group::hour: 
> chararray,hour00::count: long,ngram: chararray,hour: chararray,hour12::count: 
> long}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754466#action_12754466
 ] 

Hadoop QA commented on PIG-955:
---

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12419352/PIG-955.patch2
  against trunk revision 814075.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/25/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/25/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/25/console

This message is automatically generated.

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch, PIG-955.patch2
>
>
> SkewedPartitioner doesn't partition the skewed keys in partition table (first 
> table) correctly. This can cause data loss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-957) Tutorial is broken with 0.4 branch and trunk

2009-09-11 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-957:
---

Status: Patch Available  (was: Open)

> Tutorial is broken with 0.4 branch and trunk
> 
>
> Key: PIG-957
> URL: https://issues.apache.org/jira/browse/PIG-957
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Pradeep Kamath
> Fix For: 0.4.0
>
> Attachments: PIG-957.patch
>
>
> As I was testing the Pig Tutorial in preparation for the release, I found 
> that we broke the second script both in local mode and in MR mode. The issue 
> has to do with schema and naming fields.  
> Here is what I see:
>  
> java -cp pig.jar org.apache.pig.Main -x local script2-local.pig
> 2009-09-11 12:52:46,961 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1000: Error during parsing. Invalid alias: hour00::group::ngram in 
> {group::ngram: chararray,group::hour: chararray,hour00::count: long,ngram: 
> chararray,hour: chararray,hour12::count: long}
> 09/09/11 12:52:46 ERROR grunt.Grunt: ERROR 1000: Error during parsing. 
> Invalid alias: hour00::group::ngram in {group::ngram: chararray,group::hour: 
> chararray,hour00::count: long,ngram: chararray,hour: chararray,hour12::count: 
> long}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-957) Tutorial is broken with 0.4 branch and trunk

2009-09-11 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-957:
---

Attachment: PIG-957.patch

Attached patch to address the issue. LOJoin's getSchema() now keeps both the 
disambiguated (outeralias::inneralis)  alias and the simple inner alias for non 
duplicate columns coming out of the LOJoin.

> Tutorial is broken with 0.4 branch and trunk
> 
>
> Key: PIG-957
> URL: https://issues.apache.org/jira/browse/PIG-957
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Pradeep Kamath
> Fix For: 0.4.0
>
> Attachments: PIG-957.patch
>
>
> As I was testing the Pig Tutorial in preparation for the release, I found 
> that we broke the second script both in local mode and in MR mode. The issue 
> has to do with schema and naming fields.  
> Here is what I see:
>  
> java -cp pig.jar org.apache.pig.Main -x local script2-local.pig
> 2009-09-11 12:52:46,961 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1000: Error during parsing. Invalid alias: hour00::group::ngram in 
> {group::ngram: chararray,group::hour: chararray,hour00::count: long,ngram: 
> chararray,hour: chararray,hour12::count: long}
> 09/09/11 12:52:46 ERROR grunt.Grunt: ERROR 1000: Error during parsing. 
> Invalid alias: hour00::group::ngram in {group::ngram: chararray,group::hour: 
> chararray,hour00::count: long,ngram: chararray,hour: chararray,hour12::count: 
> long}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-957) Tutorial is broken with 0.4 branch and trunk

2009-09-11 Thread Olga Natkovich (JIRA)
Tutorial is broken with 0.4 branch and trunk


 Key: PIG-957
 URL: https://issues.apache.org/jira/browse/PIG-957
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Olga Natkovich
Assignee: Pradeep Kamath
 Fix For: 0.4.0


As I was testing the Pig Tutorial in preparation for the release, I found that 
we broke the second script both in local mode and in MR mode. The issue has to 
do with schema and naming fields.  

Here is what I see:

 

java -cp pig.jar org.apache.pig.Main -x local script2-local.pig


2009-09-11 12:52:46,961 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Invalid alias: hour00::group::ngram in 
{group::ngram: chararray,group::hour: chararray,hour00::count: long,ngram: 
chararray,hour: chararray,hour12::count: long}

09/09/11 12:52:46 ERROR grunt.Grunt: ERROR 1000: Error during parsing. Invalid 
alias: hour00::group::ngram in {group::ngram: chararray,group::hour: 
chararray,hour00::count: long,ngram: chararray,hour: chararray,hour12::count: 
long}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-955:
---

Status: Patch Available  (was: Open)

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch, PIG-955.patch2
>
>
> SkewedPartitioner doesn't partition the skewed keys in partition table (first 
> table) correctly. This can cause data loss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Ying He (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-955:


Attachment: PIG-955.patch2

add Junit test

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch, PIG-955.patch2
>
>
> SkewedPartitioner doesn't partition the skewed keys in partition table (first 
> table) correctly. This can cause data loss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured

2009-09-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-954:
---

   Resolution: Fixed
Fix Version/s: 0.4.0
   Status: Resolved  (was: Patch Available)

patch committed. Thanks, Ying for a quick fix!

> Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
> ---
>
> Key: PIG-954
> URL: https://issues.apache.org/jira/browse/PIG-954
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Fix For: 0.4.0
>
> Attachments: PIG-954.patch, PIG-954.patch2
>
>
> query fails if pig.skewedjoin.reduce.memusage is not configured. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured

2009-09-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754396#action_12754396
 ] 

Hadoop QA commented on PIG-954:
---

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12419336/PIG-954.patch2
  against trunk revision 814016.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/5/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/5/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/5/console

This message is automatically generated.

> Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
> ---
>
> Key: PIG-954
> URL: https://issues.apache.org/jira/browse/PIG-954
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-954.patch, PIG-954.patch2
>
>
> query fails if pig.skewedjoin.reduce.memusage is not configured. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-641) Fragment replicate join does not work in local mode

2009-09-11 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath resolved PIG-641.


   Resolution: Fixed
Fix Version/s: 0.4.0

This issue is fixed in current trunk since LocalLogToPhyTranslationVisitor 
always translates LOJoin into POCogroup followed by foreach flatten regardless 
of join type.

Here is a script I tried to validate:
[prade...@chargesize:~/dev/pig-apache/pig/trunk]cat a.txt 
1   2   3
2   3   4
3   4   5
[prade...@chargesize:~/dev/pig-apache/pig/trunk]cat b.txt 
3   a
1   x
4   b
[prade...@chargesize:~/dev/pig-apache/pig/trunk]cat c.txt 
1   20  30
[prade...@chargesize:~/dev/pig-apache/pig/trunk]java -cp 
/tmp/svncheckout/trunk/pig.jar org.apache.pig.Main -x local -e "a = load 
'a.txt'; b = load 'b.txt'; c = load 'c.txt'; d = join a by \$0, b by \$0 using 
\"replicated\";  dump d; e = join a by \$0, c by \$0 using \"replicated\"; dump 
 e;"
2009-09-11 15:27:54,852 [main] INFO  org.apache.pig.Main - Logging error 
messages to: /homes/pradeepk/dev/pig-apache/pig/trunk/pig_1252708074851.log
2009-09-11 15:27:55,217 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully 
stored result in: "file:/tmp/temp-1388892738/tmp1991974517"
2009-09-11 15:27:55,218 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written 
: 2
2009-09-11 15:27:55,218 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 0
2009-09-11 15:27:55,218 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2009-09-11 15:27:55,218 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(1,2,3,1,x)
(3,4,5,3,a)
2009-09-11 15:27:55,253 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully 
stored result in: "file:/tmp/temp-1388892738/tmp84396309"
2009-09-11 15:27:55,253 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written 
: 1
2009-09-11 15:27:55,253 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 0
2009-09-11 15:27:55,254 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2009-09-11 15:27:55,254 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(1,2,3,1,20,30)
[prade...@chargesize:~/dev/pig-apache/pig/trunk]


> Fragment replicate join does not work in local mode
> ---
>
> Key: PIG-641
> URL: https://issues.apache.org/jira/browse/PIG-641
> Project: Pig
>  Issue Type: Bug
>Reporter: Olga Natkovich
>Assignee: Shubham Chopra
> Fix For: 0.4.0
>
> Attachments: 641.patch, 641.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-929) Default value of memusage for skewed join is not correct

2009-09-11 Thread Ying He (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-929:


Description: default value pig.skewedjoin.reduce.memusage , which is used 
in skewed join, should be set to 0.3  (was: Fragmented replicated join has a 
few limitations:
 - One of the tables needs to be loaded into memory
 - Join is limited to two tables

Skewed join partitions the table and joins the records in the reduce phase. It 
computes a histogram of the key space to account for skewing in the input 
records. Further, it adjusts the number of reducers depending on the key 
distribution.

We need to implement the skewed join in pig.)

> Default value of memusage for skewed join is not correct
> 
>
> Key: PIG-929
> URL: https://issues.apache.org/jira/browse/PIG-929
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: memusage.patch
>
>
> default value pig.skewedjoin.reduce.memusage , which is used in skewed join, 
> should be set to 0.3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured

2009-09-11 Thread Ying He (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-954:


Description: query fails if pig.skewedjoin.reduce.memusage is not 
configured.   (was: Fragmented replicated join has a few limitations:
 - One of the tables needs to be loaded into memory
 - Join is limited to two tables

Skewed join partitions the table and joins the records in the reduce phase. It 
computes a histogram of the key space to account for skewing in the input 
records. Further, it adjusts the number of reducers depending on the key 
distribution.

We need to implement the skewed join in pig.)

> Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
> ---
>
> Key: PIG-954
> URL: https://issues.apache.org/jira/browse/PIG-954
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-954.patch, PIG-954.patch2
>
>
> query fails if pig.skewedjoin.reduce.memusage is not configured. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Ying He (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754370#action_12754370
 ] 

Ying He commented on PIG-955:
-

This is not related to replicate join. The original description is misleading. 
It came  from the the JIRA that this one is cloned from. I've updated it to the 
correct one.

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> SkewedPartitioner doesn't partition the skewed keys in partition table (first 
> table) correctly. This can cause data loss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Ying He (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-955:


Description: SkewedPartitioner doesn't the skewed keys in partition table 
correctly. This can cause data loss.  (was: Fragmented replicated join has a 
few limitations:
 - One of the tables needs to be loaded into memory
 - Join is limited to two tables

Skewed join partitions the table and joins the records in the reduce phase. It 
computes a histogram of the key space to account for skewing in the input 
records. Further, it adjusts the number of reducers depending on the key 
distribution.

We need to implement the skewed join in pig.)

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> SkewedPartitioner doesn't the skewed keys in partition table correctly. This 
> can cause data loss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Ying He (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-955:


Description: SkewedPartitioner doesn't partition the skewed keys in 
partition table (first table) correctly. This can cause data loss.  (was: 
SkewedPartitioner doesn't the skewed keys in partition table correctly. This 
can cause data loss.)

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> SkewedPartitioner doesn't partition the skewed keys in partition table (first 
> table) correctly. This can cause data loss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-949) Zebra Bug: splitting map into multiple column group using storage hint causes unexpected behaviour

2009-09-11 Thread Jing Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754353#action_12754353
 ] 

Jing Huang commented on PIG-949:


Thanks Alok. 
I am able to reproduce the problem. 
I was only using i/o layer (not pig loader) to test map split. 
This is what I did:
  final static String STR_SCHEMA = "m1:map(string),m2:map(map(int))";
  final static String STR_STORAGE = "[m1#{a}];[m2#{x|y}]; [m1#{b}, 
m2#{z}];[m1]";
...create table and insert data ..

load:  String projection = new String("m1#{a}");

I only got null returned. 



Without storage hint [m1], everything works fine. , i.e. 
 final static String STR_STORAGE = "[m1#{a}];[m2#{x|y}]; [m1#{b}, m2#{z}]";
 ...create table and insert data ..
load:  String projection = new String("m1#{a}");
I am able to get value m1#{a}. 

Zebra team is working on the fix.



> Zebra Bug: splitting map into multiple column group using storage hint causes 
> unexpected behaviour
> --
>
> Key: PIG-949
> URL: https://issues.apache.org/jira/browse/PIG-949
> Project: Pig
>  Issue Type: Bug
> Environment: linux
>Reporter: Alok Singh
>
> Hi 
>  The storage hint
> specification plays a important part whether the output table is readable or 
> not
> say if we have have the map 'map'.
> One can split the map into a column group using [map#{k1}, map#{k2}...] 
> however the remaining map field will automatically be added to the default 
> group.
> if user try to create a new column group for the remaining fields as follows
> [map#{k1}, map#{k2}, ..][map] i.e create a seperate column group
> the table writer will create the table.
> however, if one tries to load the created table via pig or via map reduce 
> using TableInputFormat
>  
> then the reader  have problem reading the map
> We get the following stack trace
> 09/09/09 00:09:45 INFO mapred.JobClient: Task Id : 
> attempt_200908191538_33939_m_21_2, Status : FAILED
> java.io.IOException: getValue() failed: null
> at 
> org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getValue(BasicTable.java:775)
> at 
> org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:717)
> at 
> org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:651)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Alok

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754349#action_12754349
 ] 

Santhosh Srinivasan commented on PIG-955:
-

Hi Ying,

How are Fragment Replicate Join and Skewed Join related as you mention in your 
bug description? Also, skewed join has been part of trunk for more than a month 
now. Your bug description states that Pig needs skewed join.

Thanks,
Santhosh

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754336#action_12754336
 ] 

Olga Natkovich commented on PIG-955:


Updated wrong JIRA

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured

2009-09-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-954:
---

Status: Patch Available  (was: Open)

> Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
> ---
>
> Key: PIG-954
> URL: https://issues.apache.org/jira/browse/PIG-954
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-954.patch, PIG-954.patch2
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754335#action_12754335
 ] 

Olga Natkovich commented on PIG-955:


+1. Changes look good. Just need to wait for test results

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-955:
---

Status: Open  (was: Patch Available)

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured

2009-09-11 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754338#action_12754338
 ] 

Olga Natkovich commented on PIG-954:


+1 on the code changes. Need to wait for test results

> Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
> ---
>
> Key: PIG-954
> URL: https://issues.apache.org/jira/browse/PIG-954
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-954.patch, PIG-954.patch2
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-955:
---

Status: Patch Available  (was: Open)

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-882) log level not propogated to loggers

2009-09-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-882:
---

   Resolution: Fixed
Fix Version/s: 0.4.0
   Status: Resolved  (was: Patch Available)

I don't have a unit test case for the same reason of the first patch. See my 
first comment.

> log level not propogated to loggers 
> 
>
> Key: PIG-882
> URL: https://issues.apache.org/jira/browse/PIG-882
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Thejas M Nair
>Assignee: Daniel Dai
> Fix For: 0.4.0
>
> Attachments: duplicate_message.patch, PIG-882-1.patch, 
> PIG-882-2.patch, PIG-882-3.patch, PIG-882-4.patch, PIG-882-5.patch
>
>
> Pig accepts log level as a parameter. But the log level it captures is not 
> set appropriately, so that loggers in different classes log at the specified 
> level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured

2009-09-11 Thread Ying He (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-954:


Attachment: PIG-954.patch2

add JUnit test

> Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
> ---
>
> Key: PIG-954
> URL: https://issues.apache.org/jira/browse/PIG-954
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-954.patch, PIG-954.patch2
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Ying He (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754319#action_12754319
 ] 

Ying He commented on PIG-955:
-

the sampling process generated a file which contains skewed keys and their  
pre-allocated reducer indexes. Each (key, beginning index, ending index) is 
stored as a tuple.

during join process, this file is loaded by SkewedPartitioner as lookup table. 
For tuples from partition table, its key is matched against this lookup table, 
if match is found, it returns a value in range of [beginning index, ending 
index] in round robin fashion. If no match found, it then use hash() to 
calculate index.

the problem is  in SkewedPartitioner, when looking up the table, the 
PigNullableWritable format of input tuple is used, while the lookup table uses 
Pig type Tuple as keys. Therefore,  no match is found. The indexes are 
calculated using hash() even for skewed keys.  This causes the data for this 
key all goes to the same reducer. 

But for streaming table,  if key is skewed key, each tuple is replicated  to 
each reducer that are pre-allocated during sampling process.

Because the reducer indexes are calculated wrong for skewed keys in partition 
table, tuples from first table are sent to wrong reducers,  if it doesn't fall 
into its pre-calculated index range, the join with second table ends up with 
empty data set for that key.  The query still appears successfully, but it has 
data loss.

The fix is to change SkewedPartitioner to use correct object type to lookup 
skewed key tables



> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754306#action_12754306
 ] 

Olga Natkovich commented on PIG-955:


Hi Ying,

Thanks for the patch. From the description it is not clear what kind of scripts 
would be effected by this issue. Adding an example to the JIRA description 
would be helpful.

Also, the patch needs a unit test

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Ying He (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-955:


Attachment: PIG-955.patch

use tuple type to lookup skewed key map 

> Skewed join generates  incorrect results 
> -
>
> Key: PIG-955
> URL: https://issues.apache.org/jira/browse/PIG-955
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: PIG-955.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (PIG-956) Reduce patch testing time

2009-09-11 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754298#action_12754298
 ] 

Olga Natkovich edited comment on PIG-956 at 9/11/09 12:41 PM:
--

My plan is to do the following:

(1) Take all the tests that take 5 seconds or less and put them into 10 minute 
tests
(2) Create a TestCheckin - that runs a few end-to-end tests

(1) + (2) combined will be the Ten-minute test group.

Going forward, any files (this is at the test file level) that take 5 seconds 
or less can be added to the Ten-minute tests. Also, when any really major 
feature is added, an end-2-end query can be added or existing one modified in 
the TestCheckin.

  was (Author: olgan):
My plan is to do the following:

(1) Take all the tests that take 5 seconds or less and put them into 10 minute 
tests
(2) Create a TestCheckin - that runs a few end-to-end tests

(1) + (2) combined will be the Ten-minute test group.

Goint forward, any files (this is at the test file level) that take 5 seconds 
or less can be added to the Ten-minute tests. Also, when any really major 
feature is added, an end-2-end query can be added or existing one modified in 
the TestCheckin.
  
> Reduce patch testing time
> -
>
> Key: PIG-956
> URL: https://issues.apache.org/jira/browse/PIG-956
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.4.0
>Reporter: Olga Natkovich
>Assignee: Olga Natkovich
> Fix For: 0.6.0
>
>
> The proposal is to split the tests into 2 groups:
> (1) Ten-minute tests - this is a set of tests that run with every patch 
> submission and takes aproximately 10 minutes
> (2) All tests - these include all tests and they will run nightly
> This is similar to work done in Hadoop: 
> http://issues.apache.org/jira/browse/HDFS-458

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-956) Reduce patch testing time

2009-09-11 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754298#action_12754298
 ] 

Olga Natkovich commented on PIG-956:


My plan is to do the following:

(1) Take all the tests that take 5 seconds or less and put them into 10 minute 
tests
(2) Create a TestCheckin - that runs a few end-to-end tests

(1) + (2) combined will be the Ten-minute test group.

Goint forward, any files (this is at the test file level) that take 5 seconds 
or less can be added to the Ten-minute tests. Also, when any really major 
feature is added, an end-2-end query can be added or existing one modified in 
the TestCheckin.

> Reduce patch testing time
> -
>
> Key: PIG-956
> URL: https://issues.apache.org/jira/browse/PIG-956
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.4.0
>Reporter: Olga Natkovich
>Assignee: Olga Natkovich
> Fix For: 0.6.0
>
>
> The proposal is to split the tests into 2 groups:
> (1) Ten-minute tests - this is a set of tests that run with every patch 
> submission and takes aproximately 10 minutes
> (2) All tests - these include all tests and they will run nightly
> This is similar to work done in Hadoop: 
> http://issues.apache.org/jira/browse/HDFS-458

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-956) Reduce patch testing time

2009-09-11 Thread Olga Natkovich (JIRA)
Reduce patch testing time
-

 Key: PIG-956
 URL: https://issues.apache.org/jira/browse/PIG-956
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.6.0


The proposal is to split the tests into 2 groups:

(1) Ten-minute tests - this is a set of tests that run with every patch 
submission and takes aproximately 10 minutes
(2) All tests - these include all tests and they will run nightly

This is similar to work done in Hadoop: 
http://issues.apache.org/jira/browse/HDFS-458

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-660) Integration with Hadoop 0.20

2009-09-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-660:
---

Affects Version/s: (was: 0.5.0)
   0.4.0
Fix Version/s: 0.5.0

> Integration with Hadoop 0.20
> 
>
> Key: PIG-660
> URL: https://issues.apache.org/jira/browse/PIG-660
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.4.0
> Environment: Hadoop 0.20
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Fix For: 0.5.0
>
> Attachments: hadoop20.jar.gz, PIG-660-for-branch-0.3.patch, 
> PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, 
> PIG-660_4.patch, PIG-660_5.patch, PIG-660_trunk.patch, PIG-660_trunk_2.patch, 
> pig_660_shims.patch, pig_660_shims_2.patch, pig_660_shims_3.patch
>
>
> With Hadoop 0.20, it will be possible to query the status of each map and 
> reduce in a map reduce job. This will allow better error reporting. Some of 
> the other items that could be on Hadoop's feature requests/bugs are 
> documented here for tracking.
> 1. Hadoop should return objects instead of strings when exceptions are thrown
> 2. The JobControl should handle all exceptions and report them appropriately. 
> For example, when the JobControl fails to launch jobs, it should handle 
> exceptions appropriately and should support APIs that query this state, i.e., 
> failure to launch jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-955) Skewed join generates incorrect results

2009-09-11 Thread Ying He (JIRA)
Skewed join generates  incorrect results 
-

 Key: PIG-955
 URL: https://issues.apache.org/jira/browse/PIG-955
 Project: Pig
  Issue Type: Improvement
Reporter: Ying He


Fragmented replicated join has a few limitations:
 - One of the tables needs to be loaded into memory
 - Join is limited to two tables

Skewed join partitions the table and joins the records in the reduce phase. It 
computes a histogram of the key space to account for skewing in the input 
records. Further, it adjusts the number of reducers depending on the key 
distribution.

We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Request for feedback: cost-based optimizer

2009-09-11 Thread Dmitriy Ryaboy
Hi Alan,
Thanks for the detailed review.

After getting Daniel's feedback (and grokking the relationship between
Pig's logical and physical operators, which is a little different than
that described in the literature), we agree that the proper place to
put the optimizer is at the logical layer, although we will need to
compile to the physical layer to get cost estimates (for example, the
number of generated MR jobs, which have associated
network/queueing/startup costs). In order to adaptively adjust
estimates, we will need to be able to trace back from an executed MR
job ("job set", really, as some operations like order and join may
require several jobs that are considered a single unit) to the logical
operators this job covered. Adding that ability will have the
additional benefit of enabling more helpful debugging output to end
users by associating a failed MR job with what it was supposed to be
doing.

Totally agree with respect to PigServer and MapReduceLauncher.  Making
PigServer an actual "server" would be good, but is somewhat orthogonal
to this work.

Great to know you are working on statistics, looking forward to
looking at the proposal.  Are you working on just data stats or also
execution stats (time per operator per record, that sort of thing)?

Thanks
-Dmitriy

On Fri, Sep 11, 2009 at 1:56 PM, Alan Gates  wrote:
> This is a good start at adding a cost based optimizer to Pig.  I have a
> number of comments:
>
> 1) Your argument for putting it in the physical layer rather than the
> logical is that the logical layer does not know physical statistics.  This
> need not be true.  You suggest adding a getStatistics call to the loader to
> give statistics.  The logical layer can make this call and make decisions
> based on the results without understanding the underlying physical layer.
>  It seems that the real reason you want to put the optimizer in the physical
> layer is, rather than trying to do predictive statistics (such as we guess
> this join will result in a 2x data explosion) you want to see the results of
> actual MR jobs and then make decisions.  This seems like a reasonable choice
> for a couple of reasons:  a) statistical guesses are hard to get right, and
> Pig has limited statistics to begin with; b) since Pig Latin scripts can be
> arbitrarily long, bad guesses at the beginning will have a worse ripple
> effect than bad guesses in a SQL optimizer.
>
> 2) The changes you propose in Pig Server are quite complex.  Would it be
> possible instead to put the changes in MapReduceLauncher?  It could run the
> first MR job in a Pig Latin script, look at the results, and then rerun your
> CBO on the remaining physical plan and re-translate this to a new MR plan
> and resubmit.  This would require annotations to the MR plan to indicate
> where in a physical plan the MR boundaries fall, so that correct portions of
> the original physical plan could be used for reoptimization and
> recompilation.  But it would contain the complexity of your changes to
> MapReduceLauncher instead of scattering them through the entire system.
>
> 3) On adding getStatistics, I am currently working on a proposal to make a
> number of changes to the load interface, including getStatistics.  I hope to
> publish that proposal by next week.  Similarly I am working on a proposal of
> how Pig will interact with metadata systems (such as Owl) which I also hope
> to propose next week.  We will be actively working in these areas because we
> need them for our SQL implementation.  So, one, you'll get a lot of this for
> free; two, we should stay connected on these things so what we implement
> works for what you need.
>
> Alan.
>
> On Sep 1, 2009, at 9:54 AM, Dmitriy Ryaboy wrote:
>
>> Whoops :-)
>> Here's the Google doc:
>>
>> http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en
>>
>> -Dmitriy
>>
>> On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan
>> wrote:
>>>
>>> Dmitriy and Gang,
>>>
>>> The mailing list does not allow attachments. Can you post it on a
>>> website and just send the URL ?
>>>
>>> Thanks,
>>> Santhosh
>>>
>>> -Original Message-
>>> From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
>>> Sent: Tuesday, September 01, 2009 9:48 AM
>>> To: pig-dev@hadoop.apache.org
>>> Subject: Request for feedback: cost-based optimizer
>>>
>>> Hi everyone,
>>> Attached is a (very) preliminary document outlining a rough design we
>>> are proposing for a cost-based optimizer for Pig.
>>> This is being done as a capstone project by three CMU Master's students
>>> (myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
>>> necessarily meant for immediate incorporation into the Pig codebase,
>>> although it would be nice if it, or parts of it, are found to be useful
>>> in the mainline.
>>>
>>> We would love to get some feedback from the developer community
>>> regarding the ideas expressed in the document, any concerns about the
>>> design, suggestions for improvement, etc.
>>>
>>> Thanks

[jira] Updated: (PIG-660) Integration with Hadoop 0.20

2009-09-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-660:
---

Affects Version/s: (was: 0.2.0)
   0.5.0

> Integration with Hadoop 0.20
> 
>
> Key: PIG-660
> URL: https://issues.apache.org/jira/browse/PIG-660
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.5.0
> Environment: Hadoop 0.20
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Fix For: 0.4.0
>
> Attachments: hadoop20.jar.gz, PIG-660-for-branch-0.3.patch, 
> PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, 
> PIG-660_4.patch, PIG-660_5.patch, PIG-660_trunk.patch, PIG-660_trunk_2.patch, 
> pig_660_shims.patch, pig_660_shims_2.patch, pig_660_shims_3.patch
>
>
> With Hadoop 0.20, it will be possible to query the status of each map and 
> reduce in a map reduce job. This will allow better error reporting. Some of 
> the other items that could be on Hadoop's feature requests/bugs are 
> documented here for tracking.
> 1. Hadoop should return objects instead of strings when exceptions are thrown
> 2. The JobControl should handle all exceptions and report them appropriately. 
> For example, when the JobControl fails to launch jobs, it should handle 
> exceptions appropriately and should support APIs that query this state, i.e., 
> failure to launch jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-892) Make COUNT and AVG deal with nulls accordingly with SQL standar

2009-09-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-892:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Make COUNT and AVG deal with nulls accordingly with SQL standar
> ---
>
> Key: PIG-892
> URL: https://issues.apache.org/jira/browse/PIG-892
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Olga Natkovich
> Fix For: 0.4.0
>
> Attachments: PIG-892.patch, PIG-892_v2.patch, PIG-892_v3.patch
>
>
> both COUNT and AVG need to ignore nulls. Also add COUNT_STAR to match 
> COUNT(*) in SQL

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-895) Default parallel for Pig

2009-09-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-895:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Default parallel for Pig
> 
>
> Key: PIG-895
> URL: https://issues.apache.org/jira/browse/PIG-895
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Daniel Dai
> Fix For: 0.4.0
>
> Attachments: PIG-895-1.patch, PIG-895-2.patch, PIG-895-3.patch
>
>
> For hadoop 20, if user don't specify the number of reducers, hadoop will use 
> 1 reducer as the default value. It is different from previous of hadoop, in 
> which default reducer number is usually good. 1 reducer is not what user want 
> for sure. Although user can use "parallel" keyword to specify number of 
> reducers for each statement, it is wordy. We need a convenient way for users 
> to express a desired number of reducers. Here is my propose:
> 1. Add one property "default_parallel" to Pig. User can set default_parallel 
> in script. Eg:
>set default_parallel 10;
> 2. default_parallel is a hint to Pig. Pig is free to optimize the number of 
> reducers (unlike parallel keyword). Currently, since we do not have a 
> mechanism to determine the optimal number of reducers, default_parallel will 
> be always granted, unless it is override by "parallel" keyword.
> 3. If user put multiple default_parallel inside script, the last entry will 
> be taken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



proposed changes to Pig UDFs

2009-09-11 Thread Olga Natkovich
Hi,

 

As you know, a lot of work this year went into performance optimization
of Pig. One of the main sources of performance problems is high memory
usage. In an effort to address this problem we propose switching
internal implementation of strings from Java Strings to Hadoop Text
because text has lower memory overhead. Examples (assumes ASCII data;
sizes are in bytes):

 

Real StringJava StringHadoop Text

5  46 37

10 56 42

20 76 52

40 116   72

80 196   112

 

As the size of the strings grows so does the gap between the two
implementations.

 

Making this change would have no impact on pig users; however, it will
have impact on existing UDFs that work with Strings. Our question is
whether UDF writers/owners are comfortable with the proposed transition
and will update their UDFs.

 

Please, let us know by the end of next week if you strongly object to
this proposal. Otherwise, we will go forward with this plan.

 

Thanks,

 

Olga 

 

 



Re: Request for feedback: cost-based optimizer

2009-09-11 Thread Alan Gates
This is a good start at adding a cost based optimizer to Pig.  I have  
a number of comments:


1) Your argument for putting it in the physical layer rather than the  
logical is that the logical layer does not know physical statistics.   
This need not be true.  You suggest adding a getStatistics call to the  
loader to give statistics.  The logical layer can make this call and  
make decisions based on the results without understanding the  
underlying physical layer.  It seems that the real reason you want to  
put the optimizer in the physical layer is, rather than trying to do  
predictive statistics (such as we guess this join will result in a 2x  
data explosion) you want to see the results of actual MR jobs and then  
make decisions.  This seems like a reasonable choice for a couple of  
reasons:  a) statistical guesses are hard to get right, and Pig has  
limited statistics to begin with; b) since Pig Latin scripts can be  
arbitrarily long, bad guesses at the beginning will have a worse  
ripple effect than bad guesses in a SQL optimizer.


2) The changes you propose in Pig Server are quite complex.  Would it  
be possible instead to put the changes in MapReduceLauncher?  It could  
run the first MR job in a Pig Latin script, look at the results, and  
then rerun your CBO on the remaining physical plan and re-translate  
this to a new MR plan and resubmit.  This would require annotations to  
the MR plan to indicate where in a physical plan the MR boundaries  
fall, so that correct portions of the original physical plan could be  
used for reoptimization and recompilation.  But it would contain the  
complexity of your changes to MapReduceLauncher instead of scattering  
them through the entire system.


3) On adding getStatistics, I am currently working on a proposal to  
make a number of changes to the load interface, including  
getStatistics.  I hope to publish that proposal by next week.   
Similarly I am working on a proposal of how Pig will interact with  
metadata systems (such as Owl) which I also hope to propose next  
week.  We will be actively working in these areas because we need them  
for our SQL implementation.  So, one, you'll get a lot of this for  
free; two, we should stay connected on these things so what we  
implement works for what you need.


Alan.

On Sep 1, 2009, at 9:54 AM, Dmitriy Ryaboy wrote:


Whoops :-)
Here's the Google doc:
http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en

-Dmitriy

On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasaninc.com> wrote:

Dmitriy and Gang,

The mailing list does not allow attachments. Can you post it on a
website and just send the URL ?

Thanks,
Santhosh

-Original Message-
From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
Sent: Tuesday, September 01, 2009 9:48 AM
To: pig-dev@hadoop.apache.org
Subject: Request for feedback: cost-based optimizer

Hi everyone,
Attached is a (very) preliminary document outlining a rough design we
are proposing for a cost-based optimizer for Pig.
This is being done as a capstone project by three CMU Master's  
students

(myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
necessarily meant for immediate incorporation into the Pig codebase,
although it would be nice if it, or parts of it, are found to be  
useful

in the mainline.

We would love to get some feedback from the developer community
regarding the ideas expressed in the document, any concerns about the
design, suggestions for improvement, etc.

Thanks,
Dmitriy, Ashutosh, Tejal





[jira] Resolved: (PIG-950) Pig Loader does not handle unix hidden files ( files starting with dot)

2009-09-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-950.


   Resolution: Invalid
Fix Version/s: 0.4.0

It is a limitation of Hadoop map-reduce, so we cannot solve it in Pig side.

> Pig Loader does not handle unix hidden files ( files starting with dot)
> ---
>
> Key: PIG-950
> URL: https://issues.apache.org/jira/browse/PIG-950
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.4.0
>Reporter: Jing Huang
> Fix For: 0.4.0
>
>
> I am trying to load .btschema file using pig loader, ( .btschema is not an 
> empty file)
> This is what I did:
> grunt> a = load '.btschema';
> grunt> dump a;
> 2009-09-09 17:41:21,170 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - MR plan size before optimization: 1
> 2009-09-09 17:41:21,170 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - MR plan size after optimization: 1
> 2009-09-09 17:41:23,092 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>  - Setting up single store job
> 2009-09-09 17:41:23,106 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics 
> - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - 
> already initialized
> 2009-09-09 17:41:23,127 [Thread-4] WARN  org.apache.hadoop.mapred.JobClient - 
> Use GenericOptionsParser for parsing the arguments. Applications should 
> implement Tool for the same.
> 2009-09-09 17:41:23,623 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 0% complete
> 2009-09-09 17:41:28,644 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 100% complete
> 2009-09-09 17:41:28,644 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Successfully stored result in: "file:/tmp/temp165972/tmp-527102439"
> 2009-09-09 17:41:28,645 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Records written : 0
> 2009-09-09 17:41:28,645 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Bytes written : 0
> 2009-09-09 17:41:28,645 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Success!
> grunt> 
> =
> it dumps nothing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-891) Fixing dfs statement for Pig

2009-09-11 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754233#action_12754233
 ] 

Daniel Dai commented on PIG-891:


Tested in local mode also, the patch even works well in local mode. We have 
discussed issues in my previous comment, the suggestions are:

1. We can keep existing file system commands for now
2. We shall use "fs" instead of "dfs" to indicate a file system command as 
latest hadoop does

Jeff, can you make this little change ("dfs"->"fs") and submit again? Thanks!

> Fixing dfs statement for Pig
> 
>
> Key: PIG-891
> URL: https://issues.apache.org/jira/browse/PIG-891
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Daniel Dai
>Assignee: Jeff Zhang
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: Pig_891.patch
>
>
> Several hadoop dfs commands are not support or restrictive on current Pig. We 
> need to fix that. These include:
> 1. Several commands do not supported: lsr, dus, count, rmr, expunge, put, 
> moveFromLocal, get, getmerge, text, moveToLocal, mkdir, touchz, test, stat, 
> tail, chmod, chown, chgrp. A reference for these command can be found in 
> http://hadoop.apache.org/common/docs/current/hdfs_shell.html
> 2. All existing dfs commands do not support globing.
> 3. Pig should provide a programmatic way to perform dfs commands. Several of 
> them exist in PigServer, but not all of them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-882) log level not propogated to loggers

2009-09-11 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754219#action_12754219
 ] 

Olga Natkovich commented on PIG-882:


+1

> log level not propogated to loggers 
> 
>
> Key: PIG-882
> URL: https://issues.apache.org/jira/browse/PIG-882
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Thejas M Nair
>Assignee: Daniel Dai
> Fix For: 0.4.0
>
> Attachments: duplicate_message.patch, PIG-882-1.patch, 
> PIG-882-2.patch, PIG-882-3.patch, PIG-882-4.patch, PIG-882-5.patch
>
>
> Pig accepts log level as a parameter. But the log level it captures is not 
> set appropriately, so that loggers in different classes log at the specified 
> level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-929) Default value of memusage for skewed join is not correct

2009-09-11 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-929.


Resolution: Fixed

> Default value of memusage for skewed join is not correct
> 
>
> Key: PIG-929
> URL: https://issues.apache.org/jira/browse/PIG-929
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ying He
> Attachments: memusage.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Double logs in grunt and ^d don't work ?

2009-09-11 Thread Vincent BARAT

Hello,

I'm new to pig, I use it on MacOS, and I wonder if there is a way to 
avoid the double log traces in the grunt console, and if there is a 
way to make the ^D key work (the DEL key).


I think this is really inconvenient.

Thanks for you answer.