[jira] Created: (PIG-1464) Should clean the Graph when register another Pig Script

2010-06-25 Thread Jeff Zhang (JIRA)
Should clean the Graph when register another Pig Script
---

 Key: PIG-1464
 URL: https://issues.apache.org/jira/browse/PIG-1464
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.8.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.8.0


In the current implementation, the variable names in pig script are all global 
variable. This make one pig script know the variable in other scripts. In my 
opinion, this is not right. Every relation name in pig script should be local 
variable, otherwise it will bring in unexpected result.  This issue relates to 
PIG-1423

E.g there are two pig script as follows:

Test_1.pig
{code}
a = load 'data/b.txt' ;
{code}

Test_2.pig
{code}
b = foreach a generate $0;   // a is recognized by Grunt although it is in 
Test_1.pig
{code}

And the following execute normally, do not throw any exception

{code}
PigServer pig=new PigServer(ExecType.Local);
pig.registerScript(Test_1.pig);
pig.registerScript(Test_2.pig);
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1464) Should clean the Graph when register another Pig Script

2010-06-25 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-1464:


Attachment: Pig-1406.patch

Attach the patch for this issue

 Should clean the Graph when register another Pig Script
 ---

 Key: PIG-1464
 URL: https://issues.apache.org/jira/browse/PIG-1464
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.8.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: PIG_1463.patch


 In the current implementation, the variable names in pig script are all 
 global variable. This make one pig script know the variable in other scripts. 
 In my opinion, this is not right. Every relation name in pig script should be 
 local variable, otherwise it will bring in unexpected result.  This issue 
 relates to PIG-1423
 E.g there are two pig script as follows:
 Test_1.pig
 {code}
 a = load 'data/b.txt' ;
 {code}
 Test_2.pig
 {code}
 b = foreach a generate $0;   // a is recognized by Grunt although it is in 
 Test_1.pig
 {code}
 And the following execute normally, do not throw any exception
 {code}
 PigServer pig=new PigServer(ExecType.Local);
 pig.registerScript(Test_1.pig);
 pig.registerScript(Test_2.pig);
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1464) Should clean the Graph when register another Pig Script

2010-06-25 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-1464:


Attachment: PIG_1463.patch

 Should clean the Graph when register another Pig Script
 ---

 Key: PIG-1464
 URL: https://issues.apache.org/jira/browse/PIG-1464
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.8.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: PIG_1463.patch


 In the current implementation, the variable names in pig script are all 
 global variable. This make one pig script know the variable in other scripts. 
 In my opinion, this is not right. Every relation name in pig script should be 
 local variable, otherwise it will bring in unexpected result.  This issue 
 relates to PIG-1423
 E.g there are two pig script as follows:
 Test_1.pig
 {code}
 a = load 'data/b.txt' ;
 {code}
 Test_2.pig
 {code}
 b = foreach a generate $0;   // a is recognized by Grunt although it is in 
 Test_1.pig
 {code}
 And the following execute normally, do not throw any exception
 {code}
 PigServer pig=new PigServer(ExecType.Local);
 pig.registerScript(Test_1.pig);
 pig.registerScript(Test_2.pig);
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1464) Should clean the Graph when register another Pig Script

2010-06-25 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-1464:


Status: Patch Available  (was: Open)

 Should clean the Graph when register another Pig Script
 ---

 Key: PIG-1464
 URL: https://issues.apache.org/jira/browse/PIG-1464
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.8.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: PIG_1463.patch


 In the current implementation, the variable names in pig script are all 
 global variable. This make one pig script know the variable in other scripts. 
 In my opinion, this is not right. Every relation name in pig script should be 
 local variable, otherwise it will bring in unexpected result.  This issue 
 relates to PIG-1423
 E.g there are two pig script as follows:
 Test_1.pig
 {code}
 a = load 'data/b.txt' ;
 {code}
 Test_2.pig
 {code}
 b = foreach a generate $0;   // a is recognized by Grunt although it is in 
 Test_1.pig
 {code}
 And the following execute normally, do not throw any exception
 {code}
 PigServer pig=new PigServer(ExecType.Local);
 pig.registerScript(Test_1.pig);
 pig.registerScript(Test_2.pig);
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1464) Should clean the Graph when register another Pig Script

2010-06-25 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-1464:


Attachment: (was: Pig-1406.patch)

 Should clean the Graph when register another Pig Script
 ---

 Key: PIG-1464
 URL: https://issues.apache.org/jira/browse/PIG-1464
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.8.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: PIG_1463.patch


 In the current implementation, the variable names in pig script are all 
 global variable. This make one pig script know the variable in other scripts. 
 In my opinion, this is not right. Every relation name in pig script should be 
 local variable, otherwise it will bring in unexpected result.  This issue 
 relates to PIG-1423
 E.g there are two pig script as follows:
 Test_1.pig
 {code}
 a = load 'data/b.txt' ;
 {code}
 Test_2.pig
 {code}
 b = foreach a generate $0;   // a is recognized by Grunt although it is in 
 Test_1.pig
 {code}
 And the following execute normally, do not throw any exception
 {code}
 PigServer pig=new PigServer(ExecType.Local);
 pig.registerScript(Test_1.pig);
 pig.registerScript(Test_2.pig);
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



compile load to mr plan

2010-06-25 Thread Gang Luo
Hi,
multiple load operators in a script start the same number of streams, some of 
them are merged later (e.g. join) and some of them are not. How to know which 
MR Operator should we place these loads at? For example, we got script like 
this:
a = load file1
b = load file2
..
dump

if we join a and b between loads and dump, the two loads (a and b) should be 
placed in the same MR operator. If we sort a and b independently, these two 
loads should be placed in separate MR operators. How to identify these two 
streams are correlated or not?

A further question is, can we specify a directory so that load will read all 
the files in that directory? Since each reducer of a mr job will produce a 
single file, when the subsequent mr job need to read all these files, what do 
we do?

Thanks,
-Gang






[jira] Commented: (PIG-1464) Should clean the Graph when register another Pig Script

2010-06-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882583#action_12882583
 ] 

Hadoop QA commented on PIG-1464:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448030/PIG_1463.patch
  against trunk revision 957753.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/350/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/350/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/350/console

This message is automatically generated.

 Should clean the Graph when register another Pig Script
 ---

 Key: PIG-1464
 URL: https://issues.apache.org/jira/browse/PIG-1464
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.8.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: PIG_1463.patch


 In the current implementation, the variable names in pig script are all 
 global variable. This make one pig script know the variable in other scripts. 
 In my opinion, this is not right. Every relation name in pig script should be 
 local variable, otherwise it will bring in unexpected result.  This issue 
 relates to PIG-1423
 E.g there are two pig script as follows:
 Test_1.pig
 {code}
 a = load 'data/b.txt' ;
 {code}
 Test_2.pig
 {code}
 b = foreach a generate $0;   // a is recognized by Grunt although it is in 
 Test_1.pig
 {code}
 And the following execute normally, do not throw any exception
 {code}
 PigServer pig=new PigServer(ExecType.Local);
 pig.registerScript(Test_1.pig);
 pig.registerScript(Test_2.pig);
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1454) Consider clean up backend code

2010-06-25 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882640#action_12882640
 ] 

Richard Ding commented on PIG-1454:
---

I've run the core tests manually and they passed.

 Consider clean up backend code
 --

 Key: PIG-1454
 URL: https://issues.apache.org/jira/browse/PIG-1454
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1454.patch


 Prior to 0.7, Pig had its own local execution mode, in addition to hadoop map 
 reduce execution mode. To support these two different execution modes, Pig 
 implemented an abstraction layer with a set of interfaces and abstract 
 classes.  Pig 0.7 replaced the local mode with hadoop local mode and made 
 this abstraction layer redundant.
 Our goal is to remove those extra code. But we need also keep code backward 
 compatible since some interfaces are exposed by top-level API.
 So we propose the first steps:
 * Deprecate methods on FileLocalizer that have DataStorage as parameter.
 * Remove ExecPhysicalOperator, ExecPhysicalPlan, ExecScopedLogicalOperator, 
 ExecutionEngine and util/ExecTools from 
 org.apache.pig.backend.executionengine package.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1454) Consider clean up backend code

2010-06-25 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882655#action_12882655
 ] 

Olga Natkovich commented on PIG-1454:
-

+1

 Consider clean up backend code
 --

 Key: PIG-1454
 URL: https://issues.apache.org/jira/browse/PIG-1454
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1454.patch


 Prior to 0.7, Pig had its own local execution mode, in addition to hadoop map 
 reduce execution mode. To support these two different execution modes, Pig 
 implemented an abstraction layer with a set of interfaces and abstract 
 classes.  Pig 0.7 replaced the local mode with hadoop local mode and made 
 this abstraction layer redundant.
 Our goal is to remove those extra code. But we need also keep code backward 
 compatible since some interfaces are exposed by top-level API.
 So we propose the first steps:
 * Deprecate methods on FileLocalizer that have DataStorage as parameter.
 * Remove ExecPhysicalOperator, ExecPhysicalPlan, ExecScopedLogicalOperator, 
 ExecutionEngine and util/ExecTools from 
 org.apache.pig.backend.executionengine package.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-809) number of input lines it processed, number of output lines it produced for PIG job

2010-06-25 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882667#action_12882667
 ] 

Richard Ding commented on PIG-809:
--

PIG-1299  PIG-1389 address  this requirement: the number of records read from 
each user input and written to each user output in a script  will be written to 
the Pig log at the end of execution.

 number of input lines it processed, number of output lines it produced for 
 PIG job
 --

 Key: PIG-809
 URL: https://issues.apache.org/jira/browse/PIG-809
 Project: Pig
  Issue Type: Improvement
  Components: impl
 Environment: Linux
Reporter: Supreeth
Assignee: Richard Ding
 Fix For: 0.8.0


 Excerpt from the mail conversation.
 It will be a great addition to Pig. Hadoop currently provides all these
 counters. All Pig has to do is to add them up for all Hadoop jobs in the
 script, and emit them at the end of the script. File a jira ?
 - Milind
 On 5/13/09 8:16 AM, Supreeth Hosur Nagesh Rao supre...@yahoo-inc.com
 wrote:
   Hi Olga
   
   With every PIG job is there any way for us to trap into the operational
   stats of that job, like number of input lines it processed, number of
   output lines it produced?
   
   I dont want to have a separate PIG script to do the same as it may be
   additional parsing, so is there such a stat. If not can that be
   provided, and exposed as a config parameter?
   
   -Supreeth
 This will be a great feature to have for our processing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1465) Filter inside foreach is broken

2010-06-25 Thread hc busy (JIRA)
Filter inside foreach is broken
---

 Key: PIG-1465
 URL: https://issues.apache.org/jira/browse/PIG-1465
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: hc busy


{quote}
% cat data.txt
x,a,1,a
x,a,2,a
x,a,3,b
x,a,4,b
y,a,1,a
y,a,2,a
y,a,3,b
y,a,4,b
% cat script.pig
a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray);
b = group a by ind;
describe b;
f = foreach b{
all_total = SUM(a.num);
fed  = filter a by (f1==f2);
some_total = (int)SUM(fed.num);
generate group as ind, all_total, some_total;
}
describe f;
dump f;
% pig -f script.pig
(x,a,1,a,,)
(x,a,2,a,,)
(x,a,3,b,,)
(x,a,4,b,,)
(y,a,1,a,,)
(y,a,2,a,,)
(y,a,3,b,,)
(y,a,4,b,,)
% cat what_I_expected
(x,10,3)
(y,10,3)
{quote}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1465) Filter inside foreach is broken

2010-06-25 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1465:
-

Description: 
{quote}
% cat data.txt
x,a,1,a
x,a,2,a
x,a,3,b
x,a,4,b
y,a,1,a
y,a,2,a
y,a,3,b
y,a,4,b
% cat script.pig
a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray);
b = group a by ind;
describe b;
f = foreach b\{
all_total = SUM(a.num);
fed  = filter a by (f1==f2);
some_total = (int)SUM(fed.num);
generate group as ind, all_total, some_total;
\}
describe f;
dump f;
% pig -f script.pig
(x,a,1,a,,)
(x,a,2,a,,)
(x,a,3,b,,)
(x,a,4,b,,)
(y,a,1,a,,)
(y,a,2,a,,)
(y,a,3,b,,)
(y,a,4,b,,)
% cat what_I_expected
(x,10,3)
(y,10,3)
{quote}

  was:
{quote}
% cat data.txt
x,a,1,a
x,a,2,a
x,a,3,b
x,a,4,b
y,a,1,a
y,a,2,a
y,a,3,b
y,a,4,b
% cat script.pig
a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray);
b = group a by ind;
describe b;
f = foreach b{
all_total = SUM(a.num);
fed  = filter a by (f1==f2);
some_total = (int)SUM(fed.num);
generate group as ind, all_total, some_total;
}
describe f;
dump f;
% pig -f script.pig
(x,a,1,a,,)
(x,a,2,a,,)
(x,a,3,b,,)
(x,a,4,b,,)
(y,a,1,a,,)
(y,a,2,a,,)
(y,a,3,b,,)
(y,a,4,b,,)
% cat what_I_expected
(x,10,3)
(y,10,3)
{quote}



 Filter inside foreach is broken
 ---

 Key: PIG-1465
 URL: https://issues.apache.org/jira/browse/PIG-1465
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: hc busy

 {quote}
 % cat data.txt
 x,a,1,a
 x,a,2,a
 x,a,3,b
 x,a,4,b
 y,a,1,a
 y,a,2,a
 y,a,3,b
 y,a,4,b
 % cat script.pig
 a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray);
 b = group a by ind;
 describe b;
 f = foreach b\{
 all_total = SUM(a.num);
 fed  = filter a by (f1==f2);
 some_total = (int)SUM(fed.num);
 generate group as ind, all_total, some_total;
 \}
 describe f;
 dump f;
 % pig -f script.pig
 (x,a,1,a,,)
 (x,a,2,a,,)
 (x,a,3,b,,)
 (x,a,4,b,,)
 (y,a,1,a,,)
 (y,a,2,a,,)
 (y,a,3,b,,)
 (y,a,4,b,,)
 % cat what_I_expected
 (x,10,3)
 (y,10,3)
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1464) Should clean the Graph when register another Pig Script

2010-06-25 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882673#action_12882673
 ] 

Alan Gates commented on PIG-1464:
-

I agree this is weird from an interface viewpoint.  But I have a couple of 
concerns in changing it.  One, it isn't backward compatible.  After the mess we 
drug users through in 0.7 we're really trying not to change anything for 0.8.  
The other concern is that it gives users a very hacky way to build Pig Latin 
modules and use them together.  We'd like to come up with a clean way to do 
this (see http://wiki.apache.org/pig/TuringCompletePig )  But until then I'm 
wondering if we should leave this there.

 Should clean the Graph when register another Pig Script
 ---

 Key: PIG-1464
 URL: https://issues.apache.org/jira/browse/PIG-1464
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.8.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: PIG_1463.patch


 In the current implementation, the variable names in pig script are all 
 global variable. This make one pig script know the variable in other scripts. 
 In my opinion, this is not right. Every relation name in pig script should be 
 local variable, otherwise it will bring in unexpected result.  This issue 
 relates to PIG-1423
 E.g there are two pig script as follows:
 Test_1.pig
 {code}
 a = load 'data/b.txt' ;
 {code}
 Test_2.pig
 {code}
 b = foreach a generate $0;   // a is recognized by Grunt although it is in 
 Test_1.pig
 {code}
 And the following execute normally, do not throw any exception
 {code}
 PigServer pig=new PigServer(ExecType.Local);
 pig.registerScript(Test_1.pig);
 pig.registerScript(Test_2.pig);
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1435) make sure dependent jobs fail when a jon in multiquery fails

2010-06-25 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding reassigned PIG-1435:
-

Assignee: niraj rai  (was: Richard Ding)

 make sure dependent jobs fail when a jon in multiquery fails
 

 Key: PIG-1435
 URL: https://issues.apache.org/jira/browse/PIG-1435
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: niraj rai
 Fix For: 0.8.0


 Currently if one of the MQ jobs fails, Pig tries to run all remainin jobs. As 
 the result, if data was partially generated by the failed job, you might get 
 incorrect results from dependent jobs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1466) Improve log messages for memory usage

2010-06-25 Thread Ashutosh Chauhan (JIRA)
Improve log messages for memory usage
-

 Key: PIG-1466
 URL: https://issues.apache.org/jira/browse/PIG-1466
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Priority: Minor


For anything more then a moderately sized dataset Pig usually spits following 
messages:
{code}
2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
low memory handler called (Usage
threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed = 
954466304(932096K) max =
954466304(932096K)

2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
low memory handler called (Collection
threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed = 
954466304(932096K) max =
954466304(932096K)
{code}

This seems to confuse users a lot. Once these messages are printed, users tend 
to believe that Pig is having hard time with memory, is spilling to disk etc. 
but in fact Pig might be cruising along at ease. We should be little more 
careful what to print in logs. Currently these are printed when a notification 
is sent by JVM and some other conditions are met which may not necessarily 
indicate low memory condition. Furthermore, with {{InternalCachedBag}} embraced 
everywhere in favor of {{DefaultBag}}, these messages have lost their 
usefulness. At the every least, we should lower the log level at which these 
are printed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1434) Allow casting relations to scalars

2010-06-25 Thread Aniket Mokashi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882711#action_12882711
 ] 

Aniket Mokashi commented on PIG-1434:
-

The proposal for scalars is as follows -
{code}
A = load '1.txt' as (a1, a2);
B = group A all;
C = foreach B generate COUNT(A);
Y = foreach A generate C;
store Y into 'Ystore';
{code}
Based on the schema of C, we detect that Y means to use C as a scalar and 
internally track it as scalar. Thus, operations like C * C are also allowed. 
The limitation is that C should have long convertible value (when stored into 
the file). Also (int) C would be allowed and will succeed if the cast operation 
succeeds.

As mentioned by Daniel earlier, there are two challenges in introducing 
scalars--
1. Addition of implicit store- We cannot do it too early (parsing), as we get 
redundant (implicit) store operation for rest of the commands in the script. If 
we do it too late, merge algorithm doesn't find the store and discards the 
branch that compiles and executes the store.
To solve this, whenever we process a store plan after the parsing stage, we 
detect the existence of scalars into the plan and add required branches that 
has those scalars into the current plan. We also attach LOStores for the 
scalars and merge the required plan.
2. Tracking of implicit dependency- Existence of scalar C needs to be converted 
into a implicit ReadScalar operation, but other than this it also needs to add 
dependency on the map-reduce job that generates this scalar value. We track 
this dependency by adding LOScalar, POScalar operators that carry the reference 
to the scalar they depend upon. When we compile the map reduce plan, we replace 
POScalar with POUserFunc to load the scalar value and mark the dependency 
between two map reduce jobs.

I am attaching the patch with above mentioned changes.

Few known issues-
To track the dependencies of scalars, we need access to map of operators from 
one type of plan to other, but this map is generated by visitors. The same 
visitors are responsible for converting LOScalar -POScalar - POUserFunc. So, 
if a visitor visits LOScalar before LO associated with scalar ( C in example) 
we do not find PO associated with C. 

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1434) Allow casting relations to scalars

2010-06-25 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-1434:


Attachment: scalarImpl.patch

Initial implemenation

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1434) Allow casting relations to scalars

2010-06-25 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-1434:


Status: Patch Available  (was: Open)

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1434) Allow casting relations to scalars

2010-06-25 Thread Aniket Mokashi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882725#action_12882725
 ] 

Aniket Mokashi commented on PIG-1434:
-

Submitting to hudson to check for test failures

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1466) Improve log messages for memory usage

2010-06-25 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882731#action_12882731
 ] 

Alan Gates commented on PIG-1466:
-

Rather than change the log level can we change it to only print when we truly 
spill a {{DefaultBag}}?  It would be nice to know if there are any cases where 
we are still doing that.

 Improve log messages for memory usage
 -

 Key: PIG-1466
 URL: https://issues.apache.org/jira/browse/PIG-1466
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Priority: Minor

 For anything more then a moderately sized dataset Pig usually spits following 
 messages:
 {code}
 2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
 low memory handler called (Usage
 threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed 
 = 954466304(932096K) max =
 954466304(932096K)
 2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
 low memory handler called (Collection
 threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed 
 = 954466304(932096K) max =
 954466304(932096K)
 {code}
 This seems to confuse users a lot. Once these messages are printed, users 
 tend to believe that Pig is having hard time with memory, is spilling to disk 
 etc. but in fact Pig might be cruising along at ease. We should be little 
 more careful what to print in logs. Currently these are printed when a 
 notification is sent by JVM and some other conditions are met which may not 
 necessarily indicate low memory condition. Furthermore, with 
 {{InternalCachedBag}} embraced everywhere in favor of {{DefaultBag}}, these 
 messages have lost their usefulness. At the every least, we should lower the 
 log level at which these are printed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1467) order by fail when set fs.file.impl.disable.cache to true

2010-06-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882813#action_12882813
 ] 

Hadoop QA commented on PIG-1467:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448105/PIG-1467-2.patch
  against trunk revision 958053.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 145 javac compiler warnings (more 
than the trunk's current 140 warnings).

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/353/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/353/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/353/console

This message is automatically generated.

 order by fail when set fs.file.impl.disable.cache to true
 ---

 Key: PIG-1467
 URL: https://issues.apache.org/jira/browse/PIG-1467
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.7.0, 0.8.0

 Attachments: PIG-1467-1.patch, PIG-1467-2.patch


 Order by fail with the message:
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)
 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
 at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
 at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:551)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:630)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
 at org.apache.hadoop.mapred.Child.main(Child.java:211)
 This happens with the following hadoop settings:
 fs.file.impl.disable.cache=true
 fs.hdfs.impl.disable.cache=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.