[jira] Created: (PIG-1464) Should clean the Graph when register another Pig Script
Should clean the Graph when register another Pig Script --- Key: PIG-1464 URL: https://issues.apache.org/jira/browse/PIG-1464 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.8.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.8.0 In the current implementation, the variable names in pig script are all global variable. This make one pig script know the variable in other scripts. In my opinion, this is not right. Every relation name in pig script should be local variable, otherwise it will bring in unexpected result. This issue relates to PIG-1423 E.g there are two pig script as follows: Test_1.pig {code} a = load 'data/b.txt' ; {code} Test_2.pig {code} b = foreach a generate $0; // a is recognized by Grunt although it is in Test_1.pig {code} And the following execute normally, do not throw any exception {code} PigServer pig=new PigServer(ExecType.Local); pig.registerScript(Test_1.pig); pig.registerScript(Test_2.pig); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1464) Should clean the Graph when register another Pig Script
[ https://issues.apache.org/jira/browse/PIG-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-1464: Attachment: Pig-1406.patch Attach the patch for this issue Should clean the Graph when register another Pig Script --- Key: PIG-1464 URL: https://issues.apache.org/jira/browse/PIG-1464 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.8.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.8.0 Attachments: PIG_1463.patch In the current implementation, the variable names in pig script are all global variable. This make one pig script know the variable in other scripts. In my opinion, this is not right. Every relation name in pig script should be local variable, otherwise it will bring in unexpected result. This issue relates to PIG-1423 E.g there are two pig script as follows: Test_1.pig {code} a = load 'data/b.txt' ; {code} Test_2.pig {code} b = foreach a generate $0; // a is recognized by Grunt although it is in Test_1.pig {code} And the following execute normally, do not throw any exception {code} PigServer pig=new PigServer(ExecType.Local); pig.registerScript(Test_1.pig); pig.registerScript(Test_2.pig); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1464) Should clean the Graph when register another Pig Script
[ https://issues.apache.org/jira/browse/PIG-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-1464: Attachment: PIG_1463.patch Should clean the Graph when register another Pig Script --- Key: PIG-1464 URL: https://issues.apache.org/jira/browse/PIG-1464 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.8.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.8.0 Attachments: PIG_1463.patch In the current implementation, the variable names in pig script are all global variable. This make one pig script know the variable in other scripts. In my opinion, this is not right. Every relation name in pig script should be local variable, otherwise it will bring in unexpected result. This issue relates to PIG-1423 E.g there are two pig script as follows: Test_1.pig {code} a = load 'data/b.txt' ; {code} Test_2.pig {code} b = foreach a generate $0; // a is recognized by Grunt although it is in Test_1.pig {code} And the following execute normally, do not throw any exception {code} PigServer pig=new PigServer(ExecType.Local); pig.registerScript(Test_1.pig); pig.registerScript(Test_2.pig); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1464) Should clean the Graph when register another Pig Script
[ https://issues.apache.org/jira/browse/PIG-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-1464: Status: Patch Available (was: Open) Should clean the Graph when register another Pig Script --- Key: PIG-1464 URL: https://issues.apache.org/jira/browse/PIG-1464 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.8.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.8.0 Attachments: PIG_1463.patch In the current implementation, the variable names in pig script are all global variable. This make one pig script know the variable in other scripts. In my opinion, this is not right. Every relation name in pig script should be local variable, otherwise it will bring in unexpected result. This issue relates to PIG-1423 E.g there are two pig script as follows: Test_1.pig {code} a = load 'data/b.txt' ; {code} Test_2.pig {code} b = foreach a generate $0; // a is recognized by Grunt although it is in Test_1.pig {code} And the following execute normally, do not throw any exception {code} PigServer pig=new PigServer(ExecType.Local); pig.registerScript(Test_1.pig); pig.registerScript(Test_2.pig); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1464) Should clean the Graph when register another Pig Script
[ https://issues.apache.org/jira/browse/PIG-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-1464: Attachment: (was: Pig-1406.patch) Should clean the Graph when register another Pig Script --- Key: PIG-1464 URL: https://issues.apache.org/jira/browse/PIG-1464 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.8.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.8.0 Attachments: PIG_1463.patch In the current implementation, the variable names in pig script are all global variable. This make one pig script know the variable in other scripts. In my opinion, this is not right. Every relation name in pig script should be local variable, otherwise it will bring in unexpected result. This issue relates to PIG-1423 E.g there are two pig script as follows: Test_1.pig {code} a = load 'data/b.txt' ; {code} Test_2.pig {code} b = foreach a generate $0; // a is recognized by Grunt although it is in Test_1.pig {code} And the following execute normally, do not throw any exception {code} PigServer pig=new PigServer(ExecType.Local); pig.registerScript(Test_1.pig); pig.registerScript(Test_2.pig); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
compile load to mr plan
Hi, multiple load operators in a script start the same number of streams, some of them are merged later (e.g. join) and some of them are not. How to know which MR Operator should we place these loads at? For example, we got script like this: a = load file1 b = load file2 .. dump if we join a and b between loads and dump, the two loads (a and b) should be placed in the same MR operator. If we sort a and b independently, these two loads should be placed in separate MR operators. How to identify these two streams are correlated or not? A further question is, can we specify a directory so that load will read all the files in that directory? Since each reducer of a mr job will produce a single file, when the subsequent mr job need to read all these files, what do we do? Thanks, -Gang
[jira] Commented: (PIG-1464) Should clean the Graph when register another Pig Script
[ https://issues.apache.org/jira/browse/PIG-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882583#action_12882583 ] Hadoop QA commented on PIG-1464: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448030/PIG_1463.patch against trunk revision 957753. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/350/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/350/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/350/console This message is automatically generated. Should clean the Graph when register another Pig Script --- Key: PIG-1464 URL: https://issues.apache.org/jira/browse/PIG-1464 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.8.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.8.0 Attachments: PIG_1463.patch In the current implementation, the variable names in pig script are all global variable. This make one pig script know the variable in other scripts. In my opinion, this is not right. Every relation name in pig script should be local variable, otherwise it will bring in unexpected result. This issue relates to PIG-1423 E.g there are two pig script as follows: Test_1.pig {code} a = load 'data/b.txt' ; {code} Test_2.pig {code} b = foreach a generate $0; // a is recognized by Grunt although it is in Test_1.pig {code} And the following execute normally, do not throw any exception {code} PigServer pig=new PigServer(ExecType.Local); pig.registerScript(Test_1.pig); pig.registerScript(Test_2.pig); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1454) Consider clean up backend code
[ https://issues.apache.org/jira/browse/PIG-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882640#action_12882640 ] Richard Ding commented on PIG-1454: --- I've run the core tests manually and they passed. Consider clean up backend code -- Key: PIG-1454 URL: https://issues.apache.org/jira/browse/PIG-1454 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1454.patch Prior to 0.7, Pig had its own local execution mode, in addition to hadoop map reduce execution mode. To support these two different execution modes, Pig implemented an abstraction layer with a set of interfaces and abstract classes. Pig 0.7 replaced the local mode with hadoop local mode and made this abstraction layer redundant. Our goal is to remove those extra code. But we need also keep code backward compatible since some interfaces are exposed by top-level API. So we propose the first steps: * Deprecate methods on FileLocalizer that have DataStorage as parameter. * Remove ExecPhysicalOperator, ExecPhysicalPlan, ExecScopedLogicalOperator, ExecutionEngine and util/ExecTools from org.apache.pig.backend.executionengine package. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1454) Consider clean up backend code
[ https://issues.apache.org/jira/browse/PIG-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882655#action_12882655 ] Olga Natkovich commented on PIG-1454: - +1 Consider clean up backend code -- Key: PIG-1454 URL: https://issues.apache.org/jira/browse/PIG-1454 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1454.patch Prior to 0.7, Pig had its own local execution mode, in addition to hadoop map reduce execution mode. To support these two different execution modes, Pig implemented an abstraction layer with a set of interfaces and abstract classes. Pig 0.7 replaced the local mode with hadoop local mode and made this abstraction layer redundant. Our goal is to remove those extra code. But we need also keep code backward compatible since some interfaces are exposed by top-level API. So we propose the first steps: * Deprecate methods on FileLocalizer that have DataStorage as parameter. * Remove ExecPhysicalOperator, ExecPhysicalPlan, ExecScopedLogicalOperator, ExecutionEngine and util/ExecTools from org.apache.pig.backend.executionengine package. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-809) number of input lines it processed, number of output lines it produced for PIG job
[ https://issues.apache.org/jira/browse/PIG-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882667#action_12882667 ] Richard Ding commented on PIG-809: -- PIG-1299 PIG-1389 address this requirement: the number of records read from each user input and written to each user output in a script will be written to the Pig log at the end of execution. number of input lines it processed, number of output lines it produced for PIG job -- Key: PIG-809 URL: https://issues.apache.org/jira/browse/PIG-809 Project: Pig Issue Type: Improvement Components: impl Environment: Linux Reporter: Supreeth Assignee: Richard Ding Fix For: 0.8.0 Excerpt from the mail conversation. It will be a great addition to Pig. Hadoop currently provides all these counters. All Pig has to do is to add them up for all Hadoop jobs in the script, and emit them at the end of the script. File a jira ? - Milind On 5/13/09 8:16 AM, Supreeth Hosur Nagesh Rao supre...@yahoo-inc.com wrote: Hi Olga With every PIG job is there any way for us to trap into the operational stats of that job, like number of input lines it processed, number of output lines it produced? I dont want to have a separate PIG script to do the same as it may be additional parsing, so is there such a stat. If not can that be provided, and exposed as a config parameter? -Supreeth This will be a great feature to have for our processing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1465) Filter inside foreach is broken
Filter inside foreach is broken --- Key: PIG-1465 URL: https://issues.apache.org/jira/browse/PIG-1465 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: hc busy {quote} % cat data.txt x,a,1,a x,a,2,a x,a,3,b x,a,4,b y,a,1,a y,a,2,a y,a,3,b y,a,4,b % cat script.pig a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray); b = group a by ind; describe b; f = foreach b{ all_total = SUM(a.num); fed = filter a by (f1==f2); some_total = (int)SUM(fed.num); generate group as ind, all_total, some_total; } describe f; dump f; % pig -f script.pig (x,a,1,a,,) (x,a,2,a,,) (x,a,3,b,,) (x,a,4,b,,) (y,a,1,a,,) (y,a,2,a,,) (y,a,3,b,,) (y,a,4,b,,) % cat what_I_expected (x,10,3) (y,10,3) {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1465) Filter inside foreach is broken
[ https://issues.apache.org/jira/browse/PIG-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1465: - Description: {quote} % cat data.txt x,a,1,a x,a,2,a x,a,3,b x,a,4,b y,a,1,a y,a,2,a y,a,3,b y,a,4,b % cat script.pig a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray); b = group a by ind; describe b; f = foreach b\{ all_total = SUM(a.num); fed = filter a by (f1==f2); some_total = (int)SUM(fed.num); generate group as ind, all_total, some_total; \} describe f; dump f; % pig -f script.pig (x,a,1,a,,) (x,a,2,a,,) (x,a,3,b,,) (x,a,4,b,,) (y,a,1,a,,) (y,a,2,a,,) (y,a,3,b,,) (y,a,4,b,,) % cat what_I_expected (x,10,3) (y,10,3) {quote} was: {quote} % cat data.txt x,a,1,a x,a,2,a x,a,3,b x,a,4,b y,a,1,a y,a,2,a y,a,3,b y,a,4,b % cat script.pig a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray); b = group a by ind; describe b; f = foreach b{ all_total = SUM(a.num); fed = filter a by (f1==f2); some_total = (int)SUM(fed.num); generate group as ind, all_total, some_total; } describe f; dump f; % pig -f script.pig (x,a,1,a,,) (x,a,2,a,,) (x,a,3,b,,) (x,a,4,b,,) (y,a,1,a,,) (y,a,2,a,,) (y,a,3,b,,) (y,a,4,b,,) % cat what_I_expected (x,10,3) (y,10,3) {quote} Filter inside foreach is broken --- Key: PIG-1465 URL: https://issues.apache.org/jira/browse/PIG-1465 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: hc busy {quote} % cat data.txt x,a,1,a x,a,2,a x,a,3,b x,a,4,b y,a,1,a y,a,2,a y,a,3,b y,a,4,b % cat script.pig a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray); b = group a by ind; describe b; f = foreach b\{ all_total = SUM(a.num); fed = filter a by (f1==f2); some_total = (int)SUM(fed.num); generate group as ind, all_total, some_total; \} describe f; dump f; % pig -f script.pig (x,a,1,a,,) (x,a,2,a,,) (x,a,3,b,,) (x,a,4,b,,) (y,a,1,a,,) (y,a,2,a,,) (y,a,3,b,,) (y,a,4,b,,) % cat what_I_expected (x,10,3) (y,10,3) {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1464) Should clean the Graph when register another Pig Script
[ https://issues.apache.org/jira/browse/PIG-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882673#action_12882673 ] Alan Gates commented on PIG-1464: - I agree this is weird from an interface viewpoint. But I have a couple of concerns in changing it. One, it isn't backward compatible. After the mess we drug users through in 0.7 we're really trying not to change anything for 0.8. The other concern is that it gives users a very hacky way to build Pig Latin modules and use them together. We'd like to come up with a clean way to do this (see http://wiki.apache.org/pig/TuringCompletePig ) But until then I'm wondering if we should leave this there. Should clean the Graph when register another Pig Script --- Key: PIG-1464 URL: https://issues.apache.org/jira/browse/PIG-1464 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.8.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.8.0 Attachments: PIG_1463.patch In the current implementation, the variable names in pig script are all global variable. This make one pig script know the variable in other scripts. In my opinion, this is not right. Every relation name in pig script should be local variable, otherwise it will bring in unexpected result. This issue relates to PIG-1423 E.g there are two pig script as follows: Test_1.pig {code} a = load 'data/b.txt' ; {code} Test_2.pig {code} b = foreach a generate $0; // a is recognized by Grunt although it is in Test_1.pig {code} And the following execute normally, do not throw any exception {code} PigServer pig=new PigServer(ExecType.Local); pig.registerScript(Test_1.pig); pig.registerScript(Test_2.pig); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1435) make sure dependent jobs fail when a jon in multiquery fails
[ https://issues.apache.org/jira/browse/PIG-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding reassigned PIG-1435: - Assignee: niraj rai (was: Richard Ding) make sure dependent jobs fail when a jon in multiquery fails Key: PIG-1435 URL: https://issues.apache.org/jira/browse/PIG-1435 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: niraj rai Fix For: 0.8.0 Currently if one of the MQ jobs fails, Pig tries to run all remainin jobs. As the result, if data was partially generated by the failed job, you might get incorrect results from dependent jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1466) Improve log messages for memory usage
Improve log messages for memory usage - Key: PIG-1466 URL: https://issues.apache.org/jira/browse/PIG-1466 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Priority: Minor For anything more then a moderately sized dataset Pig usually spits following messages: {code} 2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: low memory handler called (Usage threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed = 954466304(932096K) max = 954466304(932096K) 2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: low memory handler called (Collection threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed = 954466304(932096K) max = 954466304(932096K) {code} This seems to confuse users a lot. Once these messages are printed, users tend to believe that Pig is having hard time with memory, is spilling to disk etc. but in fact Pig might be cruising along at ease. We should be little more careful what to print in logs. Currently these are printed when a notification is sent by JVM and some other conditions are met which may not necessarily indicate low memory condition. Furthermore, with {{InternalCachedBag}} embraced everywhere in favor of {{DefaultBag}}, these messages have lost their usefulness. At the every least, we should lower the log level at which these are printed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1434) Allow casting relations to scalars
[ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882711#action_12882711 ] Aniket Mokashi commented on PIG-1434: - The proposal for scalars is as follows - {code} A = load '1.txt' as (a1, a2); B = group A all; C = foreach B generate COUNT(A); Y = foreach A generate C; store Y into 'Ystore'; {code} Based on the schema of C, we detect that Y means to use C as a scalar and internally track it as scalar. Thus, operations like C * C are also allowed. The limitation is that C should have long convertible value (when stored into the file). Also (int) C would be allowed and will succeed if the cast operation succeeds. As mentioned by Daniel earlier, there are two challenges in introducing scalars-- 1. Addition of implicit store- We cannot do it too early (parsing), as we get redundant (implicit) store operation for rest of the commands in the script. If we do it too late, merge algorithm doesn't find the store and discards the branch that compiles and executes the store. To solve this, whenever we process a store plan after the parsing stage, we detect the existence of scalars into the plan and add required branches that has those scalars into the current plan. We also attach LOStores for the scalars and merge the required plan. 2. Tracking of implicit dependency- Existence of scalar C needs to be converted into a implicit ReadScalar operation, but other than this it also needs to add dependency on the map-reduce job that generates this scalar value. We track this dependency by adding LOScalar, POScalar operators that carry the reference to the scalar they depend upon. When we compile the map reduce plan, we replace POScalar with POUserFunc to load the scalar value and mark the dependency between two map reduce jobs. I am attaching the patch with above mentioned changes. Few known issues- To track the dependencies of scalars, we need access to map of operators from one type of plan to other, but this map is generated by visitors. The same visitors are responsible for converting LOScalar -POScalar - POUserFunc. So, if a visitor visits LOScalar before LO associated with scalar ( C in example) we do not find PO associated with C. Allow casting relations to scalars -- Key: PIG-1434 URL: https://issues.apache.org/jira/browse/PIG-1434 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: scalarImpl.patch This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801. The proposal is to allow casting relations to scalar types in foreach. Example: A = load 'data' as (x, y, z); B = group A all; C = foreach B generate COUNT(A); . X = Y = foreach X generate $1/(long) C; Couple of additional comments: (1) You can only cast relations including a single value or an error will be reported (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence. (3) Y will look for C closest to it. Implementation thoughts: The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to (1) Store C (2) convert the cast to the UDF -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1434) Allow casting relations to scalars
[ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-1434: Attachment: scalarImpl.patch Initial implemenation Allow casting relations to scalars -- Key: PIG-1434 URL: https://issues.apache.org/jira/browse/PIG-1434 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: scalarImpl.patch This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801. The proposal is to allow casting relations to scalar types in foreach. Example: A = load 'data' as (x, y, z); B = group A all; C = foreach B generate COUNT(A); . X = Y = foreach X generate $1/(long) C; Couple of additional comments: (1) You can only cast relations including a single value or an error will be reported (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence. (3) Y will look for C closest to it. Implementation thoughts: The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to (1) Store C (2) convert the cast to the UDF -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1434) Allow casting relations to scalars
[ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-1434: Status: Patch Available (was: Open) Allow casting relations to scalars -- Key: PIG-1434 URL: https://issues.apache.org/jira/browse/PIG-1434 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: scalarImpl.patch This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801. The proposal is to allow casting relations to scalar types in foreach. Example: A = load 'data' as (x, y, z); B = group A all; C = foreach B generate COUNT(A); . X = Y = foreach X generate $1/(long) C; Couple of additional comments: (1) You can only cast relations including a single value or an error will be reported (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence. (3) Y will look for C closest to it. Implementation thoughts: The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to (1) Store C (2) convert the cast to the UDF -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1434) Allow casting relations to scalars
[ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882725#action_12882725 ] Aniket Mokashi commented on PIG-1434: - Submitting to hudson to check for test failures Allow casting relations to scalars -- Key: PIG-1434 URL: https://issues.apache.org/jira/browse/PIG-1434 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: scalarImpl.patch This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801. The proposal is to allow casting relations to scalar types in foreach. Example: A = load 'data' as (x, y, z); B = group A all; C = foreach B generate COUNT(A); . X = Y = foreach X generate $1/(long) C; Couple of additional comments: (1) You can only cast relations including a single value or an error will be reported (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence. (3) Y will look for C closest to it. Implementation thoughts: The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to (1) Store C (2) convert the cast to the UDF -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1466) Improve log messages for memory usage
[ https://issues.apache.org/jira/browse/PIG-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882731#action_12882731 ] Alan Gates commented on PIG-1466: - Rather than change the log level can we change it to only print when we truly spill a {{DefaultBag}}? It would be nice to know if there are any cases where we are still doing that. Improve log messages for memory usage - Key: PIG-1466 URL: https://issues.apache.org/jira/browse/PIG-1466 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Priority: Minor For anything more then a moderately sized dataset Pig usually spits following messages: {code} 2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: low memory handler called (Usage threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed = 954466304(932096K) max = 954466304(932096K) 2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: low memory handler called (Collection threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed = 954466304(932096K) max = 954466304(932096K) {code} This seems to confuse users a lot. Once these messages are printed, users tend to believe that Pig is having hard time with memory, is spilling to disk etc. but in fact Pig might be cruising along at ease. We should be little more careful what to print in logs. Currently these are printed when a notification is sent by JVM and some other conditions are met which may not necessarily indicate low memory condition. Furthermore, with {{InternalCachedBag}} embraced everywhere in favor of {{DefaultBag}}, these messages have lost their usefulness. At the every least, we should lower the log level at which these are printed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1467) order by fail when set fs.file.impl.disable.cache to true
[ https://issues.apache.org/jira/browse/PIG-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882813#action_12882813 ] Hadoop QA commented on PIG-1467: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448105/PIG-1467-2.patch against trunk revision 958053. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 145 javac compiler warnings (more than the trunk's current 140 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/353/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/353/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/353/console This message is automatically generated. order by fail when set fs.file.impl.disable.cache to true --- Key: PIG-1467 URL: https://issues.apache.org/jira/browse/PIG-1467 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0, 0.8.0 Attachments: PIG-1467-1.patch, PIG-1467-2.patch Order by fail with the message: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:551) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:630) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.mapred.Child.main(Child.java:211) This happens with the following hadoop settings: fs.file.impl.disable.cache=true fs.hdfs.impl.disable.cache=true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.