[jira] [Commented] (PIG-3581) Incorrect scope resolution with nested foreach
[ https://issues.apache.org/jira/browse/PIG-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855935#comment-13855935 ] Cheolsoo Park commented on PIG-3581: [~aniket486], I think this patch introduced a regression. Consider the following query- {code} a = LOAD 'foo' AS (x:int, y:chararray); b = GROUP a BY x; c = FOREACH b { expr = 'bar'; filtered = FILTER a BY y == expr; GENERATE COUNT(filtered); } DESCRIBE c; {code} This used to work in 0.11 but no longer works in trunk. It looks like 'expr' used to be resolved to a scalar expression ('bar'), but it's not the case anymore. My question are, 1. Is it supported to define a local scalar expression inside a nested foreach? e.g. expr = 'bar'; 2. If so, can you fix the regression? Incorrect scope resolution with nested foreach -- Key: PIG-3581 URL: https://issues.apache.org/jira/browse/PIG-3581 Project: Pig Issue Type: Bug Reporter: Venu Satuluri Assignee: Aniket Mokashi Attachments: PIG-3581-1.patch, PIG-3581-2.patch Consider the following script: {code} A = LOAD 'test_data' AS (a: int, b: int); C = FOREACH A GENERATE *; B = FOREACH (GROUP A BY a) { C = FILTER A BY b % 2 == 0; D = FILTER A BY b % 2 == 1; GENERATE group AS a, A.b AS every, C.b AS even, D.b AS odd; }; DESCRIBE B; {code} Notice that C is defined both inside the nested foreach as well as outside. I would expect that in the GENERATE inside the nested FOREACH, the C that is used will be the one that is defined inside. If that is not so, I think at least a warning is due. However, currently Pig silently assumes that the C you mean one is the one that is defined *outside* the nested FOREACH. Hence, the result of DESCRIBE B looks as follows: {code} B: { a: int, every: { ( b: int ) }, even: int, odd: { ( b: int ) } } {code} If I remove the definition of C that is outside the foreach, then I get the following for DESCRIBE B: {code} B: { a: int, every: { ( b: int ) }, even: { ( b: int ) }, odd: { ( b: int ) } } {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3581) Incorrect scope resolution with nested foreach
[ https://issues.apache.org/jira/browse/PIG-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3581: --- Fix Version/s: 0.13.0 Incorrect scope resolution with nested foreach -- Key: PIG-3581 URL: https://issues.apache.org/jira/browse/PIG-3581 Project: Pig Issue Type: Bug Reporter: Venu Satuluri Assignee: Aniket Mokashi Fix For: 0.13.0 Attachments: PIG-3581-1.patch, PIG-3581-2.patch Consider the following script: {code} A = LOAD 'test_data' AS (a: int, b: int); C = FOREACH A GENERATE *; B = FOREACH (GROUP A BY a) { C = FILTER A BY b % 2 == 0; D = FILTER A BY b % 2 == 1; GENERATE group AS a, A.b AS every, C.b AS even, D.b AS odd; }; DESCRIBE B; {code} Notice that C is defined both inside the nested foreach as well as outside. I would expect that in the GENERATE inside the nested FOREACH, the C that is used will be the one that is defined inside. If that is not so, I think at least a warning is due. However, currently Pig silently assumes that the C you mean one is the one that is defined *outside* the nested FOREACH. Hence, the result of DESCRIBE B looks as follows: {code} B: { a: int, every: { ( b: int ) }, even: int, odd: { ( b: int ) } } {code} If I remove the definition of C that is outside the foreach, then I get the following for DESCRIBE B: {code} B: { a: int, every: { ( b: int ) }, even: { ( b: int ) }, odd: { ( b: int ) } } {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3581) Incorrect scope resolution with nested foreach
[ https://issues.apache.org/jira/browse/PIG-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855949#comment-13855949 ] Aniket Mokashi commented on PIG-3581: - Let me look into this in detail. In general, expr = 'bar'; filtered = FILTER a BY y == expr; - using scalar is very bad (crazy). It is going to store the word 'bar' in a file on HDFS and read it back. Incorrect scope resolution with nested foreach -- Key: PIG-3581 URL: https://issues.apache.org/jira/browse/PIG-3581 Project: Pig Issue Type: Bug Reporter: Venu Satuluri Assignee: Aniket Mokashi Fix For: 0.13.0 Attachments: PIG-3581-1.patch, PIG-3581-2.patch Consider the following script: {code} A = LOAD 'test_data' AS (a: int, b: int); C = FOREACH A GENERATE *; B = FOREACH (GROUP A BY a) { C = FILTER A BY b % 2 == 0; D = FILTER A BY b % 2 == 1; GENERATE group AS a, A.b AS every, C.b AS even, D.b AS odd; }; DESCRIBE B; {code} Notice that C is defined both inside the nested foreach as well as outside. I would expect that in the GENERATE inside the nested FOREACH, the C that is used will be the one that is defined inside. If that is not so, I think at least a warning is due. However, currently Pig silently assumes that the C you mean one is the one that is defined *outside* the nested FOREACH. Hence, the result of DESCRIBE B looks as follows: {code} B: { a: int, every: { ( b: int ) }, even: int, odd: { ( b: int ) } } {code} If I remove the definition of C that is outside the foreach, then I get the following for DESCRIBE B: {code} B: { a: int, every: { ( b: int ) }, even: { ( b: int ) }, odd: { ( b: int ) } } {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3581) Incorrect scope resolution with nested foreach
[ https://issues.apache.org/jira/browse/PIG-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855955#comment-13855955 ] Cheolsoo Park commented on PIG-3581: I am not saying it's a good thing to do. I was answering a user questions on the mailing list and curious whether it's supported or not. For the context, see here- http://www.mail-archive.com/user@pig.apache.org/msg08977.html Incorrect scope resolution with nested foreach -- Key: PIG-3581 URL: https://issues.apache.org/jira/browse/PIG-3581 Project: Pig Issue Type: Bug Reporter: Venu Satuluri Assignee: Aniket Mokashi Fix For: 0.13.0 Attachments: PIG-3581-1.patch, PIG-3581-2.patch Consider the following script: {code} A = LOAD 'test_data' AS (a: int, b: int); C = FOREACH A GENERATE *; B = FOREACH (GROUP A BY a) { C = FILTER A BY b % 2 == 0; D = FILTER A BY b % 2 == 1; GENERATE group AS a, A.b AS every, C.b AS even, D.b AS odd; }; DESCRIBE B; {code} Notice that C is defined both inside the nested foreach as well as outside. I would expect that in the GENERATE inside the nested FOREACH, the C that is used will be the one that is defined inside. If that is not so, I think at least a warning is due. However, currently Pig silently assumes that the C you mean one is the one that is defined *outside* the nested FOREACH. Hence, the result of DESCRIBE B looks as follows: {code} B: { a: int, every: { ( b: int ) }, even: int, odd: { ( b: int ) } } {code} If I remove the definition of C that is outside the foreach, then I get the following for DESCRIBE B: {code} B: { a: int, every: { ( b: int ) }, even: { ( b: int ) }, odd: { ( b: int ) } } {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3581) Incorrect scope resolution with nested foreach
[ https://issues.apache.org/jira/browse/PIG-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855959#comment-13855959 ] Cheolsoo Park commented on PIG-3581: {quote} It is going to store the word 'bar' in a file on HDFS and read it back. {quote} Plus, I don't think this is true. See the explain output in 0.11. There is no hdfs reading for 'bar'. It's a constant- {code} c: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-20 | |---c: New For Each(false)[bag] - scope-19 | | | POUserFunc(org.apache.pig.builtin.COUNT)[long] - scope-13 | | | |---RelationToExpressionProject[bag][*] - scope-12 | | | |---filtered: Filter[bag] - scope-15 | | | | | Equal To[boolean] - scope-18 | | | | | |---Project[chararray][1] - scope-16 | | | | | |---Constant(bar) - scope-17 | | | |---Project[bag][1] - scope-14 | |---b: Package[tuple]{int} - scope-9 {code} Incorrect scope resolution with nested foreach -- Key: PIG-3581 URL: https://issues.apache.org/jira/browse/PIG-3581 Project: Pig Issue Type: Bug Reporter: Venu Satuluri Assignee: Aniket Mokashi Fix For: 0.13.0 Attachments: PIG-3581-1.patch, PIG-3581-2.patch Consider the following script: {code} A = LOAD 'test_data' AS (a: int, b: int); C = FOREACH A GENERATE *; B = FOREACH (GROUP A BY a) { C = FILTER A BY b % 2 == 0; D = FILTER A BY b % 2 == 1; GENERATE group AS a, A.b AS every, C.b AS even, D.b AS odd; }; DESCRIBE B; {code} Notice that C is defined both inside the nested foreach as well as outside. I would expect that in the GENERATE inside the nested FOREACH, the C that is used will be the one that is defined inside. If that is not so, I think at least a warning is due. However, currently Pig silently assumes that the C you mean one is the one that is defined *outside* the nested FOREACH. Hence, the result of DESCRIBE B looks as follows: {code} B: { a: int, every: { ( b: int ) }, even: int, odd: { ( b: int ) } } {code} If I remove the definition of C that is outside the foreach, then I get the following for DESCRIBE B: {code} B: { a: int, every: { ( b: int ) }, even: { ( b: int ) }, odd: { ( b: int ) } } {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3581) Incorrect scope resolution with nested foreach
[ https://issues.apache.org/jira/browse/PIG-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855960#comment-13855960 ] Aniket Mokashi commented on PIG-3581: - Understood. So, its a problem with nested_command resolution and not scalar resolution. I will submit a patch for this. Incorrect scope resolution with nested foreach -- Key: PIG-3581 URL: https://issues.apache.org/jira/browse/PIG-3581 Project: Pig Issue Type: Bug Reporter: Venu Satuluri Assignee: Aniket Mokashi Fix For: 0.13.0 Attachments: PIG-3581-1.patch, PIG-3581-2.patch Consider the following script: {code} A = LOAD 'test_data' AS (a: int, b: int); C = FOREACH A GENERATE *; B = FOREACH (GROUP A BY a) { C = FILTER A BY b % 2 == 0; D = FILTER A BY b % 2 == 1; GENERATE group AS a, A.b AS every, C.b AS even, D.b AS odd; }; DESCRIBE B; {code} Notice that C is defined both inside the nested foreach as well as outside. I would expect that in the GENERATE inside the nested FOREACH, the C that is used will be the one that is defined inside. If that is not so, I think at least a warning is due. However, currently Pig silently assumes that the C you mean one is the one that is defined *outside* the nested FOREACH. Hence, the result of DESCRIBE B looks as follows: {code} B: { a: int, every: { ( b: int ) }, even: int, odd: { ( b: int ) } } {code} If I remove the definition of C that is outside the foreach, then I get the following for DESCRIBE B: {code} B: { a: int, every: { ( b: int ) }, even: { ( b: int ) }, odd: { ( b: int ) } } {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3581) Incorrect scope resolution with nested foreach
[ https://issues.apache.org/jira/browse/PIG-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855961#comment-13855961 ] Aniket Mokashi commented on PIG-3581: - bq. Is it supported to define a local scalar expression inside a nested foreach? e.g. expr = 'bar'; Yes. https://github.com/apache/pig/blob/branch-0.12/src/org/apache/pig/parser/QueryParser.g#L923 bq. If so, can you fix the regression? Yes Incorrect scope resolution with nested foreach -- Key: PIG-3581 URL: https://issues.apache.org/jira/browse/PIG-3581 Project: Pig Issue Type: Bug Reporter: Venu Satuluri Assignee: Aniket Mokashi Fix For: 0.13.0 Attachments: PIG-3581-1.patch, PIG-3581-2.patch Consider the following script: {code} A = LOAD 'test_data' AS (a: int, b: int); C = FOREACH A GENERATE *; B = FOREACH (GROUP A BY a) { C = FILTER A BY b % 2 == 0; D = FILTER A BY b % 2 == 1; GENERATE group AS a, A.b AS every, C.b AS even, D.b AS odd; }; DESCRIBE B; {code} Notice that C is defined both inside the nested foreach as well as outside. I would expect that in the GENERATE inside the nested FOREACH, the C that is used will be the one that is defined inside. If that is not so, I think at least a warning is due. However, currently Pig silently assumes that the C you mean one is the one that is defined *outside* the nested FOREACH. Hence, the result of DESCRIBE B looks as follows: {code} B: { a: int, every: { ( b: int ) }, even: int, odd: { ( b: int ) } } {code} If I remove the definition of C that is outside the foreach, then I get the following for DESCRIBE B: {code} B: { a: int, every: { ( b: int ) }, even: { ( b: int ) }, odd: { ( b: int ) } } {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (PIG-3641) Split otherwise producing incorrect output when combined with ColumnPruning
Koji Noguchi created PIG-3641: - Summary: Split otherwise producing incorrect output when combined with ColumnPruning Key: PIG-3641 URL: https://issues.apache.org/jira/browse/PIG-3641 Project: Pig Issue Type: Bug Affects Versions: 0.11.1, 0.12.0, 0.10.0, 0.13.0 Reporter: Koji Noguchi Assignee: Koji Noguchi Our user was observing incorrect outputs depending on if the query had intermediate output or not. Below is a simplified testcase I came up with. {noformat} knoguchi pig cat test.txt 9,1,ignored 9,1,ignored 9,1,ignored knoguchi pig cat bz-6590644/test.pig A = load 'test.txt' using PigStorage(',') as (a1:int, a2:int, a3:chararray); B = foreach A generate a1,a2; SPLIT B into C1 if a2 == 1, D1 otherwise; C2 = foreach C1 generate a2; store C2 into '/tmp/testC'; store D1 into '/tmp/testD'; knoguchi@nameother-lm pig {noformat} Incorrect output shown below. /tmp/testD should be empty but somehow has data in it. {noformat} knoguchi@nameother-lm pig cat /tmp/testC/part-m-0 1 1 1 knoguchi pig cat /tmp/testD/part-m-0 9 1 9 1 9 1 knoguchi pig {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3641) Split otherwise producing incorrect output when combined with ColumnPruning
[ https://issues.apache.org/jira/browse/PIG-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3641: -- Attachment: pig-3641_v01.patch This issue is similar to PIG-3051 where projection expression was pointing to a different operator. In Pig-3051, it was about copied LOSort. Here, it's about LOSplitOutput for 'otherwise'. Attaching a preliminary patch. Need to add testing. Split otherwise producing incorrect output when combined with ColumnPruning - Key: PIG-3641 URL: https://issues.apache.org/jira/browse/PIG-3641 Project: Pig Issue Type: Bug Affects Versions: 0.10.0, 0.12.0, 0.11.1, 0.13.0 Reporter: Koji Noguchi Assignee: Koji Noguchi Attachments: pig-3641_v01.patch Our user was observing incorrect outputs depending on if the query had intermediate output or not. Below is a simplified testcase I came up with. {noformat} knoguchi pig cat test.txt 9,1,ignored 9,1,ignored 9,1,ignored knoguchi pig cat bz-6590644/test.pig A = load 'test.txt' using PigStorage(',') as (a1:int, a2:int, a3:chararray); B = foreach A generate a1,a2; SPLIT B into C1 if a2 == 1, D1 otherwise; C2 = foreach C1 generate a2; store C2 into '/tmp/testC'; store D1 into '/tmp/testD'; knoguchi@nameother-lm pig {noformat} Incorrect output shown below. /tmp/testD should be empty but somehow has data in it. {noformat} knoguchi@nameother-lm pig cat /tmp/testC/part-m-0 1 1 1 knoguchi pig cat /tmp/testD/part-m-0 9 1 9 1 9 1 knoguchi pig {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3609) ClassCastException when calling compareTo method on AvroBagWrapper
[ https://issues.apache.org/jira/browse/PIG-3609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856041#comment-13856041 ] Richard Ding commented on PIG-3609: --- [~cheolsoo], checking size is an optimization, this is also what DefaultAbstractBag implements. +1 on the patch. ClassCastException when calling compareTo method on AvroBagWrapper --- Key: PIG-3609 URL: https://issues.apache.org/jira/browse/PIG-3609 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.12.0 Reporter: Richard Ding Assignee: Richard Ding Priority: Minor Attachments: PIG-3609.patch, PIG-3609_2.patch, PIG-3609_3.patch One got the following exception when calling compareTo method on AvroBagWrapper with an AvroBagWrapper object: {code} java.lang.ClassCastException: org.apache.pig.impl.util.avro.AvroBagWrapper incompatible with java.util.Collection at org.apache.avro.generic.GenericData.compare(GenericData.java:786) at org.apache.avro.generic.GenericData.compare(GenericData.java:760) at org.apache.pig.impl.util.avro.AvroBagWrapper.compareTo(AvroBagWrapper.java:78) {code} Looking at the code, it compares objects with different types: {code} return GenericData.get().compare(theArray, o, theArray.getSchema()); {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3640) Retain intermediate files for debugging purpose in batch mode
[ https://issues.apache.org/jira/browse/PIG-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856043#comment-13856043 ] Aniket Mokashi commented on PIG-3640: - +1 Retain intermediate files for debugging purpose in batch mode - Key: PIG-3640 URL: https://issues.apache.org/jira/browse/PIG-3640 Project: Pig Issue Type: Bug Reporter: Cheolsoo Park Assignee: Cheolsoo Park Fix For: 0.13.0 Attachments: PIG-3640-1.patch PIG-3117 make it configurable whether to keep intermediate files between MR jobs during the execution. So if we run queries in Grunt shell, we can keep intermediate files. However, intermediate files are still deleted when Pig exits. It would be nice if we could retain intermediate files even in batch mode for debugging purpose. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3608) ClassCastException when looking up a value from AvroMapWrapper using a Utf8 key
[ https://issues.apache.org/jira/browse/PIG-3608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856047#comment-13856047 ] Richard Ding commented on PIG-3608: --- Thanks for reviewing the patch. Right now I don't have a Pig script to demonstrate this use case. I'm getting this problem while trying to iterate an instance of AvroMapWrapper and find out that I can't look up the value from the map using the key just retrieved from the map. I think this breaks the basic contract of a map implementation. I think the check {code} if (isUtf8key !(key instanceof Utf8)) {code} is more general. But I'm ok if it is restricted to String. ClassCastException when looking up a value from AvroMapWrapper using a Utf8 key --- Key: PIG-3608 URL: https://issues.apache.org/jira/browse/PIG-3608 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.12.0 Reporter: Richard Ding Assignee: Richard Ding Priority: Minor Attachments: PIG-3608.patch, PIG-3608_2.patch One got the following exception: {code} java.lang.ClassCastException: org.apache.avro.util.Utf8 incompatible with java.lang.String at org.apache.pig.impl.util.avro.AvroMapWrapper.get(AvroMapWrapper.java:80) {code} This is related to the change by PIG-3420. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (11 issues) Subscriber: pigdaily Key Summary PIG-3640Retain intermediate files for debugging purpose in batch mode https://issues.apache.org/jira/browse/PIG-3640 PIG-3639TestRegisteredJarVisibility is broken in trunk https://issues.apache.org/jira/browse/PIG-3639 PIG-3637PigCombiner creating log spam https://issues.apache.org/jira/browse/PIG-3637 PIG-3635Fix e2e tests for Hadoop 2.X on Windows https://issues.apache.org/jira/browse/PIG-3635 PIG-3632Add option to configure cacheBlocks in HBaseStorage https://issues.apache.org/jira/browse/PIG-3632 PIG-3609ClassCastException when calling compareTo method on AvroBagWrapper https://issues.apache.org/jira/browse/PIG-3609 PIG-3608ClassCastException when looking up a value from AvroMapWrapper using a Utf8 key https://issues.apache.org/jira/browse/PIG-3608 PIG-3573Provide StoreFunc and LoadFunc for Accumulo https://issues.apache.org/jira/browse/PIG-3573 PIG-3453Implement a Storm backend to Pig https://issues.apache.org/jira/browse/PIG-3453 PIG-3441Allow Pig to use default resources from Configuration objects https://issues.apache.org/jira/browse/PIG-3441 PIG-3347Store invocation brings side effect https://issues.apache.org/jira/browse/PIG-3347 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225filterId=12322384
Re: Review Request 16309: PIG-3629 Implement STREAM operator in Tez
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16309/ --- (Updated Dec. 23, 2013, 5:34 p.m.) Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy. Changes --- Updated with fix to race condition and e2e tests Bugs: PIG-3629 https://issues.apache.org/jira/browse/PIG-3629 Repository: pig-git Description (updated) --- Implement STREAM operator in Tez - https://issues.apache.org/jira/browse/PIG-3629 In this patch, I do not add resources to pig-misc.jar, I just add them individually. See my discussion post: https://groups.google.com/forum/#!topic/pig-on-tez/8S80GMKhMaU Basic Changes: -Run the PhyPlanSetter and EndOfAllInputSetter to set the parent plan and the end-of-all input flags necessary for STREAM, just like in MR Pig. -Add a map to hold plan-specific extra local resources in TezOperPlan.java. These resources can either come from the user's directory (e.g. SHIP('/home/abain/foo')) or from HDFS (e.g. CACHE('/user/abain/bar') in HDFS). -Add the new class TezPOStreamVisitor that assembles all the plan-specific local resources that get added in TezOperPlan.java. Resource Manager Changes: -TezResourcManager resources were previously a map of java.net.URL - Path in HDFS. Previously, the URL's were all local files, e.g. file://home/abain/pig-withouthHadoop.jar. However, the CACHE statement requires that resources already present in HDFS be able to be added as local resources. Unfortunately java.net.URL does not support hdfs:// URL's, so I changed this data structure to be a YARN URL instead. I also added methods to the ResourceManager to distinguish whether you are adding a local resource or a resource already present in HDFS. -CACHE also supports URL's with fragments at the end, which become a shortcut to the name, e.g. CACHE(/input/big-data-name.gz#data.gz). I changed the resource manager to look for a fragments and use that as the resource name (if the fragment exist). This results in the symlink to the resource being created with the fragment name, which is what we want. Race condition: -I found a race condition that resulted from reusing the Result object in POSimpleTezLoad. There are several possible solutions. After discussing in the newsgroup, we decided to change POSimpleTezLoad for now. -I also made a small cleanup to PhysicalOperator.java. Diffs (updated) - src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 37566ab src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java 8487a0f src/org/apache/pig/backend/hadoop/executionengine/tez/POSimpleTezLoad.java d57aded src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java 0ee7256 src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 191563d src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java df9fea6 src/org/apache/pig/backend/hadoop/executionengine/tez/TezJobControlCompiler.java 135b933 src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperPlan.java 0cc8e17 src/org/apache/pig/backend/hadoop/executionengine/tez/TezPOStreamVisitor.java PRE-CREATION src/org/apache/pig/backend/hadoop/executionengine/tez/TezPlanContainer.java 673fd70 src/org/apache/pig/backend/hadoop/executionengine/tez/TezResourceManager.java 0fd7575 src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java 862e637 test/e2e/pig/tests/tez.conf 0e4ba4e test/org/apache/pig/test/data/GoldenFiles/TEZC12.gld PRE-CREATION test/org/apache/pig/tez/TestTezCompiler.java 8d5e5f2 Diff: https://reviews.apache.org/r/16309/diff/ Testing (updated) --- Added a unit test to TestTezCompiler.java Added a new unit test e2e test to tez.conf with session reuse enabled Ported three other e2e tests from streaming.conf to tez.conf to increase coverage ant test-tez passes ant test-e2e-tez passes Manually tested with a large subset of tests from streaming.conf (the ones using features currently supported by Pig-on-Tez), they pass Thanks, Alex Bain
[jira] [Commented] (PIG-3629) Implement STREAM in Tez
[ https://issues.apache.org/jira/browse/PIG-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856075#comment-13856075 ] Alex Bain commented on PIG-3629: Review at https://reviews.apache.org/r/16309/ Implement STREAM in Tez --- Key: PIG-3629 URL: https://issues.apache.org/jira/browse/PIG-3629 Project: Pig Issue Type: Sub-task Components: tez Affects Versions: tez-branch Reporter: Alex Bain Assignee: Alex Bain Fix For: tez-branch Implement the STREAM operator in Tez -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3629) Implement STREAM in Tez
[ https://issues.apache.org/jira/browse/PIG-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Bain updated PIG-3629: --- Attachment: PIG-3629-3.patch Implement STREAM in Tez --- Key: PIG-3629 URL: https://issues.apache.org/jira/browse/PIG-3629 Project: Pig Issue Type: Sub-task Components: tez Affects Versions: tez-branch Reporter: Alex Bain Assignee: Alex Bain Fix For: tez-branch Attachments: PIG-3629-3.patch Implement the STREAM operator in Tez -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3629) Implement STREAM in Tez
[ https://issues.apache.org/jira/browse/PIG-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Bain updated PIG-3629: --- Status: Patch Available (was: Open) Implement STREAM in Tez --- Key: PIG-3629 URL: https://issues.apache.org/jira/browse/PIG-3629 Project: Pig Issue Type: Sub-task Components: tez Affects Versions: tez-branch Reporter: Alex Bain Assignee: Alex Bain Fix For: tez-branch Attachments: PIG-3629-3.patch Implement the STREAM operator in Tez -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Review Request 16309: PIG-3629 Implement STREAM operator in Tez
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16309/#review30848 --- Ship it! Looks good to me. I will commit it after running tests. Let me also fix my minor comments below when committing the patch as well as whitespaces. Thank you Alex! src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java https://reviews.apache.org/r/16309/#comment59052 Let's use Boolean.valueOf() instead of String.equals() here. src/org/apache/pig/backend/hadoop/executionengine/tez/TezPOStreamVisitor.java https://reviews.apache.org/r/16309/#comment59051 Apache header is missing. - Cheolsoo Park On Dec. 24, 2013, 1:34 a.m., Alex Bain wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16309/ --- (Updated Dec. 24, 2013, 1:34 a.m.) Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy. Bugs: PIG-3629 https://issues.apache.org/jira/browse/PIG-3629 Repository: pig-git Description --- Implement STREAM operator in Tez - https://issues.apache.org/jira/browse/PIG-3629 In this patch, I do not add resources to pig-misc.jar, I just add them individually. See my discussion post: https://groups.google.com/forum/#!topic/pig-on-tez/8S80GMKhMaU Basic Changes: -Run the PhyPlanSetter and EndOfAllInputSetter to set the parent plan and the end-of-all input flags necessary for STREAM, just like in MR Pig. -Add a map to hold plan-specific extra local resources in TezOperPlan.java. These resources can either come from the user's directory (e.g. SHIP('/home/abain/foo')) or from HDFS (e.g. CACHE('/user/abain/bar') in HDFS). -Add the new class TezPOStreamVisitor that assembles all the plan-specific local resources that get added in TezOperPlan.java. Resource Manager Changes: -TezResourcManager resources were previously a map of java.net.URL - Path in HDFS. Previously, the URL's were all local files, e.g. file://home/abain/pig-withouthHadoop.jar. However, the CACHE statement requires that resources already present in HDFS be able to be added as local resources. Unfortunately java.net.URL does not support hdfs:// URL's, so I changed this data structure to be a YARN URL instead. I also added methods to the ResourceManager to distinguish whether you are adding a local resource or a resource already present in HDFS. -CACHE also supports URL's with fragments at the end, which become a shortcut to the name, e.g. CACHE(/input/big-data-name.gz#data.gz). I changed the resource manager to look for a fragments and use that as the resource name (if the fragment exist). This results in the symlink to the resource being created with the fragment name, which is what we want. Race condition: -I found a race condition that resulted from reusing the Result object in POSimpleTezLoad. There are several possible solutions. After discussing in the newsgroup, we decided to change POSimpleTezLoad for now. -I also made a small cleanup to PhysicalOperator.java. Diffs - src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 37566ab src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java 8487a0f src/org/apache/pig/backend/hadoop/executionengine/tez/POSimpleTezLoad.java d57aded src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java 0ee7256 src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 191563d src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java df9fea6 src/org/apache/pig/backend/hadoop/executionengine/tez/TezJobControlCompiler.java 135b933 src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperPlan.java 0cc8e17 src/org/apache/pig/backend/hadoop/executionengine/tez/TezPOStreamVisitor.java PRE-CREATION src/org/apache/pig/backend/hadoop/executionengine/tez/TezPlanContainer.java 673fd70 src/org/apache/pig/backend/hadoop/executionengine/tez/TezResourceManager.java 0fd7575 src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java 862e637 test/e2e/pig/tests/tez.conf 0e4ba4e test/org/apache/pig/test/data/GoldenFiles/TEZC12.gld PRE-CREATION test/org/apache/pig/tez/TestTezCompiler.java 8d5e5f2 Diff: https://reviews.apache.org/r/16309/diff/ Testing --- Added a unit test to TestTezCompiler.java Added a new unit test e2e test to tez.conf with session reuse enabled Ported three other e2e tests from streaming.conf to tez.conf to increase coverage ant test-tez passes ant test-e2e-tez passes Manually tested with a large subset of tests
[jira] [Updated] (PIG-3609) ClassCastException when calling compareTo method on AvroBagWrapper
[ https://issues.apache.org/jira/browse/PIG-3609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3609: --- Resolution: Fixed Fix Version/s: 0.13.0 Status: Resolved (was: Patch Available) Committed to trunk. Thank you Richard! ClassCastException when calling compareTo method on AvroBagWrapper --- Key: PIG-3609 URL: https://issues.apache.org/jira/browse/PIG-3609 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.12.0 Reporter: Richard Ding Assignee: Richard Ding Priority: Minor Fix For: 0.13.0 Attachments: PIG-3609.patch, PIG-3609_2.patch, PIG-3609_3.patch One got the following exception when calling compareTo method on AvroBagWrapper with an AvroBagWrapper object: {code} java.lang.ClassCastException: org.apache.pig.impl.util.avro.AvroBagWrapper incompatible with java.util.Collection at org.apache.avro.generic.GenericData.compare(GenericData.java:786) at org.apache.avro.generic.GenericData.compare(GenericData.java:760) at org.apache.pig.impl.util.avro.AvroBagWrapper.compareTo(AvroBagWrapper.java:78) {code} Looking at the code, it compares objects with different types: {code} return GenericData.get().compare(theArray, o, theArray.getSchema()); {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3640) Retain intermediate files for debugging purpose in batch mode
[ https://issues.apache.org/jira/browse/PIG-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3640: --- Resolution: Fixed Status: Resolved (was: Patch Available) Committed to trunk. Thank you Aniket for the review. p.s. Main.java had funny indentations. (Several hundreds of lines were left shifted by a tab.) I fixed it while touching the file. Retain intermediate files for debugging purpose in batch mode - Key: PIG-3640 URL: https://issues.apache.org/jira/browse/PIG-3640 Project: Pig Issue Type: Bug Reporter: Cheolsoo Park Assignee: Cheolsoo Park Fix For: 0.13.0 Attachments: PIG-3640-1.patch PIG-3117 make it configurable whether to keep intermediate files between MR jobs during the execution. So if we run queries in Grunt shell, we can keep intermediate files. However, intermediate files are still deleted when Pig exits. It would be nice if we could retain intermediate files even in batch mode for debugging purpose. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3629) Implement STREAM in Tez
[ https://issues.apache.org/jira/browse/PIG-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856157#comment-13856157 ] Cheolsoo Park commented on PIG-3629: I noticed that you overwrote join e2e test cases. I am adding them back. Implement STREAM in Tez --- Key: PIG-3629 URL: https://issues.apache.org/jira/browse/PIG-3629 Project: Pig Issue Type: Sub-task Components: tez Affects Versions: tez-branch Reporter: Alex Bain Assignee: Alex Bain Fix For: tez-branch Attachments: PIG-3629-3.patch Implement the STREAM operator in Tez -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Review Request 16309: PIG-3629 Implement STREAM operator in Tez
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16309/#review30849 --- test/e2e/pig/tests/tez.conf https://reviews.apache.org/r/16309/#comment59053 I believe you're overwriting join test cases by mistake. One more below. - Cheolsoo Park On Dec. 24, 2013, 1:34 a.m., Alex Bain wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16309/ --- (Updated Dec. 24, 2013, 1:34 a.m.) Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy. Bugs: PIG-3629 https://issues.apache.org/jira/browse/PIG-3629 Repository: pig-git Description --- Implement STREAM operator in Tez - https://issues.apache.org/jira/browse/PIG-3629 In this patch, I do not add resources to pig-misc.jar, I just add them individually. See my discussion post: https://groups.google.com/forum/#!topic/pig-on-tez/8S80GMKhMaU Basic Changes: -Run the PhyPlanSetter and EndOfAllInputSetter to set the parent plan and the end-of-all input flags necessary for STREAM, just like in MR Pig. -Add a map to hold plan-specific extra local resources in TezOperPlan.java. These resources can either come from the user's directory (e.g. SHIP('/home/abain/foo')) or from HDFS (e.g. CACHE('/user/abain/bar') in HDFS). -Add the new class TezPOStreamVisitor that assembles all the plan-specific local resources that get added in TezOperPlan.java. Resource Manager Changes: -TezResourcManager resources were previously a map of java.net.URL - Path in HDFS. Previously, the URL's were all local files, e.g. file://home/abain/pig-withouthHadoop.jar. However, the CACHE statement requires that resources already present in HDFS be able to be added as local resources. Unfortunately java.net.URL does not support hdfs:// URL's, so I changed this data structure to be a YARN URL instead. I also added methods to the ResourceManager to distinguish whether you are adding a local resource or a resource already present in HDFS. -CACHE also supports URL's with fragments at the end, which become a shortcut to the name, e.g. CACHE(/input/big-data-name.gz#data.gz). I changed the resource manager to look for a fragments and use that as the resource name (if the fragment exist). This results in the symlink to the resource being created with the fragment name, which is what we want. Race condition: -I found a race condition that resulted from reusing the Result object in POSimpleTezLoad. There are several possible solutions. After discussing in the newsgroup, we decided to change POSimpleTezLoad for now. -I also made a small cleanup to PhysicalOperator.java. Diffs - src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 37566ab src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java 8487a0f src/org/apache/pig/backend/hadoop/executionengine/tez/POSimpleTezLoad.java d57aded src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java 0ee7256 src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 191563d src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java df9fea6 src/org/apache/pig/backend/hadoop/executionengine/tez/TezJobControlCompiler.java 135b933 src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperPlan.java 0cc8e17 src/org/apache/pig/backend/hadoop/executionengine/tez/TezPOStreamVisitor.java PRE-CREATION src/org/apache/pig/backend/hadoop/executionengine/tez/TezPlanContainer.java 673fd70 src/org/apache/pig/backend/hadoop/executionengine/tez/TezResourceManager.java 0fd7575 src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java 862e637 test/e2e/pig/tests/tez.conf 0e4ba4e test/org/apache/pig/test/data/GoldenFiles/TEZC12.gld PRE-CREATION test/org/apache/pig/tez/TestTezCompiler.java 8d5e5f2 Diff: https://reviews.apache.org/r/16309/diff/ Testing --- Added a unit test to TestTezCompiler.java Added a new unit test e2e test to tez.conf with session reuse enabled Ported three other e2e tests from streaming.conf to tez.conf to increase coverage ant test-tez passes ant test-e2e-tez passes Manually tested with a large subset of tests from streaming.conf (the ones using features currently supported by Pig-on-Tez), they pass Thanks, Alex Bain
[jira] [Updated] (PIG-3629) Implement STREAM in Tez
[ https://issues.apache.org/jira/browse/PIG-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3629: --- Attachment: PIG-3629-4.patch Attaching the final patch committed to tez branch. Implement STREAM in Tez --- Key: PIG-3629 URL: https://issues.apache.org/jira/browse/PIG-3629 Project: Pig Issue Type: Sub-task Components: tez Affects Versions: tez-branch Reporter: Alex Bain Assignee: Alex Bain Fix For: tez-branch Attachments: PIG-3629-3.patch, PIG-3629-4.patch Implement the STREAM operator in Tez -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3629) Implement STREAM in Tez
[ https://issues.apache.org/jira/browse/PIG-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3629: --- Resolution: Fixed Status: Resolved (was: Patch Available) Committed to tez branch. Thank you Alex! Implement STREAM in Tez --- Key: PIG-3629 URL: https://issues.apache.org/jira/browse/PIG-3629 Project: Pig Issue Type: Sub-task Components: tez Affects Versions: tez-branch Reporter: Alex Bain Assignee: Alex Bain Fix For: tez-branch Attachments: PIG-3629-3.patch, PIG-3629-4.patch Implement the STREAM operator in Tez -- This message was sent by Atlassian JIRA (v6.1.5#6160)