[jira] [Commented] (PIG-3634) Improve performance of order-by
[ https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858526#comment-13858526 ] Daniel Dai commented on PIG-3634: - Thanks for clarification. PIG-3634-2.patch should work with top of tez branch now. > Improve performance of order-by > --- > > Key: PIG-3634 > URL: https://issues.apache.org/jira/browse/PIG-3634 > Project: Pig > Issue Type: Sub-task > Components: tez >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: tez-branch > > Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch > > > This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to > implement an order-by. We can optimize to use 4 vertexes in 1 DAG: > vertex 1: close the current vertex, create input + samples input > vertex 2: aggregate samples to create quantiles > vertex 3: use quantiles to partition input > vertex 4: sort input after partition > The DAG is: > {code} > vertex 1 --> vertex 3 --> vertex 4 >\--> vertex 2 ---/ > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3634) Improve performance of order-by
[ https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858518#comment-13858518 ] Cheolsoo Park commented on PIG-3634: To be clear, I see no e2e failures in the current tez branch either. > Improve performance of order-by > --- > > Key: PIG-3634 > URL: https://issues.apache.org/jira/browse/PIG-3634 > Project: Pig > Issue Type: Sub-task > Components: tez >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: tez-branch > > Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch > > > This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to > implement an order-by. We can optimize to use 4 vertexes in 1 DAG: > vertex 1: close the current vertex, create input + samples input > vertex 2: aggregate samples to create quantiles > vertex 3: use quantiles to partition input > vertex 4: sort input after partition > The DAG is: > {code} > vertex 1 --> vertex 3 --> vertex 4 >\--> vertex 2 ---/ > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3634) Improve performance of order-by
[ https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858515#comment-13858515 ] Cheolsoo Park commented on PIG-3634: [~rohini], with the latest patch (PIG-3634-2.patch) that Daniel uploaded everything works for me. Do you still see any error? > Improve performance of order-by > --- > > Key: PIG-3634 > URL: https://issues.apache.org/jira/browse/PIG-3634 > Project: Pig > Issue Type: Sub-task > Components: tez >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: tez-branch > > Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch > > > This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to > implement an order-by. We can optimize to use 4 vertexes in 1 DAG: > vertex 1: close the current vertex, create input + samples input > vertex 2: aggregate samples to create quantiles > vertex 3: use quantiles to partition input > vertex 4: sort input after partition > The DAG is: > {code} > vertex 1 --> vertex 3 --> vertex 4 >\--> vertex 2 ---/ > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3634) Improve performance of order-by
[ https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858511#comment-13858511 ] Rohini Palaniswamy commented on PIG-3634: - [~cheolsoo], java.lang.ClassCastException: org.apache.pig.impl.io.NullableBytesWritable cannot be cast to org.apache.pig.impl.io.NullableText happens after PIG-3636 as Daniel mentioned. Without that checkin e2e tests pass fine for me as well. Initially seeing your comment, I thought that the query failed because Daniel said that just load followed by order by will not work in this patch. > Improve performance of order-by > --- > > Key: PIG-3634 > URL: https://issues.apache.org/jira/browse/PIG-3634 > Project: Pig > Issue Type: Sub-task > Components: tez >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: tez-branch > > Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch > > > This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to > implement an order-by. We can optimize to use 4 vertexes in 1 DAG: > vertex 1: close the current vertex, create input + samples input > vertex 2: aggregate samples to create quantiles > vertex 3: use quantiles to partition input > vertex 4: sort input after partition > The DAG is: > {code} > vertex 1 --> vertex 3 --> vertex 4 >\--> vertex 2 ---/ > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3634) Improve performance of order-by
[ https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858508#comment-13858508 ] Daniel Dai commented on PIG-3634: - Here is RB link: https://reviews.apache.org/r/16510/ > Improve performance of order-by > --- > > Key: PIG-3634 > URL: https://issues.apache.org/jira/browse/PIG-3634 > Project: Pig > Issue Type: Sub-task > Components: tez >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: tez-branch > > Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch > > > This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to > implement an order-by. We can optimize to use 4 vertexes in 1 DAG: > vertex 1: close the current vertex, create input + samples input > vertex 2: aggregate samples to create quantiles > vertex 3: use quantiles to partition input > vertex 4: sort input after partition > The DAG is: > {code} > vertex 1 --> vertex 3 --> vertex 4 >\--> vertex 2 ---/ > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (6 issues) Subscriber: pigdaily Key Summary PIG-3642Direct HDFS access for small jobs (fetch) https://issues.apache.org/jira/browse/PIG-3642 PIG-3635Fix e2e tests for Hadoop 2.X on Windows https://issues.apache.org/jira/browse/PIG-3635 PIG-3573Provide StoreFunc and LoadFunc for Accumulo https://issues.apache.org/jira/browse/PIG-3573 PIG-3453Implement a Storm backend to Pig https://issues.apache.org/jira/browse/PIG-3453 PIG-3441Allow Pig to use default resources from Configuration objects https://issues.apache.org/jira/browse/PIG-3441 PIG-3347Store invocation brings side effect https://issues.apache.org/jira/browse/PIG-3347 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384
[jira] [Updated] (PIG-3642) Direct HDFS access for small jobs (fetch)
[ https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3642: --- Status: Patch Available (was: Open) > Direct HDFS access for small jobs (fetch) > -- > > Key: PIG-3642 > URL: https://issues.apache.org/jira/browse/PIG-3642 > Project: Pig > Issue Type: Improvement >Reporter: Lorand Bendig >Assignee: Lorand Bendig > Fix For: 0.13.0 > > Attachments: PIG-3642.patch > > > With this patch I'd like to add the possibility to directly read data from > HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive > already has this feature (fetch). This patch shares some similarities with > the local mode of Pig 0.6. Here, fetching kicks off when the following holds > for a script: > * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, > (nested) FOREACH with expression operators, custom UDFs..etc > * no scalar aliases > * no SampleLoader > * single leaf job > * DUMP (no STORE) > The feature is enabled by default and can be toggled with: > * -N or -no_fetch > * set opt.fetch true/false; > There's no STORE support because I wanted to make it explicit that this > "optimization" is for launching small/simple scripts during development, > rather than querying and filtering large number of rows on the client > machine. However, a threshold could be given on the input size (an > estimation) to determine whether to prefer fetch over MR jobs, similar to > what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's > LoadMetadata#getStatistic ?) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3645) Move FileLocalizer.setR() calls to unit tests
[ https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858477#comment-13858477 ] Cheolsoo Park commented on PIG-3645: I also merged trunk into tez branch, so the UUID stuff in tez branch is all overwritten now. > Move FileLocalizer.setR() calls to unit tests > - > > Key: PIG-3645 > URL: https://issues.apache.org/jira/browse/PIG-3645 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Cheolsoo Park >Assignee: Cheolsoo Park >Priority: Minor > Fix For: 0.13.0 > > Attachments: PIG-3645-1.patch, PIG-3645-2.patch, PIG-3645-3.patch, > PIG-3645-4.patch, TEST-org.apache.pig.test.TestMRCompiler.txt > > > Currently, temporary paths are generated by FileLocalizer using > Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the > Random object every time when compiling physical plan to MR plan: > {code} > MRCompiler comp = new MRCompiler(php, pc); > comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new > Random()). > {code} > Besides, there are a couple of places calling FileLocalizer.setR() (e.g. > MRCompiler) with some random seed. > I think- > # Randomizing Random seed is unnecessary if we switch to UUID. > # Setting Random objects in code like this is error-prone because it can be > easily broken by having or missing a FileLocalizer.setR() somewhere else. See > an example [here|http://search-hadoop.com/m/2nxTzQXfHw1]. > So I propose that we remove all this "randomizing Random seed" code and use > UUID instead in temporary paths. > For unit tests that compare the results against gold files, we should still > allow to set Random seed through FileLocalizer.setR(). But this method will > be annotated as "VisibleForTesting" to ensure it is not used nowhere else > other than in unit tests. > Regarding the existing gold files, they can be easily regenerated by > TestMRCompiler as follows- > {code} > FileOutputStream fos = new FileOutputStream(expectedFile + "_new"); > PrintWriter pw = new PrintWriter(fos); > pw.write(compiledPlan); > {code} > I assume there won't be any kind of regressions due to this change. But > please let me know if I am wrong. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)
[ https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858473#comment-13858473 ] Lorand Bendig commented on PIG-3642: Please find attached the review request at : https://reviews.apache.org/r/16507/ > Direct HDFS access for small jobs (fetch) > -- > > Key: PIG-3642 > URL: https://issues.apache.org/jira/browse/PIG-3642 > Project: Pig > Issue Type: Improvement >Reporter: Lorand Bendig >Assignee: Lorand Bendig > Fix For: 0.13.0 > > Attachments: PIG-3642.patch > > > With this patch I'd like to add the possibility to directly read data from > HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive > already has this feature (fetch). This patch shares some similarities with > the local mode of Pig 0.6. Here, fetching kicks off when the following holds > for a script: > * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, > (nested) FOREACH with expression operators, custom UDFs..etc > * no scalar aliases > * no SampleLoader > * single leaf job > * DUMP (no STORE) > The feature is enabled by default and can be toggled with: > * -N or -no_fetch > * set opt.fetch true/false; > There's no STORE support because I wanted to make it explicit that this > "optimization" is for launching small/simple scripts during development, > rather than querying and filtering large number of rows on the client > machine. However, a threshold could be given on the input size (an > estimation) to determine whether to prefer fetch over MR jobs, similar to > what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's > LoadMetadata#getStatistic ?) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)
[ https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858471#comment-13858471 ] Lorand Bendig commented on PIG-3642: [~cheolsoo] es, I'd like to have it reviewed > Direct HDFS access for small jobs (fetch) > -- > > Key: PIG-3642 > URL: https://issues.apache.org/jira/browse/PIG-3642 > Project: Pig > Issue Type: Improvement >Reporter: Lorand Bendig >Assignee: Lorand Bendig > Fix For: 0.13.0 > > Attachments: PIG-3642.patch > > > With this patch I'd like to add the possibility to directly read data from > HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive > already has this feature (fetch). This patch shares some similarities with > the local mode of Pig 0.6. Here, fetching kicks off when the following holds > for a script: > * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, > (nested) FOREACH with expression operators, custom UDFs..etc > * no scalar aliases > * no SampleLoader > * single leaf job > * DUMP (no STORE) > The feature is enabled by default and can be toggled with: > * -N or -no_fetch > * set opt.fetch true/false; > There's no STORE support because I wanted to make it explicit that this > "optimization" is for launching small/simple scripts during development, > rather than querying and filtering large number of rows on the client > machine. However, a threshold could be given on the input size (an > estimation) to determine whether to prefer fetch over MR jobs, similar to > what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's > LoadMetadata#getStatistic ?) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16507/ --- Review request for pig. Bugs: PIG-3642 https://issues.apache.org/jira/browse/PIG-3642 Repository: pig Description --- With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script: it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc no scalar aliases no SampleLoader single leaf job DUMP (no STORE) The feature is enabled by default and can be toggled with: -N or -no_fetch set opt.fetch true/false; There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?) Diffs - /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1553596 /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1553596 /trunk/src/org/apache/pig/Main.java 1553596 /trunk/src/org/apache/pig/PigServer.java 1553596 /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1553596 /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1553596 /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1553596 /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1553596 /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1553596 /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1553596 /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1553596 /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION /trunk/test/org/apache/pig/test/TestAssert.java 1553596 /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1553596 /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION /trunk/test/org/apache/pig/test/TestPigRunner.java 1553596 Diff: https://reviews.apache.org/r/16507/diff/ Testing --- - new testcase added: TestFetch - the patch was checked against test-commit and test-core - Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible Thanks, Lorand Bendig
[jira] [Updated] (PIG-3645) Move FileLocalizer.setR() calls to unit tests
[ https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3645: --- Resolution: Fixed Status: Resolved (was: Patch Available) Committed to trunk. I removed FiliLocalizer.getR(). Thank you Rohini for all the help with this jira! > Move FileLocalizer.setR() calls to unit tests > - > > Key: PIG-3645 > URL: https://issues.apache.org/jira/browse/PIG-3645 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Cheolsoo Park >Assignee: Cheolsoo Park >Priority: Minor > Fix For: 0.13.0 > > Attachments: PIG-3645-1.patch, PIG-3645-2.patch, PIG-3645-3.patch, > PIG-3645-4.patch, TEST-org.apache.pig.test.TestMRCompiler.txt > > > Currently, temporary paths are generated by FileLocalizer using > Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the > Random object every time when compiling physical plan to MR plan: > {code} > MRCompiler comp = new MRCompiler(php, pc); > comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new > Random()). > {code} > Besides, there are a couple of places calling FileLocalizer.setR() (e.g. > MRCompiler) with some random seed. > I think- > # Randomizing Random seed is unnecessary if we switch to UUID. > # Setting Random objects in code like this is error-prone because it can be > easily broken by having or missing a FileLocalizer.setR() somewhere else. See > an example [here|http://search-hadoop.com/m/2nxTzQXfHw1]. > So I propose that we remove all this "randomizing Random seed" code and use > UUID instead in temporary paths. > For unit tests that compare the results against gold files, we should still > allow to set Random seed through FileLocalizer.setR(). But this method will > be annotated as "VisibleForTesting" to ensure it is not used nowhere else > other than in unit tests. > Regarding the existing gold files, they can be easily regenerated by > TestMRCompiler as follows- > {code} > FileOutputStream fos = new FileOutputStream(expectedFile + "_new"); > PrintWriter pw = new PrintWriter(fos); > pw.write(compiledPlan); > {code} > I assume there won't be any kind of regressions due to this change. But > please let me know if I am wrong. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3616) TestBuiltIn.testURIwithCurlyBrace() silently fails
[ https://issues.apache.org/jira/browse/PIG-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858467#comment-13858467 ] Lorand Bendig commented on PIG-3616: Sure, I had no objections. Thank you for the update! > TestBuiltIn.testURIwithCurlyBrace() silently fails > -- > > Key: PIG-3616 > URL: https://issues.apache.org/jira/browse/PIG-3616 > Project: Pig > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Lorand Bendig >Assignee: Lorand Bendig >Priority: Minor > Labels: test > Fix For: 0.13.0 > > Attachments: PIG-3616-2.patch, PIG-3616.patch > > > This test runs against MiniCluster but takes the input from the local path. > The empty catch block swallows the exception ("input path does not exist") > thus making a false negative result. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3643) Nested Foreach with UDF and bincond is broken
[ https://issues.apache.org/jira/browse/PIG-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3643: --- Resolution: Fixed Fix Version/s: 0.13.0 Status: Resolved (was: Patch Available) Committed to trunk. Thank you Rohini for the review! > Nested Foreach with UDF and bincond is broken > - > > Key: PIG-3643 > URL: https://issues.apache.org/jira/browse/PIG-3643 > Project: Pig > Issue Type: Bug >Affects Versions: 0.13.0 >Reporter: Rohini Palaniswamy >Assignee: Cheolsoo Park > Fix For: 0.13.0 > > Attachments: PIG-3643-1.patch > > > Was checking out PIG-3000. > A = load 'data' as (a:chararray); > B = foreach A { c = UPPER(a); generate ((c eq 'TEST') ? 1 : 0), ((c eq 'DEV') > ? 1 : 0); } > This now throws "Invalid field projection. Projected field [c] does not exist > in schema". Works fine in 0.11. Broken in trunk. Haven't checked 0.12. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3616) TestBuiltIn.testURIwithCurlyBrace() silently fails
[ https://issues.apache.org/jira/browse/PIG-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3616: --- Resolution: Fixed Status: Resolved (was: Patch Available) Committed to trunk. Thank you Lorand and Rohini! > TestBuiltIn.testURIwithCurlyBrace() silently fails > -- > > Key: PIG-3616 > URL: https://issues.apache.org/jira/browse/PIG-3616 > Project: Pig > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Lorand Bendig >Assignee: Lorand Bendig >Priority: Minor > Labels: test > Fix For: 0.13.0 > > Attachments: PIG-3616-2.patch, PIG-3616.patch > > > This test runs against MiniCluster but takes the input from the local path. > The empty catch block swallows the exception ("input path does not exist") > thus making a false negative result. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3643) Nested Foreach with UDF and bincond is broken
[ https://issues.apache.org/jira/browse/PIG-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858418#comment-13858418 ] Rohini Palaniswamy commented on PIG-3643: - +1. This patch just adds back code that was removed by PIG-3581. > Nested Foreach with UDF and bincond is broken > - > > Key: PIG-3643 > URL: https://issues.apache.org/jira/browse/PIG-3643 > Project: Pig > Issue Type: Bug >Affects Versions: 0.13.0 >Reporter: Rohini Palaniswamy >Assignee: Cheolsoo Park > Attachments: PIG-3643-1.patch > > > Was checking out PIG-3000. > A = load 'data' as (a:chararray); > B = foreach A { c = UPPER(a); generate ((c eq 'TEST') ? 1 : 0), ((c eq 'DEV') > ? 1 : 0); } > This now throws "Invalid field projection. Projected field [c] does not exist > in schema". Works fine in 0.11. Broken in trunk. Haven't checked 0.12. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3645) Move FileLocalizer.setR() calls to unit tests
[ https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858410#comment-13858410 ] Rohini Palaniswamy commented on PIG-3645: - +1. Can you remove the FileLocalizer.getR() method while committing? You had done that in tez branch. > Move FileLocalizer.setR() calls to unit tests > - > > Key: PIG-3645 > URL: https://issues.apache.org/jira/browse/PIG-3645 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Cheolsoo Park >Assignee: Cheolsoo Park >Priority: Minor > Fix For: 0.13.0 > > Attachments: PIG-3645-1.patch, PIG-3645-2.patch, PIG-3645-3.patch, > PIG-3645-4.patch, TEST-org.apache.pig.test.TestMRCompiler.txt > > > Currently, temporary paths are generated by FileLocalizer using > Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the > Random object every time when compiling physical plan to MR plan: > {code} > MRCompiler comp = new MRCompiler(php, pc); > comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new > Random()). > {code} > Besides, there are a couple of places calling FileLocalizer.setR() (e.g. > MRCompiler) with some random seed. > I think- > # Randomizing Random seed is unnecessary if we switch to UUID. > # Setting Random objects in code like this is error-prone because it can be > easily broken by having or missing a FileLocalizer.setR() somewhere else. See > an example [here|http://search-hadoop.com/m/2nxTzQXfHw1]. > So I propose that we remove all this "randomizing Random seed" code and use > UUID instead in temporary paths. > For unit tests that compare the results against gold files, we should still > allow to set Random seed through FileLocalizer.setR(). But this method will > be annotated as "VisibleForTesting" to ensure it is not used nowhere else > other than in unit tests. > Regarding the existing gold files, they can be easily regenerated by > TestMRCompiler as follows- > {code} > FileOutputStream fos = new FileOutputStream(expectedFile + "_new"); > PrintWriter pw = new PrintWriter(fos); > pw.write(compiledPlan); > {code} > I assume there won't be any kind of regressions due to this change. But > please let me know if I am wrong. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3616) TestBuiltIn.testURIwithCurlyBrace() silently fails
[ https://issues.apache.org/jira/browse/PIG-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858409#comment-13858409 ] Rohini Palaniswamy commented on PIG-3616: - +1 > TestBuiltIn.testURIwithCurlyBrace() silently fails > -- > > Key: PIG-3616 > URL: https://issues.apache.org/jira/browse/PIG-3616 > Project: Pig > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Lorand Bendig >Assignee: Lorand Bendig >Priority: Minor > Labels: test > Fix For: 0.13.0 > > Attachments: PIG-3616-2.patch, PIG-3616.patch > > > This test runs against MiniCluster but takes the input from the local path. > The empty catch block swallows the exception ("input path does not exist") > thus making a false negative result. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3634) Improve performance of order-by
[ https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858407#comment-13858407 ] Rohini Palaniswamy commented on PIG-3634: - [~daijy], Is there a reviewboard link for this patch? > Improve performance of order-by > --- > > Key: PIG-3634 > URL: https://issues.apache.org/jira/browse/PIG-3634 > Project: Pig > Issue Type: Sub-task > Components: tez >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: tez-branch > > Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch > > > This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to > implement an order-by. We can optimize to use 4 vertexes in 1 DAG: > vertex 1: close the current vertex, create input + samples input > vertex 2: aggregate samples to create quantiles > vertex 3: use quantiles to partition input > vertex 4: sort input after partition > The DAG is: > {code} > vertex 1 --> vertex 3 --> vertex 4 >\--> vertex 2 ---/ > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3608) ClassCastException when looking up a value from AvroMapWrapper using a Utf8 key
[ https://issues.apache.org/jira/browse/PIG-3608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858404#comment-13858404 ] Richard Ding commented on PIG-3608: --- Thanks [~cheolsoo]. > ClassCastException when looking up a value from AvroMapWrapper using a Utf8 > key > --- > > Key: PIG-3608 > URL: https://issues.apache.org/jira/browse/PIG-3608 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.12.0 >Reporter: Richard Ding >Assignee: Richard Ding >Priority: Minor > Fix For: 0.13.0 > > Attachments: PIG-3608.patch, PIG-3608_2.patch > > > One got the following exception: > {code} > java.lang.ClassCastException: org.apache.avro.util.Utf8 incompatible with > java.lang.String > at > org.apache.pig.impl.util.avro.AvroMapWrapper.get(AvroMapWrapper.java:80) > {code} > This is related to the change by PIG-3420. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3646) LoadFunc cannot get a hold of the associated user defined schema
[ https://issues.apache.org/jira/browse/PIG-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Costin Leau updated PIG-3646: - Description: Described on the mailing list here: http://www.mail-archive.com/user%40pig.apache.org/msg09009.html A Pig {{LoadFunc}} cannot get a hold of its associated schema. For example, in the following script: {code} A = LOAD 'pig/tupleartists' USING MyStorage() AS (name: chararray, links (url:chararray, picture:chararray)); B = FOREACH A GENERATE name, links.url; DUMP B; {code} {{MyStorage}} cannot get a hold of {{(name:chararray, links ...}} even when {{LoadPushDown#pushProjection()}} is implemented (which is called only when a transformation occurs - PlanOptimizer/ColumnMapKeyPrune). One can look into a {{POStore}} but even then the information obtain is incomplete - meaning the schema is incomplete and the fields mentioned in {{FOREACH}} are dereferenced {{links.url}} is returned as {{url}}. The purpose of this issue is to allow a {{LoadFunc}} implementation to get access to its schema declaration as specified in the script. Thanks! was: Described on the mailing list here: http://www.mail-archive.com/user%40pig.apache.org/msg09009.html A Pig LoadFunc cannot get a hold of its associated schema. For example, in the following script: > LoadFunc cannot get a hold of the associated user defined schema > > > Key: PIG-3646 > URL: https://issues.apache.org/jira/browse/PIG-3646 > Project: Pig > Issue Type: Bug > Components: data >Affects Versions: 0.12.0 >Reporter: Costin Leau > > Described on the mailing list here: > http://www.mail-archive.com/user%40pig.apache.org/msg09009.html > A Pig {{LoadFunc}} cannot get a hold of its associated schema. For example, > in the following script: > {code} > A = LOAD 'pig/tupleartists' USING MyStorage() AS (name: chararray, links > (url:chararray, picture:chararray)); > B = FOREACH A GENERATE name, links.url; > DUMP B; > {code} > {{MyStorage}} cannot get a hold of {{(name:chararray, links ...}} even when > {{LoadPushDown#pushProjection()}} is implemented (which is called only when a > transformation occurs - PlanOptimizer/ColumnMapKeyPrune). > One can look into a {{POStore}} but even then the information obtain is > incomplete - meaning the schema is incomplete and the fields mentioned in > {{FOREACH}} are dereferenced {{links.url}} is returned as {{url}}. > The purpose of this issue is to allow a {{LoadFunc}} implementation to get > access to its schema declaration as specified in the script. > Thanks! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (PIG-3646) LoadFunc cannot get a hold of the associated user defined schema
Costin Leau created PIG-3646: Summary: LoadFunc cannot get a hold of the associated user defined schema Key: PIG-3646 URL: https://issues.apache.org/jira/browse/PIG-3646 Project: Pig Issue Type: Bug Components: data Affects Versions: 0.12.0 Reporter: Costin Leau Described on the mailing list here: http://www.mail-archive.com/user%40pig.apache.org/msg09009.html A Pig LoadFunc cannot get a hold of its associated schema. For example, in the following script: -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3634) Improve performance of order-by
[ https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3634: Attachment: PIG-3634-2.patch Not sure why it works before PIG-3636. Reattach patch. > Improve performance of order-by > --- > > Key: PIG-3634 > URL: https://issues.apache.org/jira/browse/PIG-3634 > Project: Pig > Issue Type: Sub-task > Components: tez >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: tez-branch > > Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch > > > This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to > implement an order-by. We can optimize to use 4 vertexes in 1 DAG: > vertex 1: close the current vertex, create input + samples input > vertex 2: aggregate samples to create quantiles > vertex 3: use quantiles to partition input > vertex 4: sort input after partition > The DAG is: > {code} > vertex 1 --> vertex 3 --> vertex 4 >\--> vertex 2 ---/ > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)