[jira] [Commented] (PIG-3634) Improve performance of order-by
[ https://issues.apache.org/jira/browse/PIG-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860079#comment-13860079 ] Daniel Dai commented on PIG-3634: - Thanks Cheolsoo! Improve performance of order-by --- Key: PIG-3634 URL: https://issues.apache.org/jira/browse/PIG-3634 Project: Pig Issue Type: Sub-task Components: tez Reporter: Daniel Dai Assignee: Daniel Dai Fix For: tez-branch Attachments: PIG-3634-0.patch, PIG-3634-1.patch, PIG-3634-2.patch, PIG-3634-3.patch This is a followup for PIG-3534. In PIG-3534, we use 5 vertexes (3 DAGs) to implement an order-by. We can optimize to use 4 vertexes in 1 DAG: vertex 1: close the current vertex, create input + samples input vertex 2: aggregate samples to create quantiles vertex 3: use quantiles to partition input vertex 4: sort input after partition The DAG is: {code} vertex 1 -- vertex 3 -- vertex 4 \-- vertex 2 ---/ {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)
[ https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860156#comment-13860156 ] Gianmarco De Francisci Morales commented on PIG-3642: - I am -0 on this idea. Skipping MR requires rewriting good part of the execution logic, and might introduce weird optimization bugs. More importantly, the added advantage brought by this feature is small. Usually, if you want to test your program on a small input, you copy it locally and run Pig in local mode. Direct HDFS access for small jobs (fetch) -- Key: PIG-3642 URL: https://issues.apache.org/jira/browse/PIG-3642 Project: Pig Issue Type: Improvement Reporter: Lorand Bendig Assignee: Lorand Bendig Fix For: 0.13.0 Attachments: PIG-3642.patch With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script: * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc * no scalar aliases * no SampleLoader * single leaf job * DUMP (no STORE) The feature is enabled by default and can be toggled with: * -N or -no_fetch * set opt.fetch true/false; There's no STORE support because I wanted to make it explicit that this optimization is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's LoadMetadata#getStatistic ?) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-3453: Status: Open (was: Patch Available) Canceling patch as it is not ready to be committed. Implement a Storm backend to Pig Key: PIG-3453 URL: https://issues.apache.org/jira/browse/PIG-3453 Project: Pig Issue Type: New Feature Affects Versions: 0.13.0 Reporter: Pradeep Gollakota Assignee: Jacob Perkins Labels: storm Fix For: 0.13.0 Attachments: storm-integration.patch There is a lot of interest around implementing a Storm backend to Pig for streaming processing. The proposal and initial discussions can be found at https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)
On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote: This a great work. Thank you very much! I have few minor comments below mostly about tests. Cheolsoo, thanks for taking your time to review it! I fixed/commented the issues. On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote: /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java, line 74 https://reviews.apache.org/r/16507/diff/1/?file=404117#file404117line74 Can you move this to PigConfiguration? PigConfiguration seems to be a better place to put OPT_FETCH On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote: /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java, line 23 https://reviews.apache.org/r/16507/diff/1/?file=404122#file404122line23 Do you mind removing unused imports? - import java.util.LinkedList; - import org.apache.pig.impl.util.IdentityHashSet; - import org.apache.pig.pen.util.LineageTracer; Sure. Intially didn't want to remove these leftovers from PIG-1712 On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote: /trunk/test/org/apache/pig/test/TestAssert.java, lines 91-94 https://reviews.apache.org/r/16507/diff/1/?file=404127#file404127line91 Does this else block ever get executed given the we're running the test with opt.fetch on? I think you can do either- 1) explicitly set opt.fetch to true or false in setup(), or 2) change the test to run the query twice with opt.fetch on and off to ensure we're not breaking anything when opt.fetch is off. Not really. I chose the second option On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote: /trunk/test/org/apache/pig/test/TestPigRunner.java, line 174 https://reviews.apache.org/r/16507/diff/1/?file=404130#file404130line174 Why is this changed? I think the default value for opt.multiquery is true. I accidentally changed it - Lorand --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16507/#review30936 --- On Dec. 29, 2013, 11:19 p.m., Lorand Bendig wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16507/ --- (Updated Dec. 29, 2013, 11:19 p.m.) Review request for pig. Bugs: PIG-3642 https://issues.apache.org/jira/browse/PIG-3642 Repository: pig Description --- With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script: it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc no scalar aliases no SampleLoader single leaf job DUMP (no STORE) The feature is enabled by default and can be toggled with: -N or -no_fetch set opt.fetch true/false; There's no STORE support because I wanted to make it explicit that this optimization is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?) Diffs - /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1553596 /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1553596 /trunk/src/org/apache/pig/Main.java 1553596 /trunk/src/org/apache/pig/PigServer.java 1553596 /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1553596 /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1553596 /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1553596
Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16507/ --- (Updated Jan. 2, 2014, 2:05 p.m.) Review request for pig. Changes --- Updated patch: PIG-3642-2.patch Bugs: PIG-3642 https://issues.apache.org/jira/browse/PIG-3642 Repository: pig Description --- With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script: it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc no scalar aliases no SampleLoader single leaf job DUMP (no STORE) The feature is enabled by default and can be toggled with: -N or -no_fetch set opt.fetch true/false; There's no STORE support because I wanted to make it explicit that this optimization is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?) Diffs (updated) - /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1554785 /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1554785 /trunk/src/org/apache/pig/Main.java 1554785 /trunk/src/org/apache/pig/PigConfiguration.java 1554785 /trunk/src/org/apache/pig/PigServer.java 1554785 /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1554785 /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1554785 /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1554785 /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1554785 /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1554785 /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1554785 /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1554785 /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION /trunk/test/org/apache/pig/test/TestAssert.java 1554785 /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1554785 /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION /trunk/test/org/apache/pig/test/TestPigRunner.java 1554785 Diff: https://reviews.apache.org/r/16507/diff/ Testing --- - new testcase added: TestFetch - the patch was checked against test-commit and test-core - Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible Thanks, Lorand Bendig
[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)
[ https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860224#comment-13860224 ] Lorand Bendig commented on PIG-3642: [~azaroth], I took the idea of this patch from HIVE-2925 and PIG-2864. I agree, that the benefit is limited, however simple scripts/queries will run significantly faster than in local MR mode. As far as I can judge, aside from some mocking and initialization the execution logic literally follows Pig's pull-based model. What optimization bugs do you think that can happen? Direct HDFS access for small jobs (fetch) -- Key: PIG-3642 URL: https://issues.apache.org/jira/browse/PIG-3642 Project: Pig Issue Type: Improvement Reporter: Lorand Bendig Assignee: Lorand Bendig Fix For: 0.13.0 Attachments: PIG-3642.patch With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script: * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc * no scalar aliases * no SampleLoader * single leaf job * DUMP (no STORE) The feature is enabled by default and can be toggled with: * -N or -no_fetch * set opt.fetch true/false; There's no STORE support because I wanted to make it explicit that this optimization is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's LoadMetadata#getStatistic ?) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)
[ https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860228#comment-13860228 ] Gianmarco De Francisci Morales commented on PIG-3642: - I haven't reviewed the patch thoroughly so take my comments with the due care. I am just afraid that we will redo the same mistake we did with the local mode execution of Pig that you mention in the ticket. That mode of execution was removed because it was a burden to maintain, and in the end the two implementations (MR and local mode) were out of synch, resulting in the same script doing different things. I just want to avoid the same thing happening again. If [~cheolsoo] has reviewed the patch, I would like to hear his comments on this issue. Direct HDFS access for small jobs (fetch) -- Key: PIG-3642 URL: https://issues.apache.org/jira/browse/PIG-3642 Project: Pig Issue Type: Improvement Reporter: Lorand Bendig Assignee: Lorand Bendig Fix For: 0.13.0 Attachments: PIG-3642.patch With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script: * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc * no scalar aliases * no SampleLoader * single leaf job * DUMP (no STORE) The feature is enabled by default and can be toggled with: * -N or -no_fetch * set opt.fetch true/false; There's no STORE support because I wanted to make it explicit that this optimization is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's LoadMetadata#getStatistic ?) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)
[ https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860392#comment-13860392 ] Cheolsoo Park commented on PIG-3642: [~azaroth], thank you for raising a concern. But I still think we should commit this patch for the following reasons- # Fetch optimization happens after physical plan is fully built. If the plan is fetchable (i.e. meets all the conditions Lorand listed in the description), Pig will launch a job via FetchLauncher instead via MapReduceLauncher. Given this code path, I think the possibility of introducing a weird optimization bug is minimal. In addition, the optimization is only applicable to fairly small queries. # There are indeed changes to some backend operators such as POStream. This is because the logic about when to pull data from pipeline is different in some cases. But these changes are fairly minimal too. # IMO, the benefit of this optimization is big. I am constantly asked by users about this feature. True that it won't improve any performance of production ETL jobs, but it will shorten development iteration. In addition, launching a full MR job for a simple load/dump query definitely makes a bad impression to new users. Direct HDFS access for small jobs (fetch) -- Key: PIG-3642 URL: https://issues.apache.org/jira/browse/PIG-3642 Project: Pig Issue Type: Improvement Reporter: Lorand Bendig Assignee: Lorand Bendig Fix For: 0.13.0 Attachments: PIG-3642.patch With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script: * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc * no scalar aliases * no SampleLoader * single leaf job * DUMP (no STORE) The feature is enabled by default and can be toggled with: * -N or -no_fetch * set opt.fetch true/false; There's no STORE support because I wanted to make it explicit that this optimization is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's LoadMetadata#getStatistic ?) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)
[ https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860531#comment-13860531 ] Alan Gates commented on PIG-3642: - I don't think this will result in the same local mode/mr mode problem that we had before. The issue there was we tried (and failed) to have two modes where Pig provided all features. This is much more limited to doing things locally that can easily be done locally. Direct HDFS access for small jobs (fetch) -- Key: PIG-3642 URL: https://issues.apache.org/jira/browse/PIG-3642 Project: Pig Issue Type: Improvement Reporter: Lorand Bendig Assignee: Lorand Bendig Fix For: 0.13.0 Attachments: PIG-3642.patch With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script: * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc * no scalar aliases * no SampleLoader * single leaf job * DUMP (no STORE) The feature is enabled by default and can be toggled with: * -N or -no_fetch * set opt.fetch true/false; There's no STORE support because I wanted to make it explicit that this optimization is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's LoadMetadata#getStatistic ?) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3615) Update the way that JsonLoader/JsonStorage deal with BigDecimal
[ https://issues.apache.org/jira/browse/PIG-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860699#comment-13860699 ] Erik Selin commented on PIG-3615: - Could someone take a look at this tiny little pr. It would be great to get it merged or at least have a discussion about it :) Update the way that JsonLoader/JsonStorage deal with BigDecimal --- Key: PIG-3615 URL: https://issues.apache.org/jira/browse/PIG-3615 Project: Pig Issue Type: Improvement Affects Versions: 0.12.0 Reporter: Erik Selin Priority: Minor Attachments: bugPig-3615.patch It's a common (and good) convention to quote fixed point numbers when storing them as json. The reason being that majority of json libraries will implicitly load any number value as a floating point number and if you care about data integrity this will make you very sad. This update makes JsonLoader able to load BigDecimal values from quoted values (the old jackson library that we're using doesn't support this through the current approach) as well as making JsonStorage store BigDecimal values as quoted strings. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3615) Update the way that JsonLoader/JsonStorage deal with BigDecimal
[ https://issues.apache.org/jira/browse/PIG-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860715#comment-13860715 ] Cheolsoo Park commented on PIG-3615: [~tyro89], please hit the Submit Patch button. That will make this jira show up in the Patch Available list. Update the way that JsonLoader/JsonStorage deal with BigDecimal --- Key: PIG-3615 URL: https://issues.apache.org/jira/browse/PIG-3615 Project: Pig Issue Type: Improvement Affects Versions: 0.12.0 Reporter: Erik Selin Priority: Minor Attachments: bugPig-3615.patch It's a common (and good) convention to quote fixed point numbers when storing them as json. The reason being that majority of json libraries will implicitly load any number value as a floating point number and if you care about data integrity this will make you very sad. This update makes JsonLoader able to load BigDecimal values from quoted values (the old jackson library that we're using doesn't support this through the current approach) as well as making JsonStorage store BigDecimal values as quoted strings. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3615) Update the way that JsonLoader/JsonStorage deal with BigDecimal
[ https://issues.apache.org/jira/browse/PIG-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Selin updated PIG-3615: Status: Patch Available (was: Open) Update the way that JsonLoader/JsonStorage deal with BigDecimal --- Key: PIG-3615 URL: https://issues.apache.org/jira/browse/PIG-3615 Project: Pig Issue Type: Improvement Affects Versions: 0.12.0 Reporter: Erik Selin Priority: Minor Attachments: bugPig-3615.patch It's a common (and good) convention to quote fixed point numbers when storing them as json. The reason being that majority of json libraries will implicitly load any number value as a floating point number and if you care about data integrity this will make you very sad. This update makes JsonLoader able to load BigDecimal values from quoted values (the old jackson library that we're using doesn't support this through the current approach) as well as making JsonStorage store BigDecimal values as quoted strings. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (8 issues) Subscriber: pigdaily Key Summary PIG-3642Direct HDFS access for small jobs (fetch) https://issues.apache.org/jira/browse/PIG-3642 PIG-3635Fix e2e tests for Hadoop 2.X on Windows https://issues.apache.org/jira/browse/PIG-3635 PIG-3615Update the way that JsonLoader/JsonStorage deal with BigDecimal https://issues.apache.org/jira/browse/PIG-3615 PIG-3613UDF for SimilarityMatching between strings with matching scores https://issues.apache.org/jira/browse/PIG-3613 PIG-3587add functionality for rolling over dates https://issues.apache.org/jira/browse/PIG-3587 PIG-3573Provide StoreFunc and LoadFunc for Accumulo https://issues.apache.org/jira/browse/PIG-3573 PIG-3441Allow Pig to use default resources from Configuration objects https://issues.apache.org/jira/browse/PIG-3441 PIG-3347Store invocation brings side effect https://issues.apache.org/jira/browse/PIG-3347 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225filterId=12322384
Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16507/#review31099 --- I have one last comment below. Other than that, everything looks good. Also, can you document this? It think it's worth to mention in the Performance and Efficiency section in the manual. You can post a doc patch in a separate jira if you'd like. /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java https://reviews.apache.org/r/16507/#comment59452 This won't work if the temporary file storage is not InterStorage. It can be one of Inter, TFile, and SequenceFile storages. See here- https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/util/Utils.java#L347 - Cheolsoo Park On Jan. 2, 2014, 2:05 p.m., Lorand Bendig wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16507/ --- (Updated Jan. 2, 2014, 2:05 p.m.) Review request for pig. Bugs: PIG-3642 https://issues.apache.org/jira/browse/PIG-3642 Repository: pig Description --- With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script: it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc no scalar aliases no SampleLoader single leaf job DUMP (no STORE) The feature is enabled by default and can be toggled with: -N or -no_fetch set opt.fetch true/false; There's no STORE support because I wanted to make it explicit that this optimization is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?) Diffs - /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1554785 /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1554785 /trunk/src/org/apache/pig/Main.java 1554785 /trunk/src/org/apache/pig/PigConfiguration.java 1554785 /trunk/src/org/apache/pig/PigServer.java 1554785 /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1554785 /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1554785 /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1554785 /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1554785 /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1554785 /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1554785 /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1554785 /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION /trunk/test/org/apache/pig/test/TestAssert.java 1554785 /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1554785 /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION /trunk/test/org/apache/pig/test/TestPigRunner.java 1554785 Diff: https://reviews.apache.org/r/16507/diff/ Testing --- - new testcase added: TestFetch - the patch was checked against test-commit and test-core - Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible Thanks, Lorand Bendig