----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16507/ -----------------------------------------------------------
(Updated Jan. 2, 2014, 2:05 p.m.) Review request for pig. Changes ------- Updated patch: PIG-3642-2.patch Bugs: PIG-3642 https://issues.apache.org/jira/browse/PIG-3642 Repository: pig Description ------- With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script: it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc no scalar aliases no SampleLoader single leaf job DUMP (no STORE) The feature is enabled by default and can be toggled with: -N or -no_fetch set opt.fetch true/false; There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?) Diffs (updated) ----- /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1554785 /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1554785 /trunk/src/org/apache/pig/Main.java 1554785 /trunk/src/org/apache/pig/PigConfiguration.java 1554785 /trunk/src/org/apache/pig/PigServer.java 1554785 /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1554785 /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1554785 /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1554785 /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1554785 /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1554785 /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1554785 /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1554785 /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION /trunk/test/org/apache/pig/test/TestAssert.java 1554785 /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1554785 /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION /trunk/test/org/apache/pig/test/TestPigRunner.java 1554785 Diff: https://reviews.apache.org/r/16507/diff/ Testing ------- - new testcase added: TestFetch - the patch was checked against test-commit and test-core - Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible Thanks, Lorand Bendig