-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/
-----------------------------------------------------------
(Updated Jan. 2, 2014, 2:05 p.m.)
Review request for pig.
Changes
-------
Updated patch: PIG-3642-2.patch
Bugs: PIG-3642
https://issues.apache.org/jira/browse/PIG-3642
Repository: pig
Description
-------
With this patch I'd like to add the possibility to directly read data from HDFS
instead of launching MR jobs in case of simple (map-only) tasks. Hive already
has this feature (fetch). This patch shares some similarities with the local
mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:
it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM,
(nested) FOREACH with expression operators, custom UDFs..etc
no scalar aliases
no SampleLoader
single leaf job
DUMP (no STORE)
The feature is enabled by default and can be toggled with:
-N or -no_fetch
set opt.fetch true/false;
There's no STORE support because I wanted to make it explicit that this
"optimization" is for launching small/simple scripts during development, rather
than querying and filtering large number of rows on the client machine.
However, a threshold could be given on the input size (an estimation) to
determine whether to prefer fetch over MR jobs, similar to what Hive's
'hive.fetch.task.conversion.threshold' does. (through Pig's
LoadMetadata#getStatistic ?)
Diffs (updated)
-----
/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
1554785
/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java
1554785
/trunk/src/org/apache/pig/Main.java 1554785
/trunk/src/org/apache/pig/PigConfiguration.java 1554785
/trunk/src/org/apache/pig/PigServer.java 1554785
/trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java
1554785
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java
PRE-CREATION
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
PRE-CREATION
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java
PRE-CREATION
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java
PRE-CREATION
/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
1554785
/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
1554785
/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java
1554785
/trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java
1554785
/trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1554785
/trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java
1554785
/trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java
PRE-CREATION
/trunk/test/org/apache/pig/test/TestAssert.java 1554785
/trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1554785
/trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION
/trunk/test/org/apache/pig/test/TestPigRunner.java 1554785
Diff: https://reviews.apache.org/r/16507/diff/
Testing
-------
- new testcase added: TestFetch
- the patch was checked against test-commit and test-core
- Because opt.fetch is set by default, the testcases were using fetch instead
of MR jobs wherever it was possible
Thanks,
Lorand Bendig