Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Lorand Bendig Sun, 29 Dec 2013 15:20:06 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/
-----------------------------------------------------------


Review request for pig.


Bugs: PIG-3642
    https://issues.apache.org/jira/browse/PIG-3642


Repository: pig


Description
-------

With this patch I'd like to add the possibility to directly read data from HDFS 
instead of launching MR jobs in case of simple (map-only) tasks. Hive already 
has this feature (fetch). This patch shares some similarities with the local 
mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:

    it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
(nested) FOREACH with expression operators, custom UDFs..etc
    no scalar aliases
    no SampleLoader
    single leaf job
    DUMP (no STORE)

The feature is enabled by default and can be toggled with:

    -N or -no_fetch
    set opt.fetch true/false;

There's no STORE support because I wanted to make it explicit that this 
"optimization" is for launching small/simple scripts during development, rather 
than querying and filtering large number of rows on the client machine. 
However, a threshold could be given on the input size (an estimation) to 
determine whether to prefer fetch over MR jobs, similar to what Hive's 
'hive.fetch.task.conversion.threshold' does. (through Pig's 
LoadMetadata#getStatistic ?)


Diffs
-----

  
/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
 1553596 
  
/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java
 1553596 
  /trunk/src/org/apache/pig/Main.java 1553596 
  /trunk/src/org/apache/pig/PigServer.java 1553596 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 
1553596 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java
 PRE-CREATION 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
 PRE-CREATION 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java
 PRE-CREATION 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java
 PRE-CREATION 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
 1553596 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
 1553596 
  
/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java
 1553596 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 
1553596 
  /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1553596 
  
/trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java
 1553596 
  /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java 
PRE-CREATION 
  /trunk/test/org/apache/pig/test/TestAssert.java 1553596 
  /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1553596 
  /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
  /trunk/test/org/apache/pig/test/TestPigRunner.java 1553596 

Diff: https://reviews.apache.org/r/16507/diff/


Testing
-------

- new testcase added:  TestFetch
- the patch was checked against test-commit and test-core
- Because opt.fetch is set by default, the testcases were using fetch instead 
of MR jobs wherever it was possible


Thanks,

Lorand Bendig

Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Reply via email to