Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Cheolsoo Park Thu, 02 Jan 2014 17:19:26 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review31099
-----------------------------------------------------------



I have one last comment below. Other than that, everything looks good.

Also, can you document this? It think it's worth to mention in the "Performance 
and Efficiency" section in the manual. You can post a doc patch in a separate 
jira if you'd like.


/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
<https://reviews.apache.org/r/16507/#comment59452>

    This won't work if the temporary file storage is not InterStorage. It can 
be one of Inter, TFile, and SequenceFile storages.
    
    See here-
    
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/util/Utils.java#L347
    


- Cheolsoo Park


On Jan. 2, 2014, 2:05 p.m., Lorand Bendig wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16507/
> -----------------------------------------------------------
> 
> (Updated Jan. 2, 2014, 2:05 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-3642
>     https://issues.apache.org/jira/browse/PIG-3642
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> 
>     it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
>     no scalar aliases
>     no SampleLoader
>     single leaf job
>     DUMP (no STORE)
> 
> The feature is enabled by default and can be toggled with:
> 
>     -N or -no_fetch
>     set opt.fetch true/false;
> 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's 
> LoadMetadata#getStatistic ?)
> 
> 
> Diffs
> -----
> 
>   
> /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
>  1554785 
>   
> /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java
>  1554785 
>   /trunk/src/org/apache/pig/Main.java 1554785 
>   /trunk/src/org/apache/pig/PigConfiguration.java 1554785 
>   /trunk/src/org/apache/pig/PigServer.java 1554785 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java
>  1554785 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java
>  PRE-CREATION 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
>  PRE-CREATION 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java
>  PRE-CREATION 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java
>  PRE-CREATION 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
>  1554785 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
>  1554785 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java
>  1554785 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 
> 1554785 
>   /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1554785 
>   
> /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java
>  1554785 
>   /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java 
> PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestAssert.java 1554785 
>   /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1554785 
>   /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestPigRunner.java 1554785 
> 
> Diff: https://reviews.apache.org/r/16507/diff/
> 
> 
> Testing
> -------
> 
> - new testcase added:  TestFetch
> - the patch was checked against test-commit and test-core
> - Because opt.fetch is set by default, the testcases were using fetch instead 
> of MR jobs wherever it was possible
> 
> 
> Thanks,
> 
> Lorand Bendig
> 
>

Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Reply via email to