Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Lorand Bendig Thu, 02 Jan 2014 06:05:13 -0800


> On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
> > This a great work. Thank you very much!
> > 
> > I have few minor comments below mostly about tests.


Cheolsoo, thanks for taking your time to review it!
I fixed/commented the issues.


> On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
> > /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java,
> >  line 74
> > <https://reviews.apache.org/r/16507/diff/1/?file=404117#file404117line74>
> >
> >     Can you move this to PigConfiguration?

PigConfiguration seems to be a better place to put OPT_FETCH


> On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
> > /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java,
> >  line 23
> > <https://reviews.apache.org/r/16507/diff/1/?file=404122#file404122line23>
> >
> >     Do you mind removing unused imports?
> >     - import java.util.LinkedList;
> >     - import org.apache.pig.impl.util.IdentityHashSet;
> >     - import org.apache.pig.pen.util.LineageTracer;

Sure. Intially didn't want to remove these leftovers from PIG-1712 


> On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
> > /trunk/test/org/apache/pig/test/TestAssert.java, lines 91-94
> > <https://reviews.apache.org/r/16507/diff/1/?file=404127#file404127line91>
> >
> >     Does this else block ever get executed given the we're running the test 
> > with opt.fetch on?
> >     
> >     I think you can do either-
> >     
> >     1) explicitly set opt.fetch to true or false in setup(),
> >     
> >     or
> >     
> >     2) change the test to run the query twice with opt.fetch on and off to 
> > ensure we're not breaking anything when opt.fetch is off.

Not really. I chose the second option


> On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
> > /trunk/test/org/apache/pig/test/TestPigRunner.java, line 174
> > <https://reviews.apache.org/r/16507/diff/1/?file=404130#file404130line174>
> >
> >     Why is this changed? I think the default value for opt.multiquery is 
> > true.

I accidentally changed it


- Lorand


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review30936
-----------------------------------------------------------


On Dec. 29, 2013, 11:19 p.m., Lorand Bendig wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16507/
> -----------------------------------------------------------
> 
> (Updated Dec. 29, 2013, 11:19 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-3642
>     https://issues.apache.org/jira/browse/PIG-3642
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> 
>     it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
>     no scalar aliases
>     no SampleLoader
>     single leaf job
>     DUMP (no STORE)
> 
> The feature is enabled by default and can be toggled with:
> 
>     -N or -no_fetch
>     set opt.fetch true/false;
> 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's 
> LoadMetadata#getStatistic ?)
> 
> 
> Diffs
> -----
> 
>   
> /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
>  1553596 
>   
> /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java
>  1553596 
>   /trunk/src/org/apache/pig/Main.java 1553596 
>   /trunk/src/org/apache/pig/PigServer.java 1553596 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java
>  1553596 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java
>  PRE-CREATION 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
>  PRE-CREATION 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java
>  PRE-CREATION 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java
>  PRE-CREATION 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
>  1553596 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
>  1553596 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java
>  1553596 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 
> 1553596 
>   /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1553596 
>   
> /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java
>  1553596 
>   /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java 
> PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestAssert.java 1553596 
>   /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1553596 
>   /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestPigRunner.java 1553596 
> 
> Diff: https://reviews.apache.org/r/16507/diff/
> 
> 
> Testing
> -------
> 
> - new testcase added:  TestFetch
> - the patch was checked against test-commit and test-core
> - Because opt.fetch is set by default, the testcases were using fetch instead 
> of MR jobs wherever it was possible
> 
> 
> Thanks,
> 
> Lorand Bendig
> 
>

Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Reply via email to