Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Lorand Bendig Fri, 03 Jan 2014 15:01:16 -0800


> On Jan. 3, 2014, 1:18 a.m., Cheolsoo Park wrote:
> > I have one last comment below. Other than that, everything looks good.
> > 
> > Also, can you document this? It think it's worth to mention in the 
> > "Performance and Efficiency" section in the manual. You can post a doc 
> > patch in a separate jira if you'd like.


Great, thank you. Sure, I'll create a separate jira for this.


> On Jan. 3, 2014, 1:18 a.m., Cheolsoo Park wrote:
> > /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java,
> >  line 178
> > <https://reviews.apache.org/r/16507/diff/2/?file=413423#file413423line178>
> >
> >     This won't work if the temporary file storage is not InterStorage. It 
> > can be one of Inter, TFile, and SequenceFile storages.
> >     
> >     See here-
> >     
> > https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/util/Utils.java#L347
> >

Great observation! I wasn't aware of this. Now I the verification is done in 
two steps:
- check whether the storage class is the same as the configured one
- check if the target path equals to the temporary path (it's unlikely that 
someone will use the generated temp.
  path of the current session) 
What do you think?


- Lorand


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review31099
-----------------------------------------------------------


On Jan. 3, 2014, 10:57 p.m., Lorand Bendig wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16507/
> -----------------------------------------------------------
> 
> (Updated Jan. 3, 2014, 10:57 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-3642
>     https://issues.apache.org/jira/browse/PIG-3642
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> 
>     it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
>     no scalar aliases
>     no SampleLoader
>     single leaf job
>     DUMP (no STORE)
> 
> The feature is enabled by default and can be toggled with:
> 
>     -N or -no_fetch
>     set opt.fetch true/false;
> 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's 
> LoadMetadata#getStatistic ?)
> 
> 
> Diffs
> -----
> 
>   
> /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
>  1555255 
>   
> /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java
>  1555255 
>   /trunk/src/org/apache/pig/Main.java 1555255 
>   /trunk/src/org/apache/pig/PigConfiguration.java 1555255 
>   /trunk/src/org/apache/pig/PigServer.java 1555255 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java
>  1555255 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java
>  PRE-CREATION 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
>  PRE-CREATION 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java
>  PRE-CREATION 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java
>  PRE-CREATION 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
>  1555255 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
>  1555255 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java
>  1555255 
>   
> /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 
> 1555255 
>   /trunk/src/org/apache/pig/impl/io/FileLocalizer.java 1555255 
>   /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1555255 
>   /trunk/src/org/apache/pig/impl/util/Utils.java 1555255 
>   
> /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java
>  1555255 
>   /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java 
> PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestAssert.java 1555255 
>   /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1555255 
>   /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestPigRunner.java 1555255 
> 
> Diff: https://reviews.apache.org/r/16507/diff/
> 
> 
> Testing
> -------
> 
> - new testcase added:  TestFetch
> - the patch was checked against test-commit and test-core
> - Because opt.fetch is set by default, the testcases were using fetch instead 
> of MR jobs wherever it was possible
> 
> 
> Thanks,
> 
> Lorand Bendig
> 
>

Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Reply via email to