> On Jan. 3, 2014, 1:18 a.m., Cheolsoo Park wrote: > > I have one last comment below. Other than that, everything looks good. > > > > Also, can you document this? It think it's worth to mention in the > > "Performance and Efficiency" section in the manual. You can post a doc > > patch in a separate jira if you'd like.
Great, thank you. Sure, I'll create a separate jira for this. > On Jan. 3, 2014, 1:18 a.m., Cheolsoo Park wrote: > > /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java, > > line 178 > > <https://reviews.apache.org/r/16507/diff/2/?file=413423#file413423line178> > > > > This won't work if the temporary file storage is not InterStorage. It > > can be one of Inter, TFile, and SequenceFile storages. > > > > See here- > > > > https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/util/Utils.java#L347 > > Great observation! I wasn't aware of this. Now I the verification is done in two steps: - check whether the storage class is the same as the configured one - check if the target path equals to the temporary path (it's unlikely that someone will use the generated temp. path of the current session) What do you think? - Lorand ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16507/#review31099 ----------------------------------------------------------- On Jan. 3, 2014, 10:57 p.m., Lorand Bendig wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/16507/ > ----------------------------------------------------------- > > (Updated Jan. 3, 2014, 10:57 p.m.) > > > Review request for pig. > > > Bugs: PIG-3642 > https://issues.apache.org/jira/browse/PIG-3642 > > > Repository: pig > > > Description > ------- > > With this patch I'd like to add the possibility to directly read data from > HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive > already has this feature (fetch). This patch shares some similarities with > the local mode of Pig 0.6. Here, fetching kicks off when the following holds > for a script: > > it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, > (nested) FOREACH with expression operators, custom UDFs..etc > no scalar aliases > no SampleLoader > single leaf job > DUMP (no STORE) > > The feature is enabled by default and can be toggled with: > > -N or -no_fetch > set opt.fetch true/false; > > There's no STORE support because I wanted to make it explicit that this > "optimization" is for launching small/simple scripts during development, > rather than querying and filtering large number of rows on the client > machine. However, a threshold could be given on the input size (an > estimation) to determine whether to prefer fetch over MR jobs, similar to > what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's > LoadMetadata#getStatistic ?) > > > Diffs > ----- > > > /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java > 1555255 > > /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java > 1555255 > /trunk/src/org/apache/pig/Main.java 1555255 > /trunk/src/org/apache/pig/PigConfiguration.java 1555255 > /trunk/src/org/apache/pig/PigServer.java 1555255 > > /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java > 1555255 > > /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java > PRE-CREATION > > /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java > PRE-CREATION > > /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java > PRE-CREATION > > /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java > PRE-CREATION > > /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java > 1555255 > > /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java > 1555255 > > /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java > 1555255 > > /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java > 1555255 > /trunk/src/org/apache/pig/impl/io/FileLocalizer.java 1555255 > /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1555255 > /trunk/src/org/apache/pig/impl/util/Utils.java 1555255 > > /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java > 1555255 > /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java > PRE-CREATION > /trunk/test/org/apache/pig/test/TestAssert.java 1555255 > /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1555255 > /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION > /trunk/test/org/apache/pig/test/TestPigRunner.java 1555255 > > Diff: https://reviews.apache.org/r/16507/diff/ > > > Testing > ------- > > - new testcase added: TestFetch > - the patch was checked against test-commit and test-core > - Because opt.fetch is set by default, the testcases were using fetch instead > of MR jobs wherever it was possible > > > Thanks, > > Lorand Bendig > >