----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16717/#review31403 -----------------------------------------------------------
Ship it! I will commit it after running tests. - Cheolsoo Park On Jan. 8, 2014, 1:47 a.m., Alex Bain wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/16717/ > ----------------------------------------------------------- > > (Updated Jan. 8, 2014, 1:47 a.m.) > > > Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini > Palaniswamy. > > > Bugs: PIG-3562 > https://issues.apache.org/jira/browse/PIG-3562 > > > Repository: pig-git > > > Description > ------- > > Implement DISTINCT combiner optimizations in Tez > > 1. Use a combiner with normal uses of DISTINCT. In MR Pig, there are some > global variables and a special DistinctCombiner class that throws away the > duplicate tuples. We could hack this into Pig-on-Tez, but instead I just > reused the reduce plan as the combiner plan, which does the same thing > (through a POPackage->POProject->POForEach with the setDistinct property set > to true). > > I'm a little bit concerned that this combiner plan could somehow be slower > than the special DistinctCombiner class, but I don't see how. > > There is also a special CombinerPackager packager that I did NOT use for > this. I think that packager is really intended for use with the algebraic UDF > combiner optimizations only. > > 2. I carefully verified that DISTINCT nested inside a FOREACH code block is > optimized by the CombinerOptimizer into an algebraic UDF version of DISTINCT. > I added TestTezCompiler and e2e tests for this. Cheolsoo already made all the > combiner changes for this to work correctly - I didn't make any code changes > here. > > > Diffs > ----- > > src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java > a7de3a7 > test/e2e/pig/tests/tez.conf 71cdcbc > test/org/apache/pig/test/data/GoldenFiles/TEZC13.gld PRE-CREATION > test/org/apache/pig/test/data/GoldenFiles/TEZC5.gld 35d9313 > test/org/apache/pig/tez/TestTezCompiler.java 2252531 > > Diff: https://reviews.apache.org/r/16717/diff/ > > > Testing > ------- > > Updated golden file for existing TestTezCompiler DISTINCT test to include > combiner plan > Added TestTezCompiler test and golden file for DISTINCT algebraic udf combiner > Added e2e test that runs DISTINCT with algebraic udf combiner > I am getting some test-e2e-tez failures in ORDER BY tests, but I am also > getting these in a clean Tez branch. My new e2e test passes. > > > Thanks, > > Alex Bain > >