-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16717/
-----------------------------------------------------------

Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini 
Palaniswamy.


Bugs: PIG-3562
    https://issues.apache.org/jira/browse/PIG-3562


Repository: pig-git


Description
-------

Implement DISTINCT combiner optimizations in Tez

1. Use a combiner with normal uses of DISTINCT. In MR Pig, there are some 
global variables and a special DistinctCombiner class that throws away the 
duplicate tuples. We could hack this into Pig-on-Tez, but instead I just reused 
the reduce plan as the combiner plan, which does the same thing (through a 
POPackage->POProject->POForEach with the setDistinct property set to true).

I'm a little bit concerned that this combiner plan could somehow be slower than 
the special DistinctCombiner class, but I don't see how.

There is also a special CombinerPackager packager that I did NOT use for this. 
I think that packager is really intended for use with the algebraic UDF 
combiner optimizations only.

2. I carefully verified that DISTINCT nested inside a FOREACH code block is 
optimized by the CombinerOptimizer into an algebraic UDF version of DISTINCT. I 
added TestTezCompiler and e2e tests for this. Cheolsoo already made all the 
combiner changes for this to work correctly - I didn't make any code changes 
here.


Diffs
-----

  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 
a7de3a7 
  test/e2e/pig/tests/tez.conf 71cdcbc 
  test/org/apache/pig/test/data/GoldenFiles/TEZC13.gld PRE-CREATION 
  test/org/apache/pig/test/data/GoldenFiles/TEZC5.gld 35d9313 
  test/org/apache/pig/tez/TestTezCompiler.java 2252531 

Diff: https://reviews.apache.org/r/16717/diff/


Testing
-------

Updated golden file for existing TestTezCompiler DISTINCT test to include 
combiner plan
Added TestTezCompiler test and golden file for DISTINCT algebraic udf combiner
Added e2e test that runs DISTINCT with algebraic udf combiner
I am getting some test-e2e-tez failures in ORDER BY tests, but I am also 
getting these in a clean Tez branch. My new e2e test passes.


Thanks,

Alex Bain

Reply via email to