Re: Review Request 40743: PIG-4709 Improve performance of GROUPBY operator on Spark

Pallavi Rao Wed, 27 Jan 2016 00:13:57 -0800


> On Jan. 16, 2016, 2 a.m., Mohit Sabharwal wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java,
> >  line 139
> > <https://reviews.apache.org/r/40743/diff/3/?file=1156747#file1156747line139>
> >
> >     Let's please not make any changes in core-Pig unless absolutely 
> > necessary. This will make merging harder. Any reason this is unavoidable ?
> >     
> >     Same for removal of getFlatStr() below.


Not absolutely necessary. It just causes duplication of code. I will copy-paste 
the method to the child-class (POReduceBySpark) and will revert changes to core.


> On Jan. 16, 2016, 2 a.m., Mohit Sabharwal wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java, 
> > line 234
> > <https://reviews.apache.org/r/40743/diff/3/?file=1156749#file1156749line234>
> >
> >     LOG.debug

Just keeping it consistent with rest of the explains in the method. Explain 
outputs to System.out


> On Jan. 16, 2016, 2 a.m., Mohit Sabharwal wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/spark/converter/GlobalRearrangeConverter.java,
> >  line 177
> > <https://reviews.apache.org/r/40743/diff/3/?file=1156750#file1156750line177>
> >
> >     any particular reason you're moving this to it's own file ?

Yes. This is being used by 2 classes now (GlobalRearrangeConverter and 
ReduceByConverter). Hence moved it up.


> On Jan. 16, 2016, 2 a.m., Mohit Sabharwal wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java, 
> > line 253
> > <https://reviews.apache.org/r/40743/diff/3/?file=1156749#file1156749line253>
> >
> >     Don't remember if this needs to be earlier. Probably safer to keep this 
> > right after the combiner optimization.

Done. Moved it up.


- Pallavi


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40743/#review111113
-----------------------------------------------------------


On Dec. 18, 2015, 6:47 a.m., Pallavi Rao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40743/
> -----------------------------------------------------------
> 
> (Updated Dec. 18, 2015, 6:47 a.m.)
> 
> 
> Review request for pig, Mohit Sabharwal and Xuefu Zhang.
> 
> 
> Bugs: PIG-4709
>     https://issues.apache.org/jira/browse/PIG-4709
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic.
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.
> 
> Introduced a combiner optimizer that does the following:
>     // Checks for algebraic operations and if they exist.
>     // Replaces global rearrange (cogroup) with reduceBy as follows:
>     // Input:
>     // foreach (using algebraicOp)
>     //   -> packager
>     //      -> globalRearrange
>     //          -> localRearrange
>     // Output:
>     // foreach (using algebraicOp.Final)
>     //   -> reduceBy (uses algebraicOp.Intermediate)
>     //      -> foreach (using algebraicOp.Initial)
>     //          -> localRearrange
> 
> 
> Diffs
> -----
> 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java
>  f8c1658 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORollupHIIForEach.java
>  aca347d 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java 
> a4dbadd 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/GlobalRearrangeConverter.java
>  5f74992 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/LocalRearrangeConverter.java
>  9ce0492 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/PigSecondaryKeyComparatorSpark.java
>  PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/ReduceByConverter.java
>  PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/operator/POReduceBySpark.java
>  PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/optimizer/CombinerOptimizer.java
>  PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/util/CombinerOptimizerUtil.java
>  6b66ca1 
>   
> src/org/apache/pig/backend/hadoop/executionengine/util/SecondaryKeyOptimizerUtil.java
>  546d91e 
>   test/org/apache/pig/test/TestCombiner.java df44293 
> 
> Diff: https://reviews.apache.org/r/40743/diff/
> 
> 
> Testing
> -------
> 
> The patch unblocked one UT in TestCombiner. Added another UT in the same 
> class. Also did some manual testing.
> 
> 
> Thanks,
> 
> Pallavi Rao
> 
>

Re: Review Request 40743: PIG-4709 Improve performance of GROUPBY operator on Spark

Reply via email to