Re: Review Request: PIG-2831: MR-Cube implementation (Distributed cubing for holistic measures)

j . prasanth . j Tue, 21 Aug 2012 03:43:54 -0700


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java,
> >  line 699
> > <https://reviews.apache.org/r/6651/diff/3/?file=141662#file141662line699>
> >
> >     does this break if we have 2 cube operators? should we add some 
> > identifier to the filename?


mro.isFullCube() is true only for holistic cube job. If 2 cube operators are 
used with the following combinations
1) Both cube operators with algebraic measure : No issue
2) One cube operator with algebraic measure and another with holistic measure : 
No issue
3) Two cube operators with holistic measure and both operators have different 
load functions/input aliases : No issue because when dumping/storing the result 
alias its predecessors will have a single path to the load function.
4) Two cube operators with holistic measure and output of 1st cube operator 
being the direct/transformed input of 2nd cube operator: Not an ideal case. 
Will fail in compilation (in MRCompiler). Output of 1st cube operator will not 
be materialized during compilation and so statistics gathering (required while 
compiling 2nd cube operator) from the loader like raw tuple size, overall file 
size, in-memory tuple size etc. will throw IOException.

Can you please let me know if the scenario you thought of 2 cube operators is 
different from these? 


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java,
> >  line 701
> > <https://reviews.apache.org/r/6651/diff/3/?file=141662#file141662line701>
> >
> >     no, let's rely on the general estimator framework rather than 
> > special-casing. Having separate chunks of code dealing with this through 
> > the codebase will lead to confusion.  Now that we allow passing a full plan 
> > into the reducer estimator, it will be possible to write an estimator that 
> > walks the plan and uses factors based on knowledge of inputs and operators 
> > that work on the input to adjust the naive estimation; we should encode any 
> > cube operator-specific logic into such an estimator.
> >     
> >     I think you can go ahead and remove this comment block.

Dropped the code.


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java,
> >  line 790
> > <https://reviews.apache.org/r/6651/diff/3/?file=141662#file141662line790>
> >
> >     kill dead code.
> >     
> >     let's note that the minimal number of reducers should be the max 
> > partition value, and how to get it, in POCube documentation. That way, 
> > whoever implements an improved estimator can follow this advice when they 
> > encounter POCube.

Removed all the estimator code.


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java,
> >  line 51
> > <https://reviews.apache.org/r/6651/diff/3/?file=141663#file141663line51>
> >
> >     you are adding a lot of code to this class. At 2800 lines, it's already 
> > too big a piece of code.
> >     
> >     perhaps some of it can be factored out into CubeCompiler or something 
> > similar?

Refactored.


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java,
> >  line 1104
> > <https://reviews.apache.org/r/6651/diff/3/?file=141663#file141663line1104>
> >
> >     break this code out into its own function.

Refactored.


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java,
> >  line 1131
> > <https://reviews.apache.org/r/6651/diff/3/?file=141663#file141663line1131>
> >
> >     can you add a comment describing what the plan looks like before and 
> > after this is applied?

Updated the comment to include abstract plan.


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java,
> >  line 1190
> > <https://reviews.apache.org/r/6651/diff/3/?file=141663#file141663line1190>
> >
> >     COUNT.class.getName(), COUNT_STAR.class.getName() 
> >     
> >

Updated


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java,
> >  line 1221
> > <https://reviews.apache.org/r/6651/diff/3/?file=141663#file141663line1221>
> >
> >     why 0? "input" might be projecting something else, right? Can you reuse 
> > "input"?

Good catch. Using the attributes from input.


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java,
> >  line 1944
> > <https://reviews.apache.org/r/6651/diff/3/?file=141663#file141663line1944>
> >
> >     @Override

Added.


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java,
> >  line 1948
> > <https://reviews.apache.org/r/6651/diff/3/?file=141663#file141663line1948>
> >
> >     drop "== true", it's already a boolean.

Added this check everywhere just for readability.
Removed everywhere.


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java,
> >  line 2078
> > <https://reviews.apache.org/r/6651/diff/3/?file=141663#file141663line2078>
> >
> >     as discussed, getting the total number of records is problematic.
> >     
> >     What if we write a sampler that does something along the lines of 
> > opening all input splits and reading from each until 2M are reached or 
> > input is exhausted?
> >     
> >     This doesn't let us use the 100K for 2m-2b heuristic, but it's 
> > something.
> >     
> >     this is also where we can look at LoadFunc implementations and see if 
> > we can just get an estimated total # of records from it -- if we can, we 
> > don't have to do the above compromise, and can apply all the heuristics 
> > described by Arnab.

Working on this. Will update the code once completed. 


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java,
> >  line 2102
> > <https://reviews.apache.org/r/6651/diff/3/?file=141663#file141663line2102>
> >
> >     go ahead and write a test.

This scenario typically will not occur. I don't see a use case where output of 
1 cube operator is used by another cube operator and both using holistic 
measure. The isHolistic() conditional check at the beginning of visitCube will 
eliminate other cases where cube operator uses algebraic measure. So this 
should work without any issue. Added a testcase as well. 


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java,
> >  line 2150
> > <https://reviews.apache.org/r/6651/diff/3/?file=141663#file141663line2150>
> >
> >     don't compare booleans to true. 
> >     
> >     just do
> >     
> >     if (className.equals(...)) {
> >     
> >     }

Removed.


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java,
> >  line 2242
> > <https://reviews.apache.org/r/6651/diff/3/?file=141663#file141663line2242>
> >
> >     How come? That sounds a little scary. Let's understand this.

The issue was with iterating over the map. It is not possible to iterate 
through a map and modify it at the same time. It can be done only using 
ListIterators and not map iterators.  So modified the way of iterating (walking 
from root of the plan to leaf and replacing the current node with new node). 
Fixed.


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceOper.java,
> >  line 101
> > <https://reviews.apache.org/r/6651/diff/3/?file=141664#file141664line101>
> >
> >     can these be added to POCube and extracted from within the contained 
> > plans instead?

The annotated lattice file is required only for actual full materialization 
MRJob. We need to know the name of the annotated lattice file for it to be 
added to distributed cache. Since adding files to distributed cache happens in 
MRCompiler (where we will not have reference to POCube object), I have added 
these variables in MapReduceOper. 


> On Aug. 17, 2012, 8:43 p.m., Dmitriy Ryaboy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java,
> >  line 1214
> > <https://reviews.apache.org/r/6651/diff/3/?file=141663#file141663line1214>
> >
> >     The trick of turning COUNTs into SUMs of COUNTS is already implicitly 
> > encoded in COUNT's implementation of Algebraic. Could we use the algebraic 
> > interface to do this generically, rather than hardcoding individual 
> > functions in the compiler?

Nice idea! I updated the code to check if the UserFunc implements algebraic, if 
so we can use its UserFunc.Final.class in the UDF for post aggregation. This 
makes it generic but we may lose combiner optimization which is available 
otherwise. 


- Prasanth_J


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6651/#review10473
-----------------------------------------------------------


On Aug. 16, 2012, 10:19 p.m., Prasanth_J wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6651/
> -----------------------------------------------------------
> 
> (Updated Aug. 16, 2012, 10:19 p.m.)
> 
> 
> Review request for pig and Dmitriy Ryaboy.
> 
> 
> Description
> -------
> 
> This is a review board request for 
> https://issues.apache.org/jira/browse/PIG-2831
> 
> 
> This addresses bug PIG-2831.
>     https://issues.apache.org/jira/browse/PIG-2831
> 
> 
> Diffs
> -----
> 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
>  8029dec 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
>  1d05a20 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceOper.java
>  b87c209 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRPrinter.java
>  157caad 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java
>  cde340c 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java
>  ff65146 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POCube.java
>  PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 
> 0502917 
>   src/org/apache/pig/builtin/CubeDimensions.java 5652029 
>   src/org/apache/pig/builtin/PigStorage.java 21e835f 
>   src/org/apache/pig/builtin/RollupDimensions.java f6c26e4 
>   src/org/apache/pig/impl/builtin/HolisticCube.java PRE-CREATION 
>   src/org/apache/pig/impl/builtin/HolisticCubeCompoundKey.java PRE-CREATION 
>   src/org/apache/pig/impl/builtin/PartitionMaxGroup.java PRE-CREATION 
>   src/org/apache/pig/impl/builtin/PostProcessCube.java PRE-CREATION 
>   src/org/apache/pig/impl/io/ReadSingleLoader.java PRE-CREATION 
>   src/org/apache/pig/impl/util/Utils.java 270cb6a 
>   src/org/apache/pig/newplan/logical/optimizer/LogicalPlanPrinter.java 
> 13439c6 
>   src/org/apache/pig/newplan/logical/relational/LOCube.java b262efb 
>   
> src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 
> 127ab7a 
>   src/org/apache/pig/newplan/logical/rules/ColumnPruneHelper.java 369f5c2 
>   src/org/apache/pig/parser/LogicalPlanBuilder.java 289a76f 
>   src/org/apache/pig/pen/EquivalenceClasses.java 194f8cb 
>   src/org/apache/pig/pen/LineageTrimmingVisitor.java 917073c 
>   src/org/apache/pig/pen/util/DisplayExamples.java 265f8f7 
>   test/org/apache/pig/impl/builtin/TestHolisticCubeCompundKey.java 
> PRE-CREATION 
>   test/org/apache/pig/impl/builtin/TestPartitionMaxGroup.java PRE-CREATION 
>   test/org/apache/pig/impl/builtin/TestPostProcessCube.java PRE-CREATION 
>   test/org/apache/pig/test/TestCubeOperator.java 65d56a6 
> 
> Diff: https://reviews.apache.org/r/6651/diff/
> 
> 
> Testing
> -------
> 
> Unit tests: All passed
> 
> Pre-commit tests: All passed
> ant clean test-commit
> 
> 
> Thanks,
> 
> Prasanth_J
> 
>

Re: Review Request: PIG-2831: MR-Cube implementation (Distributed cubing for holistic measures)

Reply via email to