Re: Review Request: PIG-2831: MR-Cube implementation (Distributed cubing for holistic measures)

Dmitriy Ryaboy Fri, 17 Aug 2012 13:44:00 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6651/#review10473
-----------------------------------------------------------



Publishing the first set of comments (got through to MapReduceOper). More later.


src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
<https://reviews.apache.org/r/6651/#comment22332>

    does this break if we have 2 cube operators? should we add some identifier 
to the filename?



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
<https://reviews.apache.org/r/6651/#comment22333>

    no, let's rely on the general estimator framework rather than 
special-casing. Having separate chunks of code dealing with this through the 
codebase will lead to confusion.  Now that we allow passing a full plan into 
the reducer estimator, it will be possible to write an estimator that walks the 
plan and uses factors based on knowledge of inputs and operators that work on 
the input to adjust the naive estimation; we should encode any cube 
operator-specific logic into such an estimator.
    
    I think you can go ahead and remove this comment block.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
<https://reviews.apache.org/r/6651/#comment22334>

    kill dead code.
    
    let's note that the minimal number of reducers should be the max partition 
value, and how to get it, in POCube documentation. That way, whoever implements 
an improved estimator can follow this advice when they encounter POCube.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22351>

    you are adding a lot of code to this class. At 2800 lines, it's already too 
big a piece of code.
    
    perhaps some of it can be factored out into CubeCompiler or something 
similar? 



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22335>

    break this code out into its own function.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22337>

    can you add a comment describing what the plan looks like before and after 
this is applied?



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22336>

    COUNT.class.getName(), COUNT_STAR.class.getName() 
    
    



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22338>

    The trick of turning COUNTs into SUMs of COUNTS is already implicitly 
encoded in COUNT's implementation of Algebraic. Could we use the algebraic 
interface to do this generically, rather than hardcoding individual functions 
in the compiler?



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22339>

    why 0? "input" might be projecting something else, right? Can you reuse 
"input"?



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22340>

    @Override



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22341>

    drop "== true", it's already a boolean.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22342>

    as discussed, getting the total number of records is problematic.
    
    What if we write a sampler that does something along the lines of opening 
all input splits and reading from each until 2M are reached or input is 
exhausted?
    
    This doesn't let us use the 100K for 2m-2b heuristic, but it's something.
    
    this is also where we can look at LoadFunc implementations and see if we 
can just get an estimated total # of records from it -- if we can, we don't 
have to do the above compromise, and can apply all the heuristics described by 
Arnab.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22344>

    go ahead and write a test.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22346>

    don't compare booleans to true. 
    
    just do
    
    if (className.equals(...)) {
    
    }



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22347>

    same



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22349>

    How come? That sounds a little scary. Let's understand this.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceOper.java
<https://reviews.apache.org/r/6651/#comment22353>

    can these be added to POCube and extracted from within the contained plans 
instead?


- Dmitriy Ryaboy


On Aug. 16, 2012, 10:19 p.m., Prasanth_J wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6651/
> -----------------------------------------------------------
> 
> (Updated Aug. 16, 2012, 10:19 p.m.)
> 
> 
> Review request for pig and Dmitriy Ryaboy.
> 
> 
> Description
> -------
> 
> This is a review board request for 
> https://issues.apache.org/jira/browse/PIG-2831
> 
> 
> This addresses bug PIG-2831.
>     https://issues.apache.org/jira/browse/PIG-2831
> 
> 
> Diffs
> -----
> 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
>  8029dec 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
>  1d05a20 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceOper.java
>  b87c209 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRPrinter.java
>  157caad 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java
>  cde340c 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java
>  ff65146 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POCube.java
>  PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 
> 0502917 
>   src/org/apache/pig/builtin/CubeDimensions.java 5652029 
>   src/org/apache/pig/builtin/PigStorage.java 21e835f 
>   src/org/apache/pig/builtin/RollupDimensions.java f6c26e4 
>   src/org/apache/pig/impl/builtin/HolisticCube.java PRE-CREATION 
>   src/org/apache/pig/impl/builtin/HolisticCubeCompoundKey.java PRE-CREATION 
>   src/org/apache/pig/impl/builtin/PartitionMaxGroup.java PRE-CREATION 
>   src/org/apache/pig/impl/builtin/PostProcessCube.java PRE-CREATION 
>   src/org/apache/pig/impl/io/ReadSingleLoader.java PRE-CREATION 
>   src/org/apache/pig/impl/util/Utils.java 270cb6a 
>   src/org/apache/pig/newplan/logical/optimizer/LogicalPlanPrinter.java 
> 13439c6 
>   src/org/apache/pig/newplan/logical/relational/LOCube.java b262efb 
>   
> src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 
> 127ab7a 
>   src/org/apache/pig/newplan/logical/rules/ColumnPruneHelper.java 369f5c2 
>   src/org/apache/pig/parser/LogicalPlanBuilder.java 289a76f 
>   src/org/apache/pig/pen/EquivalenceClasses.java 194f8cb 
>   src/org/apache/pig/pen/LineageTrimmingVisitor.java 917073c 
>   src/org/apache/pig/pen/util/DisplayExamples.java 265f8f7 
>   test/org/apache/pig/impl/builtin/TestHolisticCubeCompundKey.java 
> PRE-CREATION 
>   test/org/apache/pig/impl/builtin/TestPartitionMaxGroup.java PRE-CREATION 
>   test/org/apache/pig/impl/builtin/TestPostProcessCube.java PRE-CREATION 
>   test/org/apache/pig/test/TestCubeOperator.java 65d56a6 
> 
> Diff: https://reviews.apache.org/r/6651/diff/
> 
> 
> Testing
> -------
> 
> Unit tests: All passed
> 
> Pre-commit tests: All passed
> ant clean test-commit
> 
> 
> Thanks,
> 
> Prasanth_J
> 
>

Re: Review Request: PIG-2831: MR-Cube implementation (Distributed cubing for holistic measures)

Reply via email to