[
https://issues.apache.org/jira/browse/PIG-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103483#comment-13103483
]
[email protected] commented on PIG-2228:
----------------------------------------------------
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1817/#review1868
-----------------------------------------------------------
trunk/src/org/apache/pig/Main.java
<https://reviews.apache.org/r/1817/#comment4282>
I would go 1 further and add this property to the default pig.properties,
along with the rest listed here.
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java
<https://reviews.apache.org/r/1817/#comment4284>
there's a lot of errant whitespace around.. you can set your IDE to clean
that up
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java
<https://reviews.apache.org/r/1817/#comment4285>
Not sure about the value of this comment :)
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4286>
Leaves, technically :)
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4287>
This is an extremely long function. Please refactor into more manageable
chunks. That truly helps others to read and follow the code.
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4288>
It's worth noting how mapDumpIterator be not null, and disableMapAgg be
true (I think this is the case when you auto-disable in-map aggregation having
processed some of the input, correct?)
It might be more readable to just handle the disableMapAgg case at the very
top, and not have to check it afterwards:
if (disableMapAgg) {
if (mapDumpIterator != null) {
// deal with iterator
} else {
return processInput();
}
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4290>
extra newline
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4291>
extra newlines
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4292>
return POStatus.STATUS_EOP would be clearer and not require a comment
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4293>
this should be a separate function
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4294>
what's at valueTuple[0]?
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4296>
catch a class cast exception and rethrow with a meaningful error?
("Intermediate Algebraic functions must implement EvalFunc<Tuple>") .
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4297>
newline
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4298>
is ignoring nulls really the right thing to do? what if my UDF is
"COUNT_NULLS"? Do we need a magical NULL object for this, or does this not work
elsewhere already and we don't need to worry about supporting it?
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4299>
shouldn't this be output.returnStatus?
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4300>
extra newline
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4302>
kinda crammed in here :)
if (aggMap.size() >= maxHashMapSize) {
aaah room to breathe
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4301>
this is quite trivial using LinkedHashMap initialized with the three-arg
constructor; a one-line change.
Worth a benchmark run.
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4303>
} else {
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4304>
Hortonworks must buy you guys really large screens, I am jealous.
I'll stop mentioning the newlines now, this is probably starting to annoy
you :-).
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4305>
get().getInt("pig.exec.mapPartAgg.minReducton", 0);
please make the property public static or something else that's
referenceable / discoverable
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4306>
default should be a constant at the top of the file
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4307>
catch the classcastexception, rethrow with something more meaningful about
Inermediate EvalFuncs being required to be EvalFunc<Tuple>
trunk/src/org/apache/pig/data/SelfSpillBag.java
<https://reviews.apache.org/r/1817/#comment4310>
nice refactor
trunk/src/org/apache/pig/data/SelfSpillBag.java
<https://reviews.apache.org/r/1817/#comment4311>
pig.cachedbag.memusage should be public static final
trunk/test/org/apache/pig/test/TestPOPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4314>
public static final reference...
also, use setInt
trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java
<https://reviews.apache.org/r/1817/#comment4312>
this should be a public static final from the main body of code that you
are referencing here
setBoolean
trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java
<https://reviews.apache.org/r/1817/#comment4313>
p-s-f
- Dmitriy
On 2011-09-12 23:55:12, Thejas Nair wrote:
bq.
bq. -----------------------------------------------------------
bq. This is an automatically generated e-mail. To reply, visit:
bq. https://reviews.apache.org/r/1817/
bq. -----------------------------------------------------------
bq.
bq. (Updated 2011-09-12 23:55:12)
bq.
bq.
bq. Review request for pig, Daniel Dai and Dmitriy Ryaboy.
bq.
bq.
bq. Summary
bq. -------
bq.
bq. See PIG-2228
bq.
bq.
bq. This addresses bug PIG-2228.
bq. https://issues.apache.org/jira/browse/PIG-2228
bq.
bq.
bq. Diffs
bq. -----
bq.
bq. trunk/src/org/apache/pig/Algebraic.java 1164722
bq. trunk/src/org/apache/pig/Main.java 1164722
bq.
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java
1164722
bq.
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java
1164722
bq.
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java
1164722
bq.
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java
1164722
bq.
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java
1164722
bq.
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java
1164722
bq.
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java
1164722
bq.
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
PRE-CREATION
bq. trunk/src/org/apache/pig/data/DefaultTuple.java 1164722
bq. trunk/src/org/apache/pig/data/InternalCachedBag.java 1164722
bq. trunk/src/org/apache/pig/data/InternalDistinctBag.java 1164722
bq. trunk/src/org/apache/pig/data/InternalSortedBag.java 1164722
bq. trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION
bq. trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION
bq. trunk/src/org/apache/pig/data/SortedSpillBag.java 1164722
bq. trunk/test/e2e/pig/tests/nightly.conf 1164722
bq. trunk/test/org/apache/pig/test/TestDataBag.java 1164722
bq. trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION
bq. trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION
bq. trunk/test/org/apache/pig/test/Util.java 1164722
bq. trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1164722
bq.
bq. Diff: https://reviews.apache.org/r/1817/diff
bq.
bq.
bq. Testing
bq. -------
bq.
bq. test-patch
bq. [exec] -1 overall.
bq. [exec]
bq. [exec] +1 @author. The patch does not contain any @author tags.
bq. [exec]
bq. [exec] +1 tests included. The patch appears to include 21 new or
modified tests.
bq. [exec]
bq. [exec] +1 javadoc. The javadoc tool did not generate any warning
messages.
bq. [exec]
bq. [exec] +1 javac. The applied patch does not increase the total
number of javac compiler warnings.
bq. [exec]
bq. [exec] +1 findbugs. The patch does not introduce any new
Findbugs warnings.
bq. [exec]
bq. [exec] -1 release audit. The applied patch generated 461 release
audit warnings (more than the trunk's current 455 warnings).
bq. release audit failures are because of jdiff changes
bq.
bq. All unit tests pass, new e2e tests added .
bq.
bq.
bq. Thanks,
bq.
bq. Thejas
bq.
bq.
> support partial aggregation in map task
> ---------------------------------------
>
> Key: PIG-2228
> URL: https://issues.apache.org/jira/browse/PIG-2228
> Project: Pig
> Issue Type: Bug
> Reporter: Thejas M Nair
> Assignee: Thejas M Nair
> Fix For: 0.10
>
> Attachments: PIG-2228.1.patch, PIG-2228.2.patch, PIG-2228.3.patch,
> PIG-2228.4.patch, PIG-2228.5.patch
>
>
> h3. Introduction
> Pig does (sort based) partial aggregation in map side through the use of
> combiner. MR serializes the output of map to a buffer, sorts it on the keys,
> deserializes and passes the values grouped on the keys to combiner phase. The
> same work of combiner can be done in the map phase itself by using a hash-map
> on the keys. This hash based (partial) aggregation can be done with or
> without a combiner phase.
> h3. Benefits
> It will send fewer records to combiner and thereby -
> * Save on cost of serializing and de-serializing
> * Save on cost of lock calls on the combiner input buffer. (I have found
> this to be a significant cost for a query that was doing multiple group-by's
> in a single MR job. -Thejas)
> * The problem of running out of memory in reduce side, for queries like
> COUNT(distinct col) can be avoided. The OOM issue happens because very large
> records get created after the combiner run on merged reduce input. In case of
> combiner, you have no way of telling MR not to combine records in reduce
> side. The workaround is to disable combiner completely, and the opportunity
> to reduce map output size is lost.
> * When the foreach after group-by has both algebraic and non-algebraic
> functions, or if a bag is being projected, the combiner is not used. This is
> because the data size reduction in typical cases are not significant enough
> to justify the additional (de)serialization costs. But hash based aggregation
> can be used in such cases as well.
> * It is possible to turn off the in-map combine automatically if there is
> not enough 'combination' that is taking place to justify the overhead of the
> in-map combiner. (Idea borrowed from Hive jira.)
> * If input data is sorted, it is possible to do efficient map side
> (partial) aggregation with in-map combiner.
> Design proposal is here -
> https://cwiki.apache.org/confluence/display/PIG/PigInMapCombinerProposal
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira