[
https://issues.apache.org/jira/browse/PIG-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424428#comment-13424428
]
Jie Li commented on PIG-2829:
-----------------------------
Generated 100G tpc-h data with reduction rate 2 and 3 respectively. For each
dataset, ran two queries: group-by with 8 aggregations and group-by with 1
aggregation:
||query||combiner off, partial-agg off || combiner off, partial-agg on ||
|g-by with reduction by 3 and 8 aggregations| 47m59s|47m46s|
| g-by with reduction by 2 and 8 aggregations| 48m39s | 57m3s |
|g-by with reduction by 3 and 1 aggregations| 23m37s| 20m52s |
| g-by with reduction by 2 and 1 aggregations | 24m11s | 24m36s|
>From the result we can see the minimum reduction rate for partial-agg is not
>trivial to decide: it depends on the cost of performing the reduction (number
>of aggregations, cost of aggregations, etc), and the cost to transfer the data
>( the amount of data to transfer, and the network traffic, etc). It's like
>compression: the performance is a trade-off between cpu and io, and is
>application-dependent. For the default value, 3 will give more significant
>improvement while 2 will save more traffic data. Any comment?
> Use partial aggregation more aggresively
> ----------------------------------------
>
> Key: PIG-2829
> URL: https://issues.apache.org/jira/browse/PIG-2829
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.10.0
> Reporter: Jie Li
> Attachments: 2829.1.patch, 2829.2.patch, 2829.separate.options.patch,
> pigmix-10G.png, tpch-10G.png
>
>
> Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature
> in Pig 0.10 that will perform aggregation within map function. The main
> advantage against combiner is it avoids de/serializing and sorting the data,
> and it can auto disable itself if the data reduction rate is low. Currently
> it's disabled by default.
> To leverage the power of PartialAgg more aggressively, several things need to
> be revisited:
> 1. The threshold of auto-disabling. Currently each mapper looks at first 1k
> (hard-coded) records to see if there's enough data size reduction (defaults
> to 10x, configurable). The check would happen earlier if the hash table gets
> full before processing the 1k records (hash table size is controlled by
> pig.cachedbag.memusage). We might want to relax these thresholds.
> 2. Dependency on the combiner. Currently the PartialAgg won't work without a
> combiner following it, so we need to provide separate options to enable each
> independently.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira