[
https://issues.apache.org/jira/browse/PIG-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jie Li updated PIG-2829:
------------------------
Attachment: tpch-10G.png
pigmix-10G.png
Attached some benchmark results of a group-by aggregation on different dataset
(TPC-H and Pigmix) with different selectivity and with combiner/PartialAgg
turned on/off respectively.
'none' means not using combiner or PartialAgg. 'combiner' means only using the
combiner. 'hash' means only using the PartialAgg (with some hack).
'hash+combiner' means enabling PartialAgg. For the latter two we configures the
minimum reduction to 1 so PartialAgg is never auto-disabled (otherwise it'd be
auto-disabled in all cases currently). For TPC-H We also run Hive with default
settings for reference.
The titles above each chart show the number of input records and output records
of each query, for example, "60M -> 4 reduction" means there are 60 million
input records and four output records. For both dataset we use 10GB data, and
ran on a single machine, which is ok here as we are comparing PartialAgg with
combiner so the network doesn't matter much here.
>From the results we can observe:
1) PartialAgg is more efficient than the combiner, which is as expected and
should be leveraged;
2) the combiner is unnecessary when PartialAgg is used;
3) the PartialAgg/combiner overhead can be significant if the data reduction
rate is low.
> Use partial aggregation more aggresively
> ----------------------------------------
>
> Key: PIG-2829
> URL: https://issues.apache.org/jira/browse/PIG-2829
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.10.0
> Reporter: Jie Li
> Attachments: pigmix-10G.png, tpch-10G.png
>
>
> Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature
> in Pig 0.10 that will perform aggregation within map function. The main
> advantage against combiner is it avoids de/serializing and sorting the data,
> and it can auto disable itself if the data reduction rate is low. Currently
> it's disabled by default.
> To leverage the power of PartialAgg more aggressively, several things need to
> be revisited:
> 1. The threshold of auto-disabling. Currently each mapper looks at first 1k
> (hard-coded) records to see if there's enough data size reduction (defaults
> to 10x, configurable). The check would happen earlier if the hash table gets
> full before processing the 1k records (hash table size is controlled by
> pig.cachedbag.memusage). We might want to relax these thresholds.
> 2. Dependency on the combiner. Currently the PartialAgg won't work without a
> combiner following it, so we need to provide separate options to enable each
> independently.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira