[ 
https://issues.apache.org/jira/browse/PIG-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Li updated PIG-2829:
------------------------

    Attachment: 2829.2.patch

Updated the patch with unit test fixes and new unit tests verifying default 
configurations.

Below is the benchmark results on 4-slave cluster with 100GB TPC-H data. Query 
1 and some synthetic queries are used. Each query uses 300 map tasks and 79 
reduce tasks, and each map task is processing 2 million records:

|| query || trunk || patch || comment ||
| TPCH Q1 | 58 min | 34 min | Q1's group-by has four different keys and eight 
aggregations. |
| S-600x | 35 min | 30 min | The reduction rate of output/input records is 600. 
|
| S-4x | 31 min | 21 min | The reduction rate of output/input records is 4. |
| S-1x | 59 min | 44 min | The reduction rate of output/input records is 1. 
Every group-by key is different. |
| S-high memory | map task 5min ~ 6 min | map task 2min ~ 3min | reduction rate 
is 1 (no reduction). 16 aggregations in the same group. |

We can see the performance of new default settings in this patch is always 
better than the old default settings in the trunk.

Also tested the latency of disabling MapAgg using the query S-1x (no 
reduction). There's almost no difference:
|| pig.exec.mapPartAgg.reduction.checkinterval ||  job running time ||
| 1000 | 43 min 54 sec |
| 100000 | 43 min 46 sec |


                
> Use partial aggregation more aggresively
> ----------------------------------------
>
>                 Key: PIG-2829
>                 URL: https://issues.apache.org/jira/browse/PIG-2829
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.10.0
>            Reporter: Jie Li
>         Attachments: 2829.1.patch, 2829.2.patch, 2829.separate.options.patch, 
> pigmix-10G.png, tpch-10G.png
>
>
> Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature 
> in Pig 0.10 that will perform aggregation within map function. The main 
> advantage against combiner is it avoids de/serializing and sorting the data, 
> and it can auto disable itself if the data reduction rate is low. Currently 
> it's disabled by default.
> To leverage the power of PartialAgg more aggressively, several things need to 
> be revisited:
> 1. The threshold of auto-disabling. Currently each mapper looks at first 1k 
> (hard-coded) records to see if there's enough data size reduction (defaults 
> to 10x, configurable). The check would happen earlier if the hash table gets 
> full before processing the 1k records (hash table size is controlled by 
> pig.cachedbag.memusage). We might want to relax these thresholds.
> 2. Dependency on the combiner. Currently the PartialAgg won't work without a 
> combiner following it, so we need to provide separate options to enable each 
> independently. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to