[jira] [Commented] (PIG-2829) Use partial aggregation more aggresively

2012-09-05 Thread Jie Li (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449424#comment-13449424
 ] 

Jie Li commented on PIG-2829:
-

No problem Dmitriy. I'll see if I can find some time over the weekend.

> Use partial aggregation more aggresively
> 
>
> Key: PIG-2829
> URL: https://issues.apache.org/jira/browse/PIG-2829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.10.0
>Reporter: Jie Li
> Attachments: 2829.1.patch, 2829.2.patch, 2829.separate.options.patch, 
> pigmix-10G.png, tpch-10G.png
>
>
> Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature 
> in Pig 0.10 that will perform aggregation within map function. The main 
> advantage against combiner is it avoids de/serializing and sorting the data, 
> and it can auto disable itself if the data reduction rate is low. Currently 
> it's disabled by default.
> To leverage the power of PartialAgg more aggressively, several things need to 
> be revisited:
> 1. The threshold of auto-disabling. Currently each mapper looks at first 1k 
> (hard-coded) records to see if there's enough data size reduction (defaults 
> to 10x, configurable). The check would happen earlier if the hash table gets 
> full before processing the 1k records (hash table size is controlled by 
> pig.cachedbag.memusage). We might want to relax these thresholds.
> 2. Dependency on the combiner. Currently the PartialAgg won't work without a 
> combiner following it, so we need to provide separate options to enable each 
> independently. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2829) Use partial aggregation more aggresively

2012-09-05 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448528#comment-13448528
 ] 

Dmitriy V. Ryaboy commented on PIG-2829:


Jie, sorry I missed this ticket before. As you may have seen, I completely 
reimplemented this whole chunk of code in PIG-2888. Can you rerun your 
benchmarks and see if some of the improvements you propose here should be 
applied to the code developed in that ticket?

> Use partial aggregation more aggresively
> 
>
> Key: PIG-2829
> URL: https://issues.apache.org/jira/browse/PIG-2829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.10.0
>Reporter: Jie Li
> Attachments: 2829.1.patch, 2829.2.patch, 2829.separate.options.patch, 
> pigmix-10G.png, tpch-10G.png
>
>
> Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature 
> in Pig 0.10 that will perform aggregation within map function. The main 
> advantage against combiner is it avoids de/serializing and sorting the data, 
> and it can auto disable itself if the data reduction rate is low. Currently 
> it's disabled by default.
> To leverage the power of PartialAgg more aggressively, several things need to 
> be revisited:
> 1. The threshold of auto-disabling. Currently each mapper looks at first 1k 
> (hard-coded) records to see if there's enough data size reduction (defaults 
> to 10x, configurable). The check would happen earlier if the hash table gets 
> full before processing the 1k records (hash table size is controlled by 
> pig.cachedbag.memusage). We might want to relax these thresholds.
> 2. Dependency on the combiner. Currently the PartialAgg won't work without a 
> combiner following it, so we need to provide separate options to enable each 
> independently. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2829) Use partial aggregation more aggresively

2012-07-28 Thread Jie Li (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424428#comment-13424428
 ] 

Jie Li commented on PIG-2829:
-

Generated 100G tpc-h data with reduction rate 2 and 3 respectively. For each 
dataset, ran two queries: group-by with 8 aggregations and group-by with 1 
aggregation:

||query||combiner off, partial-agg off || combiner off, partial-agg on ||
|g-by with reduction by 3 and 8 aggregations| 47m59s|47m46s|
 | g-by with reduction by 2 and 8 aggregations| 48m39s | 57m3s |
|g-by with reduction by 3 and 1 aggregations| 23m37s| 20m52s |
| g-by with reduction by 2 and 1 aggregations | 24m11s | 24m36s|

>From the result we can see the minimum reduction rate for partial-agg is not 
>trivial to decide: it depends on the cost of performing the reduction (number 
>of aggregations, cost of aggregations, etc), and the cost to transfer the data 
>( the amount of data to transfer, and the network traffic, etc). It's like 
>compression: the performance is a trade-off between cpu and io, and is 
>application-dependent. For the default value, 3 will give more significant 
>improvement while 2 will save more traffic data. Any comment?


> Use partial aggregation more aggresively
> 
>
> Key: PIG-2829
> URL: https://issues.apache.org/jira/browse/PIG-2829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.10.0
>Reporter: Jie Li
> Attachments: 2829.1.patch, 2829.2.patch, 2829.separate.options.patch, 
> pigmix-10G.png, tpch-10G.png
>
>
> Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature 
> in Pig 0.10 that will perform aggregation within map function. The main 
> advantage against combiner is it avoids de/serializing and sorting the data, 
> and it can auto disable itself if the data reduction rate is low. Currently 
> it's disabled by default.
> To leverage the power of PartialAgg more aggressively, several things need to 
> be revisited:
> 1. The threshold of auto-disabling. Currently each mapper looks at first 1k 
> (hard-coded) records to see if there's enough data size reduction (defaults 
> to 10x, configurable). The check would happen earlier if the hash table gets 
> full before processing the 1k records (hash table size is controlled by 
> pig.cachedbag.memusage). We might want to relax these thresholds.
> 2. Dependency on the combiner. Currently the PartialAgg won't work without a 
> combiner following it, so we need to provide separate options to enable each 
> independently. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2829) Use partial aggregation more aggresively

2012-07-27 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424140#comment-13424140
 ] 

Thejas M Nair commented on PIG-2829:


Thanks for the benchmark Jie. Clearly, partial-agg is working better than 
combiner. 
Can you also run some benchmarks with combiner turned off, so that we can 
verify the appropriate value for pig.exec.mapPartAgg.minReduction - 

||query || combiner off, partial-agg off || combiner off, partial-agg on ||
|g-by with reduction by 3 | | |
|g-by with reduction by 2| | |


> Use partial aggregation more aggresively
> 
>
> Key: PIG-2829
> URL: https://issues.apache.org/jira/browse/PIG-2829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.10.0
>Reporter: Jie Li
> Attachments: 2829.1.patch, 2829.2.patch, 2829.separate.options.patch, 
> pigmix-10G.png, tpch-10G.png
>
>
> Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature 
> in Pig 0.10 that will perform aggregation within map function. The main 
> advantage against combiner is it avoids de/serializing and sorting the data, 
> and it can auto disable itself if the data reduction rate is low. Currently 
> it's disabled by default.
> To leverage the power of PartialAgg more aggressively, several things need to 
> be revisited:
> 1. The threshold of auto-disabling. Currently each mapper looks at first 1k 
> (hard-coded) records to see if there's enough data size reduction (defaults 
> to 10x, configurable). The check would happen earlier if the hash table gets 
> full before processing the 1k records (hash table size is controlled by 
> pig.cachedbag.memusage). We might want to relax these thresholds.
> 2. Dependency on the combiner. Currently the PartialAgg won't work without a 
> combiner following it, so we need to provide separate options to enable each 
> independently. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2829) Use partial aggregation more aggresively

2012-07-26 Thread Jie Li (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423613#comment-13423613
 ] 

Jie Li commented on PIG-2829:
-

bq.Can you try these settings queries where there are around 10+ group+agg that 
get combined into single MR job ?

Sure the tpc-h Q1 I ran before has 8 aggregations. I'll further double the 
number of aggregations and also change the group-by key so that every hash map 
will get full, so we can identify if there's any memory issue.

bq. Can you do some benchmarks to see if there is any noticeable difference in 
runtime because of the delay in turning mapPartAgg off ? 

Yeah will compare these two settings.

> Use partial aggregation more aggresively
> 
>
> Key: PIG-2829
> URL: https://issues.apache.org/jira/browse/PIG-2829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.10.0
>Reporter: Jie Li
> Attachments: 2829.1.patch, 2829.separate.options.patch, 
> pigmix-10G.png, tpch-10G.png
>
>
> Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature 
> in Pig 0.10 that will perform aggregation within map function. The main 
> advantage against combiner is it avoids de/serializing and sorting the data, 
> and it can auto disable itself if the data reduction rate is low. Currently 
> it's disabled by default.
> To leverage the power of PartialAgg more aggressively, several things need to 
> be revisited:
> 1. The threshold of auto-disabling. Currently each mapper looks at first 1k 
> (hard-coded) records to see if there's enough data size reduction (defaults 
> to 10x, configurable). The check would happen earlier if the hash table gets 
> full before processing the 1k records (hash table size is controlled by 
> pig.cachedbag.memusage). We might want to relax these thresholds.
> 2. Dependency on the combiner. Currently the PartialAgg won't work without a 
> combiner following it, so we need to provide separate options to enable each 
> independently. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2829) Use partial aggregation more aggresively

2012-07-26 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423602#comment-13423602
 ] 

Thejas M Nair commented on PIG-2829:


I will review the patch soon. Some comments regarding the default configuration 
- 

bq. 2: changes existing default values: 
After thinking of the multi-query use case, where you can have multiple 
POPartialAgg operators in a map task, I am having second thoughts on turning 
partial agg on by default. Can you try these settings queries where there are 
around 10+ group+agg that get combined into single MR job ? Maybe we should 
address the potential OOM issues for this use case before we change the 
defaults. This is likely to be become a bigger issue when we use 100k records 
to decide to turn on/off the partial aggregation.

bq. 3: adds a property pig.exec.mapPartAgg.reduction.checkinterval which 
defaults to 100k, so after processing every 100k records mapagg will check the 
reduction rate to see if it should be disabled. Previously we only look at 
first 1000 records.
Can you do some benchmarks to see if there is any noticeable difference in 
runtime because of the delay in turning mapPartAgg off ? 

> Use partial aggregation more aggresively
> 
>
> Key: PIG-2829
> URL: https://issues.apache.org/jira/browse/PIG-2829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.10.0
>Reporter: Jie Li
> Attachments: 2829.1.patch, 2829.separate.options.patch, 
> pigmix-10G.png, tpch-10G.png
>
>
> Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature 
> in Pig 0.10 that will perform aggregation within map function. The main 
> advantage against combiner is it avoids de/serializing and sorting the data, 
> and it can auto disable itself if the data reduction rate is low. Currently 
> it's disabled by default.
> To leverage the power of PartialAgg more aggressively, several things need to 
> be revisited:
> 1. The threshold of auto-disabling. Currently each mapper looks at first 1k 
> (hard-coded) records to see if there's enough data size reduction (defaults 
> to 10x, configurable). The check would happen earlier if the hash table gets 
> full before processing the 1k records (hash table size is controlled by 
> pig.cachedbag.memusage). We might want to relax these thresholds.
> 2. Dependency on the combiner. Currently the PartialAgg won't work without a 
> combiner following it, so we need to provide separate options to enable each 
> independently. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira