[
https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442974#comment-13442974
]
Julien Le Dem commented on PIG-2888:
------------------------------------
Awesome. Stop now or it will start to be negative soon.
Comments:
* There's a "pig.exec.nocombiner" that was not replaced by a constant.
* It would be nice to have a consistent way of getting booleans (and floats)
from the conf. Something like:
{noformat}
PigConfiguration.getBoolean(Properties p, key) {
return "true".equals(p.getProperty(key, "false"));
}
{noformat}
* some of the class description was still applicable
{noformat}
/**
* Do partial aggregation in map plan. It uses a hash-map to aggregate.
* ...
*/
public class POPartialAgg extends PhysicalOperator {
{noformat}
* what is the reason for this particular value?
{noformat}
private static final int MAX_LIST_SIZE = 1 << 13 - 1;
{noformat}
* It looks like this could be a HashSet as the value never gets used (but
there's no WeakHashSet so I gues I got my answer). It could be as well
WeakHashMap<POPartialAgg, ?>. Don't you want a visitor to just list them all
once and set the count? That way you would not have to worry about keeping a
reference on them.
{noformat}
private static final WeakHashMap<POPartialAgg, Byte> ALL_POPARTS = new
WeakHashMap<POPartialAgg, Byte>();
{noformat}
* +0.5 so that it is never 0 ? Math.min(1, ...) is more readable.
{noformat}
firstTierThreshold = (int) (0.5 + totalTuples * (1f - (1f / sizeReduction)));
secondTierThreshold = (int) (0.5 + totalTuples * (1f / sizeReduction));
{noformat}
* LOG.info() should be wrapped in if (LOG.isInfoEnabled()) { ... } for perf
* in aggregateSecondLevel() can't the processedInputMap be reused?
* in getMinOutputReductionFromProp(), if minReduction <= 0 it should throw an
exception.
> Improve performance of POPartialAgg
> -----------------------------------
>
> Key: PIG-2888
> URL: https://issues.apache.org/jira/browse/PIG-2888
> Project: Pig
> Issue Type: Improvement
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch,
> partialagg_patch_3.patch, partialagg_patch_4.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance
> degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't
> well suited to the operator's assumptions. Changing the implementation to a
> more flexible hash-based model can provide significant performance
> improvements.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira