[ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442974#comment-13442974 ]
Julien Le Dem commented on PIG-2888: ------------------------------------ Awesome. Stop now or it will start to be negative soon. Comments: * There's a "pig.exec.nocombiner" that was not replaced by a constant. * It would be nice to have a consistent way of getting booleans (and floats) from the conf. Something like: {noformat} PigConfiguration.getBoolean(Properties p, key) { return "true".equals(p.getProperty(key, "false")); } {noformat} * some of the class description was still applicable {noformat} /** * Do partial aggregation in map plan. It uses a hash-map to aggregate. * ... */ public class POPartialAgg extends PhysicalOperator { {noformat} * what is the reason for this particular value? {noformat} private static final int MAX_LIST_SIZE = 1 << 13 - 1; {noformat} * It looks like this could be a HashSet as the value never gets used (but there's no WeakHashSet so I gues I got my answer). It could be as well WeakHashMap<POPartialAgg, ?>. Don't you want a visitor to just list them all once and set the count? That way you would not have to worry about keeping a reference on them. {noformat} private static final WeakHashMap<POPartialAgg, Byte> ALL_POPARTS = new WeakHashMap<POPartialAgg, Byte>(); {noformat} * +0.5 so that it is never 0 ? Math.min(1, ...) is more readable. {noformat} firstTierThreshold = (int) (0.5 + totalTuples * (1f - (1f / sizeReduction))); secondTierThreshold = (int) (0.5 + totalTuples * (1f / sizeReduction)); {noformat} * LOG.info() should be wrapped in if (LOG.isInfoEnabled()) { ... } for perf * in aggregateSecondLevel() can't the processedInputMap be reused? * in getMinOutputReductionFromProp(), if minReduction <= 0 it should throw an exception. > Improve performance of POPartialAgg > ----------------------------------- > > Key: PIG-2888 > URL: https://issues.apache.org/jira/browse/PIG-2888 > Project: Pig > Issue Type: Improvement > Reporter: Dmitriy V. Ryaboy > Assignee: Dmitriy V. Ryaboy > Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch, > partialagg_patch_3.patch, partialagg_patch_4.patch > > > During performance testing, we found that POPartialAgg can cause performance > degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't > well suited to the operator's assumptions. Changing the implementation to a > more flexible hash-based model can provide significant performance > improvements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira