[ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitriy V. Ryaboy updated PIG-2888: ----------------------------------- Attachment: partialagg_patch_2.patch Attaching a second version. It's ready for review. This takes care of memory estimation (and actually looks at number of operators, doesn't just hardcode a magic "3"), and turns off if reduction is insufficient. Would love to get a 3-rd party verification of the speed improvements. Maybe someone who has recent PigMix results can rerun with this patch? One of the test cases (TestPOPartialAgg.testPartialMultiInput1HashMemEmpty) still fails, because it assumes that even if no memory is allocated to internal cached bags, consecutive keys still get aggregated. That's an assumption that's pretty specific to the old implementation. Does anyone think that feature is critical? If not, I would like to remove the test. > Improve performance of POPartialAgg > ----------------------------------- > > Key: PIG-2888 > URL: https://issues.apache.org/jira/browse/PIG-2888 > Project: Pig > Issue Type: Improvement > Reporter: Dmitriy V. Ryaboy > Assignee: Dmitriy V. Ryaboy > Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch > > > During performance testing, we found that POPartialAgg can cause performance > degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't > well suited to the operator's assumptions. Changing the implementation to a > more flexible hash-based model can provide significant performance > improvements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira