[ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitriy V. Ryaboy updated PIG-2888: ----------------------------------- Attachment: partialagg_patch_1.patch The attached patch is an initial pass at this implementation. Reading it as a diff may be hard -- about 70% of the code in POPartialAgg changed -- I recommend applying it to a git branch and looking at the class directly. I have not implemented memory-based triggering yet, for now just relying on hardcoded limits on number of tuples in the caches. I have also not implemented the functionality to automatically turn off hash-based aggregation. Tests (except the memory setting related tests) pass. Test runs on synthetic data both in local mode and on a cluster produced correct data. Cluster runs indicate significant improvement in overall speed of execution when using this approach. > Improve performance of POPartialAgg > ----------------------------------- > > Key: PIG-2888 > URL: https://issues.apache.org/jira/browse/PIG-2888 > Project: Pig > Issue Type: Improvement > Reporter: Dmitriy V. Ryaboy > Assignee: Dmitriy V. Ryaboy > Attachments: partialagg_patch_1.patch > > > During performance testing, we found that POPartialAgg can cause performance > degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't > well suited to the operator's assumptions. Changing the implementation to a > more flexible hash-based model can provide significant performance > improvements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira