[
https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-2888:
-----------------------------------
Attachment: partialagg_patch_1.patch
The attached patch is an initial pass at this implementation. Reading it as a
diff may be hard -- about 70% of the code in POPartialAgg changed -- I
recommend applying it to a git branch and looking at the class directly.
I have not implemented memory-based triggering yet, for now just relying on
hardcoded limits on number of tuples in the caches.
I have also not implemented the functionality to automatically turn off
hash-based aggregation.
Tests (except the memory setting related tests) pass.
Test runs on synthetic data both in local mode and on a cluster produced
correct data.
Cluster runs indicate significant improvement in overall speed of execution
when using this approach.
> Improve performance of POPartialAgg
> -----------------------------------
>
> Key: PIG-2888
> URL: https://issues.apache.org/jira/browse/PIG-2888
> Project: Pig
> Issue Type: Improvement
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Attachments: partialagg_patch_1.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance
> degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't
> well suited to the operator's assumptions. Changing the implementation to a
> more flexible hash-based model can provide significant performance
> improvements.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira