[ 
https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-2888:
-----------------------------------

    Attachment: partialagg_patch_1.patch

The attached patch is an initial pass at this implementation. Reading it as a 
diff may be hard -- about 70% of the code in POPartialAgg changed -- I 
recommend applying it to a git branch and looking at the class directly.

I have not implemented memory-based triggering yet, for now just relying on 
hardcoded limits on number of tuples in the caches.

I have also not implemented the functionality to automatically turn off 
hash-based aggregation.

Tests (except the memory setting related tests) pass.

Test runs on synthetic data both in local mode and on a cluster produced 
correct data.

Cluster runs indicate significant improvement in overall speed of execution 
when using this approach.
                
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: partialagg_patch_1.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance 
> degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't 
> well suited to the operator's assumptions. Changing the implementation to a 
> more flexible hash-based model can provide significant performance 
> improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to