[
https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439904#comment-13439904
]
Dmitriy V. Ryaboy commented on PIG-2888:
----------------------------------------
The current implementation makes a two key assumptions that are frequently
violated in real-life datasets and scripts:
1) The intermediate UDF is cheap to invoke
2) Records come in mostly-grouped order (records with the same key tend to
follow each other).
When condition 2 is not satisfied, POPartialAgg winds up calling the
intermediate UDF on all accumulated values so far for a given key, plus a new
tuple, for every single tuple it sees. This causes a significant performance
degradation.
Instead, we propose accumulating tuples across the board until a memory
threshold is reached. Once this threshold is reached, all keys and tuples are
fed into the intermediate UDF and the results put into a second-level map
(presumably, having been significantly shrunk by the intermediate UDF). This
repeats until the second-level map hits its threshold, at which point *it* is
summarized and its values replaced with the aggregated ones. If after such a
reduction the memory occupied by the hashmap is still near the threshold, the
results are returned to the regular MR pipeline.
> Improve performance of POPartialAgg
> -----------------------------------
>
> Key: PIG-2888
> URL: https://issues.apache.org/jira/browse/PIG-2888
> Project: Pig
> Issue Type: Improvement
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
>
> During performance testing, we found that POPartialAgg can cause performance
> degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't
> well suited to the operator's assumptions. Changing the implementation to a
> more flexible hash-based model can provide significant performance
> improvements.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira