[
https://issues.apache.org/jira/browse/PIG-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018164#comment-16018164
]
Rohini Palaniswamy commented on PIG-4449:
-----------------------------------------
PIG-5211 has addressed this problem. The current implementation in PIG-5211
adds to PriorityQueue and then removes an element if it exceeds size limit.
Leaving this jira open to address further optimizations that I had thought of
for this issue.
1) Change to using
https://google.github.io/guava/releases/11.0/api/docs/com/google/common/collect/TreeMultiset.html
. Internally it is Map<E, AtomicInteger> which keeps count of duplicate
entries. That should save space. Also it allows peeking of first and last
entry. So after reaching the limit we can check if the new element to be added
is greater than the last entry in case of ascending sort and or lesser than the
smaller entry in case of descending sort and avoid adding in the first place.
2) Use https://docs.oracle.com/javase/7/docs/api/java/util/TreeSet.html in case
it is a DISTINCT + ORDER BY +LIMIT
3) Add support for spill. Have seen cases where people do LIMIT 100000.
> Optimize the case of Order by + Limit in nested foreach
> -------------------------------------------------------
>
> Key: PIG-4449
> URL: https://issues.apache.org/jira/browse/PIG-4449
> Project: Pig
> Issue Type: Improvement
> Reporter: Rohini Palaniswamy
> Labels: Performance
>
> This is one of the very frequently used patterns
> {code}
> grouped_data_set = group data_set by id;
> capped_data_set = foreach grouped_data_set
> {
> ordered = order joined_data_set by timestamp desc;
> capped = limit ordered $num;
> generate flatten(capped);
> };
> {code}
> But this performs very poorly when there are millions of rows for a key in
> the groupby with lot of spills. This can be easily optimized by pushing the
> limit into the InternalSortedBag and maintain only $num records any time and
> avoid memory pressure.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)