[ 
https://issues.apache.org/jira/browse/MAHOUT-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149106#comment-13149106
 ] 

Yonik Seeley commented on MAHOUT-881:
-------------------------------------

Grant's patch seems to half the number of PQ operations while collecting the 
top N.  I don't know enough about possible Mahout usage patterns, or the 
relative cost to this against everything else to know if it matters, but it 
does for Lucene.  One lucene example is sorting by date descending when they 
have been added in roughly ascending order - the number of PQ operations is 
very high, so cutting it down by half is a nice win.

As far as the sort, the patch essentially implements a heap sort, taking 
advantage of the fact that we have already done the first half of that 
(building the heap).  Not sure how to compare that to other sort algorithms...

Memory usage also seems improved.  This existing code
{code}
    List<RecommendedItem> result = Lists.newArrayListWithCapacity(size);
    result.addAll(topItems);
    Collections.sort(result, ByValueRecommendedItemComparator.getInstance());
{code}

- Object[size] allocated for the ArrayList
- result.addAll calls toArray which creates another Object[size] and then 
copies it to the first
- Collections.sort creates another Object[size] while doing the mergesort, then 
iterates the original collection, setting every item

I think Grant's might be improved by using Arrays.asList which would avoid an 
extra Object[size] allocation + copy.

So the patch also improves on the memory use - but again, I don't know enough 
about Mahout to know if it matters in this context.

                
> Refactor TopItems to use Lucene's PriorityQueue and remove excessive sorting
> ----------------------------------------------------------------------------
>
>                 Key: MAHOUT-881
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-881
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.6
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: MAHOUT-881.patch
>
>
> TopItems.getTop*() all do a fair number of excessive operations that can be 
> replaced by switching to using Lucene's PriorityQueue implementation, which 
> is more efficient and faster than Java's built in PQ implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to