hashed vector

Jake Mannix (JIRA) Fri, 29 Apr 2011 20:37:51 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027278#comment-13027278
 ]


Jake Mannix commented on MAHOUT-639:
------------------------------------

Ok, Tim I think the idea of using auxiliary structs is fine.  It's not 
perfectly optimal, but the time complexity is the same big-O as the optimal 
solution (different constant, probably, and requires extra short-lived objects, 
true).  I was able to get it about 10% faster by not using the set() method on 
OrderedIntDoubleMapping, and instead making the constructor for that class 
package private and using it with the already-ordered int[] and double[] which 
can be created directly in copySortedRandomAccessSparseVector().

I don't think we want a timeout based nondeterministic unit test for this, 
however.  That's just asking for trouble.  I think verifying all other tests 
work, and then actually checking this actually improves performance is enough. 

I'll clean this up and get it committed this weekend.

> Need special case to handle creating a new SequentialAccessSparseVector from 
> a large (> 1M dims) random/hashed vector
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-639
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-639
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Timothy Potter
>            Assignee: Jake Mannix
>            Priority: Critical
>              Labels: matrix, scalability, svd, transpose
>             Fix For: 0.6
>
>         Attachments: MAHOUT-639.patch
>
>
> When trying to transpose a matrix of tfidf vectors created from text 
> documents (ASF mail archives in this case), there is a bottleneck in the 
> TransposeJob's reducer when Mahout creates a new SequentialAccessSparseVector 
> from a RandomAccessSparseVector after the while loop completes:
>       SequentialAccessSparseVector outVector = new 
> SequentialAccessSparseVector(tmp);
> For high-frequency terms (some of which occur over ~1M times in my data), the 
> code to create a SequentialAccessSparseVector from a RandomAccessSparseVector 
> bogs down completely .... 
> From Jake Mannix:
> "Suspicion confirmed:
>   public SequentialAccessSparseVector(Vector other) {
>     this(other.size(), other.getNumNondefaultElements());
>     Iterator<Element> it = other.iterateNonZero();
>     Element e;
>     while (it.hasNext() && (e = it.next()) != null) {
>       set(e.index(), e.get());
>     }
>   }
> we iterate over the other vector (which is in random/hashed order), adding it 
> to the sequential access vector (which always tries to stay in sequential 
> order).  So actually, this may be *worse* than O(n^2), but I'd prefer to just 
> not know how much worse, and instead we should fix it.
> Should be fairly straightforward: make an array of structs (essentially) with 
> the index and the double, of size other.getNumNonDefaultElements() (what a 
> horrible method name), fill it up on one iteration over the other vector, 
> sort it in place, then make your new OrderedIntDoubleMapping out of the 
> indexes and values (unless someone has a cleverer idea to sort a pair of two 
> arrays at the same time, shuffling one based on the ordering criterion of the 
> other)."

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-639) Need special case to handle creating a new SequentialAccessSparseVector from a large (> 1M dims) random/hashed vector

Reply via email to