[
https://issues.apache.org/jira/browse/MAHOUT-639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027278#comment-13027278
]
Jake Mannix commented on MAHOUT-639:
------------------------------------
Ok, Tim I think the idea of using auxiliary structs is fine. It's not
perfectly optimal, but the time complexity is the same big-O as the optimal
solution (different constant, probably, and requires extra short-lived objects,
true). I was able to get it about 10% faster by not using the set() method on
OrderedIntDoubleMapping, and instead making the constructor for that class
package private and using it with the already-ordered int[] and double[] which
can be created directly in copySortedRandomAccessSparseVector().
I don't think we want a timeout based nondeterministic unit test for this,
however. That's just asking for trouble. I think verifying all other tests
work, and then actually checking this actually improves performance is enough.
I'll clean this up and get it committed this weekend.
> Need special case to handle creating a new SequentialAccessSparseVector from
> a large (> 1M dims) random/hashed vector
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-639
> URL: https://issues.apache.org/jira/browse/MAHOUT-639
> Project: Mahout
> Issue Type: Improvement
> Components: Math
> Affects Versions: 0.4, 0.5
> Reporter: Timothy Potter
> Assignee: Jake Mannix
> Priority: Critical
> Labels: matrix, scalability, svd, transpose
> Fix For: 0.6
>
> Attachments: MAHOUT-639.patch
>
>
> When trying to transpose a matrix of tfidf vectors created from text
> documents (ASF mail archives in this case), there is a bottleneck in the
> TransposeJob's reducer when Mahout creates a new SequentialAccessSparseVector
> from a RandomAccessSparseVector after the while loop completes:
> SequentialAccessSparseVector outVector = new
> SequentialAccessSparseVector(tmp);
> For high-frequency terms (some of which occur over ~1M times in my data), the
> code to create a SequentialAccessSparseVector from a RandomAccessSparseVector
> bogs down completely ....
> From Jake Mannix:
> "Suspicion confirmed:
> public SequentialAccessSparseVector(Vector other) {
> this(other.size(), other.getNumNondefaultElements());
> Iterator<Element> it = other.iterateNonZero();
> Element e;
> while (it.hasNext() && (e = it.next()) != null) {
> set(e.index(), e.get());
> }
> }
> we iterate over the other vector (which is in random/hashed order), adding it
> to the sequential access vector (which always tries to stay in sequential
> order). So actually, this may be *worse* than O(n^2), but I'd prefer to just
> not know how much worse, and instead we should fix it.
> Should be fairly straightforward: make an array of structs (essentially) with
> the index and the double, of size other.getNumNonDefaultElements() (what a
> horrible method name), fill it up on one iteration over the other vector,
> sort it in place, then make your new OrderedIntDoubleMapping out of the
> indexes and values (unless someone has a cleverer idea to sort a pair of two
> arrays at the same time, shuffling one based on the ordering criterion of the
> other)."
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira