[
https://issues.apache.org/jira/browse/MAHOUT-639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027399#comment-13027399
]
Timothy Potter commented on MAHOUT-639:
---------------------------------------
Hi Jake,
Agreed on the unit test timeout not being a good approach. Thanks for cleaning
up the code I submitted.
> Need special case to handle creating a new SequentialAccessSparseVector from
> a large (> 1M dims) random/hashed vector
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-639
> URL: https://issues.apache.org/jira/browse/MAHOUT-639
> Project: Mahout
> Issue Type: Improvement
> Components: Math
> Affects Versions: 0.4, 0.5
> Reporter: Timothy Potter
> Assignee: Jake Mannix
> Priority: Critical
> Labels: matrix, scalability, svd, transpose
> Fix For: 0.6
>
> Attachments: MAHOUT-639.patch
>
>
> When trying to transpose a matrix of tfidf vectors created from text
> documents (ASF mail archives in this case), there is a bottleneck in the
> TransposeJob's reducer when Mahout creates a new SequentialAccessSparseVector
> from a RandomAccessSparseVector after the while loop completes:
> SequentialAccessSparseVector outVector = new
> SequentialAccessSparseVector(tmp);
> For high-frequency terms (some of which occur over ~1M times in my data), the
> code to create a SequentialAccessSparseVector from a RandomAccessSparseVector
> bogs down completely ....
> From Jake Mannix:
> "Suspicion confirmed:
> public SequentialAccessSparseVector(Vector other) {
> this(other.size(), other.getNumNondefaultElements());
> Iterator<Element> it = other.iterateNonZero();
> Element e;
> while (it.hasNext() && (e = it.next()) != null) {
> set(e.index(), e.get());
> }
> }
> we iterate over the other vector (which is in random/hashed order), adding it
> to the sequential access vector (which always tries to stay in sequential
> order). So actually, this may be *worse* than O(n^2), but I'd prefer to just
> not know how much worse, and instead we should fix it.
> Should be fairly straightforward: make an array of structs (essentially) with
> the index and the double, of size other.getNumNonDefaultElements() (what a
> horrible method name), fill it up on one iteration over the other vector,
> sort it in place, then make your new OrderedIntDoubleMapping out of the
> indexes and values (unless someone has a cleverer idea to sort a pair of two
> arrays at the same time, shuffling one based on the ordering criterion of the
> other)."
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira