Need special case to handle creating a new SequentialAccessSparseVector from a
large (> 1M dims) random/hashed vector
---------------------------------------------------------------------------------------------------------------------
Key: MAHOUT-639
URL: https://issues.apache.org/jira/browse/MAHOUT-639
Project: Mahout
Issue Type: Improvement
Components: collections
Affects Versions: 0.4, 0.5
Reporter: Timothy Potter
Assignee: Benson Margulies
When trying to transpose a matrix of tfidf vectors created from text documents
(ASF mail archives in this case), there is a bottleneck in the TransposeJob's
reducer when Mahout creates a new SequentialAccessSparseVector from a
RandomAccessSparseVector after the while loop completes:
SequentialAccessSparseVector outVector = new
SequentialAccessSparseVector(tmp);
For high-frequency terms (some of which occur over ~1M times in my data), the
code to create a SequentialAccessSparseVector from a RandomAccessSparseVector
bogs down completely ....
>From Jake Mannix:
"Suspicion confirmed:
public SequentialAccessSparseVector(Vector other) {
this(other.size(), other.getNumNondefaultElements());
Iterator<Element> it = other.iterateNonZero();
Element e;
while (it.hasNext() && (e = it.next()) != null) {
set(e.index(), e.get());
}
}
we iterate over the other vector (which is in random/hashed order), adding it
to the sequential access vector (which always tries to stay in sequential
order). So actually, this may be *worse* than O(n^2), but I'd prefer to just
not know how much worse, and instead we should fix it.
Should be fairly straightforward: make an array of structs (essentially) with
the index and the double, of size other.getNumNonDefaultElements() (what a
horrible method name), fill it up on one iteration over the other vector, sort
it in place, then make your new OrderedIntDoubleMapping out of the indexes and
values (unless someone has a cleverer idea to sort a pair of two arrays at the
same time, shuffling one based on the ordering criterion of the other)."
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira