[ 
https://issues.apache.org/jira/browse/MAHOUT-639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Potter updated MAHOUT-639:
----------------------------------

    Status: Patch Available  (was: Open)

Not the "optimal" way suggested by Jake, but definitely gets passed the problem 
by using a new internal class OrderedElement which sorts Elements by their 
index, thus allowing us to use Arrays.sort(). However, it does mean that we 
create many temporary OrderedElement objects during the copy process. Also, I'm 
not entirely sure my test case is not going to cause the community issues since 
it's based on a JUnit timeout.

Lastly, I also changed a few things around to make the matrixmult job more 
efficient which also bogged down in multiplying these large vectors in the 
mappers after I got passed the transpose issue.

> Need special case to handle creating a new SequentialAccessSparseVector from 
> a large (> 1M dims) random/hashed vector
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-639
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-639
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Timothy Potter
>            Assignee: Jake Mannix
>              Labels: matrix, scalability, svd, transpose
>
> When trying to transpose a matrix of tfidf vectors created from text 
> documents (ASF mail archives in this case), there is a bottleneck in the 
> TransposeJob's reducer when Mahout creates a new SequentialAccessSparseVector 
> from a RandomAccessSparseVector after the while loop completes:
>       SequentialAccessSparseVector outVector = new 
> SequentialAccessSparseVector(tmp);
> For high-frequency terms (some of which occur over ~1M times in my data), the 
> code to create a SequentialAccessSparseVector from a RandomAccessSparseVector 
> bogs down completely .... 
> From Jake Mannix:
> "Suspicion confirmed:
>   public SequentialAccessSparseVector(Vector other) {
>     this(other.size(), other.getNumNondefaultElements());
>     Iterator<Element> it = other.iterateNonZero();
>     Element e;
>     while (it.hasNext() && (e = it.next()) != null) {
>       set(e.index(), e.get());
>     }
>   }
> we iterate over the other vector (which is in random/hashed order), adding it 
> to the sequential access vector (which always tries to stay in sequential 
> order).  So actually, this may be *worse* than O(n^2), but I'd prefer to just 
> not know how much worse, and instead we should fix it.
> Should be fairly straightforward: make an array of structs (essentially) with 
> the index and the double, of size other.getNumNonDefaultElements() (what a 
> horrible method name), fill it up on one iteration over the other vector, 
> sort it in place, then make your new OrderedIntDoubleMapping out of the 
> indexes and values (unless someone has a cleverer idea to sort a pair of two 
> arrays at the same time, shuffling one based on the ordering criterion of the 
> other)."

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to