hashed vector

Timothy Potter (JIRA) Sat, 30 Apr 2011 15:50:45 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027399#comment-13027399
 ]


Timothy Potter commented on MAHOUT-639:
---------------------------------------

Hi Jake,

Agreed on the unit test timeout not being a good approach. Thanks for cleaning 
up the code I submitted. 

> Need special case to handle creating a new SequentialAccessSparseVector from 
> a large (> 1M dims) random/hashed vector
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-639
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-639
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Timothy Potter
>            Assignee: Jake Mannix
>            Priority: Critical
>              Labels: matrix, scalability, svd, transpose
>             Fix For: 0.6
>
>         Attachments: MAHOUT-639.patch
>
>
> When trying to transpose a matrix of tfidf vectors created from text 
> documents (ASF mail archives in this case), there is a bottleneck in the 
> TransposeJob's reducer when Mahout creates a new SequentialAccessSparseVector 
> from a RandomAccessSparseVector after the while loop completes:
>       SequentialAccessSparseVector outVector = new 
> SequentialAccessSparseVector(tmp);
> For high-frequency terms (some of which occur over ~1M times in my data), the 
> code to create a SequentialAccessSparseVector from a RandomAccessSparseVector 
> bogs down completely .... 
> From Jake Mannix:
> "Suspicion confirmed:
>   public SequentialAccessSparseVector(Vector other) {
>     this(other.size(), other.getNumNondefaultElements());
>     Iterator<Element> it = other.iterateNonZero();
>     Element e;
>     while (it.hasNext() && (e = it.next()) != null) {
>       set(e.index(), e.get());
>     }
>   }
> we iterate over the other vector (which is in random/hashed order), adding it 
> to the sequential access vector (which always tries to stay in sequential 
> order).  So actually, this may be *worse* than O(n^2), but I'd prefer to just 
> not know how much worse, and instead we should fix it.
> Should be fairly straightforward: make an array of structs (essentially) with 
> the index and the double, of size other.getNumNonDefaultElements() (what a 
> horrible method name), fill it up on one iteration over the other vector, 
> sort it in place, then make your new OrderedIntDoubleMapping out of the 
> indexes and values (unless someone has a cleverer idea to sort a pair of two 
> arrays at the same time, shuffling one based on the ordering criterion of the 
> other)."

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-639) Need special case to handle creating a new SequentialAccessSparseVector from a large (> 1M dims) random/hashed vector

Reply via email to