[jira] Created: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow

Jake Mannix (JIRA) Tue, 17 Nov 2009 23:39:06 -0800

OrderedIntDoubleMapping / SparseVector is unnecessarily slow
------------------------------------------------------------


                 Key: MAHOUT-201
                 URL: https://issues.apache.org/jira/browse/MAHOUT-201
             Project: Mahout
          Issue Type: Improvement
          Components: Matrix
    Affects Versions: 0.2
            Reporter: Jake Mannix
             Fix For: 0.3


In the work on MAHOUT-165, I find that while their sparse vector implementation 
is great from a hashing standpoint (it's memory efficient and fast for 
random-access), they don't provide anything like the OrderedIntDoublePair - 
i.e. a vector implementation which is *not* fast for random access, or 
out-of-order modification, but is minimally sized memory-wise and blazingly 
fast for doing read-only dot-products and vector sums (where the latter is 
read-only on inputs, and is creating new output) with each other, and with 
DenseVectors.

This line of thinking got me looking back at the current SparseVector 
implementation we have in Mahout, because it *is* based on an int[] and a 
double[].  Unfortunately, it's not at all optimized for the cases where it can 
outperform all other sparse impls:

* it should override dot(Vector) and plus(Vector) to check whether the input is 
a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap 
implementation of SparseVector, that case as well), and do specialized 
operations here.
* even when those particular methods aren't being used, the AllIterator and 
NonZeroIterator inner classes are very inefficient:
** minor things like caching the values.numMappings() and values.getIndices in 
final instance variables in the Iterators
** the huge performance drain of Element.get() : {code} public double get() {  
return values.get(ind);  } {code}, which is implemented as a binary search on 
index values array (the index of which was already known!) followed by the 
array lookup

This last point is probably the entire reason why we've seen performance 
problems with the SparseVector, as it's in both the NonZeroIterator and the 
AllIterator, and so turned any O(numNonZeroElements) operations into 
O(numNonZeroElements * log(numNonZeroElements)) (with some additional constant 
factors for too much indirection thrown in for good measure).

Unless there is another JIRA ticket which has a patch fixing this which I 
didn't notice, I can whip up a patch (I've got a similar implementation over in 
decomposer I can pull stuff out of, although mine is simpler because it is 
immutable, so it's not just a copy and paste).

We don't have any benchmarking code anywhere yet, do we?  Is there a JIRA 
ticket open for that already?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow

Reply via email to