RandomAccessSparseVector.dot with another non-sequential vector can be 
extremely non-symmetric in its performance
-----------------------------------------------------------------------------------------------------------------

                 Key: MAHOUT-823
                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
             Project: Mahout
          Issue Type: Improvement
          Components: Math
            Reporter: Eugene Kirpichov


http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172

The complexity of the algorithm is O(num nondefault elements in this), while it 
could clearly be O(min(num nondefault in this, num nondefault in x)).

This can be fixed by adding this code before line 189.

{code}
if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
  return x.dot(this);
}
{code}

An easy case where this asymmetry is very apparent and makes a huge difference 
in performance is K-Means clustering.

In K-Means for high-dimensional points (e.g. those that arise in text retrieval 
problems), the centroids often have a huge number of non-zero components, 
whereas points have a small number of them.

So, if you make a mistake and use centroid.dot(point) in your code for 
computing the distance, instead of point.dot(centroid), you end up with orders 
of magnitude worse performance (which is what we actually observed - the 
clustering time was a couple of minutes with this fix and over an hour without 
it).

So, perhaps, if you make this fix, quite a few people who had a similar case 
but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to