[ 
https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118015#comment-13118015
 ] 

Sean Owen commented on MAHOUT-823:
----------------------------------

I think it's a good idea, and can be expanded. Am I right that the goal is to 
compute the dot in the way that will avoid more lookups (in the other vector), 
and/or avoid looking up in the vector with the slower get() operation?

I imagine that avoiding lookups is higher priority, so the rule you suggest is 
something we always want to do. So, for example, let's say you're dotting a 
SequentialAccessSparseVector with 100,000 non-default elements, with a 
RandomAccessSparseVector of 1,000 non-default elements. You'd want 
random-dot-sequential, right? Even though it'll take longer to look up in the 
SequentialAccessSparseVector with binary search, it's saving 100x lookups.

But we now have a rule in the code that prioritizes dotting 
sequential-dot-random. I think we'd just remove that then right?

And then in SequentialAccessSparseVector... I can imagine keeping the 
special-case of sequential-dot-sequential... but somehow it also seems 
overwhelmed by avoiding lookups. And, this special-case ignores DenseVectors, 
which should be similarly treated. So, can it go too?

And then DenseVector's special case for DenseVector seems not that useful.

And then we get to the point where all dot() implementations are the same, and 
include the helpful rule above. This seems happy and tidy -- see attached 
counter-proposal patch, just for discussion. Am I missing something?
                
> RandomAccessSparseVector.dot with another non-sequential vector can be 
> extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Eugene Kirpichov
>              Labels: dot, dot-product, vector
>             Fix For: 0.6
>
>         Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while 
> it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge 
> difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text 
> retrieval problems), the centroids often have a huge number of non-zero 
> components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for 
> computing the distance, instead of point.dot(centroid), you end up with 
> orders of magnitude worse performance (which is what we actually observed - 
> the clustering time was a couple of minutes with this fix and over an hour 
> without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case 
> but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to