[
https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118721#comment-13118721
]
Sean Owen commented on MAHOUT-823:
----------------------------------
I did the homework, and came up with the following result. The substance of the
patch is now this (as well as removing other dot() implementations):
{code}
// Crude rule of thumb: when a sequential-access vector, with O(log n)
lookups, has about
// 2^n elements, its lookups take longer than a dense / random access
vector (with O(1) lookups) by
// about a factor of (0.71n - 12.3). This holds pretty well from n=19 up to
at least n=23 according to my tests;
// below that lookups are so fast that this difference is near zero.
int thisNumNonDefault = getNumNondefaultElements();
int thatNumNonDefault = x.getNumNondefaultElements();
// Default: dot from smaller vector to larger vector
boolean reverseDot = thatNumNonDefault < thisNumNonDefault;
// But, see if we should override that -- is exactly one of them sequential
access and so slower to lookup in?
if (isSequentialAccess() != x.isSequentialAccess()) {
double log2ThisSize = Math.log(thisNumNonDefault) / LOG2;
double log2ThatSize = Math.log(thatNumNonDefault) / LOG2;
// Only override when the O(log n) factor seems big enough to care about:
if (log2ThisSize >= 19.0 && log2ThatSize >= 19.0) {
double dotCost = thisNumNonDefault;
if (x.isSequentialAccess()) {
dotCost *= 0.71 * log2ThatSize - 12.3;
}
double reverseDotCost = thatNumNonDefault;
if (isSequentialAccess()) {
reverseDotCost *= 0.71 * log2ThisSize - 12.3;
}
reverseDot = reverseDotCost < dotCost;
}
}
if (reverseDot) {
return x.dot(this);
}
{code}
Thoughts?
> RandomAccessSparseVector.dot with another non-sequential vector can be
> extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-823
> URL: https://issues.apache.org/jira/browse/MAHOUT-823
> Project: Mahout
> Issue Type: Improvement
> Components: Math
> Affects Versions: 0.5
> Reporter: Eugene Kirpichov
> Assignee: Sean Owen
> Labels: dot, dot-product, vector
> Fix For: 0.6
>
> Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while
> it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
> return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge
> difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text
> retrieval problems), the centroids often have a huge number of non-zero
> components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for
> computing the distance, instead of point.dot(centroid), you end up with
> orders of magnitude worse performance (which is what we actually observed -
> the clustering time was a couple of minutes with this fix and over an hour
> without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case
> but didn't notice it will suddenly have a dramatic performance increase :)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira