[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-19 Thread Shashikant Kore (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744973#action_12744973
 ] 

Shashikant Kore commented on MAHOUT-121:


OrderedIntDoubleMapping, the primitive hash map used in sparse vector, is not 
fast enough.  Compared to Trove, it is an order of magnitude slower. 

How can we improve the performance? 

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744980#action_12744980
 ] 

Sean Owen commented on MAHOUT-121:
--

These two implementations use a different algorithm. This implemented is the 
analog of TreeSet in Java Collections, whereas you are talking about an analog 
of HashMap.

A hash is likely going to be faster for gets. I imagine it's slower when 
required to produce the keys/values in order. Which did you benchmark? let's 
make sure we are asking the right question first.

Of course, if we find that in fact we far more often need fast gets than fast 
iteration, we do want a hash-based implementation instead. Is that the 
conclusion? (And could we make a new issue to continue that!)

We can't use Trove of course. The parts of colt we need appear to be 
Apache-compatible. That seems like a fine thing to try next.

If it's not suitable, we can easily roll our own primitive-based hash. We 
already have a long-Object map implementation.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-19 Thread Shashikant Kore (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744991#action_12744991
 ] 

Shashikant Kore commented on MAHOUT-121:



We need mix of iteration and map get/set. but the map operations are dominant. 
Hence I tested with get/set operations.  Getting iterator in case of Colt is 
less than 2x slower. 

1. Vector addition: 
Get iterator on one vector.  
Iterate on the indices to get values from that vector.   
For the same indices, get values from current vector.
Set the result in the current vector. 

In the current implementation, get is O(log n) operation. And set involves 
copying the array copy operation, which is quadratic (I think).

2. SparseVector getDistanceSquared()
Operational complexity is similar to that of vector addition, except there is 
no set operation.



 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-19 Thread Shashikant Kore (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744995#action_12744995
 ] 

Shashikant Kore commented on MAHOUT-121:


Trying it out...


 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-19 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745015#action_12745015
 ] 

Grant Ingersoll commented on MAHOUT-121:


Let's open a new issue for this one, as this issue is really long and 
overloaded at this point. 

Solr uses Colt via a download mechanism, so we probably could use it via the 
POM, but not sure if we can distribute it.  

FWIW, in my profiling, I think find() was the major operation.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-19 Thread Rob Eden (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745082#action_12745082
 ] 

Rob Eden commented on MAHOUT-121:
-

Hi guys. I'm the lead developer for the Trove project. Shashi mentioned the 
problem you're having with Trove's license. I appreciate the interest and would 
like to accommodate your usage. I'm going to speak to the original developer 
about dual-licensing.

My specific question is, would the MPL be an acceptable license for usage or 
does it have to be APL? (Not implying that I necessarily have a problem with 
APL... just checking options.)

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-19 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745097#action_12745097
 ] 

Grant Ingersoll commented on MAHOUT-121:


Hi Rob,

Thanks!  Here's the ASF's stance on licenses:  
http://apache.org/legal/resolved.html.   I think for our use case, MPL would be 
fine, since we are just using it as a library and not extending/deriving from 
it, even if ASL would be better.

-Grant

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-19 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745101#action_12745101
 ] 

Benson Margulies commented on MAHOUT-121:
-

Trove is LGPL. If it were published to maven, which is currently is not in any 
reliable way, I think mahout can have an optional dependency. As per the 
referenced, LGPL is precluded in a full dependency.


 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741068#action_12741068
 ] 

Grant Ingersoll commented on MAHOUT-121:


bq. The floating point comparision (lengthSquared != -1) is not correct. It 
should be (lengthSquared  0), if at all we want to remove the Double object.

Actually, this should be = 0, since we are returning lengthSquared in that 
case (before we we're returning it if it was not null).

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740854#action_12740854
 ] 

Sean Owen commented on MAHOUT-121:
--

Nope, you have not avoided an object allocation in either case. You are simply 
moving around declarations of object references. (Does it help to note that 
objects are never allocated on the stack as in C++?) Therefore, I think the 
prior arguments apply. This very slightly hurts performance and readability and 
should be reversed.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-08 Thread Jeff Eastman

Grant Ingersoll (JIRA) wrote:
[ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740573#action_12740573 ] 


Grant Ingersoll commented on MAHOUT-121:


bq. Please, feel free to contradict me here - That was the whole point: 
getStd() is NEVER called.

Ah, I see now.  We were in deed calculating it in computeCentroid lots of 
times, but it is only ever used for DisplayKMeans.  You are correct.


As for loop unrolling, etc. it is usually best to let the compiler take care of 
that.  As for strings, you should never, ever use String for concatenation.  It 
is a horrible performance drain.  StringBuilder is definitely the way to go.  
Especially be on the look out for String concats in logging statements that 
aren't guarded by if (log.isDebugEnabled())

I will fix the lengthSquared comparison and commit.

Thanks!

  

Speed up distance calculations for sparse vectors
-

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
 Issue Type: Improvement
 Components: Matrix
   Reporter: Shashikant Kore
   Assignee: Grant Ingersoll
Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, 
MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


From my mail to the Mahout mailing list.
I am working on clustering a dataset which has thousands of sparse vectors. The 
complete dataset has few tens of thousands of feature items but each vector has 
only couple of hundred feature items. For this, there is an optimization in 
distance calculation, a link to which I found the archives of Mahout mailing 
list.
http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
I tried out this optimization.  The test setup had 2000 document  vectors with 
few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 
values as 250 and 200.
 
Current Canopy Generation: 28 min 15 sec.

Canopy Generation with distance optimization: 1 min 38 sec.
I know by experience that using Integer, Double objects instead of primitives 
is computationally expensive. I changed the sparse vector  implementation to 
used primitive collections by Trove [
http://trove4j.sourceforge.net/ ].
Distance optimization with Trove: 59 sec
Current canopy generation with Trove: 21 min 55 sec
To sum, these two optimizations reduced cluster generation time by a 97%.
Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
Licensing of Trove seems to be an issue which needs to be addressed.



  
I added the getStd() method for use in the display examples to show a 
size metric for the clusters consistent with the Dirichlet models. The 
std calculation should, at least, be refactored out of computeCentroid 
so it is only done if needed. If the overhead of calculating the 
pointSquaredTotal is a source of unacceptable performance overhead then 
by all means remove it and I will figure out how to display something 
reasonable about cluster size.





PGP.sig
Description: PGP signature


[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740519#action_12740519
 ] 

Grant Ingersoll commented on MAHOUT-121:


bq. Declaring variables out of the loop looks like an optimization, but I 
suppose, compiler will do that automatically

This appears to be moving it right out of the computation all together, though, 
as getStd (we need a better name for that!) isn't even in the computation 
anymore, AFAICT.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740527#action_12740527
 ] 

Sean Owen commented on MAHOUT-121:
--

On the point of variable declaration -- the location of the declaration has no 
impact on performance. It is not somehow 'declared' multiple times if within a 
loop. Pulling it out of the loop could even be very slightly slower, if it 
means you are forced to assign it a dummy value at initialization which is 
never used. There's one more benefit when referencing objects: an Object 
reference that exists, say, in a loop -- the reference goes out of scope when 
the loop finishes and its referent is immediately GC-able. If it's declared 
outside the loop, the reference exists until the larger containing block 
finishes. Unless you set it to null, but, that's more maintenance. It's a waste 
if the referent is large and not actually used anymore. This has actually 
bitten me in the past.

So in general, in Java. declare variables as deep and late as possible. 

Agree with Shashikant that while using double instead of Double is nice, using 
-1 (really, use -1.0 -- double literals belong with double values) as a signal 
value probably isn't right here. Just say  0.0.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740568#action_12740568
 ] 

Nicolás Fantone commented on MAHOUT-121:


Thanks for the feedback, people. There are some things I'd like to point out:

{quote}
??I don't get the getStd() method now. You moved the functionality from 
computeCentroid, but now getStd is never called.??
{quote}

Please, feel free to contradict me here - That was the whole point: getStd() is 
NEVER called. I search through the entire mahout-core project and no call was 
found. So, why do we need to add functionality where it does not belong, and 
moreover, it is not needed? If there's a computeCentroid() method, then let it 
compute a centroid... not estimate some deviation (which isn't actually used,  
anyway) hundreds of times iteratively.  If there's a real need for that 
particular value, then create a method that calculates it. As there was no such 
method, I thought it was appropriate to use getStd().

{quote}
??I also am not following the commenting out of the check to see if the two 
vectors are the same size??
{quote}

Feel free to uncomment those lines and leave them as they once were. I realize 
that a size check is a needed and healthy thing, but then again, if the purpose 
of the optimized version of distance() is performance-gain perhaps we should 
consider dropping it out (there are a couple of methods in the project that 
already do that - check getDistanceSquared() is SparseVector).

{quote}
??Declaring variables out of the loop looks like an optimization, but I 
suppose, compiler will do that automatically. Even if it doesn't, the 
performance penalty is not so significant to sacrifice the code readability 
(which is arguable.)??
{quote}

I have had this discussion dozens of times with my colleagues at the University 
and partners at work. Though I can accept the root of the issue at hand is kind 
of a myth (whether declaration inside of a loop impacts performance negatively 
or not), instantiation of thousands of unnecessary objects IS costly in JAVA - 
not to mention garbage-accumulative. At the end of the day, avoiding this kind 
of code-writing is, more often than not, a healthy thing to do - even if it 
means a decrease in readability.

{quote}
??I ran a simple test with million loops over float multiplication  runs in 
10ms. Moving the declarations outside the loop didn't change the results.??
{quote}

floats are a primitive type. Want a simple test? Use String, which also happen 
to be immutable. Something as straightforward as,

{{String s = new String(sample-string);}}
{{for (int i = 0; i  25; i++){s += anotherString;}}}

will take several minutes to complete thanks to the cost of instantiating a new 
String each iteration with a new value. Now, replace String with StringBuilder, 
+= with append, and voilà: no more instantiations, same result, few seconds.

{quote}
??The floating point comparision (lengthSquared != -1) is not correct. It 
should be (lengthSquared  0), if at all we want to remove the Double object.??
{quote}

Agree. You are completely right.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy 

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740573#action_12740573
 ] 

Grant Ingersoll commented on MAHOUT-121:


bq. Please, feel free to contradict me here - That was the whole point: 
getStd() is NEVER called.

Ah, I see now.  We were in deed calculating it in computeCentroid lots of 
times, but it is only ever used for DisplayKMeans.  You are correct.


As for loop unrolling, etc. it is usually best to let the compiler take care of 
that.  As for strings, you should never, ever use String for concatenation.  It 
is a horrible performance drain.  StringBuilder is definitely the way to go.  
Especially be on the look out for String concats in logging statements that 
aren't guarded by if (log.isDebugEnabled())

I will fix the lengthSquared comparison and commit.

Thanks!

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740576#action_12740576
 ] 

Grant Ingersoll commented on MAHOUT-121:


What about the commenting out of:
{code}
if (centroid.size() != v.size()) {
{code}

It seems like if we are going to do this in this distance measure, we should do 
it in the other distance measure.  We could push this out a layer via 
documentation and require the caller to do the check, if required.  Otherwise, 
do we have some guarantee that the centroid and the vector are the same 
cardinality in all cases of calls to this method?  I think we do explicitly for 
the Cluster calls, but that isn't the same.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740577#action_12740577
 ] 

Sean Owen commented on MAHOUT-121:
--

Still picking on this loops / declaration issue:

Instantiating unneeded objects is indeed, obviously, a waste. But the changes 
in the last patch (which I think we're discussing?) does not save 
instantiations -- nor does it really add any either. I don't follow this point 
then.

The point illustrated by the String loop example has nothing to do with how 
variables declared, and everything to do with the difference between String and 
StringBuilder. It doesn't seem to address the point previously raised.

Here is a difference that doesn't matter:

String s = null;
for (int i = 0; i  25; i++){s = anotherString:  + i;}

for (int i = 0; i  25; i++){String s = anotherString:  + i;}

In fact the first is ever so slightly worse since it sets s to null, but the 
value is unused. But it is worse for another reason: s continues to point to 
anotherString: 24 after the loop terminates, which is also pointless.

Hence I would undo that part of the patch unless there is another purpose to it 
I missed.


 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740580#action_12740580
 ] 

Sean Owen commented on MAHOUT-121:
--

And finally to Grant's point about compilers and loop unrolling, and I may be 
stating things people already know:

- This isn't an example of unrolling is it?
- javac will never perform optimizations like this, or really of any kind, 
given it is bound to output certain bytecode given a certain input program. So 
I tend to code efficiently from the get-go in Java
- (But ProGuard can, I love ProGuard)
- But the JIT compiler might at runtime. The theory, and I like it, is that 
this half of the compilation is better done by a process on the platform where 
the target code will run. For instance it can inline some method calls from 
even non-final classes, even though Java method invocation is always 
polymorphic, if it could detect that there doesn't happen to be any subclasses 
loaded in the JVM at runtime. That would not be possible at compile time.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740587#action_12740587
 ] 

Nicolás Fantone commented on MAHOUT-121:


{quote}
The point illustrated by the String loop example has nothing to do with how 
variables declared, and everything to do with the difference between String and 
StringBuilder. It doesn't seem to address the point previously raised.
{quote}

Not quite right. The difference between String and StringBuilder IS EXACTLY the 
difference between instantiating thousands of objects and re-using just one, 
which is, I believe, the matter at hand here.

{quote}
In fact the first is ever so slightly worse since it sets s to null, but the 
value is unused. But it is worse for another reason: s continues to point to 
anotherString: 24 after the loop terminates, which is also pointless.
{quote}

If you create new Strings in a loop, then you'll have as many objects as 
iterations pointing to anotherString: 0, anotherString: 1, ..., 
anotherString: 121410, and so on, waiting to be gcollected - which may not 
even happen in the short term. Even more pointless, following your logic.

{quote}
Hence I would undo that part of the patch unless there is another purpose to it 
I missed.
{quote}

Perhaps someone could run a profiler with and without the latest patch? I tend 
to think the gain in execution speed would not be significant if any at all, as 
some of you have stated. However, unless code readability is a priority, I see 
no harm in changing something that can only help performance.

{quote}
This isn't an example of unrolling is it?
{quote}

That's right. It is not. It is about the cost of instatiation vs. reusability 
of short-lived objects.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740589#action_12740589
 ] 

Sean Owen commented on MAHOUT-121:
--

Are we talking about the same patch? sections like this?

+double distance = 0.0;
+Vector clusterCenter = null;
 for (Cluster cluster : clusters) {
-  double distance = measure.distance(point, cluster.getCenter());
-  if (nearestCluster == null || distance  nearestDistance) {
+  clusterCenter = cluster.getCenter();
+  distance = measure.distance(clusterCenter.getLengthSquared(), 
clusterCenter, point);
+  if (distance  nearestDistance) {
 nearestCluster = cluster;
 nearestDistance = distance;
   }

The question here is indeed whether to declare distance and clusterCenter 
inside or outside the loop.

While I completely agree with your statements, they do not seem to pertain to 
this.

I think we're not talking about the same thing, since you pick up on the point 
of creating Strings in a loop -- which both my examples did. That is not a 
difference and not germane to the point I have in mind.

I am further making the point that there is not a tradeoff between readability 
and speed here -- putting the declaration outside the loop actually makes both 
worse. I am quite sure this should be reversed.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740646#action_12740646
 ] 

Nicolás Fantone commented on MAHOUT-121:


Jira is hardly the place to discuss this (almost moral) issues, as this tends 
to be a matter of personal choice most of the time. But I feel the urge to 
reply: some of you have really helped me out here at work with my project. 

{quote}
Are we talking about the same patch? sections like this? 
{quote}

Yes we are. And while we are taking a look at that quoted piece of code, I 
shall mention that the declaration of distance and clusterCenter outside the 
loop isn't the only thing being changed in the patch. There's also an 
elimination of a an unnecessary boolean evaluation in the if statement. 

{quote}
I think we're not talking about the same thing, since you pick up on the point 
of creating Strings in a loop - which both my examples did. 
{quote}

We may not. Perhaps, my String example wasn't the happiest of them all - but it 
certainly IS showing how costly it is in JAVA to just instantiate and 
initialize an object. Furthermore, I thought I gave a counterargument to what 
you were implying with those examples.

{quote}
I am further making the point that there is not a tradeoff between readability 
and speed here - putting the declaration outside the loop actually makes both 
worse.
{quote}

I can tell none of my propositions were convincing enough. Maybe someone 
else's? 
http://weblogs.java.net/blog/ddevore/archive/2006/08/declare_variabl_1.html

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740654#action_12740654
 ] 

Sean Owen commented on MAHOUT-121:
--

My comments -- which narrowly concern where to declare variables -- are 
relevant to the patch. It is not a matter of choice or style, or morality or 
whatever.

Your point about Strings is undeniable, but is irrelevant to this question. See 
my counter-example for an example which would be relevant to the point I am 
trying to make.

This blog posts corroborates exactly what I am saying. Declaring outside the 
loop only incurs an extra initialization. It also makes the point that it is 
not the same thing to *allocate an Object* outside the loop, versus inside. Of 
course -- it's the difference between allocating one object and many. But, that 
is not what is happening in your patch.

I think, then, all evidence suggests the variable declaration changes should be 
reverted. I am making no other claims. For instance, removing the unnecessary 
boolean condition remains a positive change I am sure.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740666#action_12740666
 ] 

Grant Ingersoll commented on MAHOUT-121:


The only thing I'm missing at this point is the question from earlier about the 
removal of the size comparison, then I am ready to commit.
{quote}
What about the commenting out of:
{code}
if (centroid.size() != v.size()) {
{code}
It seems like if we are going to do this in this distance measure, we should do 
it in the other distance measure. We could push this out a layer via 
documentation and require the caller to do the check, if required. Otherwise, 
do we have some guarantee that the centroid and the vector are the same 
cardinality in all cases of calls to this method? I think we do explicitly for 
the Cluster calls, but that isn't the same.
{quote}



 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740681#action_12740681
 ] 

Nicolás Fantone commented on MAHOUT-121:


Sean, we are definitely not following each other. Probably due to my lack of 
communication skills.

{quote}
Your point about Strings is undeniable, but is irrelevant to this question. See 
my counter-example for an example which would be relevant to the point I am 
trying to make.
{quote}

How could it be irrelevant when it's the exact same point I tried to make in my 
patch?  There's a Vector being instantiated and allocated in every for 
iteration, in every task, in every reduce job, in every node of the cluster. 
And it is not necessary. Every time. The very same thing goes for my String 
example... except with Strings.

{quote}
Declaring outside the loop only incurs an extra initialization.
{quote}

Extra? There's only ONE initialization. If the declaration is done inside the 
loop, thousands of initializations are going to be done. That's thousands minus 
one extra initializations.

Grant, maybe you should leave the size comparison for now. It won't impact 
speed noticeably and, as of now, KMeans is only using the optimized distance 
calculation for both computing convergence and emitting points. Is there 
anywhere else a size check is done between input vectors? I believe there isn't.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740684#action_12740684
 ] 

Sean Owen commented on MAHOUT-121:
--

Where in the patch do you move an object allocation from inside the loop to 
outside? I do not see this. If I did, indeed, you would have a point. That's 
why we may not be talking about the same thing. All I see is moving 
declarations.

Yes, the one extra initialization is trivial. The fact that a reference lives 
on outside the loop unnecessarily is also tiny, but not as tiny. But they are 
both negatives. I think the consensus was it also was very slightly less 
readable? so we have a couple small negatives -- how can that add up to support 
for a change?

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740694#action_12740694
 ] 

Nicolás Fantone commented on MAHOUT-121:


{quote}
Where in the patch do you move an object allocation from inside the loop to 
outside? I do not see this. 
{quote}

Here:
{code:title=getDistanceSquared() in SparseVector.java - Original 
|borderStyle=solid}
double result = 0.0;
IteratorVector.Element iter = iterateNonZero();
while (iter.hasNext()) {
  Vector.Element elt = iter.next();
  double centroidValue = v.getQuick(elt.index());
  double delta = elt.get() - centroidValue;
  result += (delta * delta) - (centroidValue * centroidValue);
}
return result;
{code}

{code:title=getDistanceSquared() in SparseVector.java - Patched 
|borderStyle=solid}
  ... 
double result = 0.0;
IteratorVector.Element iter = iterateNonZero();
Vector.Element elt = null;
double centroidValue = 0.0;
double delta = 0.0;
while (iter.hasNext()) {
  elt = iter.next();
  centroidValue = v.getQuick(elt.index());
  delta = elt.get() - centroidValue;
  result += (delta * delta) - (centroidValue * centroidValue);
}
...
{code}

And here:

{code:title=emitPointToNearestCluster() in Cluster.java - Patched 
|borderStyle=solid}
 ...
double distance = 0.0;
Vector clusterCenter = null;
 for (Cluster cluster : clusters) {
  clusterCenter = cluster.getCenter();
  distance = measure.distance(clusterCenter.getLengthSquared(), 
clusterCenter, point);
  if (distance  nearestDistance) {
 nearestCluster = cluster;
 nearestDistance = distance;
   }
...
{code}

{code:title=emitPointToNearestCluster() in Cluster.java - From Grant patch 
|borderStyle=solid}
 Cluster nearestCluster = null;
 double nearestDistance = Double.MAX_VALUE;
 for (Cluster cluster : clusters) {
  Vector clusterCenter = cluster.getCenter();
  double distance = measure.distance(clusterCenter.getLengthSquared(), 
clusterCenter, point);
   if (nearestCluster == null || distance  nearestDistance) {
 nearestCluster = cluster;
 nearestDistance = distance;
{code}

And in outputPointWithClusterInfo(), which is actually the same bit of code as 
this last one.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-07 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740814#action_12740814
 ] 

Grant Ingersoll commented on MAHOUT-121:


I committed an in-between version of MAHOUT-121-new-distance-optimization and 
MAHOUT-121-cluster-distance.patch. The former didn't pass tests due to not 
checking to see if the nearestCluster was null.  I did, however, move that 
check for null to the second part of the || clause, as it is the less frequent 
case and is likely to be short-circuited by the distance check.

Committed revision 802283.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-06 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740382#action_12740382
 ] 

Grant Ingersoll commented on MAHOUT-121:


Nicolas,

I don't get the getStd() method now.  You moved the functionality from 
computeCentroid, but now getStd is never called, except in DisplayKMeans, which 
isn't part of the main execution path.  I also am not following the commenting 
out of the check to see if the two vectors are the same size, but then again, 
my eyes are glazing over in need of sleep.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-06 Thread Shashikant Kore (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740420#action_12740420
 ] 

Shashikant Kore commented on MAHOUT-121:


Declaring variables out of the loop  looks like an optimization, but I suppose, 
compiler will do that automatically. Even if it doesn't, the performance 
penalty is not so significant to sacrifice the code readability (which is 
arguable.) I ran a simple test with million loops over float multiplication  
runs in 10ms. Moving the declarations outside the loop didn't change the 
results.

The floating point comparision (lengthSquared != -1) is not correct. It should 
be (lengthSquared  0), if at all we want to remove the Double object. 


 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739712#action_12739712
 ] 

Grant Ingersoll commented on MAHOUT-121:


Nicolas, any word on your update?  Otherwise I will just apply my patch

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, 
 Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-01 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737875#action_12737875
 ] 

Grant Ingersoll commented on MAHOUT-121:


Hi Nicolas,

I'm getting an error applying your patch:
{quote}
 patch -p 0 -i ../patches/MAHOUT-121-distance-optimization.patch --dry-run
patching file 
core/src/main/java/org/apache/mahout/clustering/kmeans/Cluster.java
patch:  malformed patch at line 46: @@ -323,6 +327,10 @@
{quote}

I create my patches by going to the top directory of mahout and doing:
svn diff  ../mypatch.patch

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, 
 Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737956#action_12737956
 ] 

Nicolás Fantone commented on MAHOUT-121:


Um... weird. That's the way I created the patch, too. I'm going to revise it on 
Monday at work and upload a new one. (Maybe it's just an empty or a new line 
that I erased from the original .java and it's not translating correctly into 
the patch?) 

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, 
 Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-07-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736627#action_12736627
 ] 

Nicolás Fantone commented on MAHOUT-121:


Hi all - I've just sign up to Jira!

Before committing this patch, could any of you take a look to my latest mail in 
the Mahout mailing list (it's a bit extensive to copy/paste in here)? With some 
enlightenment, I could create a patch from my work and we may be able to make 
bigger improvements for KMeans.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, mahout-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-07-28 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736036#action_12736036
 ] 

Grant Ingersoll commented on MAHOUT-121:


Looks like we missed some opportunities for using our new distance 
capabilities.  Will try to get a patch out soon.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, 
 Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-07-28 Thread Shashikant Kore (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736460#action_12736460
 ] 

Shashikant Kore commented on MAHOUT-121:


Grant,

The latest patch brings back the changes from my first patch. This can go in.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, mahout-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-26 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724548#action_12724548
 ] 

Grant Ingersoll commented on MAHOUT-121:


bq. I see, EuclideanDistanceMeasure still returns the sqrt value and other 
calculations are moved to SquaredEuclideanDistanceMeasure. If someone is 
interested in only relative measure, using SquaredEuclideanDistanceMeasure is a 
faster solution. The distance value returned by EuclideanDistanceMeasure still 
continues to be the same. Am I missing something here?

Nope, your not missing anything.  I added the SquaredEDM class yesterday per 
this discussion.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, 
 Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-26 Thread Ted Dunning
No.  You didn't miss anything.

The point of EDM is to be strictly correct relative to the definition, to
not confuse people and yet still provide a pointer to SEDM.  The point of
SEDM is for speed.  This is all just as you said.

On Fri, Jun 26, 2009 at 6:29 AM, Shashikant Kore (JIRA) j...@apache.orgwrote:

 Didn't get your question. I see, EuclideanDistanceMeasure still returns the
 sqrt value and other calculations are moved to
 SquaredEuclideanDistanceMeasure. If someone is interested in only relative
 measure, using SquaredEuclideanDistanceMeasure is a faster solution.  The
 distance value returned by EuclideanDistanceMeasure still continues to be
 the same. Am I missing something here?



Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-26 Thread Ted Dunning
Not irrelevant at all.  It is a very good point.

It means that the strict triangle optimization cannot be applied with the
no-square-root optimization.

There is a stricter bound that you can use with squared distances (r_12^2 -
3 r_x1^2 as opposed to r_12 - r_x1) but using that spoils the generality of
the k-means algorithm.  We could also push a method into some distances
using an additional interface something like HasTriangleBound and allow the
distance to compute the bound, but this is getting pretty far out in the
weeds.

On Fri, Jun 26, 2009 at 6:35 AM, Sean Owen (JIRA) j...@apache.org wrote:


[
 https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724520#action_12724520]

 Sean Owen commented on MAHOUT-121:
 --

 This may be irrelevant -- haven't thought it through -- since someone
 mentioned using the triangle inequality to optimize some stuff earlier, I
 wonder if it is a problem that a squared-distance measure no longer
 satisfies this inequality? That is, it is not true that the square of one
 side is less than the sum of squares of other two sides.




[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Shashikant Kore (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723980#action_12723980
 ] 

Shashikant Kore commented on MAHOUT-121:


Grant, 

I was trying to verify the patch, but for some reason,  my hadoop setup refuses 
to run with Error reading task output message. Couldn't get it working after 
lots of effort. Will try it again. If you could post the performance numbers 
before and after the distance calculation improvement, it would be helpful to 
verify if those are in line with expectations.  


 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724075#action_12724075
 ] 

Grant Ingersoll commented on MAHOUT-121:


I haven't verified this, but intuitively, it doesn't seem all that 
unreasonable.  Presumably your vectors are a lot more sparse now.

BTW, I'll add the min #docs, % as options to the Driver.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724228#action_12724228
 ] 

Grant Ingersoll commented on MAHOUT-121:


BTW, any objection to me removing the sqrt calculation from 
EuclideanDistanceMeasure?

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, 
 Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Ted Dunning
THat is one option.

Another is to have a separate distance measure that doesn't have the square
root.

On Thu, Jun 25, 2009 at 12:40 PM, Grant Ingersoll (JIRA) j...@apache.orgwrote:


[
 https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724228#action_12724228]

 Grant Ingersoll commented on MAHOUT-121:
 

 BTW, any objection to me removing the sqrt calculation from
 EuclideanDistanceMeasure?




Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Ted Dunning
That is what I meant.

As for name, how about SquaredEuclideanDistance?

On Thu, Jun 25, 2009 at 1:10 PM, Grant Ingersoll gsing...@apache.orgwrote:

 I suppose we could have a PseudoEuclideanDM class that doesn't do the
 square root from which the Euclidean merely wraps and gives back the sqrt.
  Actually, I like that solution, except the name part of it


Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Grant Ingersoll


On Jun 25, 2009, at 4:10 PM, Grant Ingersoll wrote:



On Jun 25, 2009, at 3:41 PM, Ted Dunning wrote:


THat is one option.

Another is to have a separate distance measure that doesn't have  
the square

root.



Yeah, but that implies that we have one interface for all the other  
DistanceMeasures and one for Euclidean, or, I suppose we could have  
a PseudoEuclideanDM class that doesn't do the square root from which  
the Euclidean merely wraps and gives back the sqrt.  Actually, I  
like that solution, except the name part of it...





I'm going to go with the base class approach.   Seems like that should  
work.  Committing shortly.


Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Jeff Eastman

+1

Ted Dunning wrote:

That is what I meant.

As for name, how about SquaredEuclideanDistance?

On Thu, Jun 25, 2009 at 1:10 PM, Grant Ingersoll gsing...@apache.orgwrote:

  

I suppose we could have a PseudoEuclideanDM class that doesn't do the
square root from which the Euclidean merely wraps and gives back the sqrt.
 Actually, I like that solution, except the name part of it



  




PGP.sig
Description: PGP signature


[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724278#action_12724278
 ] 

Grant Ingersoll commented on MAHOUT-121:


Committed revision 788504.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, 
 Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-24 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723732#action_12723732
 ] 

Grant Ingersoll commented on MAHOUT-121:


Note, also on that Snapshot that I uploaded that I was running w/ MAHOUT-139

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, MAHOUT-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721646#action_12721646
 ] 

Sean Owen commented on MAHOUT-121:
--

Since I am not hearing objections, and cognizant that people are waiting on 
this, going to commit. If there are issues we can roll back or tweak from there.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, 
 Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-17 Thread Ted Dunning
Shashikant,

How does k-means treat your data?

On Tue, Jun 16, 2009 at 10:12 PM, Shashikant Kore (JIRA) j...@apache.orgwrote:

 This is  voodoo for me. For the dataset I am working with has a window of
 0.05 in which the result changes from 0 canopies to 3,000 canopies.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-17 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720627#action_12720627
 ] 

Grant Ingersoll commented on MAHOUT-121:


After the patch, it was pretty fast, but like I said, I didn't evaluate any 
correctness just yet.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: MAHOUT-121.patch, mahout-121.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-17 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720693#action_12720693
 ] 

Grant Ingersoll commented on MAHOUT-121:


FYI, Sean, can you generate the patch from within the trunk directory, this 
will make applying it easier for everyone else.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, 
 Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-17 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720694#action_12720694
 ] 

Grant Ingersoll commented on MAHOUT-121:


I'm getting a lot of failures when applying this patch.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, 
 Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-17 Thread Jeff Eastman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720764#action_12720764
 ] 

Jeff Eastman commented on MAHOUT-121:
-

A bit premature perhaps. There is still an unresolved cardinality() reference 
in 
/Mahout/utils/src/test/java/org/apache/mahout/utils/vectors/lucene/LuceneIterableTest.java
 and the VectorTest testSparseVectorTimesX fails intermittently. AFAICT, none 
of the patches include /utils and that compile fails after the unit tests and 
packaging have succeeded.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-17 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720810#action_12720810
 ] 

Grant Ingersoll commented on MAHOUT-121:


bq. I see, i did not even know this utils/ directory was added. I'll get it 
into my client and update it accordingly. 

Just this morning...  Can you tell I've got some time for Mahout lately?  ;-)

bq. To avoid confusion... I suppose let me keep the 'official' version of the 
patch for now since I need to make some fixes. It is still substantially 
identical to the last couple that have been posted, so, reviewing those is 
valid for anyone who cares.

All yours, but just so you know, this is a really hot topic for me and I've got 
a deadline associated with it, so I can help if need be.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-16 Thread Shashikant Kore (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720123#action_12720123
 ] 

Shashikant Kore commented on MAHOUT-121:


Apologies for posting incorrect results in previous comments.  I applied Sean's 
patch to the code which had optimization fo distance calculation.

When I applied this patch to the trunk, it took 2 hr 20mins to execute. I 
couldn't complete the run for the trunk code as there was trouble with my 
machine. But it wasn't complete after 4 hours. 

I will re-run and post the correct results. 


 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720159#action_12720159
 ] 

Sean Owen commented on MAHOUT-121:
--

While you are at it, what are you running to load test? that way I can focus on 
optimizing construction of the vectors in that path to achieve a hopefully 
better result.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-16 Thread Shashikant Kore (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720246#action_12720246
 ] 

Shashikant Kore commented on MAHOUT-121:


I have document vectors created from some internal text.  I create canopies 
from those vectors.  Centroid calculation is a suspect area. 

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-16 Thread Shashikant Kore (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720498#action_12720498
 ] 

Shashikant Kore commented on MAHOUT-121:


Grant, 

I am trying out the wikipedia vectors with the parameters you gave. It is 
running slow. Only 8% map completion after 18 minutes. Can you please post the 
performance before and after applying the OrderedIntDoubleMapping patch?

[OT] Also, was wondering how you came up with the values of t1 and t2 as 1.3  
1.0. This is  voodoo for me. For the dataset I am working with has a window of 
0.05 in which the result changes from 0 canopies to 3,000 canopies.



 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: MAHOUT-121.patch, mahout-121.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-15 Thread Shashikant Kore (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719462#action_12719462
 ] 

Shashikant Kore commented on MAHOUT-121:


+1 Grant's suggestion that we split two issues.

Sean,  Doubling the vector size looks aggressive. In a recent experiment, I ran 
into trouble due to such aggressive expansion.  We could optimize memory usage 
and Sparse Vector initialization for input document vectors by exposing some 
low-level APIs. When document vector is read, we already know the number of 
elements in it. It need not go through the iterative set() method which avoids 
the movement of the array elements. FastIntDouble can be initialized by two 
parallel arrays created by reading the sparse vector formatted string.

I will try out your patch and report performance. 

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-15 Thread Shashikant Kore (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719546#action_12719546
 ] 

Shashikant Kore commented on MAHOUT-121:


Sean, 

Your patch has definitely improved the performance. For my test data set, it 
took 8m 34s to generate canopies. The trunk code has been running for 40 
minutes and still only 85% of Mapper is complete. I increased the default size 
to 1024 and it took 5m 21s. 

But, this is higher than Trove's 3m 34s. The array copy operation on addition 
of new element to vector  seems to be the culprit. Think centroid calculation 
with has thousands of feature items.  This could be solved only with a hash 
implementation like Trove, which has (amortized) insertion and lookup time of 
O(1).



 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719632#action_12719632
 ] 

Sean Owen commented on MAHOUT-121:
--

That's good. I expect that, normally, vector updates are relatively infrequent. 
It should not take so many such edits to construct a vector, during 
construction for instance. What code does the work of creating the vectors, so 
I can add and integrate a faster mechanism there?

We could switch to a hashtable implementation later if that is necessary.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719094#action_12719094
 ] 

Sean Owen commented on MAHOUT-121:
--

Good point on the overflow, even if it would only happen after the  vector had 
over a billion non-zero elements, taking 12GB+ of memory! Arrays.binarySearch() 
has this issue, so perhaps good to just keep a fixed version of binary search.

Yes there needs to be a way to iterate over the elements directly, quickly. 
Thinking it best to just return the arrays, if we're going for speed.

Would it be useful to take a shot at rewriting SparseVector to use this?

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719117#action_12719117
 ] 

Grant Ingersoll commented on MAHOUT-121:


bq. Would it be useful to take a shot at rewriting SparseVector to use this?

You could do that, or an alternate implementation.  Is there any case where one 
wouldn't want this?  Also, I wouldn't mind a little better name than 
FastIntDouble.  ;-)

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719118#action_12719118
 ] 

Grant Ingersoll commented on MAHOUT-121:


Also, seems like we could split out the original two issues that Shashikant 
brought up, right?

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718966#action_12718966
 ] 

Sean Owen commented on MAHOUT-121:
--

Here's a start at that last suggestion. Anyone care to see it finished? I can 
hack it out.

public final class FastIntDouble {

  private static final double DEFAULT_VALUE = 0.0;

  private int[] indices;
  private double[] values;
  private int size;

  public FastIntDouble(int capacity) {
indices = new int[capacity];
values = new double[capacity];
size = 0;
  }

  private void growTo(int newCapacity) {
if (newCapacity  indices.length) {
  int[] newIndices = new int[newCapacity];
  System.arraycopy(indices, 0, newIndices, 0, size);
  indices = newIndices;
  double[] newValues = new double[newCapacity];
  System.arraycopy(values, 0, newValues, 0, size);
  values = newValues;
}
  }

  private int find(int index) {
int low = 0;
int high = size - 1;
while (low = high) {
  int mid = (low + high)  1;
  int midVal = indices[mid];
  if (midVal  index) {
low = mid + 1;
  } else if (midVal  index) {
high = mid - 1;
  } else {
return mid;
  }
}
return -(low + 1);
  }

  public double get(int index) {
int offset = find(index);
return offset = 0 ? values[offset] : DEFAULT_VALUE;
  }

  public void set(int index, double value) {
int offset = find(index);
if (offset = 0) {
  if (value == DEFAULT_VALUE) {
System.arraycopy(indices, offset + 1, indices, offset, size - offset);
System.arraycopy(values, offset + 1, values, offset, size - offset);
size--;
  } else {
values[offset] = value;
  }
} else {
  if (value != DEFAULT_VALUE) {
if (size = indices.length) {
  growTo(size  1);
}
int at = -offset - 1;
if (size  at) {
  System.arraycopy(indices, at, indices, at + 1, size - at);
  System.arraycopy(values, at, values, at + 1, size - at);
}
indices[at] = index;
values[at] = value;
size++;
  }
}
  }

}

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-12 Thread Stephen Green (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719016#action_12719016
 ] 

Stephen Green commented on MAHOUT-121:
--

Don't forget that there is java.util.Arrays.binarySearch (and it looks like 
they've improved it since the last time that I looked.)

Also, don't forget Josh Bloch's point about binary search.  For sufficiently 
large arrays, low+high might overflow an int, which will make your binary 
search unhappy.

Also, also, if you're going to do a lot of finds in a row, you might want to 
provide an iterator-returning method that returns the positions in the array 
one-by-one as iterating through the array once will be faster than many binary 
searches.


 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-11 Thread Ted Dunning
I just asked Luc M.   Should have an answer shortly.

On Wed, Jun 10, 2009 at 5:30 AM, Grant Ingersoll gsing...@apache.orgwrote:


 On Jun 9, 2009, at 5:46 PM, Benson Margulies (JIRA) wrote:


 -

 I could make you a fast sparse vector, but I thought you wanted to wait
 for MTJ?


 I guess the question becomes when is that going to happen?  I don't
 follow Commons Math, so does anyone else know?



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-10 Thread Shashikant Kore (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718026#action_12718026
 ] 

Shashikant Kore commented on MAHOUT-121:


That's right, Grant. Some simple tests showed that autoboxing can take 20x cpu 
and 5x memory compared to operations on primitives.  

For sparse vectors, even Trove leaves some room for improvement. The input 
document vectors are read-only for the clustering code. In which case, Sparse 
vector could simply be two arrays. The sparse vectors used to generate centroid 
requires expandable vector. 



 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-10 Thread Grant Ingersoll


On Jun 9, 2009, at 5:46 PM, Benson Margulies (JIRA) wrote:



-

I could make you a fast sparse vector, but I thought you wanted to  
wait for MTJ?


I guess the question becomes when is that going to happen?  I don't  
follow Commons Math, so does anyone else know?


[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-09 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717832#action_12717832
 ] 

Benson Margulies commented on MAHOUT-121:
-

I could make you a fast sparse vector, but I thought you wanted to wait for MTJ?

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-05-22 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712193#action_12712193
 ] 

Ted Dunning commented on MAHOUT-121:


 Licensing of Trove seems to be an issue which needs to be addressed.

The licensing of trove is a non-starter.

Typically, however, you can limit the use to a few cases like MapInteger, 
Double and then simply re-implement or find a Apache compatible implementation.

There is no magic involved in an open hash map in this kind of case.  Even just 
using two arrays with binary search could be acceptable.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: mahout-121.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.