[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-03-18 Thread Pallavi Palleti (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846785#action_12846785
 ] 

Pallavi Palleti commented on MAHOUT-153:


Forgot to mention. In this patch, I made the lengthsquared instance variable in 
AbstractVector to transient.

 Implement kmeans++ for initial cluster selection in kmeans
 --

 Key: MAHOUT-153
 URL: https://issues.apache.org/jira/browse/MAHOUT-153
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
 Environment: OS Independent
Reporter: Panagiotis Papadimitriou
Assignee: Ted Dunning
 Fix For: 0.4

 Attachments: Mahout-153.patch, Mahout-153.patch, 
 MAHOUT-153_RandomFarthest.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 The current implementation of k-means includes the following algorithms for 
 initial cluster selection (seed selection): 1) random selection of k points, 
 2) use of canopy clusters.
 I plan to implement k-means++. The details of the algorithm are available 
 here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Design Outline: I will create an abstract class SeedGenerator and a subclass 
 KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
 become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-03-03 Thread Rohini Uppuluri (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840574#action_12840574
 ] 

Rohini Uppuluri commented on MAHOUT-153:



 Hi, 

Please find a brief description on input and output below:
Hope this helps:

--

Input Format:
documentId\tdocument vector


Example line:

338  [s1682, 275:5.0, 478:3.0, 479:5.0, 1:3.0, 474:4.0, 143:2.0, 
197:5.0, 196:2.0, 286:4.0, 135:5.0, 

86:4.0, 216:4.0, 83:2.0, 213:5.0, 215:3.0, 208:3.0, 269:4.0, 517:5.0, 169:5.0, 
654:5.0, 443:5.0, 

990:4.0, 175:4.0, 513:5.0, 514:5.0, 650:5.0, 525:4.0, 1124:4.0, 382:5.0, 
708:5.0, 497:3.0, 498:4.0, 

523:3.0, 427:4.0, 488:5.0, 490:5.0, 189:4.0, 52:5.0, 301:4.0, 607:4.0, 180:4.0, 
] 


Output Format:
ClusterIdentifier\tClusterIdentifier: clusterCenterVector

Example line:
C0  C0: [s1682, 275:3.0, 1:4.0, 273:5.0, 272:2.0, 3:1.0, 546:4.0, 277:3.0, 
276:3.0, 7:5.0, 283:3.0, 

282:4.0, 9:1.0, 281:4.0, 12:5.0, 1089:2.0, 13:1.0, 286:1.0, 14:1.0, 15:2.0, 
284:3.0, 258:4.0, 17:3.0, 

257:5.0, 23:5.0, 25:2.0, 264:4.0, 270:5.0, 271:3.0, 31:5.0, 305:1.0, 1405:3.0, 
307:4.0, 39:5.0, 

311:3.0, 310:3.0, 515:5.0, 313:5.0, 525:5.0, 315:3.0, 316:4.0, 288:3.0, 50:5.0, 
532:4.0, 291:5.0, 

292:4.0, 55:5.0, 293:4.0, 294:3.0, 295:5.0, 298:4.0, 56:5.0, 300:5.0, 539:2.0, 
302:4.0, 343:4.0, 

882:4.0, 340:1.0, 887:4.0, 1025:4.0, 619:3.0, 79:5.0, 347:2.0, 346:4.0, 
345:1.0, 344:1.0, 326:4.0, 

327:3.0, 1051:4.0, 322:4.0, 323:4.0, 628:2.0, 333:4.0, 331:4.0, 1047:4.0, 
328:4.0, 636:4.0, 100:5.0, 

98:5.0, 581:4.0, 370:3.0, 591:3.0, 118:5.0, 595:4.0, 117:4.0, 358:2.0, 597:4.0, 
127:5.0, 1073:4.0, 

603:5.0, 121:5.0, 683:4.0, 413:3.0, 678:4.0, 950:4.0, 405:4.0, 156:5.0, 
696:4.0, 1244:4.0, 147:5.0, 

690:3.0, 928:3.0, 151:1.0, 924:3.0, 443:4.0, 654:5.0, 925:2.0, 649:4.0, 
164:5.0, 642:4.0, 185:5.0, 

431:5.0, 905:4.0, 1278:4.0, 176:4.0, 183:5.0, 657:5.0, 898:1.0, 181:4.0, 
659:4.0, 1016:4.0, 477:1.0, 

751:4.0, 475:4.0, 750:4.0, 203:5.0, 472:2.0, 748:3.0, 471:5.0, 1011:2.0, 
466:5.0, 742:5.0, 1013:3.0, 

1014:4.0, 762:4.0, 222:5.0, 760:3.0, 460:4.0, 458:3.0, 218:4.0, 237:4.0, 
235:3.0, 504:5.0, 717:4.0, 

234:4.0, 991:1.0, 233:5.0, 978:2.0, 229:5.0, 226:5.0, 254:1.0, 255:4.0, 
252:3.0, 250:5.0, 248:4.0, 

245:4.0, ] 





 Implement kmeans++ for initial cluster selection in kmeans
 --

 Key: MAHOUT-153
 URL: https://issues.apache.org/jira/browse/MAHOUT-153
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
 Environment: OS Independent
Reporter: Panagiotis Papadimitriou
Assignee: Ted Dunning
 Fix For: 0.4

 Attachments: Mahout-153.patch, MAHOUT-153_RandomFarthest.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 The current implementation of k-means includes the following algorithms for 
 initial cluster selection (seed selection): 1) random selection of k points, 
 2) use of canopy clusters.
 I plan to implement k-means++. The details of the algorithm are available 
 here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Design Outline: I will create an abstract class SeedGenerator and a subclass 
 KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
 become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-02-27 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839263#action_12839263
 ] 

Robin Anil commented on MAHOUT-153:
---

Hi Rohini, do you have a patch ready so that the community can review it?

 Implement kmeans++ for initial cluster selection in kmeans
 --

 Key: MAHOUT-153
 URL: https://issues.apache.org/jira/browse/MAHOUT-153
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
 Environment: OS Independent
Reporter: Panagiotis Papadimitriou
Assignee: Ted Dunning
 Fix For: 0.4

 Attachments: Mahout-153.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 The current implementation of k-means includes the following algorithms for 
 initial cluster selection (seed selection): 1) random selection of k points, 
 2) use of canopy clusters.
 I plan to implement k-means++. The details of the algorithm are available 
 here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Design Outline: I will create an abstract class SeedGenerator and a subclass 
 KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
 become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-02-10 Thread Rohini Uppuluri (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832380#action_12832380
 ] 

Rohini Uppuluri commented on MAHOUT-153:


Hi all,


I have implemented an extension to the algorithm Pallavi had mentioned. 
The extension is to add some randomness in picking the farthest point. With this
there is a chance of over coming the problem of picking some noise points as 
centroids which are very far away.

Summary:

 
1. Pick the first centroid randomly
2. for the rest of the centroids
do
- compute a few candidate centroids which are far off
Candidate centroid computation:
Divide the data into few parts. 
For each part compute the point which is farthest from  
the current list of centroids

- Select one of the candidate centroids randomly
done

I will soon submit a patch on this. Please let me know your feeback.


 Implement kmeans++ for initial cluster selection in kmeans
 --

 Key: MAHOUT-153
 URL: https://issues.apache.org/jira/browse/MAHOUT-153
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
 Environment: OS Independent
Reporter: Panagiotis Papadimitriou
Assignee: Ted Dunning
 Fix For: 0.4

 Attachments: Mahout-153.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 The current implementation of k-means includes the following algorithms for 
 initial cluster selection (seed selection): 1) random selection of k points, 
 2) use of canopy clusters.
 I plan to implement k-means++. The details of the algorithm are available 
 here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Design Outline: I will create an abstract class SeedGenerator and a subclass 
 KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
 become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-02-09 Thread Pallavi Palleti
I will add my patch with in 3 to 4 days. I am done with everything. 
except that I need to write some test classes.


Thanks
Pallavi

Robin Anil (JIRA) wrote:
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830056#action_12830056 ] 


Robin Anil commented on MAHOUT-153:
---

Any progress on this? Will it be ready soon or should it be pushed to 0.4 
release ?

  

Implement kmeans++ for initial cluster selection in kmeans
--

Key: MAHOUT-153
URL: https://issues.apache.org/jira/browse/MAHOUT-153
Project: Mahout
 Issue Type: New Feature
 Components: Clustering
   Affects Versions: 0.2
Environment: OS Independent
   Reporter: Panagiotis Papadimitriou
Fix For: 0.3

  Original Estimate: 336h
 Remaining Estimate: 336h

The current implementation of k-means includes the following algorithms for 
initial cluster selection (seed selection): 1) random selection of k points, 2) 
use of canopy clusters.
I plan to implement k-means++. The details of the algorithm are available here: 
http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
Design Outline: I will create an abstract class SeedGenerator and a subclass 
KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become 
a subclass of SeedGenerator.



  


[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-02-09 Thread Pallavi Palleti (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831391#action_12831391
 ] 

Pallavi Palleti commented on MAHOUT-153:


Forgot to mention. The above patch doesn't include test cases. Kindly review.

 Implement kmeans++ for initial cluster selection in kmeans
 --

 Key: MAHOUT-153
 URL: https://issues.apache.org/jira/browse/MAHOUT-153
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
 Environment: OS Independent
Reporter: Panagiotis Papadimitriou
Assignee: Ted Dunning
 Fix For: 0.4

 Attachments: Mahout-153.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 The current implementation of k-means includes the following algorithms for 
 initial cluster selection (seed selection): 1) random selection of k points, 
 2) use of canopy clusters.
 I plan to implement k-means++. The details of the algorithm are available 
 here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Design Outline: I will create an abstract class SeedGenerator and a subclass 
 KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
 become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-02-09 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831622#action_12831622
 ] 

Ted Dunning commented on MAHOUT-153:



I have been thinking about this problem a bit, particularly with respect to the 
obvious problems of parallelizing this very sequential algorithm in the context 
of the large sparse data that is common with scalable data mining.

What I have come up with is a map-reduce approximate algorithm that might 
provide good speedup and scalability.   

The idea is that each mapper is essentially doing the k-mean++ starting point 
selection on the data split it sees.  The output from each mapper is a set of 
potential starting centroids and the combined output is a (too large) set of 
centroid candidates.  The k-means++ starting point selection algorithm can then 
be applied to this set of candidates to get the final set of initial centroids.

Moreover, for sparse data, it is fairly common to find points that have very 
low similarity to previous points.  This suggests that the mappers can apply an 
one-pass version of the starting point selection algorithm to their data.

This one pass algorithm would keep the set of starting points selected so far 
as well as an on-line estimate of distribution of distances of points to the 
set of already selected centroids.  The distribution estimate should take the 
form of estimating the 1 - k / n quantile of the distance distribution where k 
is the desired number of starting points and n is the best estimate of the 
points that will be seen by the mapper.  N can be the number of points seen so 
far, but after just a few points have been seen, this estimate can be refined 
based on the average size of points and the total size of the split being 
examined by the mapper.

The one pass algorithm is thus:

{noformat}
centroids = set.of(first data point)
for each input point p
   update quantile_estimate using distance of p to all centroids
   if min(distance(centroids_i, p))  quantile_estimate
  centroids.add(p)
{noformat}

This algorithm will be too lax at the beginning, but poor starting points will 
be pruned in the second phase when all starting points will be re-examined to 
select a final set.

 Implement kmeans++ for initial cluster selection in kmeans
 --

 Key: MAHOUT-153
 URL: https://issues.apache.org/jira/browse/MAHOUT-153
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
 Environment: OS Independent
Reporter: Panagiotis Papadimitriou
Assignee: Ted Dunning
 Fix For: 0.4

 Attachments: Mahout-153.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 The current implementation of k-means includes the following algorithms for 
 initial cluster selection (seed selection): 1) random selection of k points, 
 2) use of canopy clusters.
 I plan to implement k-means++. The details of the algorithm are available 
 here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Design Outline: I will create an abstract class SeedGenerator and a subclass 
 KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
 become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-02-05 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830056#action_12830056
 ] 

Robin Anil commented on MAHOUT-153:
---

Any progress on this? Will it be ready soon or should it be pushed to 0.4 
release ?

 Implement kmeans++ for initial cluster selection in kmeans
 --

 Key: MAHOUT-153
 URL: https://issues.apache.org/jira/browse/MAHOUT-153
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
 Environment: OS Independent
Reporter: Panagiotis Papadimitriou
 Fix For: 0.3

   Original Estimate: 336h
  Remaining Estimate: 336h

 The current implementation of k-means includes the following algorithms for 
 initial cluster selection (seed selection): 1) random selection of k points, 
 2) use of canopy clusters.
 I plan to implement k-means++. The details of the algorithm are available 
 here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Design Outline: I will create an abstract class SeedGenerator and a subclass 
 KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
 become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-01-18 Thread Pallavi Palleti (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801716#action_12801716
 ] 

Pallavi Palleti commented on MAHOUT-153:


Hi all,

I am ready with my patch. However, I was trying to see if there is any possible 
optimizations that can be made. I will share the patch and seek further 
optimization suggestions from the group. Should I open another jira issue as 
David might be working on and submit a patch to this jira issue? Kindly suggest.


 Implement kmeans++ for initial cluster selection in kmeans
 --

 Key: MAHOUT-153
 URL: https://issues.apache.org/jira/browse/MAHOUT-153
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
 Environment: OS Independent
Reporter: Panagiotis Papadimitriou
 Fix For: 0.3

   Original Estimate: 336h
  Remaining Estimate: 336h

 The current implementation of k-means includes the following algorithms for 
 initial cluster selection (seed selection): 1) random selection of k points, 
 2) use of canopy clusters.
 I plan to implement k-means++. The details of the algorithm are available 
 here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Design Outline: I will create an abstract class SeedGenerator and a subclass 
 KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
 become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-01-18 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801755#action_12801755
 ] 

Grant Ingersoll commented on MAHOUT-153:


Please keep the same issue.  That way the two of you can compare, extend, etc.

 Implement kmeans++ for initial cluster selection in kmeans
 --

 Key: MAHOUT-153
 URL: https://issues.apache.org/jira/browse/MAHOUT-153
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
 Environment: OS Independent
Reporter: Panagiotis Papadimitriou
 Fix For: 0.3

   Original Estimate: 336h
  Remaining Estimate: 336h

 The current implementation of k-means includes the following algorithms for 
 initial cluster selection (seed selection): 1) random selection of k points, 
 2) use of canopy clusters.
 I plan to implement k-means++. The details of the algorithm are available 
 here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Design Outline: I will create an abstract class SeedGenerator and a subclass 
 KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
 become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-01-18 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801960#action_12801960
 ] 

Ted Dunning commented on MAHOUT-153:



+1 to what Grant said.  Go ahead and post a patch here.  David can either 
update that patch or provide a comparable one for comparison.

Having a good discussion is ideal for getting a really good implementation.

 Implement kmeans++ for initial cluster selection in kmeans
 --

 Key: MAHOUT-153
 URL: https://issues.apache.org/jira/browse/MAHOUT-153
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
 Environment: OS Independent
Reporter: Panagiotis Papadimitriou
 Fix For: 0.3

   Original Estimate: 336h
  Remaining Estimate: 336h

 The current implementation of k-means includes the following algorithms for 
 initial cluster selection (seed selection): 1) random selection of k points, 
 2) use of canopy clusters.
 I plan to implement k-means++. The details of the algorithm are available 
 here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Design Outline: I will create an abstract class SeedGenerator and a subclass 
 KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
 become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-01-16 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801280#action_12801280
 ] 

Isabel Drost commented on MAHOUT-153:
-

Welcome to Mahout. Thanks for stepping up and volunteering to take over the 
work for this issue.

 Implement kmeans++ for initial cluster selection in kmeans
 --

 Key: MAHOUT-153
 URL: https://issues.apache.org/jira/browse/MAHOUT-153
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
 Environment: OS Independent
Reporter: Panagiotis Papadimitriou
 Fix For: 0.3

   Original Estimate: 336h
  Remaining Estimate: 336h

 The current implementation of k-means includes the following algorithms for 
 initial cluster selection (seed selection): 1) random selection of k points, 
 2) use of canopy clusters.
 I plan to implement k-means++. The details of the algorithm are available 
 here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Design Outline: I will create an abstract class SeedGenerator and a subclass 
 KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
 become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-01-15 Thread David Tran (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800812#action_12800812
 ] 

David Tran commented on MAHOUT-153:
---

Hi,

My name is David - I am new to the mahout community, but Panagiotis has caught 
me up to speed and familiarized me with some of his earlier work on this issue. 
I will be taking over the work for this issue and hope to have a patch soon. 
Will give a more exact estimate of the timeline soon.

 Implement kmeans++ for initial cluster selection in kmeans
 --

 Key: MAHOUT-153
 URL: https://issues.apache.org/jira/browse/MAHOUT-153
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
 Environment: OS Independent
Reporter: Panagiotis Papadimitriou
 Fix For: 0.3

   Original Estimate: 336h
  Remaining Estimate: 336h

 The current implementation of k-means includes the following algorithms for 
 initial cluster selection (seed selection): 1) random selection of k points, 
 2) use of canopy clusters.
 I plan to implement k-means++. The details of the algorithm are available 
 here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Design Outline: I will create an abstract class SeedGenerator and a subclass 
 KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
 become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-01-04 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796323#action_12796323
 ] 

Ted Dunning commented on MAHOUT-153:


{quote}
On Mon, Jan 4, 2010 at 4:03 AM, Palleti, Pallavi pallavi.pall...@corp.aol.com 
wrote:

Initially, I used canopy clustering seeds as initial seeds but the results 
weren't good and the number of clusters depends on the distance thresholds we 
give as input. Later, I have considered randomly selecting some points from the 
input dataset and consider them as initial seeds. Again, the results were not 
good. Now, I have chosen initial seeds from input set in such a way that the 
points are far from each other and I have observed better clustering using 
Fuzzy Kmeans. I have not implemented a map-reducable version for this seed 
selection. I will soon implement a map-reducable version and submit a patch.
{quote}

I encouraged Pallavi on the mailing list to submit his patches here on this 
issue.  Hopefully he will be able to drive the process forward.  

 Implement kmeans++ for initial cluster selection in kmeans
 --

 Key: MAHOUT-153
 URL: https://issues.apache.org/jira/browse/MAHOUT-153
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
 Environment: OS Independent
Reporter: Panagiotis Papadimitriou
 Fix For: 0.3

   Original Estimate: 336h
  Remaining Estimate: 336h

 The current implementation of k-means includes the following algorithms for 
 initial cluster selection (seed selection): 1) random selection of k points, 
 2) use of canopy clusters.
 I plan to implement k-means++. The details of the algorithm are available 
 here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Design Outline: I will create an abstract class SeedGenerator and a subclass 
 KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
 become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2009-12-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789719#action_12789719
 ] 

Sean Owen commented on MAHOUT-153:
--

Just pinging this issue -- still interested in working on it?

 Implement kmeans++ for initial cluster selection in kmeans
 --

 Key: MAHOUT-153
 URL: https://issues.apache.org/jira/browse/MAHOUT-153
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
 Environment: OS Independent
Reporter: Panagiotis Papadimitriou
   Original Estimate: 336h
  Remaining Estimate: 336h

 The current implementation of k-means includes the following algorithms for 
 initial cluster selection (seed selection): 1) random selection of k points, 
 2) use of canopy clusters.
 I plan to implement k-means++. The details of the algorithm are available 
 here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Design Outline: I will create an abstract class SeedGenerator and a subclass 
 KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
 become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.