[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846785#action_12846785 ] Pallavi Palleti commented on MAHOUT-153: Forgot to mention. In this patch, I made the lengthsquared instance variable in AbstractVector to transient. Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Assignee: Ted Dunning Fix For: 0.4 Attachments: Mahout-153.patch, Mahout-153.patch, MAHOUT-153_RandomFarthest.patch Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840574#action_12840574 ] Rohini Uppuluri commented on MAHOUT-153: Hi, Please find a brief description on input and output below: Hope this helps: -- Input Format: documentId\tdocument vector Example line: 338 [s1682, 275:5.0, 478:3.0, 479:5.0, 1:3.0, 474:4.0, 143:2.0, 197:5.0, 196:2.0, 286:4.0, 135:5.0, 86:4.0, 216:4.0, 83:2.0, 213:5.0, 215:3.0, 208:3.0, 269:4.0, 517:5.0, 169:5.0, 654:5.0, 443:5.0, 990:4.0, 175:4.0, 513:5.0, 514:5.0, 650:5.0, 525:4.0, 1124:4.0, 382:5.0, 708:5.0, 497:3.0, 498:4.0, 523:3.0, 427:4.0, 488:5.0, 490:5.0, 189:4.0, 52:5.0, 301:4.0, 607:4.0, 180:4.0, ] Output Format: ClusterIdentifier\tClusterIdentifier: clusterCenterVector Example line: C0 C0: [s1682, 275:3.0, 1:4.0, 273:5.0, 272:2.0, 3:1.0, 546:4.0, 277:3.0, 276:3.0, 7:5.0, 283:3.0, 282:4.0, 9:1.0, 281:4.0, 12:5.0, 1089:2.0, 13:1.0, 286:1.0, 14:1.0, 15:2.0, 284:3.0, 258:4.0, 17:3.0, 257:5.0, 23:5.0, 25:2.0, 264:4.0, 270:5.0, 271:3.0, 31:5.0, 305:1.0, 1405:3.0, 307:4.0, 39:5.0, 311:3.0, 310:3.0, 515:5.0, 313:5.0, 525:5.0, 315:3.0, 316:4.0, 288:3.0, 50:5.0, 532:4.0, 291:5.0, 292:4.0, 55:5.0, 293:4.0, 294:3.0, 295:5.0, 298:4.0, 56:5.0, 300:5.0, 539:2.0, 302:4.0, 343:4.0, 882:4.0, 340:1.0, 887:4.0, 1025:4.0, 619:3.0, 79:5.0, 347:2.0, 346:4.0, 345:1.0, 344:1.0, 326:4.0, 327:3.0, 1051:4.0, 322:4.0, 323:4.0, 628:2.0, 333:4.0, 331:4.0, 1047:4.0, 328:4.0, 636:4.0, 100:5.0, 98:5.0, 581:4.0, 370:3.0, 591:3.0, 118:5.0, 595:4.0, 117:4.0, 358:2.0, 597:4.0, 127:5.0, 1073:4.0, 603:5.0, 121:5.0, 683:4.0, 413:3.0, 678:4.0, 950:4.0, 405:4.0, 156:5.0, 696:4.0, 1244:4.0, 147:5.0, 690:3.0, 928:3.0, 151:1.0, 924:3.0, 443:4.0, 654:5.0, 925:2.0, 649:4.0, 164:5.0, 642:4.0, 185:5.0, 431:5.0, 905:4.0, 1278:4.0, 176:4.0, 183:5.0, 657:5.0, 898:1.0, 181:4.0, 659:4.0, 1016:4.0, 477:1.0, 751:4.0, 475:4.0, 750:4.0, 203:5.0, 472:2.0, 748:3.0, 471:5.0, 1011:2.0, 466:5.0, 742:5.0, 1013:3.0, 1014:4.0, 762:4.0, 222:5.0, 760:3.0, 460:4.0, 458:3.0, 218:4.0, 237:4.0, 235:3.0, 504:5.0, 717:4.0, 234:4.0, 991:1.0, 233:5.0, 978:2.0, 229:5.0, 226:5.0, 254:1.0, 255:4.0, 252:3.0, 250:5.0, 248:4.0, 245:4.0, ] Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Assignee: Ted Dunning Fix For: 0.4 Attachments: Mahout-153.patch, MAHOUT-153_RandomFarthest.patch Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839263#action_12839263 ] Robin Anil commented on MAHOUT-153: --- Hi Rohini, do you have a patch ready so that the community can review it? Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Assignee: Ted Dunning Fix For: 0.4 Attachments: Mahout-153.patch Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832380#action_12832380 ] Rohini Uppuluri commented on MAHOUT-153: Hi all, I have implemented an extension to the algorithm Pallavi had mentioned. The extension is to add some randomness in picking the farthest point. With this there is a chance of over coming the problem of picking some noise points as centroids which are very far away. Summary: 1. Pick the first centroid randomly 2. for the rest of the centroids do - compute a few candidate centroids which are far off Candidate centroid computation: Divide the data into few parts. For each part compute the point which is farthest from the current list of centroids - Select one of the candidate centroids randomly done I will soon submit a patch on this. Please let me know your feeback. Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Assignee: Ted Dunning Fix For: 0.4 Attachments: Mahout-153.patch Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
I will add my patch with in 3 to 4 days. I am done with everything. except that I need to write some test classes. Thanks Pallavi Robin Anil (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830056#action_12830056 ] Robin Anil commented on MAHOUT-153: --- Any progress on this? Will it be ready soon or should it be pushed to 0.4 release ? Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Fix For: 0.3 Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831391#action_12831391 ] Pallavi Palleti commented on MAHOUT-153: Forgot to mention. The above patch doesn't include test cases. Kindly review. Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Assignee: Ted Dunning Fix For: 0.4 Attachments: Mahout-153.patch Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831622#action_12831622 ] Ted Dunning commented on MAHOUT-153: I have been thinking about this problem a bit, particularly with respect to the obvious problems of parallelizing this very sequential algorithm in the context of the large sparse data that is common with scalable data mining. What I have come up with is a map-reduce approximate algorithm that might provide good speedup and scalability. The idea is that each mapper is essentially doing the k-mean++ starting point selection on the data split it sees. The output from each mapper is a set of potential starting centroids and the combined output is a (too large) set of centroid candidates. The k-means++ starting point selection algorithm can then be applied to this set of candidates to get the final set of initial centroids. Moreover, for sparse data, it is fairly common to find points that have very low similarity to previous points. This suggests that the mappers can apply an one-pass version of the starting point selection algorithm to their data. This one pass algorithm would keep the set of starting points selected so far as well as an on-line estimate of distribution of distances of points to the set of already selected centroids. The distribution estimate should take the form of estimating the 1 - k / n quantile of the distance distribution where k is the desired number of starting points and n is the best estimate of the points that will be seen by the mapper. N can be the number of points seen so far, but after just a few points have been seen, this estimate can be refined based on the average size of points and the total size of the split being examined by the mapper. The one pass algorithm is thus: {noformat} centroids = set.of(first data point) for each input point p update quantile_estimate using distance of p to all centroids if min(distance(centroids_i, p)) quantile_estimate centroids.add(p) {noformat} This algorithm will be too lax at the beginning, but poor starting points will be pruned in the second phase when all starting points will be re-examined to select a final set. Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Assignee: Ted Dunning Fix For: 0.4 Attachments: Mahout-153.patch Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830056#action_12830056 ] Robin Anil commented on MAHOUT-153: --- Any progress on this? Will it be ready soon or should it be pushed to 0.4 release ? Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Fix For: 0.3 Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801716#action_12801716 ] Pallavi Palleti commented on MAHOUT-153: Hi all, I am ready with my patch. However, I was trying to see if there is any possible optimizations that can be made. I will share the patch and seek further optimization suggestions from the group. Should I open another jira issue as David might be working on and submit a patch to this jira issue? Kindly suggest. Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Fix For: 0.3 Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801755#action_12801755 ] Grant Ingersoll commented on MAHOUT-153: Please keep the same issue. That way the two of you can compare, extend, etc. Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Fix For: 0.3 Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801960#action_12801960 ] Ted Dunning commented on MAHOUT-153: +1 to what Grant said. Go ahead and post a patch here. David can either update that patch or provide a comparable one for comparison. Having a good discussion is ideal for getting a really good implementation. Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Fix For: 0.3 Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801280#action_12801280 ] Isabel Drost commented on MAHOUT-153: - Welcome to Mahout. Thanks for stepping up and volunteering to take over the work for this issue. Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Fix For: 0.3 Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800812#action_12800812 ] David Tran commented on MAHOUT-153: --- Hi, My name is David - I am new to the mahout community, but Panagiotis has caught me up to speed and familiarized me with some of his earlier work on this issue. I will be taking over the work for this issue and hope to have a patch soon. Will give a more exact estimate of the timeline soon. Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Fix For: 0.3 Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796323#action_12796323 ] Ted Dunning commented on MAHOUT-153: {quote} On Mon, Jan 4, 2010 at 4:03 AM, Palleti, Pallavi pallavi.pall...@corp.aol.com wrote: Initially, I used canopy clustering seeds as initial seeds but the results weren't good and the number of clusters depends on the distance thresholds we give as input. Later, I have considered randomly selecting some points from the input dataset and consider them as initial seeds. Again, the results were not good. Now, I have chosen initial seeds from input set in such a way that the points are far from each other and I have observed better clustering using Fuzzy Kmeans. I have not implemented a map-reducable version for this seed selection. I will soon implement a map-reducable version and submit a patch. {quote} I encouraged Pallavi on the mailing list to submit his patches here on this issue. Hopefully he will be able to drive the process forward. Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Fix For: 0.3 Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789719#action_12789719 ] Sean Owen commented on MAHOUT-153: -- Just pinging this issue -- still interested in working on it? Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.