[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pallavi Palleti updated MAHOUT-153: ----------------------------------- Attachment: Mahout-153.patch Here is the patch for selecting initial clusters for a clustering algorithm. Idea is taken from paper "Farthest Point Heuristic Based Initialization Methods for K-Modes Clustering"(http://arxiv.org/pdf/cs/0610043). The attached patch follow below steps: The farthest-point heuristic starts with an arbitrary point s1. Pick a point s2 that is as far from s1 as possible. Pick si to maximize the distance to the nearest of all centroids picked so far. That is, maximize the min {dist (si, s1), dist (si, s2), ...}. After all k representatives are chosen we can define the partition of D: cluster Cj consists of all points closer to sj than to any other representative > Implement kmeans++ for initial cluster selection in kmeans > ---------------------------------------------------------- > > Key: MAHOUT-153 > URL: https://issues.apache.org/jira/browse/MAHOUT-153 > Project: Mahout > Issue Type: New Feature > Components: Clustering > Affects Versions: 0.2 > Environment: OS Independent > Reporter: Panagiotis Papadimitriou > Assignee: Ted Dunning > Fix For: 0.4 > > Attachments: Mahout-153.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > The current implementation of k-means includes the following algorithms for > initial cluster selection (seed selection): 1) random selection of k points, > 2) use of canopy clusters. > I plan to implement k-means++. The details of the algorithm are available > here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. > Design Outline: I will create an abstract class SeedGenerator and a subclass > KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will > become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.