Pallavi, This is very useful feedback.
What you have done is very similar to the k-means++ algorithm and it is clearly a very good thing. There is already an issue for tracking a k-means++ implementation: http://issues.apache.org/jira/browse/MAHOUT-153 Could you post your patch there? On Mon, Jan 4, 2010 at 4:03 AM, Palleti, Pallavi < [email protected]> wrote: > Initially, I used canopy clustering seeds as initial seeds but the results > weren't good and the number of clusters depends on the distance thresholds > we give as input. Later, I have considered randomly selecting some points > from the input dataset and consider them as initial seeds. Again, the > results were not good. Now, I have chosen initial seeds from input set in > such a way that the points are far from each other and I have observed > better clustering using Fuzzy Kmeans. I have not implemented a map-reducable > version for this seed selection. I will soon implement a map-reducable > version and submit a patch. > -- Ted Dunning, CTO DeepDyve
