On Wed, Jan 1, 2014 at 6:23 PM, Dmitriy Lyubimov <[email protected]> wrote:
> > On the other hand, if we use 7 clusters, > > > > > > *> k = kmeans(iris[,1:4], centers=7, nstart=10)* > > > > *> table(iris$Species, k$cluster)* > > > > cluster > > 1 2 3 4 5 6 7 > > setosa* 0 0 28 0 22 0 0* > > versicolor* 0 7 0 20 0 0 23* > > virginica* 12 0 0 1 0 24 13* > > > > Each cluster is now composed of almost exactly one species. Only cluster > > 4 has any impurity and it is 95% composed of just versicolor samples. > > > @Ted, > > How about cluster 7? it seems it is not as a demonstrable improvement, or i > don't get something > That is a big shock. I don't remember seeing cluster 7 that way. I wonder if I re-ran the numbers one last time to put in the email (kmeans is non-deterministic) Here is another run that shows the desired effect > table(iris$Species, k$cluster) > > 1 2 3 4 5 6 7 > setosa 28 0 0 22 0 0 0 > versicolor 0 0 4 0 0 27 19 > virginica 0 12 15 0 22 1 0 In looking back at my transcript of what I did, I only see the version that I sent out earlier. I have used this example several times so I now think that I must have been seeing what I expected to see rather than what was there. In more experiments I find that with 50 restarts of k-means, the sub-optimal solution comes up very rarely. With 2 restarts, it comes up much more frequently. This non-determinism is an excellent motivation for cross validation. And a suggestion to the wise that you re-run your k-means algorithms many times (which makes streaming k-means look even better since it makes up for restarts with more clusters). Thanks for the eagle-eyes!
