On Wed, Jan 1, 2014 at 6:23 PM, Dmitriy Lyubimov <[email protected]> wrote:

> > On the other hand, if we use 7 clusters,
> >
> >
> > *> k = kmeans(iris[,1:4], centers=7, nstart=10)*
> >
> > *> table(iris$Species, k$cluster)*
> >
> >                    cluster
> >               1  2  3  4  5  6  7
> >   setosa*      0  0 28  0 22  0  0*
> >   versicolor*  0  7  0 20  0  0 23*
> >   virginica*  12  0  0  1  0 24 13*
> >
> > Each cluster is now composed of almost exactly one species.  Only cluster
> > 4 has any impurity and it is 95% composed of just versicolor samples.
> >
> @Ted,
>
> How about cluster 7? it seems it is not as a demonstrable improvement, or i
> don't get something
>

That is a big shock.  I don't remember seeing cluster 7 that way.  I wonder
if I re-ran the numbers one last time to put in the email (kmeans is
non-deterministic)

Here is another run that shows the desired effect

> table(iris$Species, k$cluster)
>
>               1  2  3  4  5  6  7
>   setosa     28  0  0 22  0  0  0
>   versicolor  0  0  4  0  0 27 19
>   virginica   0 12 15  0 22  1  0




In looking back at my transcript of what I did, I only see the version that
I sent out earlier.  I have used this example several times so I now think
that I must have been seeing what I expected to see rather than what was
there.

In more experiments I find that with 50 restarts of k-means, the
sub-optimal solution comes up very rarely.  With 2 restarts, it comes up
much more frequently.

This non-determinism is an excellent motivation for cross validation.  And
a suggestion to the wise that you re-run your k-means algorithms many times
(which makes streaming k-means look even better since it makes up for
restarts with more clusters).

Thanks for the eagle-eyes!

Reply via email to