Since I posted last night I've been exploring the possibilities of how the
two implementations could be different. The underlying algorithm appears to
be slightly but I think the main difference between the two is how the
initial seeds are chosen. In FASTCLUS I believe it's some sort of random
selection of values, while in kmeans it's a random selection of actual
observations. The big difference in results I was getting with a particular
data set was that FASTCLUS was producing one huge, dense cluster (about 85%
of obs) and a bunch of very small ones while kmeans() was producing a much
more even distribution of cluster membership.

Because the data I was using actually seems to contain a very large,
homogeneous group that occupies a very small space, a random selection of
seeds (FASTCLUS) is very unlikely to have more than one seed planted inside
the large (#obs), dense cluster which will have the effect of breaking it
apart during iterations. Kmeans() on the other hand is using a random
selection of observations each of which will have a high probability of
coming from the large, dense cluster therefore multiple seeds will most
likely be planted in that area causing it to break-up during the iterations.

At least that's my take on it, does anyone see anything wrong with line of
reasoning?

Andy

On Fri, Dec 3, 2010 at 10:15 AM, Georg Ruß <resea...@georgruss.de> wrote:

> On 02/12/10 17:49:37, Andrew Agrimson wrote:
> > I've been comparing results from kmeans() in R to PROC FASTCLUS in SAS
> > and I'm getting drastically different results with a real life data set.
> > [...] Has anybody looked into the differences in the implementations or
> > have any thoughts on the matter?
>
> Hi Andrew,
>
> as per the website below, it looks as if PROC FASTCLUS is implementing a
> certain flavor of k-Means:
>
> http://www.technion.ac.il/docs/sas/stat/chap27/sect2.htm
>
> As per the manpage ?kmeans, the R implementation of k-Means has the option
> to set one of the algorithms explicitly:
>
> algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"))
>
> I don't know whether you've tried that, but you may start by setting these
> algorithm variants explicitly and see what the outcome is.
>
> Regards,
> Georg.
> --
> Research Assistant
> Otto-von-Guericke-Universität Magdeburg
> resea...@georgruss.de
> http://research.georgruss.de
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to