Re: kmeans result is different from scikit-learn result with center points provided

Ted Dunning Mon, 05 Jan 2015 23:54:07 -0800

Clustering is harder than you appear to think:

http://www.imsc.res.in/~meena/papers/kmeans.pdf

https://en.wikipedia.org/wiki/K-means_clustering

NP-hard problems are typically solved by approximation.  K-means is a great
example.  Only a few, relatively unrealistic, examples have solutions
apparent enough to be found reliably by diverse algorithms.  For instance,
something as easy as Gaussian clusters with sd=1e-3 situated on 10 random
corners of a unit hypercube in 10 dimensional space will be clustered
differently by many algorithms unless multiple starts are used.

For instance see https://gist.github.com/tdunning/e1575ad2043af732c219 for
an R script that demonstrates that R's standard k-means algorithms fail
over 95% of the time for this trivial input, occasionally splitting a
single cluster into three parts.  Restarting multiple times doesn't fix the
problem ... it only makes it a bit more tolerable.  This example shows how
even 90 restarts could fail for this particular problem.

On Mon, Jan 5, 2015 at 11:03 PM, Lee S <sle...@gmail.com> wrote:

> But parameters and distance measure is the same. Only difference: Mahout
> kmeans convergence is based on whether every cluster is convergenced.
> scikit-learn is based on  within-cluster sum of squared criterion.
>
> 2015-01-06 14:15 GMT+08:00 Ted Dunning <ted.dunn...@gmail.com>:
>
> > I don't think that data is sufficiently clusterable to expect a unique
> > solution.
> >
> > Mean squared error would be a better measure of quality.
> >
> >
> >
> > On Mon, Jan 5, 2015 at 10:07 PM, Lee S <sle...@gmail.com> wrote:
> >
> > > Data in thie link:
> > >
> > >
> >
> http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
> > > .
> > > I convert it to sequencefile with InputDriver.
> > >
> > > 2015-01-06 14:04 GMT+08:00 Ted Dunning <ted.dunn...@gmail.com>:
> > >
> > > > What kind of synthetic data did you use?
> > > >
> > > >
> > > >
> > > > On Mon, Jan 5, 2015 at 8:29 PM, Lee S <sle...@gmail.com> wrote:
> > > >
> > > > > Hi, I used the synthetic data to test the kmeans method.
> > > > > And I write the code own to convert center points to sequecefiles.
> > > > > Then I ran the kmeans with parameter( -i input -o output -c center
> > -x 3
> > > > -cd
> > > > > 1  -cl) ,
> > > > > I compared the dumped clusteredPoints with the result of
> scikit-learn
> > > > kmens
> > > > > result, it's totally different. I'm very confused.
> > > > >
> > > > > Does anybody ever run kmeans with center points provided and
> compare
> > > the
> > > > > result with other ml-library?
> > > > >
> > > >
> > >
> >
>

Re: kmeans result is different from scikit-learn result with center points provided

Reply via email to