Running this gist can be done using the following two lines of R, btw:

library(devtools)
source_url("
https://gist.githubusercontent.com/tdunning/e1575ad2043af732c219/raw/444514454a6f3b5fcbbcaa3f8a919b1965e07f16/Clustering%20is%20hard
")

You should see something like this as output:

SHA-1 hash of file is 2bc9bf7677d6d5b8b7aa1b1d49749574f5bd942e
$fail
[1] 96

$success
[1] 4

counts
 1  2  3  4
 4 71 22  3


On Mon, Jan 5, 2015 at 11:50 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Clustering is harder than you appear to think:
>
> http://www.imsc.res.in/~meena/papers/kmeans.pdf
>
> https://en.wikipedia.org/wiki/K-means_clustering
>
> NP-hard problems are typically solved by approximation.  K-means is a
> great example.  Only a few, relatively unrealistic, examples have solutions
> apparent enough to be found reliably by diverse algorithms.  For instance,
> something as easy as Gaussian clusters with sd=1e-3 situated on 10 random
> corners of a unit hypercube in 10 dimensional space will be clustered
> differently by many algorithms unless multiple starts are used.
>
> For instance see https://gist.github.com/tdunning/e1575ad2043af732c219
> for an R script that demonstrates that R's standard k-means algorithms fail
> over 95% of the time for this trivial input, occasionally splitting a
> single cluster into three parts.  Restarting multiple times doesn't fix the
> problem ... it only makes it a bit more tolerable.  This example shows how
> even 90 restarts could fail for this particular problem.
>
>
>
>
>
> On Mon, Jan 5, 2015 at 11:03 PM, Lee S <sle...@gmail.com> wrote:
>
>> But parameters and distance measure is the same. Only difference: Mahout
>> kmeans convergence is based on whether every cluster is convergenced.
>> scikit-learn is based on  within-cluster sum of squared criterion.
>>
>> 2015-01-06 14:15 GMT+08:00 Ted Dunning <ted.dunn...@gmail.com>:
>>
>> > I don't think that data is sufficiently clusterable to expect a unique
>> > solution.
>> >
>> > Mean squared error would be a better measure of quality.
>> >
>> >
>> >
>> > On Mon, Jan 5, 2015 at 10:07 PM, Lee S <sle...@gmail.com> wrote:
>> >
>> > > Data in thie link:
>> > >
>> > >
>> >
>> http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
>> > > .
>> > > I convert it to sequencefile with InputDriver.
>> > >
>> > > 2015-01-06 14:04 GMT+08:00 Ted Dunning <ted.dunn...@gmail.com>:
>> > >
>> > > > What kind of synthetic data did you use?
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Jan 5, 2015 at 8:29 PM, Lee S <sle...@gmail.com> wrote:
>> > > >
>> > > > > Hi, I used the synthetic data to test the kmeans method.
>> > > > > And I write the code own to convert center points to sequecefiles.
>> > > > > Then I ran the kmeans with parameter( -i input -o output -c center
>> > -x 3
>> > > > -cd
>> > > > > 1  -cl) ,
>> > > > > I compared the dumped clusteredPoints with the result of
>> scikit-learn
>> > > > kmens
>> > > > > result, it's totally different. I'm very confused.
>> > > > >
>> > > > > Does anybody ever run kmeans with center points provided and
>> compare
>> > > the
>> > > > > result with other ml-library?
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Reply via email to