Running this gist can be done using the following two lines of R, btw: library(devtools) source_url(" https://gist.githubusercontent.com/tdunning/e1575ad2043af732c219/raw/444514454a6f3b5fcbbcaa3f8a919b1965e07f16/Clustering%20is%20hard ")
You should see something like this as output: SHA-1 hash of file is 2bc9bf7677d6d5b8b7aa1b1d49749574f5bd942e $fail [1] 96 $success [1] 4 counts 1 2 3 4 4 71 22 3 On Mon, Jan 5, 2015 at 11:50 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Clustering is harder than you appear to think: > > http://www.imsc.res.in/~meena/papers/kmeans.pdf > > https://en.wikipedia.org/wiki/K-means_clustering > > NP-hard problems are typically solved by approximation. K-means is a > great example. Only a few, relatively unrealistic, examples have solutions > apparent enough to be found reliably by diverse algorithms. For instance, > something as easy as Gaussian clusters with sd=1e-3 situated on 10 random > corners of a unit hypercube in 10 dimensional space will be clustered > differently by many algorithms unless multiple starts are used. > > For instance see https://gist.github.com/tdunning/e1575ad2043af732c219 > for an R script that demonstrates that R's standard k-means algorithms fail > over 95% of the time for this trivial input, occasionally splitting a > single cluster into three parts. Restarting multiple times doesn't fix the > problem ... it only makes it a bit more tolerable. This example shows how > even 90 restarts could fail for this particular problem. > > > > > > On Mon, Jan 5, 2015 at 11:03 PM, Lee S <sle...@gmail.com> wrote: > >> But parameters and distance measure is the same. Only difference: Mahout >> kmeans convergence is based on whether every cluster is convergenced. >> scikit-learn is based on within-cluster sum of squared criterion. >> >> 2015-01-06 14:15 GMT+08:00 Ted Dunning <ted.dunn...@gmail.com>: >> >> > I don't think that data is sufficiently clusterable to expect a unique >> > solution. >> > >> > Mean squared error would be a better measure of quality. >> > >> > >> > >> > On Mon, Jan 5, 2015 at 10:07 PM, Lee S <sle...@gmail.com> wrote: >> > >> > > Data in thie link: >> > > >> > > >> > >> http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data >> > > . >> > > I convert it to sequencefile with InputDriver. >> > > >> > > 2015-01-06 14:04 GMT+08:00 Ted Dunning <ted.dunn...@gmail.com>: >> > > >> > > > What kind of synthetic data did you use? >> > > > >> > > > >> > > > >> > > > On Mon, Jan 5, 2015 at 8:29 PM, Lee S <sle...@gmail.com> wrote: >> > > > >> > > > > Hi, I used the synthetic data to test the kmeans method. >> > > > > And I write the code own to convert center points to sequecefiles. >> > > > > Then I ran the kmeans with parameter( -i input -o output -c center >> > -x 3 >> > > > -cd >> > > > > 1 -cl) , >> > > > > I compared the dumped clusteredPoints with the result of >> scikit-learn >> > > > kmens >> > > > > result, it's totally different. I'm very confused. >> > > > > >> > > > > Does anybody ever run kmeans with center points provided and >> compare >> > > the >> > > > > result with other ml-library? >> > > > > >> > > > >> > > >> > >> > >