Clustering is harder than you appear to think: http://www.imsc.res.in/~meena/papers/kmeans.pdf
https://en.wikipedia.org/wiki/K-means_clustering NP-hard problems are typically solved by approximation. K-means is a great example. Only a few, relatively unrealistic, examples have solutions apparent enough to be found reliably by diverse algorithms. For instance, something as easy as Gaussian clusters with sd=1e-3 situated on 10 random corners of a unit hypercube in 10 dimensional space will be clustered differently by many algorithms unless multiple starts are used. For instance see https://gist.github.com/tdunning/e1575ad2043af732c219 for an R script that demonstrates that R's standard k-means algorithms fail over 95% of the time for this trivial input, occasionally splitting a single cluster into three parts. Restarting multiple times doesn't fix the problem ... it only makes it a bit more tolerable. This example shows how even 90 restarts could fail for this particular problem. On Mon, Jan 5, 2015 at 11:03 PM, Lee S <sle...@gmail.com> wrote: > But parameters and distance measure is the same. Only difference: Mahout > kmeans convergence is based on whether every cluster is convergenced. > scikit-learn is based on within-cluster sum of squared criterion. > > 2015-01-06 14:15 GMT+08:00 Ted Dunning <ted.dunn...@gmail.com>: > > > I don't think that data is sufficiently clusterable to expect a unique > > solution. > > > > Mean squared error would be a better measure of quality. > > > > > > > > On Mon, Jan 5, 2015 at 10:07 PM, Lee S <sle...@gmail.com> wrote: > > > > > Data in thie link: > > > > > > > > > http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data > > > . > > > I convert it to sequencefile with InputDriver. > > > > > > 2015-01-06 14:04 GMT+08:00 Ted Dunning <ted.dunn...@gmail.com>: > > > > > > > What kind of synthetic data did you use? > > > > > > > > > > > > > > > > On Mon, Jan 5, 2015 at 8:29 PM, Lee S <sle...@gmail.com> wrote: > > > > > > > > > Hi, I used the synthetic data to test the kmeans method. > > > > > And I write the code own to convert center points to sequecefiles. > > > > > Then I ran the kmeans with parameter( -i input -o output -c center > > -x 3 > > > > -cd > > > > > 1 -cl) , > > > > > I compared the dumped clusteredPoints with the result of > scikit-learn > > > > kmens > > > > > result, it's totally different. I'm very confused. > > > > > > > > > > Does anybody ever run kmeans with center points provided and > compare > > > the > > > > > result with other ml-library? > > > > > > > > > > > > > > >