I ran the example with "./bin/run-example SparkKMeans file.txt 2 0.001" I get this response: "Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0) DenseVector(5.6000000000000005, 2.0) "
The start point is not random. It uses the first K points from the given set On Thursday, July 10, 2014 11:57 AM, Sean Owen <so...@cloudera.com> wrote: I ran it, and your answer is exactly what I got. import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.clustering._ val vectors = sc.parallelize(Array((2,1),(1,2),(3,2),(2,3),(4,1),(5,1),(6,1),(4,2),(6,2),(4,3),(5,3),(6,3)).map(p => Vectors.dense(Array[Double](p._1, p._2)))) val kmeans = new KMeans() kmeans.setK(2) val model = kmeans.run(vectors) model.clusterCenters res10: Array[org.apache.spark.mllib.linalg.Vector] = Array([5.0,2.0], [2.0,2.0]) You may be aware that k-means starts from a random set of centroids. It's possible that your run picked one that leads to a suboptimal clustering. This is all the easier on a toy example like this and you can find examples on the internet. That said, I never saw any other answer. The standard approach is to run many times. Call kmeans.setRuns(10) or something to try 10 times instead of once. On Thu, Jul 10, 2014 at 9:44 AM, Wanda Hawk <wanda_haw...@yahoo.com> wrote: > Can someone please run the standard kMeans code on this input with 2 centers > ?: > 2 1 > 1 2 > 3 2 > 2 3 > 4 1 > 5 1 > 6 1 > 4 2 > 6 2 > 4 3 > 5 3 > 6 3 > > The obvious result should be (2,2) and (5,2) ... (you can draw them if you > don't believe me ...) > > Thanks, > Wanda