Re: KMeans code is rubbish

Wanda Hawk Thu, 10 Jul 2014 11:20:18 -0700

I ran the example with "./bin/run-example SparkKMeans file.txt 2 0.001"
I get this response:
"Finished iteration (delta = 0.0)
Final centers:
DenseVector(2.8571428571428568, 2.0)
DenseVector(5.6000000000000005, 2.0)
"

The start point is not random. It uses the first K points from the given set

On Thursday, July 10, 2014 11:57 AM, Sean Owen <so...@cloudera.com> wrote:

I ran it, and your answer is exactly what I got.

import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.clustering._

val vectors = 
sc.parallelize(Array((2,1),(1,2),(3,2),(2,3),(4,1),(5,1),(6,1),(4,2),(6,2),(4,3),(5,3),(6,3)).map(p
=> Vectors.dense(Array[Double](p._1, p._2))))

val kmeans = new KMeans()
kmeans.setK(2)
val model = kmeans.run(vectors)

model.clusterCenters

res10: Array[org.apache.spark.mllib.linalg.Vector] = Array([5.0,2.0], [2.0,2.0])

You may be aware that k-means starts from a random set of centroids.
It's possible that your run picked one that leads to a suboptimal
clustering. This is all the easier on a toy example like this and you
can find examples on the internet. That said, I never saw any other
answer.

The standard approach is to run many times. Call kmeans.setRuns(10) or
something to try 10 times instead of once.

On Thu, Jul 10, 2014 at 9:44 AM, Wanda Hawk <wanda_haw...@yahoo.com> wrote:
> Can someone please run the standard kMeans code on this input with 2 centers
> ?:
> 2 1
> 1 2
> 3 2
> 2 3
> 4 1
> 5 1
> 6 1
> 4 2
> 6 2
> 4 3
> 5 3
> 6 3
>
> The obvious result should be (2,2) and (5,2) ... (you can draw them if you
> don't believe me ...)
>
> Thanks,
> Wanda

Re: KMeans code is rubbish

Reply via email to