Re: KMeans code is rubbish

2014-07-13 Thread Wanda Hawk
The problem is that I get the same results every time On Friday, July 11, 2014 7:22 PM, Ameet Talwalkar wrote: Hi Wanda, As Sean mentioned, K-means is not guaranteed to find an optimal answer, even for seemingly simple toy examples. A common heuristic to deal with this issue is to run kme

Re: KMeans code is rubbish

2014-07-11 Thread Ameet Talwalkar
Hi Wanda, As Sean mentioned, K-means is not guaranteed to find an optimal answer, even for seemingly simple toy examples. A common heuristic to deal with this issue is to run kmeans multiple times and choose the best answer. You can do this by changing the runs parameter from the default value (1

Re: KMeans code is rubbish

2014-07-11 Thread Wanda Hawk
I also took a look at  spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala and ran the code in a shell. There is an issue here: "    val initMode = params.initializationMode match {       case Random => KMeans.RANDOM       case Parallel => KMeans.K_MEANS_PARALLEL

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
I am running spark-1.0.0 with java 1.8 "java version "1.8.0_05" Java(TM) SE Runtime Environment (build 1.8.0_05-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)" "which spark-shell ~/bench/spark-1.0.0/bin/spark-shell" "which scala ~/bench/scala-2.10.4/bin/scala" On Thursday,

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
I ran the example with "./bin/run-example SparkKMeans file.txt 2 0.001" I get this response: "Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0) DenseVector(5.6005, 2.0) " The start point is not random. It uses the first K points from the given set O

Re: KMeans code is rubbish

2014-07-10 Thread Xiangrui Meng
SparkKMeans is a naive implementation. Please use mllib.clustering.KMeans in practice. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das wrote: > I ran the SparkKMeans example (not the mllib KMeans that Sean ran) w

Re: KMeans code is rubbish

2014-07-10 Thread Tathagata Das
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how m

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
so this is what I am running:  "./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001" And this is the input file:" ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$ └───#!cat ~/Documents/2dim2.txt 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 " This is the final output from spark: "14/07/1

Re: KMeans code is rubbish

2014-07-10 Thread Bertrand Dechoux
A picture is worth a thousand... Well, a picture with this dataset, what you are expecting and what you get, would help answering your initial question. Bertrand On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wrote: > Can someone please run the standard kMeans code on this input with 2 > centers

Re: KMeans code is rubbish

2014-07-10 Thread Sean Owen
I ran it, and your answer is exactly what I got. import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.clustering._ val vectors = sc.parallelize(Array((2,1),(1,2),(3,2),(2,3),(4,1),(5,1),(6,1),(4,2),(6,2),(4,3),(5,3),(6,3)).map(p => Vectors.dense(Array[Double](p._1, p._2 val