The problem is that I get the same results every time
On Friday, July 11, 2014 7:22 PM, Ameet Talwalkar wrote:
Hi Wanda,
As Sean mentioned, K-means is not guaranteed to find an optimal answer, even
for seemingly simple toy examples. A common heuristic to deal with this issue
is to run kme
Hi Wanda,
As Sean mentioned, K-means is not guaranteed to find an optimal answer,
even for seemingly simple toy examples. A common heuristic to deal with
this issue is to run kmeans multiple times and choose the best answer. You
can do this by changing the runs parameter from the default value (1
I also took a look at
spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala
and ran the code in a shell.
There is an issue here:
" val initMode = params.initializationMode match {
case Random => KMeans.RANDOM
case Parallel => KMeans.K_MEANS_PARALLEL
I am running spark-1.0.0 with java 1.8
"java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)"
"which spark-shell
~/bench/spark-1.0.0/bin/spark-shell"
"which scala
~/bench/scala-2.10.4/bin/scala"
On Thursday,
I ran the example with "./bin/run-example SparkKMeans file.txt 2 0.001"
I get this response:
"Finished iteration (delta = 0.0)
Final centers:
DenseVector(2.8571428571428568, 2.0)
DenseVector(5.6005, 2.0)
"
The start point is not random. It uses the first K points from the given set
O
SparkKMeans is a naive implementation. Please use
mllib.clustering.KMeans in practice. I created a JIRA for this:
https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui
On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
wrote:
> I ran the SparkKMeans example (not the mllib KMeans that Sean ran) w
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with
your dataset as well, I got the expected answer. And I believe that even
though initialization is done using sampling, the example actually sets the
seed to a constant 42, so the result should always be the same no matter
how m
so this is what I am running:
"./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001"
And this is the input file:"
┌───[spark2013@SparkOne]──[~/spark-1.0.0].$
└───#!cat ~/Documents/2dim2.txt
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3
"
This is the final output from spark:
"14/07/1
A picture is worth a thousand... Well, a picture with this dataset, what
you are expecting and what you get, would help answering your initial
question.
Bertrand
On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wrote:
> Can someone please run the standard kMeans code on this input with 2
> centers
I ran it, and your answer is exactly what I got.
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.clustering._
val vectors =
sc.parallelize(Array((2,1),(1,2),(3,2),(2,3),(4,1),(5,1),(6,1),(4,2),(6,2),(4,3),(5,3),(6,3)).map(p
=> Vectors.dense(Array[Double](p._1, p._2
val
10 matches
Mail list logo