Re: KMeans code is rubbish

2014-07-14 Thread Wanda Hawk
The problem is that I get the same results every time


On Friday, July 11, 2014 7:22 PM, Ameet Talwalkar atalwal...@gmail.com wrote:
 


Hi Wanda,

As Sean mentioned, K-means is not guaranteed to find an optimal answer, even 
for seemingly simple toy examples. A common heuristic to deal with this issue 
is to run kmeans multiple times and choose the best answer.  You can do this by 
changing the runs parameter from the default value (1) to something larger (say 
10).

-Ameet



On Fri, Jul 11, 2014 at 1:20 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

I also took a look at 
spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala
 and ran the code in a shell.


There is an issue here:
    val initMode = params.initializationMode match {
      case Random = KMeans.RANDOM
      case Parallel = KMeans.K_MEANS_PARALLEL
    }



If I use initMode=KMeans.RANDOM everything is ok.
If I use initMode=KMeans.K_MEANS_PARALLEL I get a wrong result. I do not know 
why. The example proposed is a really simple one that should not accept 
multiple solutions and always converge to the correct one.


Now what can be altered in the original SparkKMeans.scala (the seed or 
something else ?) to get the correct results each and every single time ?
On Thursday, July 10, 2014 7:58 PM, Xiangrui Meng men...@gmail.com wrote:
 


SparkKMeans is a naive implementation. Please use
mllib.clustering.KMeans in practice. I created a JIRA for this:
https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui


On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your
 dataset as well, I got the expected answer. And I believe that even though
 initialization is done using sampling, the example actually sets the seed to
 a constant 42, so the result should always be the same no matter how many
 times you run it. So I am not really sure whats going on here.

 Can you tell us more about which version of Spark you are running? Which
 Java version?


 ==

 [tdas @ Xion spark2] cat input
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3
 [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001
 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info
 from
 SCDynamicStore
 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 14/07/10 02:45:07 WARN LoadSnappy:
 Snappy native library not loaded
 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeSystemBLAS
 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeRefBLAS
 Finished iteration (delta = 3.0)
 Finished iteration (delta = 0.0)
 Final centers:
 DenseVector(5.0, 2.0)
 DenseVector(2.0, 2.0)



 On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

 so this is what I am running:
 ./bin/run-example SparkKMeans
 ~/Documents/2dim2.txt 2 0.001

 And this is the input file:
 ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$
 └───#!cat ~/Documents/2dim2.txt
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3
 

 This is the final output from spark:
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 2 non-empty blocks
 out of 2 blocks
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 0 remote fetches in 0 ms
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 maxBytesInFlight: 50331648, targetRequestSize: 10066329
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 2 non-empty blocks out of 2 blocks
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 0 remote fetches in 0 ms
 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433
 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver
 14/07/10 20:05:12 INFO Executor: Finished task ID 14

 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0)
 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on
 localhost (progress: 1/2)
 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433
 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver
 14/07/10 20:05:12 INFO Executor: Finished task ID 15
 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1)
 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on
 localhost (progress: 2/2)
 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at
 SparkKMeans.scala:75) finished in 0.008 s
 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose
 tasks
 have all completed, from pool
 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at
 SparkKMeans.scala:75, took 0.02472681 s
 Finished iteration (delta = 0.0)
 Final centers:
 

Re: KMeans code is rubbish

2014-07-11 Thread Wanda Hawk
I also took a look at 
spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala
 and ran the code in a shell.

There is an issue here:
    val initMode = params.initializationMode match {
      case Random = KMeans.RANDOM
      case Parallel = KMeans.K_MEANS_PARALLEL
    }


If I use initMode=KMeans.RANDOM everything is ok.
If I use initMode=KMeans.K_MEANS_PARALLEL I get a wrong result. I do not know 
why. The example proposed is a really simple one that should not accept 
multiple solutions and always converge to the correct one.

Now what can be altered in the original SparkKMeans.scala (the seed or 
something else ?) to get the correct results each and every single time ?
On Thursday, July 10, 2014 7:58 PM, Xiangrui Meng men...@gmail.com wrote:
 


SparkKMeans is a naive implementation. Please use
mllib.clustering.KMeans in practice. I created a JIRA for this:
https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui


On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your
 dataset as well, I got the expected answer. And I believe that even though
 initialization is done using sampling, the example actually sets the seed to
 a constant 42, so the result should always be the same no matter how many
 times you run it. So I am not really sure whats going on here.

 Can you tell us more about which version of Spark you are running? Which
 Java version?


 ==

 [tdas @ Xion spark2] cat input
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3
 [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001
 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from
 SCDynamicStore
 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 14/07/10 02:45:07 WARN LoadSnappy:
 Snappy native library not loaded
 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeSystemBLAS
 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeRefBLAS
 Finished iteration (delta = 3.0)
 Finished iteration (delta = 0.0)
 Final centers:
 DenseVector(5.0, 2.0)
 DenseVector(2.0, 2.0)



 On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

 so this is what I am running:
 ./bin/run-example SparkKMeans
 ~/Documents/2dim2.txt 2 0.001

 And this is the input file:
 ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$
 └───#!cat ~/Documents/2dim2.txt
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3
 

 This is the final output from spark:
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 2 non-empty blocks
 out of 2 blocks
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 0 remote fetches in 0 ms
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 maxBytesInFlight: 50331648, targetRequestSize: 10066329
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 2 non-empty blocks out of 2 blocks
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 0 remote fetches in 0 ms
 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433
 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver
 14/07/10 20:05:12 INFO Executor: Finished task ID 14

 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0)
 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on
 localhost (progress: 1/2)
 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433
 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver
 14/07/10 20:05:12 INFO Executor: Finished task ID 15
 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1)
 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on
 localhost (progress: 2/2)
 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at
 SparkKMeans.scala:75) finished in 0.008 s
 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose
 tasks
 have all completed, from pool
 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at
 SparkKMeans.scala:75, took 0.02472681 s
 Finished iteration (delta = 0.0)
 Final centers:
 DenseVector(2.8571428571428568, 2.0)
 DenseVector(5.6005, 2.0)
 




 On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux decho...@gmail.com
 wrote:


 A picture is worth a thousand... Well, a picture with this dataset, what

 you are expecting and what you get, would help answering your initial
 question.

 Bertrand


 On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com
 wrote:

 Can someone please run the standard kMeans code on this input with 2
 centers ?:
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3

 The obvious result 

Re: KMeans code is rubbish

2014-07-11 Thread Ameet Talwalkar
Hi Wanda,

As Sean mentioned, K-means is not guaranteed to find an optimal answer,
even for seemingly simple toy examples. A common heuristic to deal with
this issue is to run kmeans multiple times and choose the best answer.  You
can do this by changing the runs parameter from the default value (1) to
something larger (say 10).

-Ameet


On Fri, Jul 11, 2014 at 1:20 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

 I also took a look
 at 
 spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala
 and ran the code in a shell.

 There is an issue here:
 val initMode = params.initializationMode match {
   case Random = KMeans.RANDOM
   case Parallel = KMeans.K_MEANS_PARALLEL
 }
 

 If I use initMode=KMeans.RANDOM everything is ok.
 If I use initMode=KMeans.K_MEANS_PARALLEL I get a wrong result. I do not
 know why. The example proposed is a really simple one that should not
 accept multiple solutions and always converge to the correct one.

 Now what can be altered in the original SparkKMeans.scala (the seed or
 something else ?) to get the correct results each and every single time ?
On Thursday, July 10, 2014 7:58 PM, Xiangrui Meng men...@gmail.com
 wrote:


 SparkKMeans is a naive implementation. Please use
 mllib.clustering.KMeans in practice. I created a JIRA for this:
 https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui

 On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
 tathagata.das1...@gmail.com wrote:
  I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with
 your
  dataset as well, I got the expected answer. And I believe that even
 though
  initialization is done using sampling, the example actually sets the
 seed to
  a constant 42, so the result should always be the same no matter how many
  times you run it. So I am not really sure whats going on here.
 
  Can you tell us more about which version of Spark you are running? Which
  Java version?
 
 
  ==
 
  [tdas @ Xion spark2] cat input
  2 1
  1 2
  3 2
  2 3
  4 1
  5 1
  6 1
  4 2
  6 2
  4 3
  5 3
  6 3
  [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001
  2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from
  SCDynamicStore
  14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop
  library for your platform... using builtin-java classes where applicable
  14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded
  14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
  com.github.fommil.netlib.NativeSystemBLAS
  14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
  com.github.fommil.netlib.NativeRefBLAS
  Finished iteration (delta = 3.0)
  Finished iteration (delta = 0.0)
  Final centers:
  DenseVector(5.0, 2.0)
  DenseVector(2.0, 2.0)
 
 
 
  On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com
 wrote:
 
  so this is what I am running:
  ./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001
 
  And this is the input file:
  ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$
  └───#!cat ~/Documents/2dim2.txt
  2 1
  1 2
  3 2
  2 3
  4 1
  5 1
  6 1
  4 2
  6 2
  4 3
  5 3
  6 3
  
 
  This is the final output from spark:
  14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
  Getting 2 non-empty blocks out of 2 blocks
  14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
  Started 0 remote fetches in 0 ms
  14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
  maxBytesInFlight: 50331648, targetRequestSize: 10066329
  14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
  Getting 2 non-empty blocks out of 2 blocks
  14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
  Started 0 remote fetches in 0 ms
  14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is
 1433
  14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to
 driver
  14/07/10 20:05:12 INFO Executor: Finished task ID 14
  14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0)
  14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on
  localhost (progress: 1/2)
  14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is
 1433
  14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to
 driver
  14/07/10 20:05:12 INFO Executor: Finished task ID 15
  14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1)
  14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on
  localhost (progress: 2/2)
  14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at
  SparkKMeans.scala:75) finished in 0.008 s
  14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose
 tasks
  have all completed, from pool
  14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at
  SparkKMeans.scala:75, took 0.02472681 s
  Finished iteration (delta = 0.0)
  Final centers:
  DenseVector(2.8571428571428568, 2.0)
  

KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
Can someone please run the standard kMeans code on this input with 2 centers ?:
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3

The obvious result should be (2,2) and (5,2) ... (you can draw them if you 
don't believe me ...)

Thanks, 
Wanda

Re: KMeans code is rubbish

2014-07-10 Thread Sean Owen
I ran it, and your answer is exactly what I got.

import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.clustering._

val vectors = 
sc.parallelize(Array((2,1),(1,2),(3,2),(2,3),(4,1),(5,1),(6,1),(4,2),(6,2),(4,3),(5,3),(6,3)).map(p
= Vectors.dense(Array[Double](p._1, p._2

val kmeans = new KMeans()
kmeans.setK(2)
val model = kmeans.run(vectors)

model.clusterCenters

res10: Array[org.apache.spark.mllib.linalg.Vector] = Array([5.0,2.0], [2.0,2.0])

You may be aware that k-means starts from a random set of centroids.
It's possible that your run picked one that leads to a suboptimal
clustering. This is all the easier on a toy example like this and you
can find examples on the internet. That said, I never saw any other
answer.

The standard approach is to run many times. Call kmeans.setRuns(10) or
something to try 10 times instead of once.

On Thu, Jul 10, 2014 at 9:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
 Can someone please run the standard kMeans code on this input with 2 centers
 ?:
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3

 The obvious result should be (2,2) and (5,2) ... (you can draw them if you
 don't believe me ...)

 Thanks,
 Wanda


Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
so this is what I am running: 
./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001

And this is the input file:
┌───[spark2013@SparkOne]──[~/spark-1.0.0].$
└───#!cat ~/Documents/2dim2.txt
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3


This is the final output from spark:
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 
2 non-empty blocks out of 2 blocks
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 
0 remote fetches in 0 ms
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: 
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 
2 non-empty blocks out of 2 blocks
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 
0 remote fetches in 0 ms
14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433
14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver
14/07/10 20:05:12 INFO Executor: Finished task ID 14
14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0)
14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on localhost 
(progress: 1/2)
14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433
14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver
14/07/10 20:05:12 INFO Executor: Finished task ID 15
14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1)
14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on localhost 
(progress: 2/2)
14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at 
SparkKMeans.scala:75) finished in 0.008 s
14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have 
all completed, from pool
14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at 
SparkKMeans.scala:75, took 0.02472681 s
Finished iteration (delta = 0.0)
Final centers:
DenseVector(2.8571428571428568, 2.0)
DenseVector(5.6005, 2.0)





On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux decho...@gmail.com 
wrote:
 


A picture is worth a thousand... Well, a picture with this dataset, what you 
are expecting and what you get, would help answering your initial question.


Bertrand


On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

Can someone please run the standard kMeans code on this input with 2 centers ?:
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3


The obvious result should be (2,2) and (5,2) ... (you can draw them if you 
don't believe me ...)


Thanks, 
Wanda

Re: KMeans code is rubbish

2014-07-10 Thread Tathagata Das
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with
your dataset as well, I got the expected answer. And I believe that even
though initialization is done using sampling, the example actually sets the
seed to a constant 42, so the result should always be the same no matter
how many times you run it. So I am not really sure whats going on here.

Can you tell us more about which version of Spark you are running? Which
Java version?


==

[tdas @ Xion spark2] cat input
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3
[tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001
2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from
SCDynamicStore
14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded
14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeRefBLAS
Finished iteration (delta = 3.0)
Finished iteration (delta = 0.0)
Final centers:
DenseVector(5.0, 2.0)
DenseVector(2.0, 2.0)



On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

 so this is what I am running:
 ./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001

 And this is the input file:
 ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$
 └───#!cat ~/Documents/2dim2.txt
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3
 

 This is the final output from spark:
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 2 non-empty blocks out of 2 blocks
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 0 remote fetches in 0 ms
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 maxBytesInFlight: 50331648, targetRequestSize: 10066329
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 2 non-empty blocks out of 2 blocks
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 0 remote fetches in 0 ms
 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433
 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver
 14/07/10 20:05:12 INFO Executor: Finished task ID 14
 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0)
 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on
 localhost (progress: 1/2)
 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433
 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver
 14/07/10 20:05:12 INFO Executor: Finished task ID 15
 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1)
 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on
 localhost (progress: 2/2)
 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at
 SparkKMeans.scala:75) finished in 0.008 s
 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks
 have all completed, from pool
 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at
 SparkKMeans.scala:75, took 0.02472681 s
 Finished iteration (delta = 0.0)
 Final centers:
 DenseVector(2.8571428571428568, 2.0)
 DenseVector(5.6005, 2.0)
 




   On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux 
 decho...@gmail.com wrote:


 A picture is worth a thousand... Well, a picture with this dataset, what
 you are expecting and what you get, would help answering your initial
 question.

 Bertrand


 On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com
 wrote:

 Can someone please run the standard kMeans code on this input with 2
 centers ?:
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3

 The obvious result should be (2,2) and (5,2) ... (you can draw them if you
 don't believe me ...)

 Thanks,
  Wanda







Re: KMeans code is rubbish

2014-07-10 Thread Xiangrui Meng
SparkKMeans is a naive implementation. Please use
mllib.clustering.KMeans in practice. I created a JIRA for this:
https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui

On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your
 dataset as well, I got the expected answer. And I believe that even though
 initialization is done using sampling, the example actually sets the seed to
 a constant 42, so the result should always be the same no matter how many
 times you run it. So I am not really sure whats going on here.

 Can you tell us more about which version of Spark you are running? Which
 Java version?


 ==

 [tdas @ Xion spark2] cat input
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3
 [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001
 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from
 SCDynamicStore
 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded
 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeSystemBLAS
 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeRefBLAS
 Finished iteration (delta = 3.0)
 Finished iteration (delta = 0.0)
 Final centers:
 DenseVector(5.0, 2.0)
 DenseVector(2.0, 2.0)



 On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

 so this is what I am running:
 ./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001

 And this is the input file:
 ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$
 └───#!cat ~/Documents/2dim2.txt
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3
 

 This is the final output from spark:
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 2 non-empty blocks out of 2 blocks
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 0 remote fetches in 0 ms
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 maxBytesInFlight: 50331648, targetRequestSize: 10066329
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 2 non-empty blocks out of 2 blocks
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 0 remote fetches in 0 ms
 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433
 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver
 14/07/10 20:05:12 INFO Executor: Finished task ID 14
 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0)
 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on
 localhost (progress: 1/2)
 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433
 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver
 14/07/10 20:05:12 INFO Executor: Finished task ID 15
 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1)
 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on
 localhost (progress: 2/2)
 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at
 SparkKMeans.scala:75) finished in 0.008 s
 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks
 have all completed, from pool
 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at
 SparkKMeans.scala:75, took 0.02472681 s
 Finished iteration (delta = 0.0)
 Final centers:
 DenseVector(2.8571428571428568, 2.0)
 DenseVector(5.6005, 2.0)
 




 On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux decho...@gmail.com
 wrote:


 A picture is worth a thousand... Well, a picture with this dataset, what
 you are expecting and what you get, would help answering your initial
 question.

 Bertrand


 On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com
 wrote:

 Can someone please run the standard kMeans code on this input with 2
 centers ?:
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3

 The obvious result should be (2,2) and (5,2) ... (you can draw them if you
 don't believe me ...)

 Thanks,
 Wanda







Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
I ran the example with ./bin/run-example SparkKMeans file.txt 2 0.001
I get this response:
Finished iteration (delta = 0.0)
Final centers:
DenseVector(2.8571428571428568, 2.0)
DenseVector(5.6005, 2.0)


The start point is not random. It uses the first K points from the given set


On Thursday, July 10, 2014 11:57 AM, Sean Owen so...@cloudera.com wrote:
 


I ran it, and your answer is exactly what I got.

import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.clustering._

val vectors = 
sc.parallelize(Array((2,1),(1,2),(3,2),(2,3),(4,1),(5,1),(6,1),(4,2),(6,2),(4,3),(5,3),(6,3)).map(p
= Vectors.dense(Array[Double](p._1, p._2

val kmeans = new KMeans()
kmeans.setK(2)
val model = kmeans.run(vectors)

model.clusterCenters

res10: Array[org.apache.spark.mllib.linalg.Vector] = Array([5.0,2.0], [2.0,2.0])

You may be aware that k-means starts from a random set of centroids.
It's possible that your run picked one that leads to a suboptimal
clustering. This is all the easier on a toy example like this and you
can find examples on the internet. That said, I never saw any other
answer.

The standard approach is to run many times. Call kmeans.setRuns(10) or
something to try 10 times instead of once.


On Thu, Jul 10, 2014 at 9:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
 Can someone please run the standard kMeans code on this input with 2 centers
 ?:
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3

 The obvious result should be (2,2) and (5,2) ... (you can draw them if you
 don't believe me ...)

 Thanks,
 Wanda

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
I am running spark-1.0.0 with java 1.8

java version 1.8.0_05
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)

which spark-shell
~/bench/spark-1.0.0/bin/spark-shell

which scala
~/bench/scala-2.10.4/bin/scala


On Thursday, July 10, 2014 12:46 PM, Tathagata Das 
tathagata.das1...@gmail.com wrote:
 


I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your 
dataset as well, I got the expected answer. And I believe that even though 
initialization is done using sampling, the example actually sets the seed to a 
constant 42, so the result should always be the same no matter how many times 
you run it. So I am not really sure whats going on here.

Can you tell us more about which version of Spark you are running? Which Java 
version? 


==

[tdas @ Xion spark2] cat input
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3
[tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001
2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from 
SCDynamicStore
14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded
14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS
14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS
Finished iteration (delta = 3.0)
Finished iteration (delta = 0.0)
Final centers:
DenseVector(5.0, 2.0)
DenseVector(2.0, 2.0)




On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

so this is what I am running: 
./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001


And this is the input file:
┌───[spark2013@SparkOne]──[~/spark-1.0.0].$
└───#!cat ~/Documents/2dim2.txt
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3



This is the final output from spark:
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: 
Getting 2 non-empty blocks out of 2 blocks
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 
0 remote fetches in 0 ms
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: 
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 
2 non-empty blocks out of 2 blocks
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 
0 remote fetches in 0 ms
14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433
14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver
14/07/10 20:05:12 INFO Executor: Finished task ID 14
14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0)
14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on localhost 
(progress: 1/2)
14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433
14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver
14/07/10 20:05:12 INFO Executor: Finished task ID 15
14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1)
14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on localhost 
(progress: 2/2)
14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at 
SparkKMeans.scala:75) finished in 0.008 s
14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks 
have all completed, from pool
14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at 
SparkKMeans.scala:75, took 0.02472681 s
Finished iteration (delta = 0.0)
Final centers:
DenseVector(2.8571428571428568, 2.0)
DenseVector(5.6005, 2.0)








On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux decho...@gmail.com 
wrote:
 


A picture is worth a thousand... Well, a picture with this dataset, what you 
are expecting and what you get, would help answering your initial question.


Bertrand


On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

Can someone please run the standard kMeans code on this input with 2 centers ?:
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3


The obvious result should be (2,2) and (5,2) ... (you can draw them if you 
don't believe me ...)


Thanks, 
Wanda