Re: KMeans code is rubbish
The problem is that I get the same results every time On Friday, July 11, 2014 7:22 PM, Ameet Talwalkar atalwal...@gmail.com wrote: Hi Wanda, As Sean mentioned, K-means is not guaranteed to find an optimal answer, even for seemingly simple toy examples. A common heuristic to deal with this issue is to run kmeans multiple times and choose the best answer. You can do this by changing the runs parameter from the default value (1) to something larger (say 10). -Ameet On Fri, Jul 11, 2014 at 1:20 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: I also took a look at spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala and ran the code in a shell. There is an issue here: val initMode = params.initializationMode match { case Random = KMeans.RANDOM case Parallel = KMeans.K_MEANS_PARALLEL } If I use initMode=KMeans.RANDOM everything is ok. If I use initMode=KMeans.K_MEANS_PARALLEL I get a wrong result. I do not know why. The example proposed is a really simple one that should not accept multiple solutions and always converge to the correct one. Now what can be altered in the original SparkKMeans.scala (the seed or something else ?) to get the correct results each and every single time ? On Thursday, July 10, 2014 7:58 PM, Xiangrui Meng men...@gmail.com wrote: SparkKMeans is a naive implementation. Please use mllib.clustering.KMeans in practice. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how many times you run it. So I am not really sure whats going on here. Can you tell us more about which version of Spark you are running? Which Java version? == [tdas @ Xion spark2] cat input 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from SCDynamicStore 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Finished iteration (delta = 3.0) Finished iteration (delta = 0.0) Final centers: DenseVector(5.0, 2.0) DenseVector(2.0, 2.0) On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: so this is what I am running: ./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001 And this is the input file: ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$ └───#!cat ~/Documents/2dim2.txt 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 This is the final output from spark: 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 14 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on localhost (progress: 1/2) 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 15 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on localhost (progress: 2/2) 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at SparkKMeans.scala:75) finished in 0.008 s 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at SparkKMeans.scala:75, took 0.02472681 s Finished iteration (delta = 0.0) Final centers:
Re: KMeans code is rubbish
I also took a look at spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala and ran the code in a shell. There is an issue here: val initMode = params.initializationMode match { case Random = KMeans.RANDOM case Parallel = KMeans.K_MEANS_PARALLEL } If I use initMode=KMeans.RANDOM everything is ok. If I use initMode=KMeans.K_MEANS_PARALLEL I get a wrong result. I do not know why. The example proposed is a really simple one that should not accept multiple solutions and always converge to the correct one. Now what can be altered in the original SparkKMeans.scala (the seed or something else ?) to get the correct results each and every single time ? On Thursday, July 10, 2014 7:58 PM, Xiangrui Meng men...@gmail.com wrote: SparkKMeans is a naive implementation. Please use mllib.clustering.KMeans in practice. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how many times you run it. So I am not really sure whats going on here. Can you tell us more about which version of Spark you are running? Which Java version? == [tdas @ Xion spark2] cat input 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from SCDynamicStore 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Finished iteration (delta = 3.0) Finished iteration (delta = 0.0) Final centers: DenseVector(5.0, 2.0) DenseVector(2.0, 2.0) On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: so this is what I am running: ./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001 And this is the input file: ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$ └───#!cat ~/Documents/2dim2.txt 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 This is the final output from spark: 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 14 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on localhost (progress: 1/2) 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 15 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on localhost (progress: 2/2) 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at SparkKMeans.scala:75) finished in 0.008 s 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at SparkKMeans.scala:75, took 0.02472681 s Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0) DenseVector(5.6005, 2.0) On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux decho...@gmail.com wrote: A picture is worth a thousand... Well, a picture with this dataset, what you are expecting and what you get, would help answering your initial question. Bertrand On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result
Re: KMeans code is rubbish
Hi Wanda, As Sean mentioned, K-means is not guaranteed to find an optimal answer, even for seemingly simple toy examples. A common heuristic to deal with this issue is to run kmeans multiple times and choose the best answer. You can do this by changing the runs parameter from the default value (1) to something larger (say 10). -Ameet On Fri, Jul 11, 2014 at 1:20 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: I also took a look at spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala and ran the code in a shell. There is an issue here: val initMode = params.initializationMode match { case Random = KMeans.RANDOM case Parallel = KMeans.K_MEANS_PARALLEL } If I use initMode=KMeans.RANDOM everything is ok. If I use initMode=KMeans.K_MEANS_PARALLEL I get a wrong result. I do not know why. The example proposed is a really simple one that should not accept multiple solutions and always converge to the correct one. Now what can be altered in the original SparkKMeans.scala (the seed or something else ?) to get the correct results each and every single time ? On Thursday, July 10, 2014 7:58 PM, Xiangrui Meng men...@gmail.com wrote: SparkKMeans is a naive implementation. Please use mllib.clustering.KMeans in practice. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how many times you run it. So I am not really sure whats going on here. Can you tell us more about which version of Spark you are running? Which Java version? == [tdas @ Xion spark2] cat input 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from SCDynamicStore 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Finished iteration (delta = 3.0) Finished iteration (delta = 0.0) Final centers: DenseVector(5.0, 2.0) DenseVector(2.0, 2.0) On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: so this is what I am running: ./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001 And this is the input file: ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$ └───#!cat ~/Documents/2dim2.txt 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 This is the final output from spark: 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 14 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on localhost (progress: 1/2) 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 15 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on localhost (progress: 2/2) 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at SparkKMeans.scala:75) finished in 0.008 s 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at SparkKMeans.scala:75, took 0.02472681 s Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0)
KMeans code is rubbish
Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result should be (2,2) and (5,2) ... (you can draw them if you don't believe me ...) Thanks, Wanda
Re: KMeans code is rubbish
I ran it, and your answer is exactly what I got. import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.clustering._ val vectors = sc.parallelize(Array((2,1),(1,2),(3,2),(2,3),(4,1),(5,1),(6,1),(4,2),(6,2),(4,3),(5,3),(6,3)).map(p = Vectors.dense(Array[Double](p._1, p._2 val kmeans = new KMeans() kmeans.setK(2) val model = kmeans.run(vectors) model.clusterCenters res10: Array[org.apache.spark.mllib.linalg.Vector] = Array([5.0,2.0], [2.0,2.0]) You may be aware that k-means starts from a random set of centroids. It's possible that your run picked one that leads to a suboptimal clustering. This is all the easier on a toy example like this and you can find examples on the internet. That said, I never saw any other answer. The standard approach is to run many times. Call kmeans.setRuns(10) or something to try 10 times instead of once. On Thu, Jul 10, 2014 at 9:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result should be (2,2) and (5,2) ... (you can draw them if you don't believe me ...) Thanks, Wanda
Re: KMeans code is rubbish
so this is what I am running: ./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001 And this is the input file: ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$ └───#!cat ~/Documents/2dim2.txt 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 This is the final output from spark: 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 14 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on localhost (progress: 1/2) 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 15 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on localhost (progress: 2/2) 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at SparkKMeans.scala:75) finished in 0.008 s 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at SparkKMeans.scala:75, took 0.02472681 s Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0) DenseVector(5.6005, 2.0) On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux decho...@gmail.com wrote: A picture is worth a thousand... Well, a picture with this dataset, what you are expecting and what you get, would help answering your initial question. Bertrand On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result should be (2,2) and (5,2) ... (you can draw them if you don't believe me ...) Thanks, Wanda
Re: KMeans code is rubbish
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how many times you run it. So I am not really sure whats going on here. Can you tell us more about which version of Spark you are running? Which Java version? == [tdas @ Xion spark2] cat input 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from SCDynamicStore 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Finished iteration (delta = 3.0) Finished iteration (delta = 0.0) Final centers: DenseVector(5.0, 2.0) DenseVector(2.0, 2.0) On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: so this is what I am running: ./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001 And this is the input file: ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$ └───#!cat ~/Documents/2dim2.txt 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 This is the final output from spark: 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 14 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on localhost (progress: 1/2) 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 15 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on localhost (progress: 2/2) 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at SparkKMeans.scala:75) finished in 0.008 s 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at SparkKMeans.scala:75, took 0.02472681 s Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0) DenseVector(5.6005, 2.0) On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux decho...@gmail.com wrote: A picture is worth a thousand... Well, a picture with this dataset, what you are expecting and what you get, would help answering your initial question. Bertrand On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result should be (2,2) and (5,2) ... (you can draw them if you don't believe me ...) Thanks, Wanda
Re: KMeans code is rubbish
SparkKMeans is a naive implementation. Please use mllib.clustering.KMeans in practice. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how many times you run it. So I am not really sure whats going on here. Can you tell us more about which version of Spark you are running? Which Java version? == [tdas @ Xion spark2] cat input 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from SCDynamicStore 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Finished iteration (delta = 3.0) Finished iteration (delta = 0.0) Final centers: DenseVector(5.0, 2.0) DenseVector(2.0, 2.0) On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: so this is what I am running: ./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001 And this is the input file: ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$ └───#!cat ~/Documents/2dim2.txt 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 This is the final output from spark: 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 14 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on localhost (progress: 1/2) 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 15 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on localhost (progress: 2/2) 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at SparkKMeans.scala:75) finished in 0.008 s 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at SparkKMeans.scala:75, took 0.02472681 s Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0) DenseVector(5.6005, 2.0) On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux decho...@gmail.com wrote: A picture is worth a thousand... Well, a picture with this dataset, what you are expecting and what you get, would help answering your initial question. Bertrand On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result should be (2,2) and (5,2) ... (you can draw them if you don't believe me ...) Thanks, Wanda
Re: KMeans code is rubbish
I ran the example with ./bin/run-example SparkKMeans file.txt 2 0.001 I get this response: Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0) DenseVector(5.6005, 2.0) The start point is not random. It uses the first K points from the given set On Thursday, July 10, 2014 11:57 AM, Sean Owen so...@cloudera.com wrote: I ran it, and your answer is exactly what I got. import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.clustering._ val vectors = sc.parallelize(Array((2,1),(1,2),(3,2),(2,3),(4,1),(5,1),(6,1),(4,2),(6,2),(4,3),(5,3),(6,3)).map(p = Vectors.dense(Array[Double](p._1, p._2 val kmeans = new KMeans() kmeans.setK(2) val model = kmeans.run(vectors) model.clusterCenters res10: Array[org.apache.spark.mllib.linalg.Vector] = Array([5.0,2.0], [2.0,2.0]) You may be aware that k-means starts from a random set of centroids. It's possible that your run picked one that leads to a suboptimal clustering. This is all the easier on a toy example like this and you can find examples on the internet. That said, I never saw any other answer. The standard approach is to run many times. Call kmeans.setRuns(10) or something to try 10 times instead of once. On Thu, Jul 10, 2014 at 9:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result should be (2,2) and (5,2) ... (you can draw them if you don't believe me ...) Thanks, Wanda
Re: KMeans code is rubbish
I am running spark-1.0.0 with java 1.8 java version 1.8.0_05 Java(TM) SE Runtime Environment (build 1.8.0_05-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode) which spark-shell ~/bench/spark-1.0.0/bin/spark-shell which scala ~/bench/scala-2.10.4/bin/scala On Thursday, July 10, 2014 12:46 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how many times you run it. So I am not really sure whats going on here. Can you tell us more about which version of Spark you are running? Which Java version? == [tdas @ Xion spark2] cat input 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from SCDynamicStore 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Finished iteration (delta = 3.0) Finished iteration (delta = 0.0) Final centers: DenseVector(5.0, 2.0) DenseVector(2.0, 2.0) On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: so this is what I am running: ./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001 And this is the input file: ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$ └───#!cat ~/Documents/2dim2.txt 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 This is the final output from spark: 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 14 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on localhost (progress: 1/2) 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 15 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on localhost (progress: 2/2) 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at SparkKMeans.scala:75) finished in 0.008 s 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at SparkKMeans.scala:75, took 0.02472681 s Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0) DenseVector(5.6005, 2.0) On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux decho...@gmail.com wrote: A picture is worth a thousand... Well, a picture with this dataset, what you are expecting and what you get, would help answering your initial question. Bertrand On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result should be (2,2) and (5,2) ... (you can draw them if you don't believe me ...) Thanks, Wanda