Re: KMeans code is rubbish

Wanda Hawk Fri, 11 Jul 2014 01:22:34 -0700

I also took a look at 
spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala
 and ran the code in a shell.


There is an issue here:
"    val initMode = params.initializationMode match {
      case Random => KMeans.RANDOM
      case Parallel => KMeans.K_MEANS_PARALLEL
    }
"

If I use initMode=KMeans.RANDOM everything is ok.
If I use initMode=KMeans.K_MEANS_PARALLEL I get a wrong result. I do not know 
why. The example proposed is a really simple one that should not accept 
multiple solutions and always converge to the correct one.

Now what can be altered in the original SparkKMeans.scala (the seed or 
something else ?) to get the correct results each and every single time ?
On Thursday, July 10, 2014 7:58 PM, Xiangrui Meng <men...@gmail.com> wrote:
 


SparkKMeans is a naive implementation. Please use
mllib.clustering.KMeans in practice. I created a JIRA for this:
https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui


On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
<tathagata.das1...@gmail.com> wrote:
> I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your
> dataset as well, I got the expected answer. And I believe that even though
> initialization is done using sampling, the example actually sets the seed to
> a constant 42, so the result should always be the same no matter how many
> times you run it. So I am not really sure whats going on here.
>
> Can you tell us more about which version of Spark you are running? Which
> Java version?
>
>
> ======================================
>
> [tdas @ Xion spark2] cat input
> 2 1
> 1 2
> 3 2
> 2 3
> 4 1
> 5 1
> 6 1
> 4 2
> 6 2
> 4 3
> 5 3
> 6 3
> [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001
> 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from
> SCDynamicStore
> 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/07/10 02:45:07 WARN LoadSnappy:
 Snappy native library not loaded
> 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeSystemBLAS
> 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeRefBLAS
> Finished iteration (delta = 3.0)
> Finished iteration (delta = 0.0)
> Final centers:
> DenseVector(5.0, 2.0)
> DenseVector(2.0, 2.0)
>
>
>
> On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk <wanda_haw...@yahoo.com> wrote:
>>
>> so this is what I am running:
>> "./bin/run-example SparkKMeans
 ~/Documents/2dim2.txt 2 0.001"
>>
>> And this is the input file:"
>> ┌───[spark2013@SparkOne]──────[~/spark-1.0.0].$
>> └───#!cat ~/Documents/2dim2.txt
>> 2 1
>> 1 2
>> 3 2
>> 2 3
>> 4 1
>> 5 1
>> 6 1
>> 4 2
>> 6 2
>> 4 3
>> 5 3
>> 6 3
>> "
>>
>> This is the final output from spark:
>> "14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> Getting 2 non-empty blocks
 out of 2 blocks
>> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> Started 0 remote fetches in 0 ms
>> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> maxBytesInFlight: 50331648, targetRequestSize: 10066329
>> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> Getting 2 non-empty blocks out of 2 blocks
>> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> Started 0 remote fetches in 0 ms
>> 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433
>> 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver
>> 14/07/10 20:05:12 INFO Executor: Finished task ID 14
>>
 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0)
>> 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on
>> localhost (progress: 1/2)
>> 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433
>> 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver
>> 14/07/10 20:05:12 INFO Executor: Finished task ID 15
>> 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1)
>> 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on
>> localhost (progress: 2/2)
>> 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at
>> SparkKMeans.scala:75) finished in 0.008 s
>> 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose
 tasks
>> have all completed, from pool
>> 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at
>> SparkKMeans.scala:75, took 0.02472681 s
>> Finished iteration (delta = 0.0)
>> Final centers:
>> DenseVector(2.8571428571428568, 2.0)
>> DenseVector(5.6000000000000005, 2.0)
>> "
>>
>>
>>
>>
>> On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux <decho...@gmail.com>
>> wrote:
>>
>>
>> A picture is worth a thousand... Well, a picture with this dataset, what
>>
 you are expecting and what you get, would help answering your initial
>> question.
>>
>> Bertrand
>>
>>
>> On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk <wanda_haw...@yahoo.com>
>> wrote:
>>
>> Can someone please run the standard kMeans code on this input with 2
>> centers ?:
>> 2 1
>> 1 2
>> 3 2
>> 2 3
>> 4 1
>> 5 1
>> 6 1
>> 4 2
>> 6 2
>> 4 3
>> 5 3
>> 6 3
>>
>> The obvious result should be (2,2) and (5,2) ... (you can draw them if you
>> don't believe me ...)
>>
>> Thanks,
>> Wanda
>>
>>
>>
>>
>

Re: KMeans code is rubbish

Reply via email to