Re: KMeans code is rubbish

Ameet Talwalkar Fri, 11 Jul 2014 09:23:29 -0700

Hi Wanda,

As Sean mentioned, K-means is not guaranteed to find an optimal answer,
even for seemingly simple toy examples. A common heuristic to deal with
this issue is to run kmeans multiple times and choose the best answer.  You
can do this by changing the runs parameter from the default value (1) to
something larger (say 10).


-Ameet


On Fri, Jul 11, 2014 at 1:20 AM, Wanda Hawk <wanda_haw...@yahoo.com> wrote:

> I also took a look
> at 
> spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala
> and ran the code in a shell.
>
> There is an issue here:
> "    val initMode = params.initializationMode match {
>       case Random => KMeans.RANDOM
>       case Parallel => KMeans.K_MEANS_PARALLEL
>     }
> "
>
> If I use initMode=KMeans.RANDOM everything is ok.
> If I use initMode=KMeans.K_MEANS_PARALLEL I get a wrong result. I do not
> know why. The example proposed is a really simple one that should not
> accept multiple solutions and always converge to the correct one.
>
> Now what can be altered in the original SparkKMeans.scala (the seed or
> something else ?) to get the correct results each and every single time ?
>    On Thursday, July 10, 2014 7:58 PM, Xiangrui Meng <men...@gmail.com>
> wrote:
>
>
> SparkKMeans is a naive implementation. Please use
> mllib.clustering.KMeans in practice. I created a JIRA for this:
> https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui
>
> On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
> <tathagata.das1...@gmail.com> wrote:
> > I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with
> your
> > dataset as well, I got the expected answer. And I believe that even
> though
> > initialization is done using sampling, the example actually sets the
> seed to
> > a constant 42, so the result should always be the same no matter how many
> > times you run it. So I am not really sure whats going on here.
> >
> > Can you tell us more about which version of Spark you are running? Which
> > Java version?
> >
> >
> > ======================================
> >
> > [tdas @ Xion spark2] cat input
> > 2 1
> > 1 2
> > 3 2
> > 2 3
> > 4 1
> > 5 1
> > 6 1
> > 4 2
> > 6 2
> > 4 3
> > 5 3
> > 6 3
> > [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001
> > 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from
> > SCDynamicStore
> > 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop
> > library for your platform... using builtin-java classes where applicable
> > 14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded
> > 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
> > com.github.fommil.netlib.NativeSystemBLAS
> > 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
> > com.github.fommil.netlib.NativeRefBLAS
> > Finished iteration (delta = 3.0)
> > Finished iteration (delta = 0.0)
> > Final centers:
> > DenseVector(5.0, 2.0)
> > DenseVector(2.0, 2.0)
> >
> >
> >
> > On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk <wanda_haw...@yahoo.com>
> wrote:
> >>
> >> so this is what I am running:
> >> "./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001"
> >>
> >> And this is the input file:"
> >> ┌───[spark2013@SparkOne]──────[~/spark-1.0.0].$
> >> └───#!cat ~/Documents/2dim2.txt
> >> 2 1
> >> 1 2
> >> 3 2
> >> 2 3
> >> 4 1
> >> 5 1
> >> 6 1
> >> 4 2
> >> 6 2
> >> 4 3
> >> 5 3
> >> 6 3
> >> "
> >>
> >> This is the final output from spark:
> >> "14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> >> Getting 2 non-empty blocks out of 2 blocks
> >> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> >> Started 0 remote fetches in 0 ms
> >> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> >> maxBytesInFlight: 50331648, targetRequestSize: 10066329
> >> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> >> Getting 2 non-empty blocks out of 2 blocks
> >> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> >> Started 0 remote fetches in 0 ms
> >> 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is
> 1433
> >> 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to
> driver
> >> 14/07/10 20:05:12 INFO Executor: Finished task ID 14
> >> 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0)
> >> 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on
> >> localhost (progress: 1/2)
> >> 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is
> 1433
> >> 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to
> driver
> >> 14/07/10 20:05:12 INFO Executor: Finished task ID 15
> >> 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1)
> >> 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on
> >> localhost (progress: 2/2)
> >> 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at
> >> SparkKMeans.scala:75) finished in 0.008 s
> >> 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose
> tasks
> >> have all completed, from pool
> >> 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at
> >> SparkKMeans.scala:75, took 0.02472681 s
> >> Finished iteration (delta = 0.0)
> >> Final centers:
> >> DenseVector(2.8571428571428568, 2.0)
> >> DenseVector(5.6000000000000005, 2.0)
> >> "
> >>
> >>
> >>
> >>
> >> On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux <
> decho...@gmail.com>
> >> wrote:
> >>
> >>
> >> A picture is worth a thousand... Well, a picture with this dataset, what
> >> you are expecting and what you get, would help answering your initial
> >> question.
> >>
> >> Bertrand
> >>
> >>
> >> On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk <wanda_haw...@yahoo.com>
> >> wrote:
> >>
> >> Can someone please run the standard kMeans code on this input with 2
> >> centers ?:
> >> 2 1
> >> 1 2
> >> 3 2
> >> 2 3
> >> 4 1
> >> 5 1
> >> 6 1
> >> 4 2
> >> 6 2
> >> 4 3
> >> 5 3
> >> 6 3
> >>
> >> The obvious result should be (2,2) and (5,2) ... (you can draw them if
> you
> >> don't believe me ...)
> >>
> >> Thanks,
> >> Wanda
> >>
> >>
> >>
> >>
> >
>
>
>

Re: KMeans code is rubbish

Reply via email to