The code is really simple :

*object TestKMeans {*

*  def main(args: Array[String]) {*

*    val conf = new SparkConf()*
*      .setAppName("Test KMeans")*
*      .setMaster("local[8]")*
*      .set("spark.executor.memory", "8g")*

*    val sc = new SparkContext(conf)*

*    val numClusters = 500;*
*    val numIterations = 2;*


*    val data = sc.textFile("sample.csv").map(x =>
Vectors.dense(x.split(',').map(_.toDouble)))*
*    data.cache()*


*    val clusters = KMeans.train(data, numClusters, numIterations)*

*    println(clusters.clusterCenters.size)*

*    val wssse = clusters.computeCost(data)*
*    println(s"error : $wssse")*

*  }*
*}*


For the testing purpose, I was generating a sample random data with julia
and store it in a csv file delimited by comma. The dimensions is 248000 x
384.

In the target application, I will have more than 248k data to cluster.


On Fri, Dec 5, 2014 at 6:03 PM, Davies Liu <dav...@databricks.com> wrote:

> Could you post you script to reproduce the results (also how to
> generate the dataset)? That will help us to investigate it.
>
> On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa <jaon...@gmail.com>
> wrote:
> > Hmm, here I use spark on local mode on my laptop with 8 cores. The data
> is
> > on my local filesystem. Event thought, there an overhead due to the
> > distributed computation, I found the difference between the runtime of
> the
> > two implementations really, really huge. Is there a benchmark on how well
> > the algorithm implemented in mllib performs ?
> >
> > On Fri, Dec 5, 2014 at 4:56 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> Spark has much more overhead, since it's set up to distribute the
> >> computation. Julia isn't distributed, and so has no such overhead in a
> >> completely in-core implementation. You generally use Spark when you
> >> have a problem large enough to warrant distributing, or, your data
> >> already lives in a distributed store like HDFS.
> >>
> >> But it's also possible you're not configuring the implementations the
> >> same way, yes. There's not enough info here really to say.
> >>
> >> On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa <jaon...@gmail.com>
> >> wrote:
> >> > Hi all,
> >> >
> >> > I'm trying to a run clustering with kmeans algorithm. The size of my
> >> > data
> >> > set is about 240k vectors of dimension 384.
> >> >
> >> > Solving the problem with the kmeans available in julia (kmean++)
> >> >
> >> > http://clusteringjl.readthedocs.org/en/latest/kmeans.html
> >> >
> >> > take about 8 minutes on a single core.
> >> >
> >> > Solving the same problem with spark kmean|| take more than 1.5 hours
> >> > with 8
> >> > cores!!!!
> >> >
> >> > Either they don't implement the same algorithm either I don't
> understand
> >> > how
> >> > the kmeans in spark works. Is my data not big enough to take full
> >> > advantage
> >> > of spark ? At least, I expect to the same runtime.
> >> >
> >> >
> >> > Cheers,
> >> >
> >> >
> >> > Jao
> >
> >
>

Reply via email to