The code is really simple : *object TestKMeans {*
* def main(args: Array[String]) {* * val conf = new SparkConf()* * .setAppName("Test KMeans")* * .setMaster("local[8]")* * .set("spark.executor.memory", "8g")* * val sc = new SparkContext(conf)* * val numClusters = 500;* * val numIterations = 2;* * val data = sc.textFile("sample.csv").map(x => Vectors.dense(x.split(',').map(_.toDouble)))* * data.cache()* * val clusters = KMeans.train(data, numClusters, numIterations)* * println(clusters.clusterCenters.size)* * val wssse = clusters.computeCost(data)* * println(s"error : $wssse")* * }* *}* For the testing purpose, I was generating a sample random data with julia and store it in a csv file delimited by comma. The dimensions is 248000 x 384. In the target application, I will have more than 248k data to cluster. On Fri, Dec 5, 2014 at 6:03 PM, Davies Liu <dav...@databricks.com> wrote: > Could you post you script to reproduce the results (also how to > generate the dataset)? That will help us to investigate it. > > On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa <jaon...@gmail.com> > wrote: > > Hmm, here I use spark on local mode on my laptop with 8 cores. The data > is > > on my local filesystem. Event thought, there an overhead due to the > > distributed computation, I found the difference between the runtime of > the > > two implementations really, really huge. Is there a benchmark on how well > > the algorithm implemented in mllib performs ? > > > > On Fri, Dec 5, 2014 at 4:56 PM, Sean Owen <so...@cloudera.com> wrote: > >> > >> Spark has much more overhead, since it's set up to distribute the > >> computation. Julia isn't distributed, and so has no such overhead in a > >> completely in-core implementation. You generally use Spark when you > >> have a problem large enough to warrant distributing, or, your data > >> already lives in a distributed store like HDFS. > >> > >> But it's also possible you're not configuring the implementations the > >> same way, yes. There's not enough info here really to say. > >> > >> On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa <jaon...@gmail.com> > >> wrote: > >> > Hi all, > >> > > >> > I'm trying to a run clustering with kmeans algorithm. The size of my > >> > data > >> > set is about 240k vectors of dimension 384. > >> > > >> > Solving the problem with the kmeans available in julia (kmean++) > >> > > >> > http://clusteringjl.readthedocs.org/en/latest/kmeans.html > >> > > >> > take about 8 minutes on a single core. > >> > > >> > Solving the same problem with spark kmean|| take more than 1.5 hours > >> > with 8 > >> > cores!!!! > >> > > >> > Either they don't implement the same algorithm either I don't > understand > >> > how > >> > the kmeans in spark works. Is my data not big enough to take full > >> > advantage > >> > of spark ? At least, I expect to the same runtime. > >> > > >> > > >> > Cheers, > >> > > >> > > >> > Jao > > > > >