You just need to use the latest master code without any configuration to get performance improvement from my PR.
Sincerely, DB Tsai ------------------------------------------------------- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Dec 8, 2014 at 7:53 AM, Jaonary Rabarisoa <jaon...@gmail.com> wrote: > After some investigation, I learned that I can't compare kmeans in mllib > with another kmeans implementation directly. The kmeans|| initialization > step takes more time than the algorithm implemented in julia for example. > There is also the ability to run multiple runs of kmeans algorithm in mllib > even by default the number of runs is 1. > > DB Tsai can you please tell me the configuration you took for the > improvement you mention in your pull request. I'd like to run the same > benchmark on mnist8m on my computer. > > > Cheers; > > > > On Fri, Dec 5, 2014 at 10:34 PM, DB Tsai <dbt...@dbtsai.com> wrote: >> >> Also, are you using the latest master in this experiment? A PR merged >> into the master couple days ago will spend up the k-means three times. >> See >> >> >> https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1 >> >> Sincerely, >> >> DB Tsai >> ------------------------------------------------------- >> My Blog: https://www.dbtsai.com >> LinkedIn: https://www.linkedin.com/in/dbtsai >> >> >> On Fri, Dec 5, 2014 at 9:36 AM, Jaonary Rabarisoa <jaon...@gmail.com> >> wrote: >> > The code is really simple : >> > >> > object TestKMeans { >> > >> > def main(args: Array[String]) { >> > >> > val conf = new SparkConf() >> > .setAppName("Test KMeans") >> > .setMaster("local[8]") >> > .set("spark.executor.memory", "8g") >> > >> > val sc = new SparkContext(conf) >> > >> > val numClusters = 500; >> > val numIterations = 2; >> > >> > >> > val data = sc.textFile("sample.csv").map(x => >> > Vectors.dense(x.split(',').map(_.toDouble))) >> > data.cache() >> > >> > >> > val clusters = KMeans.train(data, numClusters, numIterations) >> > >> > println(clusters.clusterCenters.size) >> > >> > val wssse = clusters.computeCost(data) >> > println(s"error : $wssse") >> > >> > } >> > } >> > >> > >> > For the testing purpose, I was generating a sample random data with >> > julia >> > and store it in a csv file delimited by comma. The dimensions is 248000 >> > x >> > 384. >> > >> > In the target application, I will have more than 248k data to cluster. >> > >> > >> > On Fri, Dec 5, 2014 at 6:03 PM, Davies Liu <dav...@databricks.com> >> > wrote: >> >> >> >> Could you post you script to reproduce the results (also how to >> >> generate the dataset)? That will help us to investigate it. >> >> >> >> On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa <jaon...@gmail.com> >> >> wrote: >> >> > Hmm, here I use spark on local mode on my laptop with 8 cores. The >> >> > data >> >> > is >> >> > on my local filesystem. Event thought, there an overhead due to the >> >> > distributed computation, I found the difference between the runtime >> >> > of >> >> > the >> >> > two implementations really, really huge. Is there a benchmark on how >> >> > well >> >> > the algorithm implemented in mllib performs ? >> >> > >> >> > On Fri, Dec 5, 2014 at 4:56 PM, Sean Owen <so...@cloudera.com> wrote: >> >> >> >> >> >> Spark has much more overhead, since it's set up to distribute the >> >> >> computation. Julia isn't distributed, and so has no such overhead in >> >> >> a >> >> >> completely in-core implementation. You generally use Spark when you >> >> >> have a problem large enough to warrant distributing, or, your data >> >> >> already lives in a distributed store like HDFS. >> >> >> >> >> >> But it's also possible you're not configuring the implementations >> >> >> the >> >> >> same way, yes. There's not enough info here really to say. >> >> >> >> >> >> On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa >> >> >> <jaon...@gmail.com> >> >> >> wrote: >> >> >> > Hi all, >> >> >> > >> >> >> > I'm trying to a run clustering with kmeans algorithm. The size of >> >> >> > my >> >> >> > data >> >> >> > set is about 240k vectors of dimension 384. >> >> >> > >> >> >> > Solving the problem with the kmeans available in julia (kmean++) >> >> >> > >> >> >> > http://clusteringjl.readthedocs.org/en/latest/kmeans.html >> >> >> > >> >> >> > take about 8 minutes on a single core. >> >> >> > >> >> >> > Solving the same problem with spark kmean|| take more than 1.5 >> >> >> > hours >> >> >> > with 8 >> >> >> > cores!!!! >> >> >> > >> >> >> > Either they don't implement the same algorithm either I don't >> >> >> > understand >> >> >> > how >> >> >> > the kmeans in spark works. Is my data not big enough to take full >> >> >> > advantage >> >> >> > of spark ? At least, I expect to the same runtime. >> >> >> > >> >> >> > >> >> >> > Cheers, >> >> >> > >> >> >> > >> >> >> > Jao >> >> > >> >> > >> > >> > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org