Please check the number of partitions after sc.textFile. Use sc.textFile('...', 8) to have at least 8 partitions. -Xiangrui
On Tue, Dec 9, 2014 at 4:58 AM, DB Tsai <dbt...@dbtsai.com> wrote: > You just need to use the latest master code without any configuration > to get performance improvement from my PR. > > Sincerely, > > DB Tsai > ------------------------------------------------------- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Mon, Dec 8, 2014 at 7:53 AM, Jaonary Rabarisoa <jaon...@gmail.com> wrote: >> After some investigation, I learned that I can't compare kmeans in mllib >> with another kmeans implementation directly. The kmeans|| initialization >> step takes more time than the algorithm implemented in julia for example. >> There is also the ability to run multiple runs of kmeans algorithm in mllib >> even by default the number of runs is 1. >> >> DB Tsai can you please tell me the configuration you took for the >> improvement you mention in your pull request. I'd like to run the same >> benchmark on mnist8m on my computer. >> >> >> Cheers; >> >> >> >> On Fri, Dec 5, 2014 at 10:34 PM, DB Tsai <dbt...@dbtsai.com> wrote: >>> >>> Also, are you using the latest master in this experiment? A PR merged >>> into the master couple days ago will spend up the k-means three times. >>> See >>> >>> >>> https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1 >>> >>> Sincerely, >>> >>> DB Tsai >>> ------------------------------------------------------- >>> My Blog: https://www.dbtsai.com >>> LinkedIn: https://www.linkedin.com/in/dbtsai >>> >>> >>> On Fri, Dec 5, 2014 at 9:36 AM, Jaonary Rabarisoa <jaon...@gmail.com> >>> wrote: >>> > The code is really simple : >>> > >>> > object TestKMeans { >>> > >>> > def main(args: Array[String]) { >>> > >>> > val conf = new SparkConf() >>> > .setAppName("Test KMeans") >>> > .setMaster("local[8]") >>> > .set("spark.executor.memory", "8g") >>> > >>> > val sc = new SparkContext(conf) >>> > >>> > val numClusters = 500; >>> > val numIterations = 2; >>> > >>> > >>> > val data = sc.textFile("sample.csv").map(x => >>> > Vectors.dense(x.split(',').map(_.toDouble))) >>> > data.cache() >>> > >>> > >>> > val clusters = KMeans.train(data, numClusters, numIterations) >>> > >>> > println(clusters.clusterCenters.size) >>> > >>> > val wssse = clusters.computeCost(data) >>> > println(s"error : $wssse") >>> > >>> > } >>> > } >>> > >>> > >>> > For the testing purpose, I was generating a sample random data with >>> > julia >>> > and store it in a csv file delimited by comma. The dimensions is 248000 >>> > x >>> > 384. >>> > >>> > In the target application, I will have more than 248k data to cluster. >>> > >>> > >>> > On Fri, Dec 5, 2014 at 6:03 PM, Davies Liu <dav...@databricks.com> >>> > wrote: >>> >> >>> >> Could you post you script to reproduce the results (also how to >>> >> generate the dataset)? That will help us to investigate it. >>> >> >>> >> On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa <jaon...@gmail.com> >>> >> wrote: >>> >> > Hmm, here I use spark on local mode on my laptop with 8 cores. The >>> >> > data >>> >> > is >>> >> > on my local filesystem. Event thought, there an overhead due to the >>> >> > distributed computation, I found the difference between the runtime >>> >> > of >>> >> > the >>> >> > two implementations really, really huge. Is there a benchmark on how >>> >> > well >>> >> > the algorithm implemented in mllib performs ? >>> >> > >>> >> > On Fri, Dec 5, 2014 at 4:56 PM, Sean Owen <so...@cloudera.com> wrote: >>> >> >> >>> >> >> Spark has much more overhead, since it's set up to distribute the >>> >> >> computation. Julia isn't distributed, and so has no such overhead in >>> >> >> a >>> >> >> completely in-core implementation. You generally use Spark when you >>> >> >> have a problem large enough to warrant distributing, or, your data >>> >> >> already lives in a distributed store like HDFS. >>> >> >> >>> >> >> But it's also possible you're not configuring the implementations >>> >> >> the >>> >> >> same way, yes. There's not enough info here really to say. >>> >> >> >>> >> >> On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa >>> >> >> <jaon...@gmail.com> >>> >> >> wrote: >>> >> >> > Hi all, >>> >> >> > >>> >> >> > I'm trying to a run clustering with kmeans algorithm. The size of >>> >> >> > my >>> >> >> > data >>> >> >> > set is about 240k vectors of dimension 384. >>> >> >> > >>> >> >> > Solving the problem with the kmeans available in julia (kmean++) >>> >> >> > >>> >> >> > http://clusteringjl.readthedocs.org/en/latest/kmeans.html >>> >> >> > >>> >> >> > take about 8 minutes on a single core. >>> >> >> > >>> >> >> > Solving the same problem with spark kmean|| take more than 1.5 >>> >> >> > hours >>> >> >> > with 8 >>> >> >> > cores!!!! >>> >> >> > >>> >> >> > Either they don't implement the same algorithm either I don't >>> >> >> > understand >>> >> >> > how >>> >> >> > the kmeans in spark works. Is my data not big enough to take full >>> >> >> > advantage >>> >> >> > of spark ? At least, I expect to the same runtime. >>> >> >> > >>> >> >> > >>> >> >> > Cheers, >>> >> >> > >>> >> >> > >>> >> >> > Jao >>> >> > >>> >> > >>> > >>> > >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org