I've tried some additional experiments with kmeans and I finally got it worked as I expected. In fact, the number of partition is critical. I had a data set of 240000x784 with 12 partitions. In this case the kmeans algorithm took a very long time (about hours to converge). When I change the partition into 32, the same kmeans ( runs = 10, k = 10, iterations = 300, init = kmeans|| ) converges in 4 min with 8 cores !!!! As a comparison, the same problem solve with python scikit-learn takes 21 min on a single core. So spark wins :)
As conclusion, setting the number of partition correctly is essential. Is there a rule of thumb for that ? On Mon, Dec 15, 2014 at 8:55 PM, Xiangrui Meng <men...@gmail.com> wrote: > > Please check the number of partitions after sc.textFile. Use > sc.textFile('...', 8) to have at least 8 partitions. -Xiangrui > > On Tue, Dec 9, 2014 at 4:58 AM, DB Tsai <dbt...@dbtsai.com> wrote: > > You just need to use the latest master code without any configuration > > to get performance improvement from my PR. > > > > Sincerely, > > > > DB Tsai > > ------------------------------------------------------- > > My Blog: https://www.dbtsai.com > > LinkedIn: https://www.linkedin.com/in/dbtsai > > > > > > On Mon, Dec 8, 2014 at 7:53 AM, Jaonary Rabarisoa <jaon...@gmail.com> > wrote: > >> After some investigation, I learned that I can't compare kmeans in mllib > >> with another kmeans implementation directly. The kmeans|| initialization > >> step takes more time than the algorithm implemented in julia for > example. > >> There is also the ability to run multiple runs of kmeans algorithm in > mllib > >> even by default the number of runs is 1. > >> > >> DB Tsai can you please tell me the configuration you took for the > >> improvement you mention in your pull request. I'd like to run the same > >> benchmark on mnist8m on my computer. > >> > >> > >> Cheers; > >> > >> > >> > >> On Fri, Dec 5, 2014 at 10:34 PM, DB Tsai <dbt...@dbtsai.com> wrote: > >>> > >>> Also, are you using the latest master in this experiment? A PR merged > >>> into the master couple days ago will spend up the k-means three times. > >>> See > >>> > >>> > >>> > https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1 > >>> > >>> Sincerely, > >>> > >>> DB Tsai > >>> ------------------------------------------------------- > >>> My Blog: https://www.dbtsai.com > >>> LinkedIn: https://www.linkedin.com/in/dbtsai > >>> > >>> > >>> On Fri, Dec 5, 2014 at 9:36 AM, Jaonary Rabarisoa <jaon...@gmail.com> > >>> wrote: > >>> > The code is really simple : > >>> > > >>> > object TestKMeans { > >>> > > >>> > def main(args: Array[String]) { > >>> > > >>> > val conf = new SparkConf() > >>> > .setAppName("Test KMeans") > >>> > .setMaster("local[8]") > >>> > .set("spark.executor.memory", "8g") > >>> > > >>> > val sc = new SparkContext(conf) > >>> > > >>> > val numClusters = 500; > >>> > val numIterations = 2; > >>> > > >>> > > >>> > val data = sc.textFile("sample.csv").map(x => > >>> > Vectors.dense(x.split(',').map(_.toDouble))) > >>> > data.cache() > >>> > > >>> > > >>> > val clusters = KMeans.train(data, numClusters, numIterations) > >>> > > >>> > println(clusters.clusterCenters.size) > >>> > > >>> > val wssse = clusters.computeCost(data) > >>> > println(s"error : $wssse") > >>> > > >>> > } > >>> > } > >>> > > >>> > > >>> > For the testing purpose, I was generating a sample random data with > >>> > julia > >>> > and store it in a csv file delimited by comma. The dimensions is > 248000 > >>> > x > >>> > 384. > >>> > > >>> > In the target application, I will have more than 248k data to > cluster. > >>> > > >>> > > >>> > On Fri, Dec 5, 2014 at 6:03 PM, Davies Liu <dav...@databricks.com> > >>> > wrote: > >>> >> > >>> >> Could you post you script to reproduce the results (also how to > >>> >> generate the dataset)? That will help us to investigate it. > >>> >> > >>> >> On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa < > jaon...@gmail.com> > >>> >> wrote: > >>> >> > Hmm, here I use spark on local mode on my laptop with 8 cores. The > >>> >> > data > >>> >> > is > >>> >> > on my local filesystem. Event thought, there an overhead due to > the > >>> >> > distributed computation, I found the difference between the > runtime > >>> >> > of > >>> >> > the > >>> >> > two implementations really, really huge. Is there a benchmark on > how > >>> >> > well > >>> >> > the algorithm implemented in mllib performs ? > >>> >> > > >>> >> > On Fri, Dec 5, 2014 at 4:56 PM, Sean Owen <so...@cloudera.com> > wrote: > >>> >> >> > >>> >> >> Spark has much more overhead, since it's set up to distribute the > >>> >> >> computation. Julia isn't distributed, and so has no such > overhead in > >>> >> >> a > >>> >> >> completely in-core implementation. You generally use Spark when > you > >>> >> >> have a problem large enough to warrant distributing, or, your > data > >>> >> >> already lives in a distributed store like HDFS. > >>> >> >> > >>> >> >> But it's also possible you're not configuring the implementations > >>> >> >> the > >>> >> >> same way, yes. There's not enough info here really to say. > >>> >> >> > >>> >> >> On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa > >>> >> >> <jaon...@gmail.com> > >>> >> >> wrote: > >>> >> >> > Hi all, > >>> >> >> > > >>> >> >> > I'm trying to a run clustering with kmeans algorithm. The size > of > >>> >> >> > my > >>> >> >> > data > >>> >> >> > set is about 240k vectors of dimension 384. > >>> >> >> > > >>> >> >> > Solving the problem with the kmeans available in julia > (kmean++) > >>> >> >> > > >>> >> >> > http://clusteringjl.readthedocs.org/en/latest/kmeans.html > >>> >> >> > > >>> >> >> > take about 8 minutes on a single core. > >>> >> >> > > >>> >> >> > Solving the same problem with spark kmean|| take more than 1.5 > >>> >> >> > hours > >>> >> >> > with 8 > >>> >> >> > cores!!!! > >>> >> >> > > >>> >> >> > Either they don't implement the same algorithm either I don't > >>> >> >> > understand > >>> >> >> > how > >>> >> >> > the kmeans in spark works. Is my data not big enough to take > full > >>> >> >> > advantage > >>> >> >> > of spark ? At least, I expect to the same runtime. > >>> >> >> > > >>> >> >> > > >>> >> >> > Cheers, > >>> >> >> > > >>> >> >> > > >>> >> >> > Jao > >>> >> > > >>> >> > > >>> > > >>> > > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > >