Re: Why KMeans with mllib is so slow ?

2016-03-14 Thread Priya Ch
set is about 270k vectors, each has >>> about 350 dimensions. If I set k=500, the job takes about 3hrs on my >>> cluster. The cluster has 7 executors, each has 8 cores... >>> >>> If I set k=5000 which is the required value for my task, the job goes on >>> forever...

Re: Why KMeans with mllib is so slow ?

2016-03-12 Thread Xi Shen
>> -------------- >> If you reply to this email, your message will be added to the discussion >> below: >> >> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html >> > To start a new topic under

Re: Why KMeans with mllib is so slow ?

2016-03-12 Thread Chitturi Padma
- > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html > To start a new topic under Apache Spark User List, email > ml-node+s10015

Re: Why KMeans with mllib is so slow ?

2015-03-29 Thread Xi Shen
, David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Why KMeans with mllib is so slow ?

2015-03-28 Thread davidshen84
cluster. The cluster has 7 executors, each has 8 cores... If I set k=5000 which is the required value for my task, the job goes on forever... Thanks, David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html Sent

Re: Why KMeans with mllib is so slow ?

2015-03-28 Thread Burak Yavuz
is the required value for my task, the job goes on forever... Thanks, David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Why KMeans with mllib is so slow ?

2014-12-15 Thread Xiangrui Meng
Please check the number of partitions after sc.textFile. Use sc.textFile('...', 8) to have at least 8 partitions. -Xiangrui On Tue, Dec 9, 2014 at 4:58 AM, DB Tsai dbt...@dbtsai.com wrote: You just need to use the latest master code without any configuration to get performance improvement from

Re: Why KMeans with mllib is so slow ?

2014-12-15 Thread Jaonary Rabarisoa
I've tried some additional experiments with kmeans and I finally got it worked as I expected. In fact, the number of partition is critical. I had a data set of 24x784 with 12 partitions. In this case the kmeans algorithm took a very long time (about hours to converge). When I change the

Re: Why KMeans with mllib is so slow ?

2014-12-08 Thread DB Tsai
You just need to use the latest master code without any configuration to get performance improvement from my PR. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Dec 8, 2014 at 7:53

Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
Hi all, I'm trying to a run clustering with kmeans algorithm. The size of my data set is about 240k vectors of dimension 384. Solving the problem with the kmeans available in julia (kmean++) http://clusteringjl.readthedocs.org/en/latest/kmeans.html take about 8 minutes on a single core.

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Sean Owen
Spark has much more overhead, since it's set up to distribute the computation. Julia isn't distributed, and so has no such overhead in a completely in-core implementation. You generally use Spark when you have a problem large enough to warrant distributing, or, your data already lives in a

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Davies Liu
Could you post you script to reproduce the results (also how to generate the dataset)? That will help us to investigate it. On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hmm, here I use spark on local mode on my laptop with 8 cores. The data is on my local

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
The code is really simple : *object TestKMeans {* * def main(args: Array[String]) {* *val conf = new SparkConf()* * .setAppName(Test KMeans)* * .setMaster(local[8])* * .set(spark.executor.memory, 8g)* *val sc = new SparkContext(conf)* *val numClusters = 500;* *

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread DB Tsai
Also, are you using the latest master in this experiment? A PR merged into the master couple days ago will spend up the k-means three times. See https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1 Sincerely, DB Tsai