[MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Hi, For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of time (16+ mints). It takes lot of time at this task; org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) Can this be

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Thanks Burak. Now it takes minutes to repartition; Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/Total InputOutputShuffle Read Shuffle Write 42 (kill) http://localhost:4040/stages/stage/kill/?id=42terminate=true repartition at UnsupervisedSparkModelBuilder.java:120

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Can it be the limited memory causing this slowness? On Tue, Jul 14, 2015 at 9:00 AM, Nirmal Fernando nir...@wso2.com wrote: Thanks Burak. Now it takes minutes to repartition; Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle Read Shuffle Write

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
What are the other parameters? Are you just setting k=3? What about # of runs? How many partitions do you have? How many cores does your machine have? Thanks, Burak On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com wrote: Hi Burak, k = 3 dimension = 785 features Spark 1.4

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
I'm using; org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20); Cpu cores: 8 (using default Spark conf thought) On partitions, I'm not sure how to find that. On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote: What are the other parameters? Are you just

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also, .cache()? something like, (I'm assuming you are using Java): ``` JavaRDDVector input = data.repartition(8).cache(); org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20); ``` On Mon, Jul 13, 2015 at 11:10 AM,

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Hi Burak, k = 3 dimension = 785 features Spark 1.4 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote: Hi, How are you running K-Means? What is your k? What is the dimension of your dataset (columns)? Which Spark version are you using? Thanks, Burak On Mon, Jul 13,

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
Hi, How are you running K-Means? What is your k? What is the dimension of your dataset (columns)? Which Spark version are you using? Thanks, Burak On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com wrote: Hi, For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot