Hi,
For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of time
(16+ mints).
It takes lot of time at this task;
org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
Can this be
Thanks Burak.
Now it takes minutes to repartition;
Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/Total
InputOutputShuffle Read Shuffle Write 42 (kill)
http://localhost:4040/stages/stage/kill/?id=42terminate=true repartition
at UnsupervisedSparkModelBuilder.java:120
Can it be the limited memory causing this slowness?
On Tue, Jul 14, 2015 at 9:00 AM, Nirmal Fernando nir...@wso2.com wrote:
Thanks Burak.
Now it takes minutes to repartition;
Active Stages (1) Stage IdDescriptionSubmittedDurationTasks:
Succeeded/TotalInputOutputShuffle Read Shuffle Write
What are the other parameters? Are you just setting k=3? What about # of
runs? How many partitions do you have? How many cores does your machine
have?
Thanks,
Burak
On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com wrote:
Hi Burak,
k = 3
dimension = 785 features
Spark 1.4
I'm using;
org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);
Cpu cores: 8 (using default Spark conf thought)
On partitions, I'm not sure how to find that.
On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote:
What are the other parameters? Are you just
Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also,
.cache()?
something like, (I'm assuming you are using Java):
```
JavaRDDVector input = data.repartition(8).cache();
org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20);
```
On Mon, Jul 13, 2015 at 11:10 AM,
Hi Burak,
k = 3
dimension = 785 features
Spark 1.4
On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote:
Hi,
How are you running K-Means? What is your k? What is the dimension of your
dataset (columns)? Which Spark version are you using?
Thanks,
Burak
On Mon, Jul 13,
Hi,
How are you running K-Means? What is your k? What is the dimension of your
dataset (columns)? Which Spark version are you using?
Thanks,
Burak
On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com wrote:
Hi,
For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot