Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Thanks Burak.

Now it takes minutes to repartition;

Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/Total
InputOutputShuffle Read Shuffle Write  42 (kill)
http://localhost:4040/stages/stage/kill/?id=42terminate=true repartition
at UnsupervisedSparkModelBuilder.java:120
http://localhost:4040/stages/stage?id=42attempt=0 +details

org.apache.spark.api.java.JavaRDD.repartition(JavaRDD.scala:100)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:120)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

 2015/07/14 08:59:30 3.6 min
 0/3
 14.6 MB Pending Stages (1) Stage IdDescriptionSubmittedDurationTasks:
Succeeded/TotalInputOutputShuffle Read Shuffle Write  43 sum at
KMeansModel.scala:70
http://localhost:4040/stages/stage?id=43attempt=0 +details


org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:121)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

 Unknown Unknown
 0/8

On Mon, Jul 13, 2015 at 11:44 PM, Burak Yavuz brk...@gmail.com wrote:

 Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also,
 .cache()?

 something like, (I'm assuming you are using Java):
 ```
 JavaRDDVector input = data.repartition(8).cache();
 org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20);
 ```

 On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando nir...@wso2.com wrote:

 I'm using;

 org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);

 Cpu cores: 8 (using default Spark conf thought)

 On partitions, I'm not sure how to find that.

 On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote:

 What are the other parameters? Are you just setting k=3? What about # of
 runs? How many partitions do you have? How many cores does your machine
 have?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Hi Burak,

 k = 3
 dimension = 785 features
 Spark 1.4

 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote:

 Hi,

 How are you running K-Means? What is your k? What is the dimension of
 your dataset (columns)? Which Spark version are you using?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Hi,

 For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot
 of time (16+ mints).

 It takes lot of time at this task;

 org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
 org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)

 Can this be improved?

 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






-- 

Thanks  regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Can it be the limited memory causing this slowness?

On Tue, Jul 14, 2015 at 9:00 AM, Nirmal Fernando nir...@wso2.com wrote:

 Thanks Burak.

 Now it takes minutes to repartition;

 Active Stages (1) Stage IdDescriptionSubmittedDurationTasks:
 Succeeded/TotalInputOutputShuffle Read Shuffle Write  42 (kill)
 http://localhost:4040/stages/stage/kill/?id=42terminate=true repartition
 at UnsupervisedSparkModelBuilder.java:120
 http://localhost:4040/stages/stage?id=42attempt=0 +details

 org.apache.spark.api.java.JavaRDD.repartition(JavaRDD.scala:100)
 org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:120)
 org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
 org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)

  2015/07/14 08:59:30 3.6 min
  0/3
  14.6 MB Pending Stages (1) Stage IdDescriptionSubmittedDurationTasks:
 Succeeded/TotalInputOutputShuffle Read Shuffle Write  43 sum at
 KMeansModel.scala:70 http://localhost:4040/stages/stage?id=43attempt=0 
 +details


 org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
 org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
 org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:121)
 org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
 org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)

  Unknown Unknown
  0/8

 On Mon, Jul 13, 2015 at 11:44 PM, Burak Yavuz brk...@gmail.com wrote:

 Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also,
 .cache()?

 something like, (I'm assuming you are using Java):
 ```
 JavaRDDVector input = data.repartition(8).cache();
 org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20);
 ```

 On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 I'm using;

 org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);

 Cpu cores: 8 (using default Spark conf thought)

 On partitions, I'm not sure how to find that.

 On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote:

 What are the other parameters? Are you just setting k=3? What about #
 of runs? How many partitions do you have? How many cores does your machine
 have?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Hi Burak,

 k = 3
 dimension = 785 features
 Spark 1.4

 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com
 wrote:

 Hi,

 How are you running K-Means? What is your k? What is the dimension of
 your dataset (columns)? Which Spark version are you using?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Hi,

 For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot
 of time (16+ mints).

 It takes lot of time at this task;

 org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
 org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)

 Can this be improved?

 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/





-- 

Thanks  regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
What are the other parameters? Are you just setting k=3? What about # of
runs? How many partitions do you have? How many cores does your machine
have?

Thanks,
Burak

On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com wrote:

 Hi Burak,

 k = 3
 dimension = 785 features
 Spark 1.4

 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote:

 Hi,

 How are you running K-Means? What is your k? What is the dimension of
 your dataset (columns)? Which Spark version are you using?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com wrote:

 Hi,

 For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
 time (16+ mints).

 It takes lot of time at this task;

 org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
 org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)

 Can this be improved?

 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/





Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
I'm using;

org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);

Cpu cores: 8 (using default Spark conf thought)

On partitions, I'm not sure how to find that.

On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote:

 What are the other parameters? Are you just setting k=3? What about # of
 runs? How many partitions do you have? How many cores does your machine
 have?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com wrote:

 Hi Burak,

 k = 3
 dimension = 785 features
 Spark 1.4

 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote:

 Hi,

 How are you running K-Means? What is your k? What is the dimension of
 your dataset (columns)? Which Spark version are you using?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Hi,

 For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
 time (16+ mints).

 It takes lot of time at this task;

 org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
 org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)

 Can this be improved?

 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






-- 

Thanks  regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also,
.cache()?

something like, (I'm assuming you are using Java):
```
JavaRDDVector input = data.repartition(8).cache();
org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20);
```

On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando nir...@wso2.com wrote:

 I'm using;

 org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);

 Cpu cores: 8 (using default Spark conf thought)

 On partitions, I'm not sure how to find that.

 On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote:

 What are the other parameters? Are you just setting k=3? What about # of
 runs? How many partitions do you have? How many cores does your machine
 have?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Hi Burak,

 k = 3
 dimension = 785 features
 Spark 1.4

 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote:

 Hi,

 How are you running K-Means? What is your k? What is the dimension of
 your dataset (columns)? Which Spark version are you using?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com
 wrote:

 Hi,

 For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
 time (16+ mints).

 It takes lot of time at this task;

 org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
 org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)

 Can this be improved?

 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/





Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Hi Burak,

k = 3
dimension = 785 features
Spark 1.4

On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote:

 Hi,

 How are you running K-Means? What is your k? What is the dimension of your
 dataset (columns)? Which Spark version are you using?

 Thanks,
 Burak

 On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com wrote:

 Hi,

 For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
 time (16+ mints).

 It takes lot of time at this task;

 org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
 org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)

 Can this be improved?

 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/






-- 

Thanks  regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
Hi,

How are you running K-Means? What is your k? What is the dimension of your
dataset (columns)? Which Spark version are you using?

Thanks,
Burak

On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com wrote:

 Hi,

 For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
 time (16+ mints).

 It takes lot of time at this task;

 org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
 org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)

 Can this be improved?

 --

 Thanks  regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/