Thanks Burak. Now it takes minutes to repartition;
Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/Total InputOutputShuffle Read Shuffle Write 42 (kill) <http://localhost:4040/stages/stage/kill/?id=42&terminate=true> repartition at UnsupervisedSparkModelBuilder.java:120 <http://localhost:4040/stages/stage?id=42&attempt=0> +details org.apache.spark.api.java.JavaRDD.repartition(JavaRDD.scala:100) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:120) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84) org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) 2015/07/14 08:59:30 3.6 min 0/3 14.6 MB Pending Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle Read Shuffle Write 43 sum at KMeansModel.scala:70 <http://localhost:4040/stages/stage?id=43&attempt=0> +details org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:121) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84) org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Unknown Unknown 0/8 On Mon, Jul 13, 2015 at 11:44 PM, Burak Yavuz <brk...@gmail.com> wrote: > Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also, > .cache()? > > something like, (I'm assuming you are using Java): > ``` > JavaRDD<Vector> input = data.repartition(8).cache(); > org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20); > ``` > > On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando <nir...@wso2.com> wrote: > >> I'm using; >> >> org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20); >> >> Cpu cores: 8 (using default Spark conf thought) >> >> On partitions, I'm not sure how to find that. >> >> On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz <brk...@gmail.com> wrote: >> >>> What are the other parameters? Are you just setting k=3? What about # of >>> runs? How many partitions do you have? How many cores does your machine >>> have? >>> >>> Thanks, >>> Burak >>> >>> On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando <nir...@wso2.com> >>> wrote: >>> >>>> Hi Burak, >>>> >>>> k = 3 >>>> dimension = 785 features >>>> Spark 1.4 >>>> >>>> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz <brk...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> How are you running K-Means? What is your k? What is the dimension of >>>>> your dataset (columns)? Which Spark version are you using? >>>>> >>>>> Thanks, >>>>> Burak >>>>> >>>>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando <nir...@wso2.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot >>>>>> of time (16+ mints). >>>>>> >>>>>> It takes lot of time at this task; >>>>>> >>>>>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) >>>>>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) >>>>>> >>>>>> Can this be improved? >>>>>> >>>>>> -- >>>>>> >>>>>> Thanks & regards, >>>>>> Nirmal >>>>>> >>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>> Mobile: +94715779733 >>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> Thanks & regards, >>>> Nirmal >>>> >>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>> Mobile: +94715779733 >>>> Blog: http://nirmalfdo.blogspot.com/ >>>> >>>> >>>> >>> >> >> >> -- >> >> Thanks & regards, >> Nirmal >> >> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >> Mobile: +94715779733 >> Blog: http://nirmalfdo.blogspot.com/ >> >> >> > -- Thanks & regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/