Can it be the limited memory causing this slowness? On Tue, Jul 14, 2015 at 9:00 AM, Nirmal Fernando <nir...@wso2.com> wrote:
> Thanks Burak. > > Now it takes minutes to repartition; > > Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: > Succeeded/TotalInputOutputShuffle Read Shuffle Write 42 (kill) > <http://localhost:4040/stages/stage/kill/?id=42&terminate=true> repartition > at UnsupervisedSparkModelBuilder.java:120 > <http://localhost:4040/stages/stage?id=42&attempt=0> +details > > org.apache.spark.api.java.JavaRDD.repartition(JavaRDD.scala:100) > org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:120) > org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84) > org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > > 2015/07/14 08:59:30 3.6 min > 0/3 > 14.6 MB Pending Stages (1) Stage IdDescriptionSubmittedDurationTasks: > Succeeded/TotalInputOutputShuffle Read Shuffle Write 43 sum at > KMeansModel.scala:70 <http://localhost:4040/stages/stage?id=43&attempt=0> > +details > > > org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) > org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) > org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:121) > org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84) > org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > > Unknown Unknown > 0/8 > > On Mon, Jul 13, 2015 at 11:44 PM, Burak Yavuz <brk...@gmail.com> wrote: > >> Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also, >> .cache()? >> >> something like, (I'm assuming you are using Java): >> ``` >> JavaRDD<Vector> input = data.repartition(8).cache(); >> org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20); >> ``` >> >> On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando <nir...@wso2.com> >> wrote: >> >>> I'm using; >>> >>> org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20); >>> >>> Cpu cores: 8 (using default Spark conf thought) >>> >>> On partitions, I'm not sure how to find that. >>> >>> On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz <brk...@gmail.com> wrote: >>> >>>> What are the other parameters? Are you just setting k=3? What about # >>>> of runs? How many partitions do you have? How many cores does your machine >>>> have? >>>> >>>> Thanks, >>>> Burak >>>> >>>> On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando <nir...@wso2.com> >>>> wrote: >>>> >>>>> Hi Burak, >>>>> >>>>> k = 3 >>>>> dimension = 785 features >>>>> Spark 1.4 >>>>> >>>>> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz <brk...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> How are you running K-Means? What is your k? What is the dimension of >>>>>> your dataset (columns)? Which Spark version are you using? >>>>>> >>>>>> Thanks, >>>>>> Burak >>>>>> >>>>>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando <nir...@wso2.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot >>>>>>> of time (16+ mints). >>>>>>> >>>>>>> It takes lot of time at this task; >>>>>>> >>>>>>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) >>>>>>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) >>>>>>> >>>>>>> Can this be improved? >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Thanks & regards, >>>>>>> Nirmal >>>>>>> >>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>>> Mobile: +94715779733 >>>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Thanks & regards, >>>>> Nirmal >>>>> >>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>> Mobile: +94715779733 >>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> >>> Thanks & regards, >>> Nirmal >>> >>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>> Mobile: +94715779733 >>> Blog: http://nirmalfdo.blogspot.com/ >>> >>> >>> >> > > > -- > > Thanks & regards, > Nirmal > > Associate Technical Lead - Data Technologies Team, WSO2 Inc. > Mobile: +94715779733 > Blog: http://nirmalfdo.blogspot.com/ > > > -- Thanks & regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/