I'm using; org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);
Cpu cores: 8 (using default Spark conf thought) On partitions, I'm not sure how to find that. On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz <brk...@gmail.com> wrote: > What are the other parameters? Are you just setting k=3? What about # of > runs? How many partitions do you have? How many cores does your machine > have? > > Thanks, > Burak > > On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando <nir...@wso2.com> wrote: > >> Hi Burak, >> >> k = 3 >> dimension = 785 features >> Spark 1.4 >> >> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz <brk...@gmail.com> wrote: >> >>> Hi, >>> >>> How are you running K-Means? What is your k? What is the dimension of >>> your dataset (columns)? Which Spark version are you using? >>> >>> Thanks, >>> Burak >>> >>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando <nir...@wso2.com> >>> wrote: >>> >>>> Hi, >>>> >>>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of >>>> time (16+ mints). >>>> >>>> It takes lot of time at this task; >>>> >>>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) >>>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) >>>> >>>> Can this be improved? >>>> >>>> -- >>>> >>>> Thanks & regards, >>>> Nirmal >>>> >>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>> Mobile: +94715779733 >>>> Blog: http://nirmalfdo.blogspot.com/ >>>> >>>> >>>> >>> >> >> >> -- >> >> Thanks & regards, >> Nirmal >> >> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >> Mobile: +94715779733 >> Blog: http://nirmalfdo.blogspot.com/ >> >> >> > -- Thanks & regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/