Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g and repartition your data to match number of CPU cores such that the data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage the data. Make sure there are enough memory for caching. -Xiangrui
On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan <viora...@gmail.com> wrote: > I am trying to use MLlib for K-Means clustering on a data set with 1 million > rows and 50 columns (all columns have double values) which is on HDFS (raw > txt file is 28 MB) > > I initially tried the following: > > val data3 = sc.textFile("hdfs://...inputData.txt") > val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) > val numIterations = 10 > val numClusters = 200 > val clusters = KMeans.train(parsedData3, numClusters, numIterations) > > This took me nearly 850 seconds. > > I tried using persist with MEMORY_ONLY option hoping that this would > significantly speed up the algorithm: > > val data3 = sc.textFile("hdfs://...inputData.txt") > val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) > parsedData3.persist(MEMORY_ONLY) > val numIterations = 10 > val numClusters = 200 > val clusters = KMeans.train(parsedData3, numClusters, numIterations) > > This resulted in only a marginal improvement and took around 720 seconds. > > Is there any other way to speed up the algorithm further? > > Thank you. > > Regards, > Ravi