I am trying to use MLlib for K-Means clustering on a data set with 1 million rows and 50 columns (all columns have double values) which is on HDFS (raw txt file is 28 MB)
I initially tried the following: val data3 = sc.textFile("hdfs://...inputData.txt") val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) val numIterations = 10 val numClusters = 200 val clusters = KMeans.train(parsedData3, numClusters, numIterations) This took me nearly 850 seconds. I tried using persist with MEMORY_ONLY option hoping that this would significantly speed up the algorithm: val data3 = sc.textFile("hdfs://...inputData.txt") val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) parsedData3.persist(MEMORY_ONLY) val numIterations = 10 val numClusters = 200 val clusters = KMeans.train(parsedData3, numClusters, numIterations) This resulted in only a marginal improvement and took around 720 seconds. Is there any other way to speed up the algorithm further? Thank you. Regards, Ravi