I am trying to use MLlib for K-Means clustering on a data set with 1
million rows and 50 columns (all columns have double values) which is on
HDFS (raw txt file is 28 MB)
I initially tried the following:
val data3 = sc.textFile("hdfs://...inputData.txt")
val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
val numIterations = 10
val numClusters = 200
val clusters = KMeans.train(parsedData3, numClusters, numIterations)
This took me nearly 850 seconds.
I tried using persist with MEMORY_ONLY option hoping that this would
significantly speed up the algorithm:
val data3 = sc.textFile("hdfs://...inputData.txt")
val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
parsedData3.persist(MEMORY_ONLY)
val numIterations = 10
val numClusters = 200
val clusters = KMeans.train(parsedData3, numClusters, numIterations)
This resulted in only a marginal improvement and took around 720 seconds.
Is there any other way to speed up the algorithm further?
Thank you.
Regards,
Ravi