I am trying to use MLlib for K-Means clustering on a data set with 1
million rows and 50 columns (all columns have double values) which is on
HDFS (raw txt file is 28 MB)

I initially tried the following:

val data3 = sc.textFile("hdfs://...inputData.txt")
val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
val numIterations = 10
val numClusters = 200
val clusters = KMeans.train(parsedData3, numClusters, numIterations)

This took me nearly 850 seconds.

I tried using persist with MEMORY_ONLY option hoping that this would
significantly speed up the algorithm:

val data3 = sc.textFile("hdfs://...inputData.txt")
val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
parsedData3.persist(MEMORY_ONLY)
val numIterations = 10
val numClusters = 200
val clusters = KMeans.train(parsedData3, numClusters, numIterations)

This resulted in only a marginal improvement and took around 720 seconds.

Is there any other way to speed up the algorithm further?

Thank you.

Regards,
Ravi

Reply via email to