from:"Ravishankar Rajagopalan"

Re: Speeding up K-Means Clustering

2014-07-17 Thread Ravishankar Rajagopalan

and repartition your data to match number of CPU cores such that the > data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage > the data. Make sure there are enough memory for caching. -Xiangrui > > On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan > wrote: >

Speeding up K-Means Clustering

2014-07-17 Thread Ravishankar Rajagopalan

I am trying to use MLlib for K-Means clustering on a data set with 1 million rows and 50 columns (all columns have double values) which is on HDFS (raw txt file is 28 MB) I initially tried the following: val data3 = sc.textFile("hdfs://...inputData.txt") val parsedData3 = data3.map( _.split('\t')