and repartition your data to match number of CPU cores such that the
> data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
> the data. Make sure there are enough memory for caching. -Xiangrui
>
> On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan
> wrote:
>
I am trying to use MLlib for K-Means clustering on a data set with 1
million rows and 50 columns (all columns have double values) which is on
HDFS (raw txt file is 28 MB)
I initially tried the following:
val data3 = sc.textFile("hdfs://...inputData.txt")
val parsedData3 = data3.map( _.split('\t')