Please try
val parsedData3 =
data3.repartition(12).map(_.split("\t")).map(_.toDouble).cache()
and check the storage and driver/executor memory in the WebUI. Make
sure the data is fully cached.
-Xiangrui
On Thu, Jul 17, 2014 at 5:09 AM, Ravishankar Rajagopalan
wrote:
> Hi Xiangrui,
>
> Yes I a
Hi Xiangrui,
Yes I am using Spark v0.9 and am not running it in local mode.
I did the memory setting using "export SPARK_MEM=4G" before starting the Spark
instance.
Also previously, I was starting it with -c 1 but changed it to -c 12 since
it is a 12 core machine. It did bring down the time tak
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
and repartition your data to match number of CPU cores such that the
data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
the data. Make sure there are enough memory for caching. -Xiangrui
On Thu, Jul 17, 2014 at 1
I am trying to use MLlib for K-Means clustering on a data set with 1
million rows and 50 columns (all columns have double values) which is on
HDFS (raw txt file is 28 MB)
I initially tried the following:
val data3 = sc.textFile("hdfs://...inputData.txt")
val parsedData3 = data3.map( _.split('\t')