Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
and repartition your data to match number of CPU cores such that the
data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
the data. Make sure there are enough memory for caching. -Xiangrui

On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan
viora...@gmail.com wrote:
 I am trying to use MLlib for K-Means clustering on a data set with 1 million
 rows and 50 columns (all columns have double values) which is on HDFS (raw
 txt file is 28 MB)

 I initially tried the following:

 val data3 = sc.textFile(hdfs://...inputData.txt)
 val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
 val numIterations = 10
 val numClusters = 200
 val clusters = KMeans.train(parsedData3, numClusters, numIterations)

 This took me nearly 850 seconds.

 I tried using persist with MEMORY_ONLY option hoping that this would
 significantly speed up the algorithm:

 val data3 = sc.textFile(hdfs://...inputData.txt)
 val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
 parsedData3.persist(MEMORY_ONLY)
 val numIterations = 10
 val numClusters = 200
 val clusters = KMeans.train(parsedData3, numClusters, numIterations)

 This resulted in only a marginal improvement and took around 720 seconds.

 Is there any other way to speed up the algorithm further?

 Thank you.

 Regards,
 Ravi


Re: Speeding up K-Means Clustering

2014-07-17 Thread Ravishankar Rajagopalan
Hi Xiangrui,

Yes I am using Spark v0.9 and am not running it in local mode.

I did the memory setting using export SPARK_MEM=4G before starting the  Spark
instance.

Also previously, I was starting it with -c 1 but changed it to -c 12 since
it is a 12 core machine. It did bring down the time taken to less than 200
seconds from over 700 seconds.

I am not sure how to repartition the data to match the CPU cores. How do I
do it?

Thank you.

Ravi


On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng men...@gmail.com wrote:

 Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
 and repartition your data to match number of CPU cores such that the
 data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
 the data. Make sure there are enough memory for caching. -Xiangrui

 On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan
 viora...@gmail.com wrote:
  I am trying to use MLlib for K-Means clustering on a data set with 1
 million
  rows and 50 columns (all columns have double values) which is on HDFS
 (raw
  txt file is 28 MB)
 
  I initially tried the following:
 
  val data3 = sc.textFile(hdfs://...inputData.txt)
  val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
  val numIterations = 10
  val numClusters = 200
  val clusters = KMeans.train(parsedData3, numClusters, numIterations)
 
  This took me nearly 850 seconds.
 
  I tried using persist with MEMORY_ONLY option hoping that this would
  significantly speed up the algorithm:
 
  val data3 = sc.textFile(hdfs://...inputData.txt)
  val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
  parsedData3.persist(MEMORY_ONLY)
  val numIterations = 10
  val numClusters = 200
  val clusters = KMeans.train(parsedData3, numClusters, numIterations)
 
  This resulted in only a marginal improvement and took around 720 seconds.
 
  Is there any other way to speed up the algorithm further?
 
  Thank you.
 
  Regards,
  Ravi



Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
Please try

val parsedData3 =
data3.repartition(12).map(_.split(\t)).map(_.toDouble).cache()

and check the storage and driver/executor memory in the WebUI. Make
sure the data is fully cached.

-Xiangrui


On Thu, Jul 17, 2014 at 5:09 AM, Ravishankar Rajagopalan
viora...@gmail.com wrote:
 Hi Xiangrui,

 Yes I am using Spark v0.9 and am not running it in local mode.

 I did the memory setting using export SPARK_MEM=4G before starting the
 Spark instance.

 Also previously, I was starting it with -c 1 but changed it to -c 12 since
 it is a 12 core machine. It did bring down the time taken to less than 200
 seconds from over 700 seconds.

 I am not sure how to repartition the data to match the CPU cores. How do I
 do it?

 Thank you.

 Ravi


 On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng men...@gmail.com wrote:

 Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
 and repartition your data to match number of CPU cores such that the
 data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
 the data. Make sure there are enough memory for caching. -Xiangrui

 On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan
 viora...@gmail.com wrote:
  I am trying to use MLlib for K-Means clustering on a data set with 1
  million
  rows and 50 columns (all columns have double values) which is on HDFS
  (raw
  txt file is 28 MB)
 
  I initially tried the following:
 
  val data3 = sc.textFile(hdfs://...inputData.txt)
  val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
  val numIterations = 10
  val numClusters = 200
  val clusters = KMeans.train(parsedData3, numClusters, numIterations)
 
  This took me nearly 850 seconds.
 
  I tried using persist with MEMORY_ONLY option hoping that this would
  significantly speed up the algorithm:
 
  val data3 = sc.textFile(hdfs://...inputData.txt)
  val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
  parsedData3.persist(MEMORY_ONLY)
  val numIterations = 10
  val numClusters = 200
  val clusters = KMeans.train(parsedData3, numClusters, numIterations)
 
  This resulted in only a marginal improvement and took around 720
  seconds.
 
  Is there any other way to speed up the algorithm further?
 
  Thank you.
 
  Regards,
  Ravi