Re: Speeding up K-Means Clustering
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g and repartition your data to match number of CPU cores such that the data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage the data. Make sure there are enough memory for caching. -Xiangrui On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan viora...@gmail.com wrote: I am trying to use MLlib for K-Means clustering on a data set with 1 million rows and 50 columns (all columns have double values) which is on HDFS (raw txt file is 28 MB) I initially tried the following: val data3 = sc.textFile(hdfs://...inputData.txt) val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) val numIterations = 10 val numClusters = 200 val clusters = KMeans.train(parsedData3, numClusters, numIterations) This took me nearly 850 seconds. I tried using persist with MEMORY_ONLY option hoping that this would significantly speed up the algorithm: val data3 = sc.textFile(hdfs://...inputData.txt) val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) parsedData3.persist(MEMORY_ONLY) val numIterations = 10 val numClusters = 200 val clusters = KMeans.train(parsedData3, numClusters, numIterations) This resulted in only a marginal improvement and took around 720 seconds. Is there any other way to speed up the algorithm further? Thank you. Regards, Ravi
Re: Speeding up K-Means Clustering
Hi Xiangrui, Yes I am using Spark v0.9 and am not running it in local mode. I did the memory setting using export SPARK_MEM=4G before starting the Spark instance. Also previously, I was starting it with -c 1 but changed it to -c 12 since it is a 12 core machine. It did bring down the time taken to less than 200 seconds from over 700 seconds. I am not sure how to repartition the data to match the CPU cores. How do I do it? Thank you. Ravi On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng men...@gmail.com wrote: Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g and repartition your data to match number of CPU cores such that the data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage the data. Make sure there are enough memory for caching. -Xiangrui On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan viora...@gmail.com wrote: I am trying to use MLlib for K-Means clustering on a data set with 1 million rows and 50 columns (all columns have double values) which is on HDFS (raw txt file is 28 MB) I initially tried the following: val data3 = sc.textFile(hdfs://...inputData.txt) val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) val numIterations = 10 val numClusters = 200 val clusters = KMeans.train(parsedData3, numClusters, numIterations) This took me nearly 850 seconds. I tried using persist with MEMORY_ONLY option hoping that this would significantly speed up the algorithm: val data3 = sc.textFile(hdfs://...inputData.txt) val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) parsedData3.persist(MEMORY_ONLY) val numIterations = 10 val numClusters = 200 val clusters = KMeans.train(parsedData3, numClusters, numIterations) This resulted in only a marginal improvement and took around 720 seconds. Is there any other way to speed up the algorithm further? Thank you. Regards, Ravi
Re: Speeding up K-Means Clustering
Please try val parsedData3 = data3.repartition(12).map(_.split(\t)).map(_.toDouble).cache() and check the storage and driver/executor memory in the WebUI. Make sure the data is fully cached. -Xiangrui On Thu, Jul 17, 2014 at 5:09 AM, Ravishankar Rajagopalan viora...@gmail.com wrote: Hi Xiangrui, Yes I am using Spark v0.9 and am not running it in local mode. I did the memory setting using export SPARK_MEM=4G before starting the Spark instance. Also previously, I was starting it with -c 1 but changed it to -c 12 since it is a 12 core machine. It did bring down the time taken to less than 200 seconds from over 700 seconds. I am not sure how to repartition the data to match the CPU cores. How do I do it? Thank you. Ravi On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng men...@gmail.com wrote: Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g and repartition your data to match number of CPU cores such that the data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage the data. Make sure there are enough memory for caching. -Xiangrui On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan viora...@gmail.com wrote: I am trying to use MLlib for K-Means clustering on a data set with 1 million rows and 50 columns (all columns have double values) which is on HDFS (raw txt file is 28 MB) I initially tried the following: val data3 = sc.textFile(hdfs://...inputData.txt) val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) val numIterations = 10 val numClusters = 200 val clusters = KMeans.train(parsedData3, numClusters, numIterations) This took me nearly 850 seconds. I tried using persist with MEMORY_ONLY option hoping that this would significantly speed up the algorithm: val data3 = sc.textFile(hdfs://...inputData.txt) val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) parsedData3.persist(MEMORY_ONLY) val numIterations = 10 val numClusters = 200 val clusters = KMeans.train(parsedData3, numClusters, numIterations) This resulted in only a marginal improvement and took around 720 seconds. Is there any other way to speed up the algorithm further? Thank you. Regards, Ravi