issue on applying SVM to 5 million examples.
Hi, Previous we have applied SVM algorithm in MLlib to 5 million records (600 mb), it takes more than 25 minutes to finish. The spark version we are using is 1.0 and we were running this program on a 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. The 5 million records only have two distinct records (One positive and one negative), others are all duplications. Any one has any idea on why it takes so long on this small data? Thanks, Best, Peng
Re: issue on applying SVM to 5 million examples.
Watch the app manager it should tell you what's running and taking awhile... My guess it's a distinct function on the data. J Sent from my iPhone On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote: Hi, Previous we have applied SVM algorithm in MLlib to 5 million records (600 mb), it takes more than 25 minutes to finish. The spark version we are using is 1.0 and we were running this program on a 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. The 5 million records only have two distinct records (One positive and one negative), others are all duplications. Any one has any idea on why it takes so long on this small data? Thanks, Best, Peng
Re: issue on applying SVM to 5 million examples.
DId you cache the data and check the load balancing? How many features? Which API are you using, Scala, Java, or Python? -Xiangrui On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote: Watch the app manager it should tell you what's running and taking awhile... My guess it's a distinct function on the data. J Sent from my iPhone On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote: Hi, Previous we have applied SVM algorithm in MLlib to 5 million records (600 mb), it takes more than 25 minutes to finish. The spark version we are using is 1.0 and we were running this program on a 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. The 5 million records only have two distinct records (One positive and one negative), others are all duplications. Any one has any idea on why it takes so long on this small data? Thanks, Best, Peng - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: issue on applying SVM to 5 million examples.
Thanks for all your help. I think I didn't cache the data. My previous cluster was expired and I don't have a chance to check the load balance or app manager. Below is my code. There are 18 features for each record and I am using the Scala API. import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import java.util.Calendar object BenchmarkClassification { def main(args: Array[String]) { // Load and parse the data file val conf = new SparkConf() .setAppName(SVM) .set(spark.executor.memory, 8g) // .set(spark.executor.extraJavaOptions, -Xms8g -Xmx8g) val sc = new SparkContext(conf) val data = sc.textFile(args(0)) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x = x.toDouble))) } val testData = sc.textFile(args(1)) val testParsedData = testData .map { line = val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x = x.toDouble))) } // Run training algorithm to build the model val numIterations = 20 val model = SVMWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error // val labelAndPreds = testParsedData.map { point = // val prediction = model.predict(point.features) // (point.label, prediction) // } // val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble / testParsedData.count // println(Training Error = + trainErr) println(Calendar.getInstance().getTime()) } } Thanks, Best, Peng On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng men...@gmail.com wrote: DId you cache the data and check the load balancing? How many features? Which API are you using, Scala, Java, or Python? -Xiangrui On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote: Watch the app manager it should tell you what's running and taking awhile... My guess it's a distinct function on the data. J Sent from my iPhone On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote: Hi, Previous we have applied SVM algorithm in MLlib to 5 million records (600 mb), it takes more than 25 minutes to finish. The spark version we are using is 1.0 and we were running this program on a 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. The 5 million records only have two distinct records (One positive and one negative), others are all duplications. Any one has any idea on why it takes so long on this small data? Thanks, Best, Peng
Re: issue on applying SVM to 5 million examples.
Then caching should solve the problem. Otherwise, it is just loading and parsing data from disk for each iteration. -Xiangrui On Thu, Oct 30, 2014 at 11:44 AM, peng xia toxiap...@gmail.com wrote: Thanks for all your help. I think I didn't cache the data. My previous cluster was expired and I don't have a chance to check the load balance or app manager. Below is my code. There are 18 features for each record and I am using the Scala API. import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import java.util.Calendar object BenchmarkClassification { def main(args: Array[String]) { // Load and parse the data file val conf = new SparkConf() .setAppName(SVM) .set(spark.executor.memory, 8g) // .set(spark.executor.extraJavaOptions, -Xms8g -Xmx8g) val sc = new SparkContext(conf) val data = sc.textFile(args(0)) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x = x.toDouble))) } val testData = sc.textFile(args(1)) val testParsedData = testData .map { line = val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x = x.toDouble))) } // Run training algorithm to build the model val numIterations = 20 val model = SVMWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error // val labelAndPreds = testParsedData.map { point = // val prediction = model.predict(point.features) // (point.label, prediction) // } // val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble / testParsedData.count // println(Training Error = + trainErr) println(Calendar.getInstance().getTime()) } } Thanks, Best, Peng On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng men...@gmail.com wrote: DId you cache the data and check the load balancing? How many features? Which API are you using, Scala, Java, or Python? -Xiangrui On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote: Watch the app manager it should tell you what's running and taking awhile... My guess it's a distinct function on the data. J Sent from my iPhone On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote: Hi, Previous we have applied SVM algorithm in MLlib to 5 million records (600 mb), it takes more than 25 minutes to finish. The spark version we are using is 1.0 and we were running this program on a 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. The 5 million records only have two distinct records (One positive and one negative), others are all duplications. Any one has any idea on why it takes so long on this small data? Thanks, Best, Peng - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: issue on applying SVM to 5 million examples.
Hi Xiangrui, Can you give me some code example about caching, as I am new to Spark. Thanks, Best, Peng On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote: Then caching should solve the problem. Otherwise, it is just loading and parsing data from disk for each iteration. -Xiangrui On Thu, Oct 30, 2014 at 11:44 AM, peng xia toxiap...@gmail.com wrote: Thanks for all your help. I think I didn't cache the data. My previous cluster was expired and I don't have a chance to check the load balance or app manager. Below is my code. There are 18 features for each record and I am using the Scala API. import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import java.util.Calendar object BenchmarkClassification { def main(args: Array[String]) { // Load and parse the data file val conf = new SparkConf() .setAppName(SVM) .set(spark.executor.memory, 8g) // .set(spark.executor.extraJavaOptions, -Xms8g -Xmx8g) val sc = new SparkContext(conf) val data = sc.textFile(args(0)) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x = x.toDouble))) } val testData = sc.textFile(args(1)) val testParsedData = testData .map { line = val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x = x.toDouble))) } // Run training algorithm to build the model val numIterations = 20 val model = SVMWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error // val labelAndPreds = testParsedData.map { point = // val prediction = model.predict(point.features) // (point.label, prediction) // } // val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble / testParsedData.count // println(Training Error = + trainErr) println(Calendar.getInstance().getTime()) } } Thanks, Best, Peng On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng men...@gmail.com wrote: DId you cache the data and check the load balancing? How many features? Which API are you using, Scala, Java, or Python? -Xiangrui On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote: Watch the app manager it should tell you what's running and taking awhile... My guess it's a distinct function on the data. J Sent from my iPhone On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote: Hi, Previous we have applied SVM algorithm in MLlib to 5 million records (600 mb), it takes more than 25 minutes to finish. The spark version we are using is 1.0 and we were running this program on a 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. The 5 million records only have two distinct records (One positive and one negative), others are all duplications. Any one has any idea on why it takes so long on this small data? Thanks, Best, Peng
Re: issue on applying SVM to 5 million examples.
sampleRDD. cache() Sent from my iPhone On Oct 30, 2014, at 5:01 PM, peng xia toxiap...@gmail.com wrote: Hi Xiangrui, Can you give me some code example about caching, as I am new to Spark. Thanks, Best, Peng On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote: Then caching should solve the problem. Otherwise, it is just loading and parsing data from disk for each iteration. -Xiangrui On Thu, Oct 30, 2014 at 11:44 AM, peng xia toxiap...@gmail.com wrote: Thanks for all your help. I think I didn't cache the data. My previous cluster was expired and I don't have a chance to check the load balance or app manager. Below is my code. There are 18 features for each record and I am using the Scala API. import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import java.util.Calendar object BenchmarkClassification { def main(args: Array[String]) { // Load and parse the data file val conf = new SparkConf() .setAppName(SVM) .set(spark.executor.memory, 8g) // .set(spark.executor.extraJavaOptions, -Xms8g -Xmx8g) val sc = new SparkContext(conf) val data = sc.textFile(args(0)) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x = x.toDouble))) } val testData = sc.textFile(args(1)) val testParsedData = testData .map { line = val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x = x.toDouble))) } // Run training algorithm to build the model val numIterations = 20 val model = SVMWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error // val labelAndPreds = testParsedData.map { point = // val prediction = model.predict(point.features) // (point.label, prediction) // } // val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble / testParsedData.count // println(Training Error = + trainErr) println(Calendar.getInstance().getTime()) } } Thanks, Best, Peng On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng men...@gmail.com wrote: DId you cache the data and check the load balancing? How many features? Which API are you using, Scala, Java, or Python? -Xiangrui On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote: Watch the app manager it should tell you what's running and taking awhile... My guess it's a distinct function on the data. J Sent from my iPhone On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote: Hi, Previous we have applied SVM algorithm in MLlib to 5 million records (600 mb), it takes more than 25 minutes to finish. The spark version we are using is 1.0 and we were running this program on a 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. The 5 million records only have two distinct records (One positive and one negative), others are all duplications. Any one has any idea on why it takes so long on this small data? Thanks, Best, Peng
Re: issue on applying SVM to 5 million examples.
Thanks Jimmy. I will have a try. Thanks very much for your guys' help. Best, Peng On Thu, Oct 30, 2014 at 8:19 PM, Jimmy ji...@sellpoints.com wrote: sampleRDD. cache() Sent from my iPhone On Oct 30, 2014, at 5:01 PM, peng xia toxiap...@gmail.com wrote: Hi Xiangrui, Can you give me some code example about caching, as I am new to Spark. Thanks, Best, Peng On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote: Then caching should solve the problem. Otherwise, it is just loading and parsing data from disk for each iteration. -Xiangrui On Thu, Oct 30, 2014 at 11:44 AM, peng xia toxiap...@gmail.com wrote: Thanks for all your help. I think I didn't cache the data. My previous cluster was expired and I don't have a chance to check the load balance or app manager. Below is my code. There are 18 features for each record and I am using the Scala API. import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import java.util.Calendar object BenchmarkClassification { def main(args: Array[String]) { // Load and parse the data file val conf = new SparkConf() .setAppName(SVM) .set(spark.executor.memory, 8g) // .set(spark.executor.extraJavaOptions, -Xms8g -Xmx8g) val sc = new SparkContext(conf) val data = sc.textFile(args(0)) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x = x.toDouble))) } val testData = sc.textFile(args(1)) val testParsedData = testData .map { line = val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x = x.toDouble))) } // Run training algorithm to build the model val numIterations = 20 val model = SVMWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error // val labelAndPreds = testParsedData.map { point = // val prediction = model.predict(point.features) // (point.label, prediction) // } // val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble / testParsedData.count // println(Training Error = + trainErr) println(Calendar.getInstance().getTime()) } } Thanks, Best, Peng On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng men...@gmail.com wrote: DId you cache the data and check the load balancing? How many features? Which API are you using, Scala, Java, or Python? -Xiangrui On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote: Watch the app manager it should tell you what's running and taking awhile... My guess it's a distinct function on the data. J Sent from my iPhone On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote: Hi, Previous we have applied SVM algorithm in MLlib to 5 million records (600 mb), it takes more than 25 minutes to finish. The spark version we are using is 1.0 and we were running this program on a 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. The 5 million records only have two distinct records (One positive and one negative), others are all duplications. Any one has any idea on why it takes so long on this small data? Thanks, Best, Peng