issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
Hi,



Previous we have applied SVM algorithm in MLlib to 5 million records (600
mb), it takes more than 25 minutes to finish.
The spark version we are using is 1.0 and we were running this program on a
4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.

The 5 million records only have two distinct records (One positive and one
negative), others are all duplications.

Any one has any idea on why it takes so long on this small data?



Thanks,
Best,

Peng


Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Jimmy
Watch the app manager it should tell you what's running and taking awhile... My 
guess it's a distinct function on the data.
J

Sent from my iPhone

 On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote:
 
 Hi,
 
  
 
 Previous we have applied SVM algorithm in MLlib to 5 million records (600 
 mb), it takes more than 25 minutes to finish.
 The spark version we are using is 1.0 and we were running this program on a 4 
 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
 
 The 5 million records only have two distinct records (One positive and one 
 negative), others are all duplications.
 
 Any one has any idea on why it takes so long on this small data?
 
  
 
 Thanks,
 Best,
 
 Peng


Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Xiangrui Meng
DId you cache the data and check the load balancing? How many
features? Which API are you using, Scala, Java, or Python? -Xiangrui

On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote:
 Watch the app manager it should tell you what's running and taking awhile...
 My guess it's a distinct function on the data.
 J

 Sent from my iPhone

 On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote:

 Hi,



 Previous we have applied SVM algorithm in MLlib to 5 million records (600
 mb), it takes more than 25 minutes to finish.
 The spark version we are using is 1.0 and we were running this program on a
 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.

 The 5 million records only have two distinct records (One positive and one
 negative), others are all duplications.

 Any one has any idea on why it takes so long on this small data?



 Thanks,
 Best,

 Peng

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
Thanks for all your help.
I think I didn't cache the data. My previous cluster was expired and I
don't have a chance to check the load balance or app manager.
Below is my code.
There are 18 features for each record and I am using the Scala API.

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import java.util.Calendar

object BenchmarkClassification {
def main(args: Array[String]) {
// Load and parse the data file
val conf = new SparkConf()
  .setAppName(SVM)
  .set(spark.executor.memory, 8g)
  // .set(spark.executor.extraJavaOptions, -Xms8g -Xmx8g)
val sc = new SparkContext(conf)
val data = sc.textFile(args(0))
val parsedData = data.map { line =
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =
x.toDouble)))
}
val testData = sc.textFile(args(1))
val testParsedData = testData .map { line =
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =
x.toDouble)))
}

// Run training algorithm to build the model
val numIterations = 20
val model = SVMWithSGD.train(parsedData, numIterations)

// Evaluate model on training examples and compute training error
// val labelAndPreds = testParsedData.map { point =
//   val prediction = model.predict(point.features)
//   (point.label, prediction)
// }
// val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble /
testParsedData.count
// println(Training Error =  + trainErr)
println(Calendar.getInstance().getTime())
}
}




Thanks,
Best,
Peng

On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng men...@gmail.com wrote:

 DId you cache the data and check the load balancing? How many
 features? Which API are you using, Scala, Java, or Python? -Xiangrui

 On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote:
  Watch the app manager it should tell you what's running and taking
 awhile...
  My guess it's a distinct function on the data.
  J
 
  Sent from my iPhone
 
  On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote:
 
  Hi,
 
 
 
  Previous we have applied SVM algorithm in MLlib to 5 million records (600
  mb), it takes more than 25 minutes to finish.
  The spark version we are using is 1.0 and we were running this program
 on a
  4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
 
  The 5 million records only have two distinct records (One positive and
 one
  negative), others are all duplications.
 
  Any one has any idea on why it takes so long on this small data?
 
 
 
  Thanks,
  Best,
 
  Peng



Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Xiangrui Meng
Then caching should solve the problem. Otherwise, it is just loading
and parsing data from disk for each iteration. -Xiangrui

On Thu, Oct 30, 2014 at 11:44 AM, peng xia toxiap...@gmail.com wrote:
 Thanks for all your help.
 I think I didn't cache the data. My previous cluster was expired and I don't
 have a chance to check the load balance or app manager.
 Below is my code.
 There are 18 features for each record and I am using the Scala API.

 import org.apache.spark.SparkConf
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 import org.apache.spark.rdd._
 import org.apache.spark.mllib.classification.SVMWithSGD
 import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.linalg.Vectors
 import java.util.Calendar

 object BenchmarkClassification {
 def main(args: Array[String]) {
 // Load and parse the data file
 val conf = new SparkConf()
  .setAppName(SVM)
  .set(spark.executor.memory, 8g)
  // .set(spark.executor.extraJavaOptions, -Xms8g -Xmx8g)
val sc = new SparkContext(conf)
 val data = sc.textFile(args(0))
 val parsedData = data.map { line =
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =
 x.toDouble)))
 }
 val testData = sc.textFile(args(1))
 val testParsedData = testData .map { line =
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =
 x.toDouble)))
 }

 // Run training algorithm to build the model
 val numIterations = 20
 val model = SVMWithSGD.train(parsedData, numIterations)

 // Evaluate model on training examples and compute training error
 // val labelAndPreds = testParsedData.map { point =
 //   val prediction = model.predict(point.features)
 //   (point.label, prediction)
 // }
 // val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble /
 testParsedData.count
 // println(Training Error =  + trainErr)
 println(Calendar.getInstance().getTime())
 }
 }




 Thanks,
 Best,
 Peng

 On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng men...@gmail.com wrote:

 DId you cache the data and check the load balancing? How many
 features? Which API are you using, Scala, Java, or Python? -Xiangrui

 On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote:
  Watch the app manager it should tell you what's running and taking
  awhile...
  My guess it's a distinct function on the data.
  J
 
  Sent from my iPhone
 
  On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote:
 
  Hi,
 
 
 
  Previous we have applied SVM algorithm in MLlib to 5 million records
  (600
  mb), it takes more than 25 minutes to finish.
  The spark version we are using is 1.0 and we were running this program
  on a
  4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
 
  The 5 million records only have two distinct records (One positive and
  one
  negative), others are all duplications.
 
  Any one has any idea on why it takes so long on this small data?
 
 
 
  Thanks,
  Best,
 
  Peng



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
Hi Xiangrui,

Can you give me some code example about caching, as I am new to Spark.

Thanks,
Best,
Peng

On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote:

 Then caching should solve the problem. Otherwise, it is just loading
 and parsing data from disk for each iteration. -Xiangrui

 On Thu, Oct 30, 2014 at 11:44 AM, peng xia toxiap...@gmail.com wrote:
  Thanks for all your help.
  I think I didn't cache the data. My previous cluster was expired and I
 don't
  have a chance to check the load balance or app manager.
  Below is my code.
  There are 18 features for each record and I am using the Scala API.
 
  import org.apache.spark.SparkConf
  import org.apache.spark.SparkContext
  import org.apache.spark.SparkContext._
  import org.apache.spark.rdd._
  import org.apache.spark.mllib.classification.SVMWithSGD
  import org.apache.spark.mllib.regression.LabeledPoint
  import org.apache.spark.mllib.linalg.Vectors
  import java.util.Calendar
 
  object BenchmarkClassification {
  def main(args: Array[String]) {
  // Load and parse the data file
  val conf = new SparkConf()
   .setAppName(SVM)
   .set(spark.executor.memory, 8g)
   // .set(spark.executor.extraJavaOptions, -Xms8g -Xmx8g)
 val sc = new SparkContext(conf)
  val data = sc.textFile(args(0))
  val parsedData = data.map { line =
   val parts = line.split(',')
   LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =
  x.toDouble)))
  }
  val testData = sc.textFile(args(1))
  val testParsedData = testData .map { line =
   val parts = line.split(',')
   LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =
  x.toDouble)))
  }
 
  // Run training algorithm to build the model
  val numIterations = 20
  val model = SVMWithSGD.train(parsedData, numIterations)
 
  // Evaluate model on training examples and compute training error
  // val labelAndPreds = testParsedData.map { point =
  //   val prediction = model.predict(point.features)
  //   (point.label, prediction)
  // }
  // val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble
 /
  testParsedData.count
  // println(Training Error =  + trainErr)
  println(Calendar.getInstance().getTime())
  }
  }
 
 
 
 
  Thanks,
  Best,
  Peng
 
  On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng men...@gmail.com wrote:
 
  DId you cache the data and check the load balancing? How many
  features? Which API are you using, Scala, Java, or Python? -Xiangrui
 
  On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote:
   Watch the app manager it should tell you what's running and taking
   awhile...
   My guess it's a distinct function on the data.
   J
  
   Sent from my iPhone
  
   On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote:
  
   Hi,
  
  
  
   Previous we have applied SVM algorithm in MLlib to 5 million records
   (600
   mb), it takes more than 25 minutes to finish.
   The spark version we are using is 1.0 and we were running this program
   on a
   4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
  
   The 5 million records only have two distinct records (One positive and
   one
   negative), others are all duplications.
  
   Any one has any idea on why it takes so long on this small data?
  
  
  
   Thanks,
   Best,
  
   Peng
 
 



Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Jimmy
sampleRDD. cache()

Sent from my iPhone

 On Oct 30, 2014, at 5:01 PM, peng xia toxiap...@gmail.com wrote:
 
 Hi Xiangrui, 
 
 Can you give me some code example about caching, as I am new to Spark.
 
 Thanks,
 Best,
 Peng
 
 On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote:
 Then caching should solve the problem. Otherwise, it is just loading
 and parsing data from disk for each iteration. -Xiangrui
 
 On Thu, Oct 30, 2014 at 11:44 AM, peng xia toxiap...@gmail.com wrote:
  Thanks for all your help.
  I think I didn't cache the data. My previous cluster was expired and I 
  don't
  have a chance to check the load balance or app manager.
  Below is my code.
  There are 18 features for each record and I am using the Scala API.
 
  import org.apache.spark.SparkConf
  import org.apache.spark.SparkContext
  import org.apache.spark.SparkContext._
  import org.apache.spark.rdd._
  import org.apache.spark.mllib.classification.SVMWithSGD
  import org.apache.spark.mllib.regression.LabeledPoint
  import org.apache.spark.mllib.linalg.Vectors
  import java.util.Calendar
 
  object BenchmarkClassification {
  def main(args: Array[String]) {
  // Load and parse the data file
  val conf = new SparkConf()
   .setAppName(SVM)
   .set(spark.executor.memory, 8g)
   // .set(spark.executor.extraJavaOptions, -Xms8g -Xmx8g)
 val sc = new SparkContext(conf)
  val data = sc.textFile(args(0))
  val parsedData = data.map { line =
   val parts = line.split(',')
   LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =
  x.toDouble)))
  }
  val testData = sc.textFile(args(1))
  val testParsedData = testData .map { line =
   val parts = line.split(',')
   LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =
  x.toDouble)))
  }
 
  // Run training algorithm to build the model
  val numIterations = 20
  val model = SVMWithSGD.train(parsedData, numIterations)
 
  // Evaluate model on training examples and compute training error
  // val labelAndPreds = testParsedData.map { point =
  //   val prediction = model.predict(point.features)
  //   (point.label, prediction)
  // }
  // val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble /
  testParsedData.count
  // println(Training Error =  + trainErr)
  println(Calendar.getInstance().getTime())
  }
  }
 
 
 
 
  Thanks,
  Best,
  Peng
 
  On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng men...@gmail.com wrote:
 
  DId you cache the data and check the load balancing? How many
  features? Which API are you using, Scala, Java, or Python? -Xiangrui
 
  On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote:
   Watch the app manager it should tell you what's running and taking
   awhile...
   My guess it's a distinct function on the data.
   J
  
   Sent from my iPhone
  
   On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote:
  
   Hi,
  
  
  
   Previous we have applied SVM algorithm in MLlib to 5 million records
   (600
   mb), it takes more than 25 minutes to finish.
   The spark version we are using is 1.0 and we were running this program
   on a
   4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
  
   The 5 million records only have two distinct records (One positive and
   one
   negative), others are all duplications.
  
   Any one has any idea on why it takes so long on this small data?
  
  
  
   Thanks,
   Best,
  
   Peng
 
 
 


Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
Thanks Jimmy.
I will have a try.
Thanks very much for your guys' help.

Best,
Peng

On Thu, Oct 30, 2014 at 8:19 PM, Jimmy ji...@sellpoints.com wrote:

 sampleRDD. cache()

 Sent from my iPhone

 On Oct 30, 2014, at 5:01 PM, peng xia toxiap...@gmail.com wrote:

 Hi Xiangrui,

 Can you give me some code example about caching, as I am new to Spark.

 Thanks,
 Best,
 Peng

 On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote:

 Then caching should solve the problem. Otherwise, it is just loading
 and parsing data from disk for each iteration. -Xiangrui

 On Thu, Oct 30, 2014 at 11:44 AM, peng xia toxiap...@gmail.com wrote:
  Thanks for all your help.
  I think I didn't cache the data. My previous cluster was expired and I
 don't
  have a chance to check the load balance or app manager.
  Below is my code.
  There are 18 features for each record and I am using the Scala API.
 
  import org.apache.spark.SparkConf
  import org.apache.spark.SparkContext
  import org.apache.spark.SparkContext._
  import org.apache.spark.rdd._
  import org.apache.spark.mllib.classification.SVMWithSGD
  import org.apache.spark.mllib.regression.LabeledPoint
  import org.apache.spark.mllib.linalg.Vectors
  import java.util.Calendar
 
  object BenchmarkClassification {
  def main(args: Array[String]) {
  // Load and parse the data file
  val conf = new SparkConf()
   .setAppName(SVM)
   .set(spark.executor.memory, 8g)
   // .set(spark.executor.extraJavaOptions, -Xms8g -Xmx8g)
 val sc = new SparkContext(conf)
  val data = sc.textFile(args(0))
  val parsedData = data.map { line =
   val parts = line.split(',')
   LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =
  x.toDouble)))
  }
  val testData = sc.textFile(args(1))
  val testParsedData = testData .map { line =
   val parts = line.split(',')
   LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =
  x.toDouble)))
  }
 
  // Run training algorithm to build the model
  val numIterations = 20
  val model = SVMWithSGD.train(parsedData, numIterations)
 
  // Evaluate model on training examples and compute training error
  // val labelAndPreds = testParsedData.map { point =
  //   val prediction = model.predict(point.features)
  //   (point.label, prediction)
  // }
  // val trainErr = labelAndPreds.filter(r = r._1 !=
 r._2).count.toDouble /
  testParsedData.count
  // println(Training Error =  + trainErr)
  println(Calendar.getInstance().getTime())
  }
  }
 
 
 
 
  Thanks,
  Best,
  Peng
 
  On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng men...@gmail.com
 wrote:
 
  DId you cache the data and check the load balancing? How many
  features? Which API are you using, Scala, Java, or Python? -Xiangrui
 
  On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote:
   Watch the app manager it should tell you what's running and taking
   awhile...
   My guess it's a distinct function on the data.
   J
  
   Sent from my iPhone
  
   On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote:
  
   Hi,
  
  
  
   Previous we have applied SVM algorithm in MLlib to 5 million records
   (600
   mb), it takes more than 25 minutes to finish.
   The spark version we are using is 1.0 and we were running this
 program
   on a
   4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
  
   The 5 million records only have two distinct records (One positive
 and
   one
   negative), others are all duplications.
  
   Any one has any idea on why it takes so long on this small data?
  
  
  
   Thanks,
   Best,
  
   Peng