Thanks for all your help.
I think I didn't cache the data. My previous cluster was expired and I
don't have a chance to check the load balance or app manager.
Below is my code.
There are 18 features for each record and I am using the Scala API.

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import java.util.Calendar

object BenchmarkClassification {
def main(args: Array[String]) {
// Load and parse the data file
val conf = new SparkConf()
      .setAppName("SVM")
      .set("spark.executor.memory", "8g")
      // .set("spark.executor.extraJavaOptions", "-Xms8g -Xmx8g")
    val sc = new SparkContext(conf)
val data = sc.textFile(args(0))
val parsedData = data.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
x.toDouble)))
}
val testData = sc.textFile(args(1))
val testParsedData = testData .map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
x.toDouble)))
}

// Run training algorithm to build the model
val numIterations = 20
val model = SVMWithSGD.train(parsedData, numIterations)

// Evaluate model on training examples and compute training error
// val labelAndPreds = testParsedData.map { point =>
//   val prediction = model.predict(point.features)
//   (point.label, prediction)
// }
// val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble /
testParsedData.count
// println("Training Error = " + trainErr)
println(Calendar.getInstance().getTime())
}
}




Thanks,
Best,
Peng

On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng <men...@gmail.com> wrote:

> DId you cache the data and check the load balancing? How many
> features? Which API are you using, Scala, Java, or Python? -Xiangrui
>
> On Thu, Oct 30, 2014 at 9:13 AM, Jimmy <ji...@sellpoints.com> wrote:
> > Watch the app manager it should tell you what's running and taking
> awhile...
> > My guess it's a "distinct" function on the data.
> > J
> >
> > Sent from my iPhone
> >
> > On Oct 30, 2014, at 8:22 AM, peng xia <toxiap...@gmail.com> wrote:
> >
> > Hi,
> >
> >
> >
> > Previous we have applied SVM algorithm in MLlib to 5 million records (600
> > mb), it takes more than 25 minutes to finish.
> > The spark version we are using is 1.0 and we were running this program
> on a
> > 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
> >
> > The 5 million records only have two distinct records (One positive and
> one
> > negative), others are all duplications.
> >
> > Any one has any idea on why it takes so long on this small data?
> >
> >
> >
> > Thanks,
> > Best,
> >
> > Peng
>

Reply via email to