Thanks for all your help. I think I didn't cache the data. My previous cluster was expired and I don't have a chance to check the load balance or app manager. Below is my code. There are 18 features for each record and I am using the Scala API.
import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import java.util.Calendar object BenchmarkClassification { def main(args: Array[String]) { // Load and parse the data file val conf = new SparkConf() .setAppName("SVM") .set("spark.executor.memory", "8g") // .set("spark.executor.extraJavaOptions", "-Xms8g -Xmx8g") val sc = new SparkContext(conf) val data = sc.textFile(args(0)) val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x => x.toDouble))) } val testData = sc.textFile(args(1)) val testParsedData = testData .map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x => x.toDouble))) } // Run training algorithm to build the model val numIterations = 20 val model = SVMWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error // val labelAndPreds = testParsedData.map { point => // val prediction = model.predict(point.features) // (point.label, prediction) // } // val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testParsedData.count // println("Training Error = " + trainErr) println(Calendar.getInstance().getTime()) } } Thanks, Best, Peng On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng <men...@gmail.com> wrote: > DId you cache the data and check the load balancing? How many > features? Which API are you using, Scala, Java, or Python? -Xiangrui > > On Thu, Oct 30, 2014 at 9:13 AM, Jimmy <ji...@sellpoints.com> wrote: > > Watch the app manager it should tell you what's running and taking > awhile... > > My guess it's a "distinct" function on the data. > > J > > > > Sent from my iPhone > > > > On Oct 30, 2014, at 8:22 AM, peng xia <toxiap...@gmail.com> wrote: > > > > Hi, > > > > > > > > Previous we have applied SVM algorithm in MLlib to 5 million records (600 > > mb), it takes more than 25 minutes to finish. > > The spark version we are using is 1.0 and we were running this program > on a > > 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. > > > > The 5 million records only have two distinct records (One positive and > one > > negative), others are all duplications. > > > > Any one has any idea on why it takes so long on this small data? > > > > > > > > Thanks, > > Best, > > > > Peng >