Then caching should solve the problem. Otherwise, it is just loading and parsing data from disk for each iteration. -Xiangrui
On Thu, Oct 30, 2014 at 11:44 AM, peng xia <toxiap...@gmail.com> wrote: > Thanks for all your help. > I think I didn't cache the data. My previous cluster was expired and I don't > have a chance to check the load balance or app manager. > Below is my code. > There are 18 features for each record and I am using the Scala API. > > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.SparkContext._ > import org.apache.spark.rdd._ > import org.apache.spark.mllib.classification.SVMWithSGD > import org.apache.spark.mllib.regression.LabeledPoint > import org.apache.spark.mllib.linalg.Vectors > import java.util.Calendar > > object BenchmarkClassification { > def main(args: Array[String]) { > // Load and parse the data file > val conf = new SparkConf() > .setAppName("SVM") > .set("spark.executor.memory", "8g") > // .set("spark.executor.extraJavaOptions", "-Xms8g -Xmx8g") > val sc = new SparkContext(conf) > val data = sc.textFile(args(0)) > val parsedData = data.map { line => > val parts = line.split(',') > LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x => > x.toDouble))) > } > val testData = sc.textFile(args(1)) > val testParsedData = testData .map { line => > val parts = line.split(',') > LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x => > x.toDouble))) > } > > // Run training algorithm to build the model > val numIterations = 20 > val model = SVMWithSGD.train(parsedData, numIterations) > > // Evaluate model on training examples and compute training error > // val labelAndPreds = testParsedData.map { point => > // val prediction = model.predict(point.features) > // (point.label, prediction) > // } > // val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / > testParsedData.count > // println("Training Error = " + trainErr) > println(Calendar.getInstance().getTime()) > } > } > > > > > Thanks, > Best, > Peng > > On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng <men...@gmail.com> wrote: >> >> DId you cache the data and check the load balancing? How many >> features? Which API are you using, Scala, Java, or Python? -Xiangrui >> >> On Thu, Oct 30, 2014 at 9:13 AM, Jimmy <ji...@sellpoints.com> wrote: >> > Watch the app manager it should tell you what's running and taking >> > awhile... >> > My guess it's a "distinct" function on the data. >> > J >> > >> > Sent from my iPhone >> > >> > On Oct 30, 2014, at 8:22 AM, peng xia <toxiap...@gmail.com> wrote: >> > >> > Hi, >> > >> > >> > >> > Previous we have applied SVM algorithm in MLlib to 5 million records >> > (600 >> > mb), it takes more than 25 minutes to finish. >> > The spark version we are using is 1.0 and we were running this program >> > on a >> > 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. >> > >> > The 5 million records only have two distinct records (One positive and >> > one >> > negative), others are all duplications. >> > >> > Any one has any idea on why it takes so long on this small data? >> > >> > >> > >> > Thanks, >> > Best, >> > >> > Peng > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org