Hi Xiangrui, Can you give me some code example about caching, as I am new to Spark.
Thanks, Best, Peng On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng <men...@gmail.com> wrote: > Then caching should solve the problem. Otherwise, it is just loading > and parsing data from disk for each iteration. -Xiangrui > > On Thu, Oct 30, 2014 at 11:44 AM, peng xia <toxiap...@gmail.com> wrote: > > Thanks for all your help. > > I think I didn't cache the data. My previous cluster was expired and I > don't > > have a chance to check the load balance or app manager. > > Below is my code. > > There are 18 features for each record and I am using the Scala API. > > > > import org.apache.spark.SparkConf > > import org.apache.spark.SparkContext > > import org.apache.spark.SparkContext._ > > import org.apache.spark.rdd._ > > import org.apache.spark.mllib.classification.SVMWithSGD > > import org.apache.spark.mllib.regression.LabeledPoint > > import org.apache.spark.mllib.linalg.Vectors > > import java.util.Calendar > > > > object BenchmarkClassification { > > def main(args: Array[String]) { > > // Load and parse the data file > > val conf = new SparkConf() > > .setAppName("SVM") > > .set("spark.executor.memory", "8g") > > // .set("spark.executor.extraJavaOptions", "-Xms8g -Xmx8g") > > val sc = new SparkContext(conf) > > val data = sc.textFile(args(0)) > > val parsedData = data.map { line => > > val parts = line.split(',') > > LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x => > > x.toDouble))) > > } > > val testData = sc.textFile(args(1)) > > val testParsedData = testData .map { line => > > val parts = line.split(',') > > LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x => > > x.toDouble))) > > } > > > > // Run training algorithm to build the model > > val numIterations = 20 > > val model = SVMWithSGD.train(parsedData, numIterations) > > > > // Evaluate model on training examples and compute training error > > // val labelAndPreds = testParsedData.map { point => > > // val prediction = model.predict(point.features) > > // (point.label, prediction) > > // } > > // val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble > / > > testParsedData.count > > // println("Training Error = " + trainErr) > > println(Calendar.getInstance().getTime()) > > } > > } > > > > > > > > > > Thanks, > > Best, > > Peng > > > > On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng <men...@gmail.com> wrote: > >> > >> DId you cache the data and check the load balancing? How many > >> features? Which API are you using, Scala, Java, or Python? -Xiangrui > >> > >> On Thu, Oct 30, 2014 at 9:13 AM, Jimmy <ji...@sellpoints.com> wrote: > >> > Watch the app manager it should tell you what's running and taking > >> > awhile... > >> > My guess it's a "distinct" function on the data. > >> > J > >> > > >> > Sent from my iPhone > >> > > >> > On Oct 30, 2014, at 8:22 AM, peng xia <toxiap...@gmail.com> wrote: > >> > > >> > Hi, > >> > > >> > > >> > > >> > Previous we have applied SVM algorithm in MLlib to 5 million records > >> > (600 > >> > mb), it takes more than 25 minutes to finish. > >> > The spark version we are using is 1.0 and we were running this program > >> > on a > >> > 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. > >> > > >> > The 5 million records only have two distinct records (One positive and > >> > one > >> > negative), others are all duplications. > >> > > >> > Any one has any idea on why it takes so long on this small data? > >> > > >> > > >> > > >> > Thanks, > >> > Best, > >> > > >> > Peng > > > > >