Thanks Jimmy. I will have a try. Thanks very much for your guys' help. Best, Peng
On Thu, Oct 30, 2014 at 8:19 PM, Jimmy <ji...@sellpoints.com> wrote: > sampleRDD. cache() > > Sent from my iPhone > > On Oct 30, 2014, at 5:01 PM, peng xia <toxiap...@gmail.com> wrote: > > Hi Xiangrui, > > Can you give me some code example about caching, as I am new to Spark. > > Thanks, > Best, > Peng > > On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng <men...@gmail.com> wrote: > >> Then caching should solve the problem. Otherwise, it is just loading >> and parsing data from disk for each iteration. -Xiangrui >> >> On Thu, Oct 30, 2014 at 11:44 AM, peng xia <toxiap...@gmail.com> wrote: >> > Thanks for all your help. >> > I think I didn't cache the data. My previous cluster was expired and I >> don't >> > have a chance to check the load balance or app manager. >> > Below is my code. >> > There are 18 features for each record and I am using the Scala API. >> > >> > import org.apache.spark.SparkConf >> > import org.apache.spark.SparkContext >> > import org.apache.spark.SparkContext._ >> > import org.apache.spark.rdd._ >> > import org.apache.spark.mllib.classification.SVMWithSGD >> > import org.apache.spark.mllib.regression.LabeledPoint >> > import org.apache.spark.mllib.linalg.Vectors >> > import java.util.Calendar >> > >> > object BenchmarkClassification { >> > def main(args: Array[String]) { >> > // Load and parse the data file >> > val conf = new SparkConf() >> > .setAppName("SVM") >> > .set("spark.executor.memory", "8g") >> > // .set("spark.executor.extraJavaOptions", "-Xms8g -Xmx8g") >> > val sc = new SparkContext(conf) >> > val data = sc.textFile(args(0)) >> > val parsedData = data.map { line => >> > val parts = line.split(',') >> > LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x => >> > x.toDouble))) >> > } >> > val testData = sc.textFile(args(1)) >> > val testParsedData = testData .map { line => >> > val parts = line.split(',') >> > LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x => >> > x.toDouble))) >> > } >> > >> > // Run training algorithm to build the model >> > val numIterations = 20 >> > val model = SVMWithSGD.train(parsedData, numIterations) >> > >> > // Evaluate model on training examples and compute training error >> > // val labelAndPreds = testParsedData.map { point => >> > // val prediction = model.predict(point.features) >> > // (point.label, prediction) >> > // } >> > // val trainErr = labelAndPreds.filter(r => r._1 != >> r._2).count.toDouble / >> > testParsedData.count >> > // println("Training Error = " + trainErr) >> > println(Calendar.getInstance().getTime()) >> > } >> > } >> > >> > >> > >> > >> > Thanks, >> > Best, >> > Peng >> > >> > On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng <men...@gmail.com> >> wrote: >> >> >> >> DId you cache the data and check the load balancing? How many >> >> features? Which API are you using, Scala, Java, or Python? -Xiangrui >> >> >> >> On Thu, Oct 30, 2014 at 9:13 AM, Jimmy <ji...@sellpoints.com> wrote: >> >> > Watch the app manager it should tell you what's running and taking >> >> > awhile... >> >> > My guess it's a "distinct" function on the data. >> >> > J >> >> > >> >> > Sent from my iPhone >> >> > >> >> > On Oct 30, 2014, at 8:22 AM, peng xia <toxiap...@gmail.com> wrote: >> >> > >> >> > Hi, >> >> > >> >> > >> >> > >> >> > Previous we have applied SVM algorithm in MLlib to 5 million records >> >> > (600 >> >> > mb), it takes more than 25 minutes to finish. >> >> > The spark version we are using is 1.0 and we were running this >> program >> >> > on a >> >> > 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. >> >> > >> >> > The 5 million records only have two distinct records (One positive >> and >> >> > one >> >> > negative), others are all duplications. >> >> > >> >> > Any one has any idea on why it takes so long on this small data? >> >> > >> >> > >> >> > >> >> > Thanks, >> >> > Best, >> >> > >> >> > Peng >> > >> > >> > >