Re: issue on applying SVM to 5 million examples.

peng xia Thu, 30 Oct 2014 17:02:28 -0700

Hi Xiangrui,

Can you give me some code example about caching, as I am new to Spark.


Thanks,
Best,
Peng

On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng <men...@gmail.com> wrote:

> Then caching should solve the problem. Otherwise, it is just loading
> and parsing data from disk for each iteration. -Xiangrui
>
> On Thu, Oct 30, 2014 at 11:44 AM, peng xia <toxiap...@gmail.com> wrote:
> > Thanks for all your help.
> > I think I didn't cache the data. My previous cluster was expired and I
> don't
> > have a chance to check the load balance or app manager.
> > Below is my code.
> > There are 18 features for each record and I am using the Scala API.
> >
> > import org.apache.spark.SparkConf
> > import org.apache.spark.SparkContext
> > import org.apache.spark.SparkContext._
> > import org.apache.spark.rdd._
> > import org.apache.spark.mllib.classification.SVMWithSGD
> > import org.apache.spark.mllib.regression.LabeledPoint
> > import org.apache.spark.mllib.linalg.Vectors
> > import java.util.Calendar
> >
> > object BenchmarkClassification {
> > def main(args: Array[String]) {
> > // Load and parse the data file
> > val conf = new SparkConf()
> >      .setAppName("SVM")
> >      .set("spark.executor.memory", "8g")
> >      // .set("spark.executor.extraJavaOptions", "-Xms8g -Xmx8g")
> >    val sc = new SparkContext(conf)
> > val data = sc.textFile(args(0))
> > val parsedData = data.map { line =>
> >  val parts = line.split(',')
> >  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
> > x.toDouble)))
> > }
> > val testData = sc.textFile(args(1))
> > val testParsedData = testData .map { line =>
> >  val parts = line.split(',')
> >  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
> > x.toDouble)))
> > }
> >
> > // Run training algorithm to build the model
> > val numIterations = 20
> > val model = SVMWithSGD.train(parsedData, numIterations)
> >
> > // Evaluate model on training examples and compute training error
> > // val labelAndPreds = testParsedData.map { point =>
> > //   val prediction = model.predict(point.features)
> > //   (point.label, prediction)
> > // }
> > // val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble
> /
> > testParsedData.count
> > // println("Training Error = " + trainErr)
> > println(Calendar.getInstance().getTime())
> > }
> > }
> >
> >
> >
> >
> > Thanks,
> > Best,
> > Peng
> >
> > On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng <men...@gmail.com> wrote:
> >>
> >> DId you cache the data and check the load balancing? How many
> >> features? Which API are you using, Scala, Java, or Python? -Xiangrui
> >>
> >> On Thu, Oct 30, 2014 at 9:13 AM, Jimmy <ji...@sellpoints.com> wrote:
> >> > Watch the app manager it should tell you what's running and taking
> >> > awhile...
> >> > My guess it's a "distinct" function on the data.
> >> > J
> >> >
> >> > Sent from my iPhone
> >> >
> >> > On Oct 30, 2014, at 8:22 AM, peng xia <toxiap...@gmail.com> wrote:
> >> >
> >> > Hi,
> >> >
> >> >
> >> >
> >> > Previous we have applied SVM algorithm in MLlib to 5 million records
> >> > (600
> >> > mb), it takes more than 25 minutes to finish.
> >> > The spark version we are using is 1.0 and we were running this program
> >> > on a
> >> > 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
> >> >
> >> > The 5 million records only have two distinct records (One positive and
> >> > one
> >> > negative), others are all duplications.
> >> >
> >> > Any one has any idea on why it takes so long on this small data?
> >> >
> >> >
> >> >
> >> > Thanks,
> >> > Best,
> >> >
> >> > Peng
> >
> >
>

Re: issue on applying SVM to 5 million examples.

Reply via email to