Re: issue on applying SVM to 5 million examples.

Xiangrui Meng Thu, 30 Oct 2014 16:00:05 -0700

Then caching should solve the problem. Otherwise, it is just loading
and parsing data from disk for each iteration. -Xiangrui


On Thu, Oct 30, 2014 at 11:44 AM, peng xia <toxiap...@gmail.com> wrote:
> Thanks for all your help.
> I think I didn't cache the data. My previous cluster was expired and I don't
> have a chance to check the load balance or app manager.
> Below is my code.
> There are 18 features for each record and I am using the Scala API.
>
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> import org.apache.spark.rdd._
> import org.apache.spark.mllib.classification.SVMWithSGD
> import org.apache.spark.mllib.regression.LabeledPoint
> import org.apache.spark.mllib.linalg.Vectors
> import java.util.Calendar
>
> object BenchmarkClassification {
> def main(args: Array[String]) {
> // Load and parse the data file
> val conf = new SparkConf()
>      .setAppName("SVM")
>      .set("spark.executor.memory", "8g")
>      // .set("spark.executor.extraJavaOptions", "-Xms8g -Xmx8g")
>    val sc = new SparkContext(conf)
> val data = sc.textFile(args(0))
> val parsedData = data.map { line =>
>  val parts = line.split(',')
>  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
> x.toDouble)))
> }
> val testData = sc.textFile(args(1))
> val testParsedData = testData .map { line =>
>  val parts = line.split(',')
>  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
> x.toDouble)))
> }
>
> // Run training algorithm to build the model
> val numIterations = 20
> val model = SVMWithSGD.train(parsedData, numIterations)
>
> // Evaluate model on training examples and compute training error
> // val labelAndPreds = testParsedData.map { point =>
> //   val prediction = model.predict(point.features)
> //   (point.label, prediction)
> // }
> // val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble /
> testParsedData.count
> // println("Training Error = " + trainErr)
> println(Calendar.getInstance().getTime())
> }
> }
>
>
>
>
> Thanks,
> Best,
> Peng
>
> On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> DId you cache the data and check the load balancing? How many
>> features? Which API are you using, Scala, Java, or Python? -Xiangrui
>>
>> On Thu, Oct 30, 2014 at 9:13 AM, Jimmy <ji...@sellpoints.com> wrote:
>> > Watch the app manager it should tell you what's running and taking
>> > awhile...
>> > My guess it's a "distinct" function on the data.
>> > J
>> >
>> > Sent from my iPhone
>> >
>> > On Oct 30, 2014, at 8:22 AM, peng xia <toxiap...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> >
>> >
>> > Previous we have applied SVM algorithm in MLlib to 5 million records
>> > (600
>> > mb), it takes more than 25 minutes to finish.
>> > The spark version we are using is 1.0 and we were running this program
>> > on a
>> > 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
>> >
>> > The 5 million records only have two distinct records (One positive and
>> > one
>> > negative), others are all duplications.
>> >
>> > Any one has any idea on why it takes so long on this small data?
>> >
>> >
>> >
>> > Thanks,
>> > Best,
>> >
>> > Peng
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: issue on applying SVM to 5 million examples.

Reply via email to