Thanks Jimmy.
I will have a try.
Thanks very much for your guys' help.

Best,
Peng

On Thu, Oct 30, 2014 at 8:19 PM, Jimmy <ji...@sellpoints.com> wrote:

> sampleRDD. cache()
>
> Sent from my iPhone
>
> On Oct 30, 2014, at 5:01 PM, peng xia <toxiap...@gmail.com> wrote:
>
> Hi Xiangrui,
>
> Can you give me some code example about caching, as I am new to Spark.
>
> Thanks,
> Best,
> Peng
>
> On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng <men...@gmail.com> wrote:
>
>> Then caching should solve the problem. Otherwise, it is just loading
>> and parsing data from disk for each iteration. -Xiangrui
>>
>> On Thu, Oct 30, 2014 at 11:44 AM, peng xia <toxiap...@gmail.com> wrote:
>> > Thanks for all your help.
>> > I think I didn't cache the data. My previous cluster was expired and I
>> don't
>> > have a chance to check the load balance or app manager.
>> > Below is my code.
>> > There are 18 features for each record and I am using the Scala API.
>> >
>> > import org.apache.spark.SparkConf
>> > import org.apache.spark.SparkContext
>> > import org.apache.spark.SparkContext._
>> > import org.apache.spark.rdd._
>> > import org.apache.spark.mllib.classification.SVMWithSGD
>> > import org.apache.spark.mllib.regression.LabeledPoint
>> > import org.apache.spark.mllib.linalg.Vectors
>> > import java.util.Calendar
>> >
>> > object BenchmarkClassification {
>> > def main(args: Array[String]) {
>> > // Load and parse the data file
>> > val conf = new SparkConf()
>> >      .setAppName("SVM")
>> >      .set("spark.executor.memory", "8g")
>> >      // .set("spark.executor.extraJavaOptions", "-Xms8g -Xmx8g")
>> >    val sc = new SparkContext(conf)
>> > val data = sc.textFile(args(0))
>> > val parsedData = data.map { line =>
>> >  val parts = line.split(',')
>> >  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
>> > x.toDouble)))
>> > }
>> > val testData = sc.textFile(args(1))
>> > val testParsedData = testData .map { line =>
>> >  val parts = line.split(',')
>> >  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
>> > x.toDouble)))
>> > }
>> >
>> > // Run training algorithm to build the model
>> > val numIterations = 20
>> > val model = SVMWithSGD.train(parsedData, numIterations)
>> >
>> > // Evaluate model on training examples and compute training error
>> > // val labelAndPreds = testParsedData.map { point =>
>> > //   val prediction = model.predict(point.features)
>> > //   (point.label, prediction)
>> > // }
>> > // val trainErr = labelAndPreds.filter(r => r._1 !=
>> r._2).count.toDouble /
>> > testParsedData.count
>> > // println("Training Error = " + trainErr)
>> > println(Calendar.getInstance().getTime())
>> > }
>> > }
>> >
>> >
>> >
>> >
>> > Thanks,
>> > Best,
>> > Peng
>> >
>> > On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng <men...@gmail.com>
>> wrote:
>> >>
>> >> DId you cache the data and check the load balancing? How many
>> >> features? Which API are you using, Scala, Java, or Python? -Xiangrui
>> >>
>> >> On Thu, Oct 30, 2014 at 9:13 AM, Jimmy <ji...@sellpoints.com> wrote:
>> >> > Watch the app manager it should tell you what's running and taking
>> >> > awhile...
>> >> > My guess it's a "distinct" function on the data.
>> >> > J
>> >> >
>> >> > Sent from my iPhone
>> >> >
>> >> > On Oct 30, 2014, at 8:22 AM, peng xia <toxiap...@gmail.com> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> >
>> >> >
>> >> > Previous we have applied SVM algorithm in MLlib to 5 million records
>> >> > (600
>> >> > mb), it takes more than 25 minutes to finish.
>> >> > The spark version we are using is 1.0 and we were running this
>> program
>> >> > on a
>> >> > 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
>> >> >
>> >> > The 5 million records only have two distinct records (One positive
>> and
>> >> > one
>> >> > negative), others are all duplications.
>> >> >
>> >> > Any one has any idea on why it takes so long on this small data?
>> >> >
>> >> >
>> >> >
>> >> > Thanks,
>> >> > Best,
>> >> >
>> >> > Peng
>> >
>> >
>>
>
>

Reply via email to