Sent from my iPhone

Begin forwarded message:

> From: Robin East <robin.e...@xense.co.uk>
> Date: 16 January 2015 11:35:23 GMT
> To: Joseph Bradley <jos...@databricks.com>
> Cc: Yana Kadiyska <yana.kadiy...@gmail.com>, Devl Devel 
> <devl.developm...@gmail.com>
> Subject: Re: LinearRegressionWithSGD accuracy
> 
> Yes with scaled data intercept would be 5000 but the code as it stands is 
> running a model where intercept will be 0.00. You need to call 
> setIntercept(true) to include the intercept in the model.
> 
> Robin
> 
> Sent from my iPhone
> 
>> On 16 Jan 2015, at 02:01, Joseph Bradley <jos...@databricks.com> wrote:
>> 
>> Good point about using the intercept.  When scaling uses the mean (shifting 
>> the feature values), then the "true" model now has an intercept of 5000.5, 
>> whereas the original data's "true" model has an intercept of 0.  I think 
>> that's the issue.
>> 
>>> On Thu, Jan 15, 2015 at 5:16 PM, Yana Kadiyska <yana.kadiy...@gmail.com> 
>>> wrote:
>>> I can actually reproduce his MSE -- with the scaled data only (non-scaled 
>>> works out just fine)
>>> 
>>> import org.apache.spark.mllib.regression._
>>> import org.apache.spark.mllib.linalg.{Vector, Vectors}
>>> 
>>> val t=(1 to 10000).map(x=>(x,x))
>>> val rdd = sc.parallelize(t)
>>> val parsedData =  
>>> rdd.map(q=>LabeledPoint(q._1.toDouble,Vectors.dense(q._2.toDouble))
>>> 
>>> val lr = new LinearRegressionWithSGD()
>>> lr.optimizer.setStepSize(0.00000001)
>>> lr.optimizer.setNumIterations(100)
>>> 
>>> val scaledData = parsedData.map(x => LabeledPoint(x.label, 
>>> scaler.transform(Vectors.dense(x.features.toArray))))
>>> val model = lr.run(scaledData)
>>> 
>>> val valuesAndPreds = scaledData.map { point =>
>>>       val prediction = model.predict(point.features)
>>>       (prediction,point.label)
>>>     }
>>> val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
>>> Last few lines read as:
>>> 
>>> 15/01/15 16:16:40 INFO GradientDescent: GradientDescent.runMiniBatchSGD 
>>> finished. Last 10 stochastic losses 3.3338313007386144E7, 
>>> 3.333831299679853E7, 3.333831298621632E7, 3.333831297563938E7, 
>>> 3.3338312965067785E7, 3.3338312954501465E7, 3.333831294394051E7, 
>>> 3.3338312933384743E7, 3.33383129228344E7, 3.3338312912289333E7
>>> 15/01/15 16:16:40 WARN LinearRegressionWithSGD: The input data was not 
>>> directly cached, which may hurt performance if its parent RDDs are also 
>>> uncached.
>>> model: org.apache.spark.mllib.regression.LinearRegressionModel = 
>>> (weights=[0.003567902277776811], intercept=0.0)
>>> 
>>> So I am a bit puzzled as I was under the impression that a scaled model 
>>> would only converge faster. Non-scaled version produced near perfect 
>>> results at alpha=0.00000001,numIterations=100
>>> 
>>> According to R the weights should be a lot higher:
>>> y=seq(1, 10000)
>>> X=scale(a, center = TRUE, scale = TRUE)
>>> dt=data.frame(y,X)
>>> names(dt) = c("y","x")
>>> model= lm(y~x,data=dt)
>>> #intercept:5000.5,2886.896
>>> new <- data.frame(x=dt$x)
>>> preds = predict(model,new)
>>> mean( (preds-dt$y)^2 , na.rm = TRUE )
>>> Coefficients:
>>> (Intercept)            x  
>>>        5000.5,  2886.896
>>> 
>>> I did have success with the following model and scaled features as shown in 
>>> the original code block:
>>> 
>>> val lr = new LinearRegressionWithSGD().setIntercept(true)
>>> lr.optimizer.setStepSize(0.1)
>>> lr.optimizer.setNumIterations(1000)
>>> 
>>> scala> model
>>> res12: org.apache.spark.mllib.regression.LinearRegressionModel = 
>>> (weights=[2886.885094323781], intercept=5000.48169121784)
>>> MSE: Double = 4.472548743491049E-4
>>> 
>>> Not sure that it's a question for the dev list as much as someone who 
>>> understands ML well -- I'd appreciate if you guys have any insight on why 
>>> the small alpha/numIters did so poorly on the scaled data (I've removed the 
>>> dev list)
>>> 
>>> 
>>> 
>>> 
>>>> On Thu, Jan 15, 2015 at 3:23 PM, Joseph Bradley <jos...@databricks.com> 
>>>> wrote:
>>> 
>>>> It looks like you're training on the non-scaled data but testing on the
>>>> scaled data.  Have you tried this training & testing on only the scaled
>>>> data?
>>>> 
>>>> On Thu, Jan 15, 2015 at 10:42 AM, Devl Devel <devl.developm...@gmail.com>
>>>> wrote:
>>>> 
>>>> > Thanks, that helps a bit at least with the NaN but the MSE is still very
>>>> > high even with that step size and 10k iterations:
>>>> >
>>>> > training Mean Squared Error = 3.3322561285919316E7
>>>> >
>>>> > Does this method need say 100k iterations?
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Thu, Jan 15, 2015 at 5:42 PM, Robin East <robin.e...@xense.co.uk>
>>>> > wrote:
>>>> >
>>>> > > -dev, +user
>>>> > >
>>>> > > You’ll need to set the gradient descent step size to something small - 
>>>> > > a
>>>> > > bit of trial and error shows that 0.00000001 works.
>>>> > >
>>>> > > You’ll need to create a LinearRegressionWithSGD instance and set the 
>>>> > > step
>>>> > > size explicitly:
>>>> > >
>>>> > > val lr = new LinearRegressionWithSGD()
>>>> > > lr.optimizer.setStepSize(0.00000001)
>>>> > > lr.optimizer.setNumIterations(100)
>>>> > > val model = lr.run(parsedData)
>>>> > >
>>>> > > On 15 Jan 2015, at 16:46, devl.development <devl.developm...@gmail.com>
>>>> > > wrote:
>>>> > >
>>>> > > From what I gather, you use LinearRegressionWithSGD to predict y or the
>>>> > > response variable given a feature vector x.
>>>> > >
>>>> > > In a simple example I used a perfectly linear dataset such that x=y
>>>> > > y,x
>>>> > > 1,1
>>>> > > 2,2
>>>> > > ...
>>>> > >
>>>> > > 10000,10000
>>>> > >
>>>> > > Using the out-of-box example from the website (with and without 
>>>> > > scaling):
>>>> > >
>>>> > > val data = sc.textFile(file)
>>>> > >
>>>> > >    val parsedData = data.map { line =>
>>>> > >      val parts = line.split(',')
>>>> > >     LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) 
>>>> > > //y
>>>> > > and x
>>>> > >
>>>> > >    }
>>>> > >    val scaler = new StandardScaler(withMean = true, withStd = true)
>>>> > >      .fit(parsedData.map(x => x.features))
>>>> > >    val scaledData = parsedData
>>>> > >      .map(x =>
>>>> > >      LabeledPoint(x.label,
>>>> > >        scaler.transform(Vectors.dense(x.features.toArray))))
>>>> > >
>>>> > >    // Building the model
>>>> > >    val numIterations = 100
>>>> > >    val model = LinearRegressionWithSGD.train(parsedData, numIterations)
>>>> > >
>>>> > >    // Evaluate model on training examples and compute training error *
>>>> > > tried using both scaledData and parsedData
>>>> > >    val valuesAndPreds = scaledData.map { point =>
>>>> > >      val prediction = model.predict(point.features)
>>>> > >      (point.label, prediction)
>>>> > >    }
>>>> > >    val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p),
>>>> > 2)}.mean()
>>>> > >    println("training Mean Squared Error = " + MSE)
>>>> > >
>>>> > > Both scaled and unscaled attempts give:
>>>> > >
>>>> > > training Mean Squared Error = NaN
>>>> > >
>>>> > > I've even tried x, y+(sample noise from normal with mean 0 and stddev 
>>>> > > 1)
>>>> > > still comes up with the same thing.
>>>> > >
>>>> > > Is this not supposed to work for x and y or 2 dimensional plots? Is 
>>>> > > there
>>>> > > something I'm missing or wrong in the code above? Or is there a
>>>> > limitation
>>>> > > in the method?
>>>> > >
>>>> > > Thanks for any advice.
>>>> > >
>>>> > >
>>>> > >
>>>> > > --
>>>> > > View this message in context:
>>>> > >
>>>> > http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html
>>>> > > Sent from the Apache Spark Developers List mailing list archive at
>>>> > > Nabble.com.
>>>> > >
>>>> > > ---------------------------------------------------------------------
>>>> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> > > For additional commands, e-mail: dev-h...@spark.apache.org
>>>> > >
>>>> > >
>>>> > >
>>>> >
>> 

Reply via email to