Sent from my iPhone
Begin forwarded message: > From: Robin East <robin.e...@xense.co.uk> > Date: 16 January 2015 11:35:23 GMT > To: Joseph Bradley <jos...@databricks.com> > Cc: Yana Kadiyska <yana.kadiy...@gmail.com>, Devl Devel > <devl.developm...@gmail.com> > Subject: Re: LinearRegressionWithSGD accuracy > > Yes with scaled data intercept would be 5000 but the code as it stands is > running a model where intercept will be 0.00. You need to call > setIntercept(true) to include the intercept in the model. > > Robin > > Sent from my iPhone > >> On 16 Jan 2015, at 02:01, Joseph Bradley <jos...@databricks.com> wrote: >> >> Good point about using the intercept. When scaling uses the mean (shifting >> the feature values), then the "true" model now has an intercept of 5000.5, >> whereas the original data's "true" model has an intercept of 0. I think >> that's the issue. >> >>> On Thu, Jan 15, 2015 at 5:16 PM, Yana Kadiyska <yana.kadiy...@gmail.com> >>> wrote: >>> I can actually reproduce his MSE -- with the scaled data only (non-scaled >>> works out just fine) >>> >>> import org.apache.spark.mllib.regression._ >>> import org.apache.spark.mllib.linalg.{Vector, Vectors} >>> >>> val t=(1 to 10000).map(x=>(x,x)) >>> val rdd = sc.parallelize(t) >>> val parsedData = >>> rdd.map(q=>LabeledPoint(q._1.toDouble,Vectors.dense(q._2.toDouble)) >>> >>> val lr = new LinearRegressionWithSGD() >>> lr.optimizer.setStepSize(0.00000001) >>> lr.optimizer.setNumIterations(100) >>> >>> val scaledData = parsedData.map(x => LabeledPoint(x.label, >>> scaler.transform(Vectors.dense(x.features.toArray)))) >>> val model = lr.run(scaledData) >>> >>> val valuesAndPreds = scaledData.map { point => >>> val prediction = model.predict(point.features) >>> (prediction,point.label) >>> } >>> val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() >>> Last few lines read as: >>> >>> 15/01/15 16:16:40 INFO GradientDescent: GradientDescent.runMiniBatchSGD >>> finished. Last 10 stochastic losses 3.3338313007386144E7, >>> 3.333831299679853E7, 3.333831298621632E7, 3.333831297563938E7, >>> 3.3338312965067785E7, 3.3338312954501465E7, 3.333831294394051E7, >>> 3.3338312933384743E7, 3.33383129228344E7, 3.3338312912289333E7 >>> 15/01/15 16:16:40 WARN LinearRegressionWithSGD: The input data was not >>> directly cached, which may hurt performance if its parent RDDs are also >>> uncached. >>> model: org.apache.spark.mllib.regression.LinearRegressionModel = >>> (weights=[0.003567902277776811], intercept=0.0) >>> >>> So I am a bit puzzled as I was under the impression that a scaled model >>> would only converge faster. Non-scaled version produced near perfect >>> results at alpha=0.00000001,numIterations=100 >>> >>> According to R the weights should be a lot higher: >>> y=seq(1, 10000) >>> X=scale(a, center = TRUE, scale = TRUE) >>> dt=data.frame(y,X) >>> names(dt) = c("y","x") >>> model= lm(y~x,data=dt) >>> #intercept:5000.5,2886.896 >>> new <- data.frame(x=dt$x) >>> preds = predict(model,new) >>> mean( (preds-dt$y)^2 , na.rm = TRUE ) >>> Coefficients: >>> (Intercept) x >>> 5000.5, 2886.896 >>> >>> I did have success with the following model and scaled features as shown in >>> the original code block: >>> >>> val lr = new LinearRegressionWithSGD().setIntercept(true) >>> lr.optimizer.setStepSize(0.1) >>> lr.optimizer.setNumIterations(1000) >>> >>> scala> model >>> res12: org.apache.spark.mllib.regression.LinearRegressionModel = >>> (weights=[2886.885094323781], intercept=5000.48169121784) >>> MSE: Double = 4.472548743491049E-4 >>> >>> Not sure that it's a question for the dev list as much as someone who >>> understands ML well -- I'd appreciate if you guys have any insight on why >>> the small alpha/numIters did so poorly on the scaled data (I've removed the >>> dev list) >>> >>> >>> >>> >>>> On Thu, Jan 15, 2015 at 3:23 PM, Joseph Bradley <jos...@databricks.com> >>>> wrote: >>> >>>> It looks like you're training on the non-scaled data but testing on the >>>> scaled data. Have you tried this training & testing on only the scaled >>>> data? >>>> >>>> On Thu, Jan 15, 2015 at 10:42 AM, Devl Devel <devl.developm...@gmail.com> >>>> wrote: >>>> >>>> > Thanks, that helps a bit at least with the NaN but the MSE is still very >>>> > high even with that step size and 10k iterations: >>>> > >>>> > training Mean Squared Error = 3.3322561285919316E7 >>>> > >>>> > Does this method need say 100k iterations? >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > On Thu, Jan 15, 2015 at 5:42 PM, Robin East <robin.e...@xense.co.uk> >>>> > wrote: >>>> > >>>> > > -dev, +user >>>> > > >>>> > > You’ll need to set the gradient descent step size to something small - >>>> > > a >>>> > > bit of trial and error shows that 0.00000001 works. >>>> > > >>>> > > You’ll need to create a LinearRegressionWithSGD instance and set the >>>> > > step >>>> > > size explicitly: >>>> > > >>>> > > val lr = new LinearRegressionWithSGD() >>>> > > lr.optimizer.setStepSize(0.00000001) >>>> > > lr.optimizer.setNumIterations(100) >>>> > > val model = lr.run(parsedData) >>>> > > >>>> > > On 15 Jan 2015, at 16:46, devl.development <devl.developm...@gmail.com> >>>> > > wrote: >>>> > > >>>> > > From what I gather, you use LinearRegressionWithSGD to predict y or the >>>> > > response variable given a feature vector x. >>>> > > >>>> > > In a simple example I used a perfectly linear dataset such that x=y >>>> > > y,x >>>> > > 1,1 >>>> > > 2,2 >>>> > > ... >>>> > > >>>> > > 10000,10000 >>>> > > >>>> > > Using the out-of-box example from the website (with and without >>>> > > scaling): >>>> > > >>>> > > val data = sc.textFile(file) >>>> > > >>>> > > val parsedData = data.map { line => >>>> > > val parts = line.split(',') >>>> > > LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) >>>> > > //y >>>> > > and x >>>> > > >>>> > > } >>>> > > val scaler = new StandardScaler(withMean = true, withStd = true) >>>> > > .fit(parsedData.map(x => x.features)) >>>> > > val scaledData = parsedData >>>> > > .map(x => >>>> > > LabeledPoint(x.label, >>>> > > scaler.transform(Vectors.dense(x.features.toArray)))) >>>> > > >>>> > > // Building the model >>>> > > val numIterations = 100 >>>> > > val model = LinearRegressionWithSGD.train(parsedData, numIterations) >>>> > > >>>> > > // Evaluate model on training examples and compute training error * >>>> > > tried using both scaledData and parsedData >>>> > > val valuesAndPreds = scaledData.map { point => >>>> > > val prediction = model.predict(point.features) >>>> > > (point.label, prediction) >>>> > > } >>>> > > val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), >>>> > 2)}.mean() >>>> > > println("training Mean Squared Error = " + MSE) >>>> > > >>>> > > Both scaled and unscaled attempts give: >>>> > > >>>> > > training Mean Squared Error = NaN >>>> > > >>>> > > I've even tried x, y+(sample noise from normal with mean 0 and stddev >>>> > > 1) >>>> > > still comes up with the same thing. >>>> > > >>>> > > Is this not supposed to work for x and y or 2 dimensional plots? Is >>>> > > there >>>> > > something I'm missing or wrong in the code above? Or is there a >>>> > limitation >>>> > > in the method? >>>> > > >>>> > > Thanks for any advice. >>>> > > >>>> > > >>>> > > >>>> > > -- >>>> > > View this message in context: >>>> > > >>>> > http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html >>>> > > Sent from the Apache Spark Developers List mailing list archive at >>>> > > Nabble.com. >>>> > > >>>> > > --------------------------------------------------------------------- >>>> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>> > > For additional commands, e-mail: dev-h...@spark.apache.org >>>> > > >>>> > > >>>> > > >>>> > >>