It was a bug in the code, however adding the step parameter got the results to work. Mean Squared Error = 2.610379825794694E-5
I've also opened a jira to put the step parameter in the examples so that people new to mllib have a way to improve the MSE. https://issues.apache.org/jira/browse/SPARK-5273 On Thu, Jan 15, 2015 at 8:23 PM, Joseph Bradley <jos...@databricks.com> wrote: > It looks like you're training on the non-scaled data but testing on the > scaled data. Have you tried this training & testing on only the scaled > data? > > On Thu, Jan 15, 2015 at 10:42 AM, Devl Devel <devl.developm...@gmail.com> > wrote: > >> Thanks, that helps a bit at least with the NaN but the MSE is still very >> high even with that step size and 10k iterations: >> >> training Mean Squared Error = 3.3322561285919316E7 >> >> Does this method need say 100k iterations? >> >> >> >> >> >> >> On Thu, Jan 15, 2015 at 5:42 PM, Robin East <robin.e...@xense.co.uk> >> wrote: >> >> > -dev, +user >> > >> > You’ll need to set the gradient descent step size to something small - a >> > bit of trial and error shows that 0.00000001 works. >> > >> > You’ll need to create a LinearRegressionWithSGD instance and set the >> step >> > size explicitly: >> > >> > val lr = new LinearRegressionWithSGD() >> > lr.optimizer.setStepSize(0.00000001) >> > lr.optimizer.setNumIterations(100) >> > val model = lr.run(parsedData) >> > >> > On 15 Jan 2015, at 16:46, devl.development <devl.developm...@gmail.com> >> > wrote: >> > >> > From what I gather, you use LinearRegressionWithSGD to predict y or the >> > response variable given a feature vector x. >> > >> > In a simple example I used a perfectly linear dataset such that x=y >> > y,x >> > 1,1 >> > 2,2 >> > ... >> > >> > 10000,10000 >> > >> > Using the out-of-box example from the website (with and without >> scaling): >> > >> > val data = sc.textFile(file) >> > >> > val parsedData = data.map { line => >> > val parts = line.split(',') >> > LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) >> //y >> > and x >> > >> > } >> > val scaler = new StandardScaler(withMean = true, withStd = true) >> > .fit(parsedData.map(x => x.features)) >> > val scaledData = parsedData >> > .map(x => >> > LabeledPoint(x.label, >> > scaler.transform(Vectors.dense(x.features.toArray)))) >> > >> > // Building the model >> > val numIterations = 100 >> > val model = LinearRegressionWithSGD.train(parsedData, numIterations) >> > >> > // Evaluate model on training examples and compute training error * >> > tried using both scaledData and parsedData >> > val valuesAndPreds = scaledData.map { point => >> > val prediction = model.predict(point.features) >> > (point.label, prediction) >> > } >> > val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), >> 2)}.mean() >> > println("training Mean Squared Error = " + MSE) >> > >> > Both scaled and unscaled attempts give: >> > >> > training Mean Squared Error = NaN >> > >> > I've even tried x, y+(sample noise from normal with mean 0 and stddev 1) >> > still comes up with the same thing. >> > >> > Is this not supposed to work for x and y or 2 dimensional plots? Is >> there >> > something I'm missing or wrong in the code above? Or is there a >> limitation >> > in the method? >> > >> > Thanks for any advice. >> > >> > >> > >> > -- >> > View this message in context: >> > >> http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html >> > Sent from the Apache Spark Developers List mailing list archive at >> > Nabble.com. >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> > For additional commands, e-mail: dev-h...@spark.apache.org >> > >> > >> > >> > >