Re: LinearRegressionWithSGD accuracy
Hi Robin, You can try this PR out. This has built-in features scaling, and has ElasticNet regularization (L1/L2 mix). This implementation can stably converge to model from R's glmnet package. https://github.com/apache/spark/pull/4259 Sincerely, DB Tsai --- Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Jan 15, 2015 at 9:42 AM, Robin East robin.e...@xense.co.uk wrote: -dev, +user You’ll need to set the gradient descent step size to something small - a bit of trial and error shows that 0.0001 works. You’ll need to create a LinearRegressionWithSGD instance and set the step size explicitly: val lr = new LinearRegressionWithSGD() lr.optimizer.setStepSize(0.0001) lr.optimizer.setNumIterations(100) val model = lr.run(parsedData) On 15 Jan 2015, at 16:46, devl.development devl.developm...@gmail.com wrote: From what I gather, you use LinearRegressionWithSGD to predict y or the response variable given a feature vector x. In a simple example I used a perfectly linear dataset such that x=y y,x 1,1 2,2 ... 1,1 Using the out-of-box example from the website (with and without scaling): val data = sc.textFile(file) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y and x } val scaler = new StandardScaler(withMean = true, withStd = true) .fit(parsedData.map(x = x.features)) val scaledData = parsedData .map(x = LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error * tried using both scaledData and parsedData val valuesAndPreds = scaledData.map { point = val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) = math.pow((v - p), 2)}.mean() println(training Mean Squared Error = + MSE) Both scaled and unscaled attempts give: training Mean Squared Error = NaN I've even tried x, y+(sample noise from normal with mean 0 and stddev 1) still comes up with the same thing. Is this not supposed to work for x and y or 2 dimensional plots? Is there something I'm missing or wrong in the code above? Or is there a limitation in the method? Thanks for any advice. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: LinearRegressionWithSGD accuracy
I'm working on LinearRegressionWithElasticNet using OWLQN now. This will do the data standardization internally so it's transparent to users. With OWLQN, you don't have to manually choose stepSize. Will send out PR soon next week. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Jan 15, 2015 at 8:46 AM, devl.development devl.developm...@gmail.com wrote: From what I gather, you use LinearRegressionWithSGD to predict y or the response variable given a feature vector x. In a simple example I used a perfectly linear dataset such that x=y y,x 1,1 2,2 ... 1,1 Using the out-of-box example from the website (with and without scaling): val data = sc.textFile(file) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y and x } val scaler = new StandardScaler(withMean = true, withStd = true) .fit(parsedData.map(x = x.features)) val scaledData = parsedData .map(x = LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error * tried using both scaledData and parsedData val valuesAndPreds = scaledData.map { point = val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) = math.pow((v - p), 2)}.mean() println(training Mean Squared Error = + MSE) Both scaled and unscaled attempts give: training Mean Squared Error = NaN I've even tried x, y+(sample noise from normal with mean 0 and stddev 1) still comes up with the same thing. Is this not supposed to work for x and y or 2 dimensional plots? Is there something I'm missing or wrong in the code above? Or is there a limitation in the method? Thanks for any advice. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: LinearRegressionWithSGD accuracy
Thanks, that helps a bit at least with the NaN but the MSE is still very high even with that step size and 10k iterations: training Mean Squared Error = 3.3322561285919316E7 Does this method need say 100k iterations? On Thu, Jan 15, 2015 at 5:42 PM, Robin East robin.e...@xense.co.uk wrote: -dev, +user You’ll need to set the gradient descent step size to something small - a bit of trial and error shows that 0.0001 works. You’ll need to create a LinearRegressionWithSGD instance and set the step size explicitly: val lr = new LinearRegressionWithSGD() lr.optimizer.setStepSize(0.0001) lr.optimizer.setNumIterations(100) val model = lr.run(parsedData) On 15 Jan 2015, at 16:46, devl.development devl.developm...@gmail.com wrote: From what I gather, you use LinearRegressionWithSGD to predict y or the response variable given a feature vector x. In a simple example I used a perfectly linear dataset such that x=y y,x 1,1 2,2 ... 1,1 Using the out-of-box example from the website (with and without scaling): val data = sc.textFile(file) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y and x } val scaler = new StandardScaler(withMean = true, withStd = true) .fit(parsedData.map(x = x.features)) val scaledData = parsedData .map(x = LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error * tried using both scaledData and parsedData val valuesAndPreds = scaledData.map { point = val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) = math.pow((v - p), 2)}.mean() println(training Mean Squared Error = + MSE) Both scaled and unscaled attempts give: training Mean Squared Error = NaN I've even tried x, y+(sample noise from normal with mean 0 and stddev 1) still comes up with the same thing. Is this not supposed to work for x and y or 2 dimensional plots? Is there something I'm missing or wrong in the code above? Or is there a limitation in the method? Thanks for any advice. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: LinearRegressionWithSGD accuracy
-dev, +user You’ll need to set the gradient descent step size to something small - a bit of trial and error shows that 0.0001 works. You’ll need to create a LinearRegressionWithSGD instance and set the step size explicitly: val lr = new LinearRegressionWithSGD() lr.optimizer.setStepSize(0.0001) lr.optimizer.setNumIterations(100) val model = lr.run(parsedData) On 15 Jan 2015, at 16:46, devl.development devl.developm...@gmail.com wrote: From what I gather, you use LinearRegressionWithSGD to predict y or the response variable given a feature vector x. In a simple example I used a perfectly linear dataset such that x=y y,x 1,1 2,2 ... 1,1 Using the out-of-box example from the website (with and without scaling): val data = sc.textFile(file) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y and x } val scaler = new StandardScaler(withMean = true, withStd = true) .fit(parsedData.map(x = x.features)) val scaledData = parsedData .map(x = LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error * tried using both scaledData and parsedData val valuesAndPreds = scaledData.map { point = val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) = math.pow((v - p), 2)}.mean() println(training Mean Squared Error = + MSE) Both scaled and unscaled attempts give: training Mean Squared Error = NaN I've even tried x, y+(sample noise from normal with mean 0 and stddev 1) still comes up with the same thing. Is this not supposed to work for x and y or 2 dimensional plots? Is there something I'm missing or wrong in the code above? Or is there a limitation in the method? Thanks for any advice. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: LinearRegressionWithSGD accuracy
It looks like you're training on the non-scaled data but testing on the scaled data. Have you tried this training testing on only the scaled data? On Thu, Jan 15, 2015 at 10:42 AM, Devl Devel devl.developm...@gmail.com wrote: Thanks, that helps a bit at least with the NaN but the MSE is still very high even with that step size and 10k iterations: training Mean Squared Error = 3.3322561285919316E7 Does this method need say 100k iterations? On Thu, Jan 15, 2015 at 5:42 PM, Robin East robin.e...@xense.co.uk wrote: -dev, +user You’ll need to set the gradient descent step size to something small - a bit of trial and error shows that 0.0001 works. You’ll need to create a LinearRegressionWithSGD instance and set the step size explicitly: val lr = new LinearRegressionWithSGD() lr.optimizer.setStepSize(0.0001) lr.optimizer.setNumIterations(100) val model = lr.run(parsedData) On 15 Jan 2015, at 16:46, devl.development devl.developm...@gmail.com wrote: From what I gather, you use LinearRegressionWithSGD to predict y or the response variable given a feature vector x. In a simple example I used a perfectly linear dataset such that x=y y,x 1,1 2,2 ... 1,1 Using the out-of-box example from the website (with and without scaling): val data = sc.textFile(file) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y and x } val scaler = new StandardScaler(withMean = true, withStd = true) .fit(parsedData.map(x = x.features)) val scaledData = parsedData .map(x = LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error * tried using both scaledData and parsedData val valuesAndPreds = scaledData.map { point = val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) = math.pow((v - p), 2)}.mean() println(training Mean Squared Error = + MSE) Both scaled and unscaled attempts give: training Mean Squared Error = NaN I've even tried x, y+(sample noise from normal with mean 0 and stddev 1) still comes up with the same thing. Is this not supposed to work for x and y or 2 dimensional plots? Is there something I'm missing or wrong in the code above? Or is there a limitation in the method? Thanks for any advice. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org