spark git commit: [SPARK-6496] [MLLIB] GeneralizedLinearAlgorithm.run(input, initialWeights) should initialize numFeatures

2015-03-25 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 64262ed99 -> 10c78607b


[SPARK-6496] [MLLIB] GeneralizedLinearAlgorithm.run(input, initialWeights) 
should initialize numFeatures

In GeneralizedLinearAlgorithm ```numFeatures``` is default to -1, we need to 
update it to correct value when we call run() to train a model.
```LogisticRegressionWithLBFGS.run(input)``` works well, but when we call 
```LogisticRegressionWithLBFGS.run(input, initialWeights)``` to train 
multiclass classification model, it will throw exception due to the numFeatures 
is not updated.
In this PR, we just update numFeatures at the beginning of 
GeneralizedLinearAlgorithm.run(input, initialWeights) and add test case.

Author: Yanbo Liang 

Closes #5167 from yanboliang/spark-6496 and squashes the following commits:

8131c48 [Yanbo Liang] LogisticRegressionWithLBFGS.run(input, initialWeights) 
should initialize numFeatures


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/10c78607
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/10c78607
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/10c78607

Branch: refs/heads/master
Commit: 10c78607b2724f5a64b0cdb966e9c5805f23919b
Parents: 64262ed
Author: Yanbo Liang 
Authored: Wed Mar 25 17:05:56 2015 +
Committer: Sean Owen 
Committed: Wed Mar 25 17:05:56 2015 +

--
 .../spark/mllib/regression/GeneralizedLinearAlgorithm.scala| 4 
 .../spark/mllib/classification/LogisticRegressionSuite.scala   | 6 ++
 2 files changed, 10 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/10c78607/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
index 45b9ebb..9fd60ff 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
@@ -211,6 +211,10 @@ abstract class GeneralizedLinearAlgorithm[M <: 
GeneralizedLinearModel]
*/
   def run(input: RDD[LabeledPoint], initialWeights: Vector): M = {
 
+if (numFeatures < 0) {
+  numFeatures = input.map(_.features.size).first()
+}
+
 if (input.getStorageLevel == StorageLevel.NONE) {
   logWarning("The input data is not directly cached, which may hurt 
performance if its"
 + " parent RDDs are also uncached.")

http://git-wip-us.apache.org/repos/asf/spark/blob/10c78607/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
index aaa81da..a26c528 100644
--- 
a/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
@@ -425,6 +425,12 @@ class LogisticRegressionSuite extends FunSuite with 
MLlibTestSparkContext with M
 
 val model = lr.run(testRDD)
 
+val numFeatures = testRDD.map(_.features.size).first()
+val initialWeights = Vectors.dense(new Array[Double]((numFeatures + 1) * 
2))
+val model2 = lr.run(testRDD, initialWeights)
+
+LogisticRegressionSuite.checkModelsEqual(model, model2)
+
 /**
  * The following is the instruction to reproduce the model using R's 
glmnet package.
  *


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-6496] [MLLIB] GeneralizedLinearAlgorithm.run(input, initialWeights) should initialize numFeatures

2015-03-25 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-1.3 8e4e2e3f8 -> 2be4255a0


[SPARK-6496] [MLLIB] GeneralizedLinearAlgorithm.run(input, initialWeights) 
should initialize numFeatures

In GeneralizedLinearAlgorithm ```numFeatures``` is default to -1, we need to 
update it to correct value when we call run() to train a model.
```LogisticRegressionWithLBFGS.run(input)``` works well, but when we call 
```LogisticRegressionWithLBFGS.run(input, initialWeights)``` to train 
multiclass classification model, it will throw exception due to the numFeatures 
is not updated.
In this PR, we just update numFeatures at the beginning of 
GeneralizedLinearAlgorithm.run(input, initialWeights) and add test case.

Author: Yanbo Liang 

Closes #5167 from yanboliang/spark-6496 and squashes the following commits:

8131c48 [Yanbo Liang] LogisticRegressionWithLBFGS.run(input, initialWeights) 
should initialize numFeatures

(cherry picked from commit 10c78607b2724f5a64b0cdb966e9c5805f23919b)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2be4255a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2be4255a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2be4255a

Branch: refs/heads/branch-1.3
Commit: 2be4255a05e7a1548f51b02f6bf62507f1c3414b
Parents: 8e4e2e3
Author: Yanbo Liang 
Authored: Wed Mar 25 17:05:56 2015 +
Committer: Sean Owen 
Committed: Wed Mar 25 17:06:04 2015 +

--
 .../spark/mllib/regression/GeneralizedLinearAlgorithm.scala| 4 
 .../spark/mllib/classification/LogisticRegressionSuite.scala   | 6 ++
 2 files changed, 10 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2be4255a/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
index 7c66e8c..9a2751a 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
@@ -196,6 +196,10 @@ abstract class GeneralizedLinearAlgorithm[M <: 
GeneralizedLinearModel]
*/
   def run(input: RDD[LabeledPoint], initialWeights: Vector): M = {
 
+if (numFeatures < 0) {
+  numFeatures = input.map(_.features.size).first()
+}
+
 if (input.getStorageLevel == StorageLevel.NONE) {
   logWarning("The input data is not directly cached, which may hurt 
performance if its"
 + " parent RDDs are also uncached.")

http://git-wip-us.apache.org/repos/asf/spark/blob/2be4255a/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
index aaa81da..a26c528 100644
--- 
a/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
@@ -425,6 +425,12 @@ class LogisticRegressionSuite extends FunSuite with 
MLlibTestSparkContext with M
 
 val model = lr.run(testRDD)
 
+val numFeatures = testRDD.map(_.features.size).first()
+val initialWeights = Vectors.dense(new Array[Double]((numFeatures + 1) * 
2))
+val model2 = lr.run(testRDD, initialWeights)
+
+LogisticRegressionSuite.checkModelsEqual(model, model2)
+
 /**
  * The following is the instruction to reproduce the model using R's 
glmnet package.
  *


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org