from:"yliang"

spark git commit: [SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression behavior difference

2016-06-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 890baaca5 -> 6ecedf39b


[SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression 
behavior difference

## What changes were proposed in this pull request?
When fitting ```LinearRegressionModel```(by "l-bfgs" solver) and 
```LogisticRegressionModel``` w/o intercept on dataset with constant nonzero 
column, spark.ml produce same model as R glmnet but different from LIBSVM.

When fitting ```AFTSurvivalRegressionModel``` w/o intercept on dataset with 
constant nonzero column, spark.ml produce different model compared with R 
survival::survreg.

We should output a warning message and clarify in document for this condition.

## How was this patch tested?
Document change, no unit test.

cc mengxr

Author: Yanbo Liang 

Closes #12731 from yanboliang/spark-13590.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6ecedf39
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6ecedf39
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6ecedf39

Branch: refs/heads/master
Commit: 6ecedf39b44c9acd58cdddf1a31cf11e8e24428c
Parents: 890baac
Author: Yanbo Liang 
Authored: Tue Jun 7 15:25:36 2016 -0700
Committer: Yanbo Liang 
Committed: Tue Jun 7 15:25:36 2016 -0700

--
 docs/ml-classification-regression.md| 6 ++
 .../apache/spark/ml/classification/LogisticRegression.scala | 7 +++
 .../apache/spark/ml/regression/AFTSurvivalRegression.scala  | 9 -
 .../org/apache/spark/ml/regression/LinearRegression.scala   | 7 +++
 4 files changed, 28 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6ecedf39/docs/ml-classification-regression.md
--
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index ff8dec6..88457d4 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -62,6 +62,8 @@ For more background and more details about the 
implementation, refer to the docu
 
   > The current implementation of logistic regression in `spark.ml` only 
supports binary classes. Support for multiclass regression will be added in the 
future.
 
+  > When fitting LogisticRegressionModel without intercept on dataset with 
constant nonzero column, Spark MLlib outputs zero coefficients for constant 
nonzero columns. This behavior is the same as R glmnet but different from 
LIBSVM.
+
 **Example**
 
 The following example shows how to train a logistic regression model
@@ -351,6 +353,8 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 The interface for working with linear regression models and model
 summaries is similar to the logistic regression case.
 
+  > When fitting LinearRegressionModel without intercept on dataset with 
constant nonzero column by "l-bfgs" solver, Spark MLlib outputs zero 
coefficients for constant nonzero columns. This behavior is the same as R 
glmnet but different from LIBSVM.
+
 **Example**
 
 The following
@@ -666,6 +670,8 @@ The optimization algorithm underlying the implementation is 
L-BFGS.
 The implementation matches the result from R's survival function 
 
[survreg](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html)
 
+  > When fitting AFTSurvivalRegressionModel without intercept on dataset with 
constant nonzero column, Spark MLlib outputs zero coefficients for constant 
nonzero columns. This behavior is different from R survival::survreg.
+
 **Example**
 
 

http://git-wip-us.apache.org/repos/asf/spark/blob/6ecedf39/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index 1ea4d90..51ede15 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -333,6 +333,13 @@ class LogisticRegression @Since("1.2.0") (
 val featuresMean = summarizer.mean.toArray
 val featuresStd = summarizer.variance.toArray.map(math.sqrt)
 
+if (!$(fitIntercept) && (0 until numFeatures).exists { i =>
+  featuresStd(i) == 0.0 && featuresMean(i) != 0.0 }) {
+  logWarning("Fitting LogisticRegressionModel without intercept on 
dataset with " +
+"constant nonzero column, Spark MLlib outputs zero coefficients 
for constant " +
+"nonzero columns. This behavior is the same as R glmnet but 
different from LIBSVM.")
+}
+

spark git commit: [SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression behavior difference

2016-06-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 9e16f23e7 -> e21a9ddef


[SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression 
behavior difference

## What changes were proposed in this pull request?
When fitting ```LinearRegressionModel```(by "l-bfgs" solver) and 
```LogisticRegressionModel``` w/o intercept on dataset with constant nonzero 
column, spark.ml produce same model as R glmnet but different from LIBSVM.

When fitting ```AFTSurvivalRegressionModel``` w/o intercept on dataset with 
constant nonzero column, spark.ml produce different model compared with R 
survival::survreg.

We should output a warning message and clarify in document for this condition.

## How was this patch tested?
Document change, no unit test.

cc mengxr

Author: Yanbo Liang 

Closes #12731 from yanboliang/spark-13590.

(cherry picked from commit 6ecedf39b44c9acd58cdddf1a31cf11e8e24428c)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e21a9dde
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e21a9dde
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e21a9dde

Branch: refs/heads/branch-2.0
Commit: e21a9ddefed074de84d3b3bb0f347d64b82696c6
Parents: 9e16f23
Author: Yanbo Liang 
Authored: Tue Jun 7 15:25:36 2016 -0700
Committer: Yanbo Liang 
Committed: Tue Jun 7 15:26:14 2016 -0700

--
 docs/ml-classification-regression.md| 6 ++
 .../apache/spark/ml/classification/LogisticRegression.scala | 7 +++
 .../apache/spark/ml/regression/AFTSurvivalRegression.scala  | 9 -
 .../org/apache/spark/ml/regression/LinearRegression.scala   | 7 +++
 4 files changed, 28 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e21a9dde/docs/ml-classification-regression.md
--
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index ff8dec6..88457d4 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -62,6 +62,8 @@ For more background and more details about the 
implementation, refer to the docu
 
   > The current implementation of logistic regression in `spark.ml` only 
supports binary classes. Support for multiclass regression will be added in the 
future.
 
+  > When fitting LogisticRegressionModel without intercept on dataset with 
constant nonzero column, Spark MLlib outputs zero coefficients for constant 
nonzero columns. This behavior is the same as R glmnet but different from 
LIBSVM.
+
 **Example**
 
 The following example shows how to train a logistic regression model
@@ -351,6 +353,8 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 The interface for working with linear regression models and model
 summaries is similar to the logistic regression case.
 
+  > When fitting LinearRegressionModel without intercept on dataset with 
constant nonzero column by "l-bfgs" solver, Spark MLlib outputs zero 
coefficients for constant nonzero columns. This behavior is the same as R 
glmnet but different from LIBSVM.
+
 **Example**
 
 The following
@@ -666,6 +670,8 @@ The optimization algorithm underlying the implementation is 
L-BFGS.
 The implementation matches the result from R's survival function 
 
[survreg](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html)
 
+  > When fitting AFTSurvivalRegressionModel without intercept on dataset with 
constant nonzero column, Spark MLlib outputs zero coefficients for constant 
nonzero columns. This behavior is different from R survival::survreg.
+
 **Example**
 
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e21a9dde/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index 1ea4d90..51ede15 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -333,6 +333,13 @@ class LogisticRegression @Since("1.2.0") (
 val featuresMean = summarizer.mean.toArray
 val featuresStd = summarizer.variance.toArray.map(math.sqrt)
 
+if (!$(fitIntercept) && (0 until numFeatures).exists { i =>
+  featuresStd(i) == 0.0 && featuresMean(i) != 0.0 }) {
+  logWarning("Fitting LogisticRegressionModel without intercept on 
dataset with " +
+"constant nonzero column, Spark MLlib outputs zero coefficients 
for constant " +
+

spark git commit: [SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula str method similar to Scala API

2016-06-10 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 254bc8c34 -> 7d7a0a5e0


[SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula __str__ method similar to 
Scala API

## What changes were proposed in this pull request?
Adding __str__ to RFormula and model that will show the set formula param and 
resolved formula.  This is currently present in the Scala API, found missing in 
PySpark during Spark 2.0 coverage review.

## How was this patch tested?
run pyspark-ml tests locally

Author: Bryan Cutler 

Closes #13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7d7a0a5e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7d7a0a5e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7d7a0a5e

Branch: refs/heads/master
Commit: 7d7a0a5e0749909e97d90188707cc9220a1bb73a
Parents: 254bc8c
Author: Bryan Cutler 
Authored: Fri Jun 10 11:27:30 2016 -0700
Committer: Yanbo Liang 
Committed: Fri Jun 10 11:27:30 2016 -0700

--
 .../scala/org/apache/spark/ml/feature/RFormula.scala  |  2 +-
 .../org/apache/spark/ml/feature/RFormulaParser.scala  | 14 +-
 python/pyspark/ml/feature.py  | 12 
 3 files changed, 26 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7d7a0a5e/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
index 2916b6d..a7ca0fe 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
@@ -182,7 +182,7 @@ class RFormula(override val uid: String)
 
   override def copy(extra: ParamMap): RFormula = defaultCopy(extra)
 
-  override def toString: String = s"RFormula(${get(formula)}) (uid=$uid)"
+  override def toString: String = s"RFormula(${get(formula).getOrElse("")}) 
(uid=$uid)"
 }
 
 @Since("2.0.0")

http://git-wip-us.apache.org/repos/asf/spark/blob/7d7a0a5e/mllib/src/main/scala/org/apache/spark/ml/feature/RFormulaParser.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormulaParser.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormulaParser.scala
index 19aecff..2dd565a 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormulaParser.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormulaParser.scala
@@ -126,7 +126,19 @@ private[ml] case class ParsedRFormula(label: ColumnRef, 
terms: Seq[Term]) {
  * @param hasIntercept whether the formula specifies fitting with an intercept.
  */
 private[ml] case class ResolvedRFormula(
-  label: String, terms: Seq[Seq[String]], hasIntercept: Boolean)
+  label: String, terms: Seq[Seq[String]], hasIntercept: Boolean) {
+
+  override def toString: String = {
+val ts = terms.map {
+  case t if t.length > 1 =>
+s"${t.mkString("{", ",", "}")}"
+  case t =>
+t.mkString
+}
+val termStr = ts.mkString("[", ",", "]")
+s"ResolvedRFormula(label=$label, terms=$termStr, 
hasIntercept=$hasIntercept)"
+  }
+}
 
 /**
  * R formula terms. See the R formula docs here for more information:

http://git-wip-us.apache.org/repos/asf/spark/blob/7d7a0a5e/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index bfb2fb7..ca77ac3 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -2528,6 +2528,8 @@ class RFormula(JavaEstimator, HasFeaturesCol, 
HasLabelCol, JavaMLReadable, JavaM
 True
 >>> loadedRF.getLabelCol() == rf.getLabelCol()
 True
+>>> str(loadedRF)
+'RFormula(y ~ x + s) (uid=...)'
 >>> modelPath = temp_path + "/rFormulaModel"
 >>> model.save(modelPath)
 >>> loadedModel = RFormulaModel.load(modelPath)
@@ -2542,6 +2544,8 @@ class RFormula(JavaEstimator, HasFeaturesCol, 
HasLabelCol, JavaMLReadable, JavaM
 |0.0|0.0|  a|[0.0,1.0]|  0.0|
 +---+---+---+-+-+
 ...
+>>> str(loadedModel)
+'RFormulaModel(ResolvedRFormula(label=y, terms=[x,s], hasIntercept=true)) 
(uid=...)'
 
 .. versionadded:: 1.5.0
 """
@@ -2586,6 +2590,10 @@ class RFormula(JavaEstimator, HasFeaturesCol, 
HasLabelCol, JavaMLReadable, JavaM
 def _create_model(self, java_model):
 return RFormulaModel(java_model)
 
+def __str__(self):
+formulaStr = self.getFormula() if self.isDefined(self.formula) else ""
+return "RFormula(%s) (uid=%s)" % (formulaStr, self.uid)

spark git commit: [SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula str method similar to Scala API

2016-06-10 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 8b6742a37 -> 80b8711b3


[SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula __str__ method similar to 
Scala API

## What changes were proposed in this pull request?
Adding __str__ to RFormula and model that will show the set formula param and 
resolved formula.  This is currently present in the Scala API, found missing in 
PySpark during Spark 2.0 coverage review.

## How was this patch tested?
run pyspark-ml tests locally

Author: Bryan Cutler 

Closes #13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.

(cherry picked from commit 7d7a0a5e0749909e97d90188707cc9220a1bb73a)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/80b8711b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/80b8711b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/80b8711b

Branch: refs/heads/branch-2.0
Commit: 80b8711b342c5a569fe89d7ffbdd552653b9b6ec
Parents: 8b6742a
Author: Bryan Cutler 
Authored: Fri Jun 10 11:27:30 2016 -0700
Committer: Yanbo Liang 
Committed: Fri Jun 10 14:01:55 2016 -0700

--
 .../scala/org/apache/spark/ml/feature/RFormula.scala  |  2 +-
 .../org/apache/spark/ml/feature/RFormulaParser.scala  | 14 +-
 python/pyspark/ml/feature.py  | 12 
 3 files changed, 26 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/80b8711b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
index 2916b6d..a7ca0fe 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
@@ -182,7 +182,7 @@ class RFormula(override val uid: String)
 
   override def copy(extra: ParamMap): RFormula = defaultCopy(extra)
 
-  override def toString: String = s"RFormula(${get(formula)}) (uid=$uid)"
+  override def toString: String = s"RFormula(${get(formula).getOrElse("")}) 
(uid=$uid)"
 }
 
 @Since("2.0.0")

http://git-wip-us.apache.org/repos/asf/spark/blob/80b8711b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormulaParser.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormulaParser.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormulaParser.scala
index 19aecff..2dd565a 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormulaParser.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormulaParser.scala
@@ -126,7 +126,19 @@ private[ml] case class ParsedRFormula(label: ColumnRef, 
terms: Seq[Term]) {
  * @param hasIntercept whether the formula specifies fitting with an intercept.
  */
 private[ml] case class ResolvedRFormula(
-  label: String, terms: Seq[Seq[String]], hasIntercept: Boolean)
+  label: String, terms: Seq[Seq[String]], hasIntercept: Boolean) {
+
+  override def toString: String = {
+val ts = terms.map {
+  case t if t.length > 1 =>
+s"${t.mkString("{", ",", "}")}"
+  case t =>
+t.mkString
+}
+val termStr = ts.mkString("[", ",", "]")
+s"ResolvedRFormula(label=$label, terms=$termStr, 
hasIntercept=$hasIntercept)"
+  }
+}
 
 /**
  * R formula terms. See the R formula docs here for more information:

http://git-wip-us.apache.org/repos/asf/spark/blob/80b8711b/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index bfb2fb7..ca77ac3 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -2528,6 +2528,8 @@ class RFormula(JavaEstimator, HasFeaturesCol, 
HasLabelCol, JavaMLReadable, JavaM
 True
 >>> loadedRF.getLabelCol() == rf.getLabelCol()
 True
+>>> str(loadedRF)
+'RFormula(y ~ x + s) (uid=...)'
 >>> modelPath = temp_path + "/rFormulaModel"
 >>> model.save(modelPath)
 >>> loadedModel = RFormulaModel.load(modelPath)
@@ -2542,6 +2544,8 @@ class RFormula(JavaEstimator, HasFeaturesCol, 
HasLabelCol, JavaMLReadable, JavaM
 |0.0|0.0|  a|[0.0,1.0]|  0.0|
 +---+---+---+-+-+
 ...
+>>> str(loadedModel)
+'RFormulaModel(ResolvedRFormula(label=y, terms=[x,s], hasIntercept=true)) 
(uid=...)'
 
 .. versionadded:: 1.5.0
 """
@@ -2586,6 +2590,10 @@ class RFormula(JavaEstimator, HasFeaturesCol, 
HasLabelCol, JavaMLReadable, JavaM
 def _create_model(self, java_model):
 return RFormulaModel(java_model)
 
+def __str__(self):
+formulaStr = self.getFormula()

spark git commit: [SPARK-15945][MLLIB] Conversion between old/new vector columns in a DataFrame (Scala/Java)

2016-06-14 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 b75542603 -> f277cdf78


[SPARK-15945][MLLIB] Conversion between old/new vector columns in a DataFrame 
(Scala/Java)

## What changes were proposed in this pull request?

This PR provides conversion utils between old/new vector columns in a 
DataFrame. So users can use it to migrate their datasets and pipelines 
manually. The methods are implemented under `MLUtils` and called 
`convertVectorColumnsToML` and `convertVectorColumnsFromML`. Both take a 
DataFrame and a list of vector columns to be converted. It is a no-op on vector 
columns that are already converted. A warning message is logged if actual 
conversion happens.

This is the first sub-task under SPARK-15944 to make it easier to migrate 
existing pipelines to Spark 2.0.

## How was this patch tested?

Unit tests in Scala and Java.

cc: yanboliang

Author: Xiangrui Meng 

Closes #13662 from mengxr/SPARK-15945.

(cherry picked from commit 63e0aebe22ba41c636ecaddd8647721d7690a1ec)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f277cdf7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f277cdf7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f277cdf7

Branch: refs/heads/branch-2.0
Commit: f277cdf787de4402cd6cdba5e15e38bb71d8c5c7
Parents: b755426
Author: Xiangrui Meng 
Authored: Tue Jun 14 18:57:45 2016 -0700
Committer: Yanbo Liang 
Committed: Tue Jun 14 18:58:02 2016 -0700

--
 .../org/apache/spark/mllib/util/MLUtils.scala   | 117 ++-
 .../spark/mllib/util/JavaMLUtilsSuite.java  |  49 
 .../apache/spark/mllib/util/MLUtilsSuite.scala  |  60 +-
 3 files changed, 218 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f277cdf7/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
index f0346e6..7d5bdff 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
@@ -17,14 +17,19 @@
 
 package org.apache.spark.mllib.util
 
+import scala.annotation.varargs
 import scala.reflect.ClassTag
 
 import org.apache.spark.SparkContext
 import org.apache.spark.annotation.Since
-import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{VectorUDT => MLVectorUDT}
+import org.apache.spark.mllib.linalg._
 import org.apache.spark.mllib.linalg.BLAS.dot
 import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.rdd.{PartitionwiseSampledRDD, RDD}
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.functions.{col, udf}
 import org.apache.spark.storage.StorageLevel
 import org.apache.spark.util.random.BernoulliCellSampler
 
@@ -32,7 +37,7 @@ import org.apache.spark.util.random.BernoulliCellSampler
  * Helper methods to load, save and pre-process data used in ML Lib.
  */
 @Since("0.8.0")
-object MLUtils {
+object MLUtils extends Logging {
 
   private[mllib] lazy val EPSILON = {
 var eps = 1.0
@@ -50,7 +55,6 @@ object MLUtils {
* where the indices are one-based and in ascending order.
* This method parses each line into a 
[[org.apache.spark.mllib.regression.LabeledPoint]],
* where the feature indices are converted to zero-based.
-   *
* @param sc Spark context
* @param path file or directory path in any Hadoop-supported file system URI
* @param numFeatures number of features, which will be determined from the 
input data if a
@@ -145,7 +149,6 @@ object MLUtils {
* Save labeled data in LIBSVM format.
* @param data an RDD of LabeledPoint to be saved
* @param dir directory to save the data
-   *
* @see [[org.apache.spark.mllib.util.MLUtils#loadLibSVMFile]]
*/
   @Since("1.0.0")
@@ -254,6 +257,110 @@ object MLUtils {
   }
 
   /**
+   * Converts vector columns in an input Dataset from the 
[[org.apache.spark.mllib.linalg.Vector]]
+   * type to the new [[org.apache.spark.ml.linalg.Vector]] type under the 
`spark.ml` package.
+   * @param dataset input dataset
+   * @param cols a list of vector columns to be converted. New vector columns 
will be ignored. If
+   * unspecified, all old vector columns will be converted except 
nested ones.
+   * @return the input [[DataFrame]] with old vector columns converted to the 
new vector type
+   */
+  @Since("2.0.0")
+  @varargs
+  def convertVectorColumnsToML(dataset: Dataset[_], cols: String*): DataFrame 
= {
+val schema = dataset.schema
+val colSet =

spark git commit: [SPARK-15945][MLLIB] Conversion between old/new vector columns in a DataFrame (Scala/Java)

2016-06-14 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 42a28caf1 -> 63e0aebe2


[SPARK-15945][MLLIB] Conversion between old/new vector columns in a DataFrame 
(Scala/Java)

## What changes were proposed in this pull request?

This PR provides conversion utils between old/new vector columns in a 
DataFrame. So users can use it to migrate their datasets and pipelines 
manually. The methods are implemented under `MLUtils` and called 
`convertVectorColumnsToML` and `convertVectorColumnsFromML`. Both take a 
DataFrame and a list of vector columns to be converted. It is a no-op on vector 
columns that are already converted. A warning message is logged if actual 
conversion happens.

This is the first sub-task under SPARK-15944 to make it easier to migrate 
existing pipelines to Spark 2.0.

## How was this patch tested?

Unit tests in Scala and Java.

cc: yanboliang

Author: Xiangrui Meng 

Closes #13662 from mengxr/SPARK-15945.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/63e0aebe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/63e0aebe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/63e0aebe

Branch: refs/heads/master
Commit: 63e0aebe22ba41c636ecaddd8647721d7690a1ec
Parents: 42a28ca
Author: Xiangrui Meng 
Authored: Tue Jun 14 18:57:45 2016 -0700
Committer: Yanbo Liang 
Committed: Tue Jun 14 18:57:45 2016 -0700

--
 .../org/apache/spark/mllib/util/MLUtils.scala   | 117 ++-
 .../spark/mllib/util/JavaMLUtilsSuite.java  |  49 
 .../apache/spark/mllib/util/MLUtilsSuite.scala  |  60 +-
 3 files changed, 218 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/63e0aebe/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
index f0346e6..7d5bdff 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
@@ -17,14 +17,19 @@
 
 package org.apache.spark.mllib.util
 
+import scala.annotation.varargs
 import scala.reflect.ClassTag
 
 import org.apache.spark.SparkContext
 import org.apache.spark.annotation.Since
-import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{VectorUDT => MLVectorUDT}
+import org.apache.spark.mllib.linalg._
 import org.apache.spark.mllib.linalg.BLAS.dot
 import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.rdd.{PartitionwiseSampledRDD, RDD}
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.functions.{col, udf}
 import org.apache.spark.storage.StorageLevel
 import org.apache.spark.util.random.BernoulliCellSampler
 
@@ -32,7 +37,7 @@ import org.apache.spark.util.random.BernoulliCellSampler
  * Helper methods to load, save and pre-process data used in ML Lib.
  */
 @Since("0.8.0")
-object MLUtils {
+object MLUtils extends Logging {
 
   private[mllib] lazy val EPSILON = {
 var eps = 1.0
@@ -50,7 +55,6 @@ object MLUtils {
* where the indices are one-based and in ascending order.
* This method parses each line into a 
[[org.apache.spark.mllib.regression.LabeledPoint]],
* where the feature indices are converted to zero-based.
-   *
* @param sc Spark context
* @param path file or directory path in any Hadoop-supported file system URI
* @param numFeatures number of features, which will be determined from the 
input data if a
@@ -145,7 +149,6 @@ object MLUtils {
* Save labeled data in LIBSVM format.
* @param data an RDD of LabeledPoint to be saved
* @param dir directory to save the data
-   *
* @see [[org.apache.spark.mllib.util.MLUtils#loadLibSVMFile]]
*/
   @Since("1.0.0")
@@ -254,6 +257,110 @@ object MLUtils {
   }
 
   /**
+   * Converts vector columns in an input Dataset from the 
[[org.apache.spark.mllib.linalg.Vector]]
+   * type to the new [[org.apache.spark.ml.linalg.Vector]] type under the 
`spark.ml` package.
+   * @param dataset input dataset
+   * @param cols a list of vector columns to be converted. New vector columns 
will be ignored. If
+   * unspecified, all old vector columns will be converted except 
nested ones.
+   * @return the input [[DataFrame]] with old vector columns converted to the 
new vector type
+   */
+  @Since("2.0.0")
+  @varargs
+  def convertVectorColumnsToML(dataset: Dataset[_], cols: String*): DataFrame 
= {
+val schema = dataset.schema
+val colSet = if (cols.nonEmpty) {
+  cols.flatMap { c =>
+val dataType = schema(c).dataType
+if (d

spark git commit: [SPARK-15608][ML][EXAMPLES][DOC] add examples and documents of ml.isotonic regression

2016-06-16 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 b3678eb7e -> 68e7a25cc


[SPARK-15608][ML][EXAMPLES][DOC] add examples and documents of ml.isotonic 
regression

## What changes were proposed in this pull request?

add ml doc for ml isotonic regression
add scala example for ml isotonic regression
add java example for ml isotonic regression
add python example for ml isotonic regression

modify scala example for mllib isotonic regression
modify java example for mllib isotonic regression
modify python example for mllib isotonic regression

add data/mllib/sample_isotonic_regression_libsvm_data.txt
delete data/mllib/sample_isotonic_regression_data.txt
## How was this patch tested?

N/A

Author: WeichenXu 

Closes #13381 from WeichenXu123/add_isotonic_regression_doc.

(cherry picked from commit 9040d83bc2cdce06dab0e1bdee4f796da9a9a55c)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/68e7a25c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/68e7a25c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/68e7a25c

Branch: refs/heads/branch-2.0
Commit: 68e7a25cc06cbfe357be8d224c117abaa7ba94f4
Parents: b3678eb
Author: WeichenXu 
Authored: Thu Jun 16 17:35:40 2016 -0700
Committer: Yanbo Liang 
Committed: Thu Jun 16 17:35:51 2016 -0700

--
 data/mllib/sample_isotonic_regression_data.txt  | 100 ---
 .../sample_isotonic_regression_libsvm_data.txt  | 100 +++
 docs/ml-classification-regression.md|  70 +
 .../ml/JavaIsotonicRegressionExample.java   |  62 
 .../mllib/JavaIsotonicRegressionExample.java|  19 ++--
 .../python/ml/isotonic_regression_example.py|  54 ++
 .../python/mllib/isotonic_regression_example.py |  11 +-
 .../examples/ml/IsotonicRegressionExample.scala |  62 
 .../mllib/IsotonicRegressionExample.scala   |   9 +-
 9 files changed, 373 insertions(+), 114 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/68e7a25c/data/mllib/sample_isotonic_regression_data.txt
--
diff --git a/data/mllib/sample_isotonic_regression_data.txt 
b/data/mllib/sample_isotonic_regression_data.txt
deleted file mode 100644
index d257b50..000
--- a/data/mllib/sample_isotonic_regression_data.txt
+++ /dev/null
@@ -1,100 +0,0 @@
-0.24579296,0.01
-0.28505864,0.02
-0.31208567,0.03
-0.35900051,0.04
-0.35747068,0.05
-0.16675166,0.06
-0.17491076,0.07
-0.04181540,0.08
-0.04793473,0.09
-0.03926568,0.10
-0.12952575,0.11
-0.,0.12
-0.01376849,0.13
-0.13105558,0.14
-0.08873024,0.15
-0.12595614,0.16
-0.15247323,0.17
-0.25956145,0.18
-0.20040796,0.19
-0.19581846,0.20
-0.15757267,0.21
-0.13717491,0.22
-0.19020908,0.23
-0.19581846,0.24
-0.20091790,0.25
-0.16879143,0.26
-0.18510964,0.27
-0.20040796,0.28
-0.29576747,0.29
-0.43396226,0.30
-0.53391127,0.31
-0.52116267,0.32
-0.48546660,0.33
-0.49209587,0.34
-0.54156043,0.35
-0.59765426,0.36
-0.56144824,0.37
-0.58592555,0.38
-0.52983172,0.39
-0.50178480,0.40
-0.52626211,0.41
-0.58286588,0.42
-0.64660887,0.43
-0.68077511,0.44
-0.74298827,0.45
-0.64864865,0.46
-0.67261601,0.47
-0.65782764,0.48
-0.69811321,0.49
-0.63029067,0.50
-0.61601224,0.51
-0.63233044,0.52
-0.65323814,0.53
-0.65323814,0.54
-0.67363590,0.55
-0.67006629,0.56
-0.51555329,0.57
-0.50892402,0.58
-0.33299337,0.59
-0.36206017,0.60
-0.43090260,0.61
-0.45996940,0.62
-0.56348802,0.63
-0.54920959,0.64
-0.48393677,0.65
-0.48495665,0.66
-0.46965834,0.67
-0.45181030,0.68
-0.45843957,0.69
-0.47118817,0.70
-0.51555329,0.71
-0.58031617,0.72
-0.55481897,0.73
-0.56297807,0.74
-0.56603774,0.75
-0.57929628,0.76
-0.64762876,0.77
-0.66241713,0.78
-0.69301377,0.79
-0.65119837,0.80
-0.68332483,0.81
-0.66598674,0.82
-0.73890872,0.83
-0.73992861,0.84
-0.84242733,0.85
-0.91330954,0.86
-0.88016318,0.87
-0.90719021,0.88
-0.93115757,0.89
-0.93115757,0.90
-0.91942886,0.91
-0.92911780,0.92
-0.95665477,0.93
-0.95002550,0.94
-0.96940337,0.95
-1.,0.96
-0.89801122,0.97
-0.90311066,0.98
-0.90362060,0.99
-0.83477817,1.0
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/spark/blob/68e7a25c/data/mllib/sample_isotonic_regression_libsvm_data.txt
--
diff --git a/data/mllib/sample_isotonic_regression_libsvm_data.txt 
b/data/mllib/sample_isotonic_regression_libsvm_data.txt
new file mode 100644
index 000..f39fe02
--- /dev/null
+++ b/data/mllib/sample_isotonic_regression_libsvm_data.txt
@@ -0,0 +1,100 @@
+0.24579296 1:0.01
+0.28505864 1:0.02
+0.31208567 1:0.03
+0.35900051 1:0.04
+0.35747068 1:0.05
+0.16675166 1:0.06
+0.17491076 1:0.07
+0.04181540 1:0.08
+0.04793473 1:0.09
+0.03926568 1:0.10
+0.12952575 1:0.11
+0. 1:0.12
+0.01376849 1:0.13

spark git commit: [SPARK-15608][ML][EXAMPLES][DOC] add examples and documents of ml.isotonic regression

2016-06-16 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master d9c6628c4 -> 9040d83bc


[SPARK-15608][ML][EXAMPLES][DOC] add examples and documents of ml.isotonic 
regression

## What changes were proposed in this pull request?

add ml doc for ml isotonic regression
add scala example for ml isotonic regression
add java example for ml isotonic regression
add python example for ml isotonic regression

modify scala example for mllib isotonic regression
modify java example for mllib isotonic regression
modify python example for mllib isotonic regression

add data/mllib/sample_isotonic_regression_libsvm_data.txt
delete data/mllib/sample_isotonic_regression_data.txt
## How was this patch tested?

N/A

Author: WeichenXu 

Closes #13381 from WeichenXu123/add_isotonic_regression_doc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9040d83b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9040d83b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9040d83b

Branch: refs/heads/master
Commit: 9040d83bc2cdce06dab0e1bdee4f796da9a9a55c
Parents: d9c6628
Author: WeichenXu 
Authored: Thu Jun 16 17:35:40 2016 -0700
Committer: Yanbo Liang 
Committed: Thu Jun 16 17:35:40 2016 -0700

--
 data/mllib/sample_isotonic_regression_data.txt  | 100 ---
 .../sample_isotonic_regression_libsvm_data.txt  | 100 +++
 docs/ml-classification-regression.md|  70 +
 .../ml/JavaIsotonicRegressionExample.java   |  62 
 .../mllib/JavaIsotonicRegressionExample.java|  19 ++--
 .../python/ml/isotonic_regression_example.py|  54 ++
 .../python/mllib/isotonic_regression_example.py |  11 +-
 .../examples/ml/IsotonicRegressionExample.scala |  62 
 .../mllib/IsotonicRegressionExample.scala   |   9 +-
 9 files changed, 373 insertions(+), 114 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9040d83b/data/mllib/sample_isotonic_regression_data.txt
--
diff --git a/data/mllib/sample_isotonic_regression_data.txt 
b/data/mllib/sample_isotonic_regression_data.txt
deleted file mode 100644
index d257b50..000
--- a/data/mllib/sample_isotonic_regression_data.txt
+++ /dev/null
@@ -1,100 +0,0 @@
-0.24579296,0.01
-0.28505864,0.02
-0.31208567,0.03
-0.35900051,0.04
-0.35747068,0.05
-0.16675166,0.06
-0.17491076,0.07
-0.04181540,0.08
-0.04793473,0.09
-0.03926568,0.10
-0.12952575,0.11
-0.,0.12
-0.01376849,0.13
-0.13105558,0.14
-0.08873024,0.15
-0.12595614,0.16
-0.15247323,0.17
-0.25956145,0.18
-0.20040796,0.19
-0.19581846,0.20
-0.15757267,0.21
-0.13717491,0.22
-0.19020908,0.23
-0.19581846,0.24
-0.20091790,0.25
-0.16879143,0.26
-0.18510964,0.27
-0.20040796,0.28
-0.29576747,0.29
-0.43396226,0.30
-0.53391127,0.31
-0.52116267,0.32
-0.48546660,0.33
-0.49209587,0.34
-0.54156043,0.35
-0.59765426,0.36
-0.56144824,0.37
-0.58592555,0.38
-0.52983172,0.39
-0.50178480,0.40
-0.52626211,0.41
-0.58286588,0.42
-0.64660887,0.43
-0.68077511,0.44
-0.74298827,0.45
-0.64864865,0.46
-0.67261601,0.47
-0.65782764,0.48
-0.69811321,0.49
-0.63029067,0.50
-0.61601224,0.51
-0.63233044,0.52
-0.65323814,0.53
-0.65323814,0.54
-0.67363590,0.55
-0.67006629,0.56
-0.51555329,0.57
-0.50892402,0.58
-0.33299337,0.59
-0.36206017,0.60
-0.43090260,0.61
-0.45996940,0.62
-0.56348802,0.63
-0.54920959,0.64
-0.48393677,0.65
-0.48495665,0.66
-0.46965834,0.67
-0.45181030,0.68
-0.45843957,0.69
-0.47118817,0.70
-0.51555329,0.71
-0.58031617,0.72
-0.55481897,0.73
-0.56297807,0.74
-0.56603774,0.75
-0.57929628,0.76
-0.64762876,0.77
-0.66241713,0.78
-0.69301377,0.79
-0.65119837,0.80
-0.68332483,0.81
-0.66598674,0.82
-0.73890872,0.83
-0.73992861,0.84
-0.84242733,0.85
-0.91330954,0.86
-0.88016318,0.87
-0.90719021,0.88
-0.93115757,0.89
-0.93115757,0.90
-0.91942886,0.91
-0.92911780,0.92
-0.95665477,0.93
-0.95002550,0.94
-0.96940337,0.95
-1.,0.96
-0.89801122,0.97
-0.90311066,0.98
-0.90362060,0.99
-0.83477817,1.0
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/spark/blob/9040d83b/data/mllib/sample_isotonic_regression_libsvm_data.txt
--
diff --git a/data/mllib/sample_isotonic_regression_libsvm_data.txt 
b/data/mllib/sample_isotonic_regression_libsvm_data.txt
new file mode 100644
index 000..f39fe02
--- /dev/null
+++ b/data/mllib/sample_isotonic_regression_libsvm_data.txt
@@ -0,0 +1,100 @@
+0.24579296 1:0.01
+0.28505864 1:0.02
+0.31208567 1:0.03
+0.35900051 1:0.04
+0.35747068 1:0.05
+0.16675166 1:0.06
+0.17491076 1:0.07
+0.04181540 1:0.08
+0.04793473 1:0.09
+0.03926568 1:0.10
+0.12952575 1:0.11
+0. 1:0.12
+0.01376849 1:0.13
+0.13105558 1:0.14
+0.08873024 1:0.15
+0.12595614 1:0.16
+0.15247323 1:0.17
+0.25956145 1:0.18
+0.20040796

spark git commit: [SPARK-15946][MLLIB] Conversion between old/new vector columns in a DataFrame (Python)

2016-06-17 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master af2a4b082 -> edb23f9e4


[SPARK-15946][MLLIB] Conversion between old/new vector columns in a DataFrame 
(Python)

## What changes were proposed in this pull request?

This PR implements python wrappers for #13662 to convert old/new vector columns 
in a DataFrame.

## How was this patch tested?

doctest in Python

cc: yanboliang

Author: Xiangrui Meng 

Closes #13731 from mengxr/SPARK-15946.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/edb23f9e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/edb23f9e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/edb23f9e

Branch: refs/heads/master
Commit: edb23f9e47eecfe60992dde0e037ec1985c77e1d
Parents: af2a4b0
Author: Xiangrui Meng 
Authored: Fri Jun 17 21:22:29 2016 -0700
Committer: Yanbo Liang 
Committed: Fri Jun 17 21:22:29 2016 -0700

--
 .../spark/mllib/api/python/PythonMLLibAPI.scala | 14 
 python/pyspark/mllib/util.py| 82 
 2 files changed, 96 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/edb23f9e/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index 7df6160..f2c70ba 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -1201,6 +1201,20 @@ private[python] class PythonMLLibAPI extends 
Serializable {
 val spark = SparkSession.builder().sparkContext(sc).getOrCreate()
 spark.createDataFrame(blockMatrix.blocks)
   }
+
+  /**
+   * Python-friendly version of [[MLUtils.convertVectorColumnsToML()]].
+   */
+  def convertVectorColumnsToML(dataset: DataFrame, cols: JArrayList[String]): 
DataFrame = {
+MLUtils.convertVectorColumnsToML(dataset, cols.asScala: _*)
+  }
+
+  /**
+   * Python-friendly version of [[MLUtils.convertVectorColumnsFromML()]]
+   */
+  def convertVectorColumnsFromML(dataset: DataFrame, cols: 
JArrayList[String]): DataFrame = {
+MLUtils.convertVectorColumnsFromML(dataset, cols.asScala: _*)
+  }
 }
 
 /**

http://git-wip-us.apache.org/repos/asf/spark/blob/edb23f9e/python/pyspark/mllib/util.py
--
diff --git a/python/pyspark/mllib/util.py b/python/pyspark/mllib/util.py
index a316ee1..a7e6bcc 100644
--- a/python/pyspark/mllib/util.py
+++ b/python/pyspark/mllib/util.py
@@ -26,6 +26,7 @@ if sys.version > '3':
 from pyspark import SparkContext, since
 from pyspark.mllib.common import callMLlibFunc, inherit_doc
 from pyspark.mllib.linalg import Vectors, SparseVector, _convert_to_vector
+from pyspark.sql import DataFrame
 
 
 class MLUtils(object):
@@ -200,6 +201,86 @@ class MLUtils(object):
 """
 return callMLlibFunc("loadVectors", sc, path)
 
+@staticmethod
+@since("2.0.0")
+def convertVectorColumnsToML(dataset, *cols):
+"""
+Converts vector columns in an input DataFrame from the
+:py:class:`pyspark.mllib.linalg.Vector` type to the new
+:py:class:`pyspark.ml.linalg.Vector` type under the `spark.ml`
+package.
+
+:param dataset:
+  input dataset
+:param cols:
+  a list of vector columns to be converted.
+  New vector columns will be ignored. If unspecified, all old
+  vector columns will be converted excepted nested ones.
+:return:
+  the input dataset with old vector columns converted to the
+  new vector type
+
+>>> import pyspark
+>>> from pyspark.mllib.linalg import Vectors
+>>> from pyspark.mllib.util import MLUtils
+>>> df = spark.createDataFrame(
+... [(0, Vectors.sparse(2, [1], [1.0]), Vectors.dense(2.0, 3.0))],
+... ["id", "x", "y"])
+>>> r1 = MLUtils.convertVectorColumnsToML(df).first()
+>>> isinstance(r1.x, pyspark.ml.linalg.SparseVector)
+True
+>>> isinstance(r1.y, pyspark.ml.linalg.DenseVector)
+True
+>>> r2 = MLUtils.convertVectorColumnsToML(df, "x").first()
+>>> isinstance(r2.x, pyspark.ml.linalg.SparseVector)
+True
+>>> isinstance(r2.y, pyspark.mllib.linalg.DenseVector)
+True
+"""
+if not isinstance(dataset, DataFrame):
+raise TypeError("Input dataset must be a DataFrame but got 
{}.".format(type(dataset)))
+return callMLlibFunc("convertVectorColumnsToML", dataset, list(cols))
+
+@staticmethod
+@since("2.0.0")
+def con

spark git commit: [SPARK-15946][MLLIB] Conversion between old/new vector columns in a DataFrame (Python)

2016-06-17 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 f0de45cb1 -> 0a8fd2eb8


[SPARK-15946][MLLIB] Conversion between old/new vector columns in a DataFrame 
(Python)

## What changes were proposed in this pull request?

This PR implements python wrappers for #13662 to convert old/new vector columns 
in a DataFrame.

## How was this patch tested?

doctest in Python

cc: yanboliang

Author: Xiangrui Meng 

Closes #13731 from mengxr/SPARK-15946.

(cherry picked from commit edb23f9e47eecfe60992dde0e037ec1985c77e1d)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0a8fd2eb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0a8fd2eb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0a8fd2eb

Branch: refs/heads/branch-2.0
Commit: 0a8fd2eb8966afaff3030adef5fc6fd73171607c
Parents: f0de45c
Author: Xiangrui Meng 
Authored: Fri Jun 17 21:22:29 2016 -0700
Committer: Yanbo Liang 
Committed: Fri Jun 17 21:22:41 2016 -0700

--
 .../spark/mllib/api/python/PythonMLLibAPI.scala | 14 
 python/pyspark/mllib/util.py| 82 
 2 files changed, 96 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0a8fd2eb/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index 7df6160..f2c70ba 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -1201,6 +1201,20 @@ private[python] class PythonMLLibAPI extends 
Serializable {
 val spark = SparkSession.builder().sparkContext(sc).getOrCreate()
 spark.createDataFrame(blockMatrix.blocks)
   }
+
+  /**
+   * Python-friendly version of [[MLUtils.convertVectorColumnsToML()]].
+   */
+  def convertVectorColumnsToML(dataset: DataFrame, cols: JArrayList[String]): 
DataFrame = {
+MLUtils.convertVectorColumnsToML(dataset, cols.asScala: _*)
+  }
+
+  /**
+   * Python-friendly version of [[MLUtils.convertVectorColumnsFromML()]]
+   */
+  def convertVectorColumnsFromML(dataset: DataFrame, cols: 
JArrayList[String]): DataFrame = {
+MLUtils.convertVectorColumnsFromML(dataset, cols.asScala: _*)
+  }
 }
 
 /**

http://git-wip-us.apache.org/repos/asf/spark/blob/0a8fd2eb/python/pyspark/mllib/util.py
--
diff --git a/python/pyspark/mllib/util.py b/python/pyspark/mllib/util.py
index a316ee1..a7e6bcc 100644
--- a/python/pyspark/mllib/util.py
+++ b/python/pyspark/mllib/util.py
@@ -26,6 +26,7 @@ if sys.version > '3':
 from pyspark import SparkContext, since
 from pyspark.mllib.common import callMLlibFunc, inherit_doc
 from pyspark.mllib.linalg import Vectors, SparseVector, _convert_to_vector
+from pyspark.sql import DataFrame
 
 
 class MLUtils(object):
@@ -200,6 +201,86 @@ class MLUtils(object):
 """
 return callMLlibFunc("loadVectors", sc, path)
 
+@staticmethod
+@since("2.0.0")
+def convertVectorColumnsToML(dataset, *cols):
+"""
+Converts vector columns in an input DataFrame from the
+:py:class:`pyspark.mllib.linalg.Vector` type to the new
+:py:class:`pyspark.ml.linalg.Vector` type under the `spark.ml`
+package.
+
+:param dataset:
+  input dataset
+:param cols:
+  a list of vector columns to be converted.
+  New vector columns will be ignored. If unspecified, all old
+  vector columns will be converted excepted nested ones.
+:return:
+  the input dataset with old vector columns converted to the
+  new vector type
+
+>>> import pyspark
+>>> from pyspark.mllib.linalg import Vectors
+>>> from pyspark.mllib.util import MLUtils
+>>> df = spark.createDataFrame(
+... [(0, Vectors.sparse(2, [1], [1.0]), Vectors.dense(2.0, 3.0))],
+... ["id", "x", "y"])
+>>> r1 = MLUtils.convertVectorColumnsToML(df).first()
+>>> isinstance(r1.x, pyspark.ml.linalg.SparseVector)
+True
+>>> isinstance(r1.y, pyspark.ml.linalg.DenseVector)
+True
+>>> r2 = MLUtils.convertVectorColumnsToML(df, "x").first()
+>>> isinstance(r2.x, pyspark.ml.linalg.SparseVector)
+True
+>>> isinstance(r2.y, pyspark.mllib.linalg.DenseVector)
+True
+"""
+if not isinstance(dataset, DataFrame):
+raise TypeError("Input dataset must be a DataFrame but got 
{}.".format(type(dataset)))
+return callMLlibFu

spark git commit: [SPARK-18401][SPARKR][ML] SparkR random forest should support output original label.

2016-11-10 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master a3356343c -> 5ddf69470


[SPARK-18401][SPARKR][ML] SparkR random forest should support output original 
label.

## What changes were proposed in this pull request?
SparkR ```spark.randomForest``` classification prediction should output 
original label rather than the indexed label. This issue is very similar with 
[SPARK-18291](https://issues.apache.org/jira/browse/SPARK-18291).

## How was this patch tested?
Add unit tests.

Author: Yanbo Liang 

Closes #15842 from yanboliang/spark-18401.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5ddf6947
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5ddf6947
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5ddf6947

Branch: refs/heads/master
Commit: 5ddf69470b93c0b8a28bb4ac905e7670d9c50a95
Parents: a335634
Author: Yanbo Liang 
Authored: Thu Nov 10 17:13:10 2016 -0800
Committer: Yanbo Liang 
Committed: Thu Nov 10 17:13:10 2016 -0800

--
 R/pkg/inst/tests/testthat/test_mllib.R  | 24 +
 .../r/RandomForestClassificationWrapper.scala   | 28 +---
 2 files changed, 48 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5ddf6947/R/pkg/inst/tests/testthat/test_mllib.R
--
diff --git a/R/pkg/inst/tests/testthat/test_mllib.R 
b/R/pkg/inst/tests/testthat/test_mllib.R
index 33e9d0d..b76f75d 100644
--- a/R/pkg/inst/tests/testthat/test_mllib.R
+++ b/R/pkg/inst/tests/testthat/test_mllib.R
@@ -935,6 +935,10 @@ test_that("spark.randomForest Classification", {
   expect_equal(stats$numTrees, 20)
   expect_error(capture.output(stats), NA)
   expect_true(length(capture.output(stats)) > 6)
+  # Test string prediction values
+  predictions <- collect(predict(model, data))$prediction
+  expect_equal(length(grep("setosa", predictions)), 50)
+  expect_equal(length(grep("versicolor", predictions)), 50)
 
   modelPath <- tempfile(pattern = "spark-randomForestClassification", fileext 
= ".tmp")
   write.ml(model, modelPath)
@@ -947,6 +951,26 @@ test_that("spark.randomForest Classification", {
   expect_equal(stats$numClasses, stats2$numClasses)
 
   unlink(modelPath)
+
+  # Test numeric response variable
+  labelToIndex <- function(species) {
+switch(as.character(species),
+  setosa = 0.0,
+  versicolor = 1.0,
+  virginica = 2.0
+)
+  }
+  iris$NumericSpecies <- lapply(iris$Species, labelToIndex)
+  data <- suppressWarnings(createDataFrame(iris[-5]))
+  model <- spark.randomForest(data, NumericSpecies ~ Petal_Length + 
Petal_Width, "classification",
+  maxDepth = 5, maxBins = 16)
+  stats <- summary(model)
+  expect_equal(stats$numFeatures, 2)
+  expect_equal(stats$numTrees, 20)
+  # Test numeric prediction values
+  predictions <- collect(predict(model, data))$prediction
+  expect_equal(length(grep("1.0", predictions)), 50)
+  expect_equal(length(grep("2.0", predictions)), 50)
 })
 
 test_that("spark.gbt", {

http://git-wip-us.apache.org/repos/asf/spark/blob/5ddf6947/mllib/src/main/scala/org/apache/spark/ml/r/RandomForestClassificationWrapper.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/r/RandomForestClassificationWrapper.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/r/RandomForestClassificationWrapper.scala
index 6947ba7..31f846d 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/r/RandomForestClassificationWrapper.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/r/RandomForestClassificationWrapper.scala
@@ -23,9 +23,9 @@ import org.json4s.JsonDSL._
 import org.json4s.jackson.JsonMethods._
 
 import org.apache.spark.ml.{Pipeline, PipelineModel}
-import org.apache.spark.ml.attribute.AttributeGroup
+import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, 
NominalAttribute}
 import org.apache.spark.ml.classification.{RandomForestClassificationModel, 
RandomForestClassifier}
-import org.apache.spark.ml.feature.RFormula
+import org.apache.spark.ml.feature.{IndexToString, RFormula}
 import org.apache.spark.ml.linalg.Vector
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset}
@@ -35,6 +35,8 @@ private[r] class RandomForestClassifierWrapper private (
   val formula: String,
   val features: Array[String]) extends MLWritable {
 
+  import RandomForestClassifierWrapper._
+
   private val rfcModel: RandomForestClassificationModel =
 pipeline.stages(1).asInstanceOf[RandomForestClassificationModel]
 
@@ -46,7 +48,9 @@ private[r] class RandomForestClassifierWrapper private (
   def summary: String = rfcModel.toDebugString
 
   def transform(dataset: Dataset[_]): DataFrame = {
-pipeline.transform(d

spark git commit: [SPARK-18401][SPARKR][ML] SparkR random forest should support output original label.

2016-11-10 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 064d4315f -> 51dca6143


[SPARK-18401][SPARKR][ML] SparkR random forest should support output original 
label.

## What changes were proposed in this pull request?
SparkR ```spark.randomForest``` classification prediction should output 
original label rather than the indexed label. This issue is very similar with 
[SPARK-18291](https://issues.apache.org/jira/browse/SPARK-18291).

## How was this patch tested?
Add unit tests.

Author: Yanbo Liang 

Closes #15842 from yanboliang/spark-18401.

(cherry picked from commit 5ddf69470b93c0b8a28bb4ac905e7670d9c50a95)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/51dca614
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/51dca614
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/51dca614

Branch: refs/heads/branch-2.1
Commit: 51dca6143670ec1c1cb090047c3941becaf41fa9
Parents: 064d431
Author: Yanbo Liang 
Authored: Thu Nov 10 17:13:10 2016 -0800
Committer: Yanbo Liang 
Committed: Thu Nov 10 17:13:26 2016 -0800

--
 R/pkg/inst/tests/testthat/test_mllib.R  | 24 +
 .../r/RandomForestClassificationWrapper.scala   | 28 +---
 2 files changed, 48 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/51dca614/R/pkg/inst/tests/testthat/test_mllib.R
--
diff --git a/R/pkg/inst/tests/testthat/test_mllib.R 
b/R/pkg/inst/tests/testthat/test_mllib.R
index 1e456ef..33e85b7 100644
--- a/R/pkg/inst/tests/testthat/test_mllib.R
+++ b/R/pkg/inst/tests/testthat/test_mllib.R
@@ -935,6 +935,10 @@ test_that("spark.randomForest Classification", {
   expect_equal(stats$numTrees, 20)
   expect_error(capture.output(stats), NA)
   expect_true(length(capture.output(stats)) > 6)
+  # Test string prediction values
+  predictions <- collect(predict(model, data))$prediction
+  expect_equal(length(grep("setosa", predictions)), 50)
+  expect_equal(length(grep("versicolor", predictions)), 50)
 
   modelPath <- tempfile(pattern = "spark-randomForestClassification", fileext 
= ".tmp")
   write.ml(model, modelPath)
@@ -947,6 +951,26 @@ test_that("spark.randomForest Classification", {
   expect_equal(stats$numClasses, stats2$numClasses)
 
   unlink(modelPath)
+
+  # Test numeric response variable
+  labelToIndex <- function(species) {
+switch(as.character(species),
+  setosa = 0.0,
+  versicolor = 1.0,
+  virginica = 2.0
+)
+  }
+  iris$NumericSpecies <- lapply(iris$Species, labelToIndex)
+  data <- suppressWarnings(createDataFrame(iris[-5]))
+  model <- spark.randomForest(data, NumericSpecies ~ Petal_Length + 
Petal_Width, "classification",
+  maxDepth = 5, maxBins = 16)
+  stats <- summary(model)
+  expect_equal(stats$numFeatures, 2)
+  expect_equal(stats$numTrees, 20)
+  # Test numeric prediction values
+  predictions <- collect(predict(model, data))$prediction
+  expect_equal(length(grep("1.0", predictions)), 50)
+  expect_equal(length(grep("2.0", predictions)), 50)
 })
 
 test_that("spark.gbt", {

http://git-wip-us.apache.org/repos/asf/spark/blob/51dca614/mllib/src/main/scala/org/apache/spark/ml/r/RandomForestClassificationWrapper.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/r/RandomForestClassificationWrapper.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/r/RandomForestClassificationWrapper.scala
index 6947ba7..31f846d 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/r/RandomForestClassificationWrapper.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/r/RandomForestClassificationWrapper.scala
@@ -23,9 +23,9 @@ import org.json4s.JsonDSL._
 import org.json4s.jackson.JsonMethods._
 
 import org.apache.spark.ml.{Pipeline, PipelineModel}
-import org.apache.spark.ml.attribute.AttributeGroup
+import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, 
NominalAttribute}
 import org.apache.spark.ml.classification.{RandomForestClassificationModel, 
RandomForestClassifier}
-import org.apache.spark.ml.feature.RFormula
+import org.apache.spark.ml.feature.{IndexToString, RFormula}
 import org.apache.spark.ml.linalg.Vector
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset}
@@ -35,6 +35,8 @@ private[r] class RandomForestClassifierWrapper private (
   val formula: String,
   val features: Array[String]) extends MLWritable {
 
+  import RandomForestClassifierWrapper._
+
   private val rfcModel: RandomForestClassificationModel =
 pipeline.stages(1).asInstanceOf[RandomForestClassificationModel]
 
@@ -46,7 +48,9 @@ private[r] class RandomForestClassifierWrapper private (
   def summary: String

spark git commit: [SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes

2016-11-12 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master bc41d997e -> 22cb3a060


[SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes

## What changes were proposed in this pull request?
* Refactor out ```trainWithLabelCheck``` and make ```mllib.NaiveBayes``` call 
into it.
* Avoid capturing the outer object for ```modelType```.
* Move ```requireNonnegativeValues``` and ```requireZeroOneBernoulliValues``` 
to companion object.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #15826 from yanboliang/spark-14077-2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/22cb3a06
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/22cb3a06
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/22cb3a06

Branch: refs/heads/master
Commit: 22cb3a060a440205281b71686637679645454ca6
Parents: bc41d99
Author: Yanbo Liang 
Authored: Sat Nov 12 06:13:22 2016 -0800
Committer: Yanbo Liang 
Committed: Sat Nov 12 06:13:22 2016 -0800

--
 .../spark/ml/classification/NaiveBayes.scala| 72 ++--
 .../spark/mllib/classification/NaiveBayes.scala |  6 +-
 2 files changed, 39 insertions(+), 39 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/22cb3a06/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
index b03a07a..f1a7676 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
@@ -76,7 +76,7 @@ class NaiveBayes @Since("1.5.0") (
   extends ProbabilisticClassifier[Vector, NaiveBayes, NaiveBayesModel]
   with NaiveBayesParams with DefaultParamsWritable {
 
-  import NaiveBayes.{Bernoulli, Multinomial}
+  import NaiveBayes._
 
   @Since("1.5.0")
   def this() = this(Identifiable.randomUID("nb"))
@@ -110,21 +110,20 @@ class NaiveBayes @Since("1.5.0") (
   @Since("2.1.0")
   def setWeightCol(value: String): this.type = set(weightCol, value)
 
+  override protected def train(dataset: Dataset[_]): NaiveBayesModel = {
+trainWithLabelCheck(dataset, positiveLabel = true)
+  }
+
   /**
* ml assumes input labels in range [0, numClasses). But this implementation
* is also called by mllib NaiveBayes which allows other kinds of input 
labels
-   * such as {-1, +1}. Here we use this parameter to switch between different 
processing logic.
-   * It should be removed when we remove mllib NaiveBayes.
+   * such as {-1, +1}. `positiveLabel` is used to determine whether the label
+   * should be checked and it should be removed when we remove mllib 
NaiveBayes.
*/
-  private[spark] var isML: Boolean = true
-
-  private[spark] def setIsML(isML: Boolean): this.type = {
-this.isML = isML
-this
-  }
-
-  override protected def train(dataset: Dataset[_]): NaiveBayesModel = {
-if (isML) {
+  private[spark] def trainWithLabelCheck(
+  dataset: Dataset[_],
+  positiveLabel: Boolean): NaiveBayesModel = {
+if (positiveLabel) {
   val numClasses = getNumClasses(dataset)
   if (isDefined(thresholds)) {
 require($(thresholds).length == numClasses, 
this.getClass.getSimpleName +
@@ -133,28 +132,9 @@ class NaiveBayes @Since("1.5.0") (
   }
 }
 
-val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
-  val values = v match {
-case sv: SparseVector => sv.values
-case dv: DenseVector => dv.values
-  }
-
-  require(values.forall(_ >= 0.0),
-s"Naive Bayes requires nonnegative feature values but found $v.")
-}
-
-val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
-  val values = v match {
-case sv: SparseVector => sv.values
-case dv: DenseVector => dv.values
-  }
-
-  require(values.forall(v => v == 0.0 || v == 1.0),
-s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
-}
-
+val modelTypeValue = $(modelType)
 val requireValues: Vector => Unit = {
-  $(modelType) match {
+  modelTypeValue match {
 case Multinomial =>
   requireNonnegativeValues
 case Bernoulli =>
@@ -226,13 +206,33 @@ class NaiveBayes @Since("1.5.0") (
 @Since("1.6.0")
 object NaiveBayes extends DefaultParamsReadable[NaiveBayes] {
   /** String name for multinomial model type. */
-  private[spark] val Multinomial: String = "multinomial"
+  private[classification] val Multinomial: String = "multinomial"
 
   /** String name for Bernoulli model type. */
-  private[spark] val Bernoulli: String = "bernoulli"
+  private[classification

spark git commit: [SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes

2016-11-12 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 893355143 -> b2ba83d10


[SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes

## What changes were proposed in this pull request?
* Refactor out ```trainWithLabelCheck``` and make ```mllib.NaiveBayes``` call 
into it.
* Avoid capturing the outer object for ```modelType```.
* Move ```requireNonnegativeValues``` and ```requireZeroOneBernoulliValues``` 
to companion object.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #15826 from yanboliang/spark-14077-2.

(cherry picked from commit 22cb3a060a440205281b71686637679645454ca6)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b2ba83d1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b2ba83d1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b2ba83d1

Branch: refs/heads/branch-2.1
Commit: b2ba83d10ac06614c0126f4b0d913f6979051682
Parents: 8933551
Author: Yanbo Liang 
Authored: Sat Nov 12 06:13:22 2016 -0800
Committer: Yanbo Liang 
Committed: Sat Nov 12 06:18:45 2016 -0800

--
 .../spark/ml/classification/NaiveBayes.scala| 72 ++--
 .../spark/mllib/classification/NaiveBayes.scala |  6 +-
 2 files changed, 39 insertions(+), 39 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b2ba83d1/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
index b03a07a..f1a7676 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
@@ -76,7 +76,7 @@ class NaiveBayes @Since("1.5.0") (
   extends ProbabilisticClassifier[Vector, NaiveBayes, NaiveBayesModel]
   with NaiveBayesParams with DefaultParamsWritable {
 
-  import NaiveBayes.{Bernoulli, Multinomial}
+  import NaiveBayes._
 
   @Since("1.5.0")
   def this() = this(Identifiable.randomUID("nb"))
@@ -110,21 +110,20 @@ class NaiveBayes @Since("1.5.0") (
   @Since("2.1.0")
   def setWeightCol(value: String): this.type = set(weightCol, value)
 
+  override protected def train(dataset: Dataset[_]): NaiveBayesModel = {
+trainWithLabelCheck(dataset, positiveLabel = true)
+  }
+
   /**
* ml assumes input labels in range [0, numClasses). But this implementation
* is also called by mllib NaiveBayes which allows other kinds of input 
labels
-   * such as {-1, +1}. Here we use this parameter to switch between different 
processing logic.
-   * It should be removed when we remove mllib NaiveBayes.
+   * such as {-1, +1}. `positiveLabel` is used to determine whether the label
+   * should be checked and it should be removed when we remove mllib 
NaiveBayes.
*/
-  private[spark] var isML: Boolean = true
-
-  private[spark] def setIsML(isML: Boolean): this.type = {
-this.isML = isML
-this
-  }
-
-  override protected def train(dataset: Dataset[_]): NaiveBayesModel = {
-if (isML) {
+  private[spark] def trainWithLabelCheck(
+  dataset: Dataset[_],
+  positiveLabel: Boolean): NaiveBayesModel = {
+if (positiveLabel) {
   val numClasses = getNumClasses(dataset)
   if (isDefined(thresholds)) {
 require($(thresholds).length == numClasses, 
this.getClass.getSimpleName +
@@ -133,28 +132,9 @@ class NaiveBayes @Since("1.5.0") (
   }
 }
 
-val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
-  val values = v match {
-case sv: SparseVector => sv.values
-case dv: DenseVector => dv.values
-  }
-
-  require(values.forall(_ >= 0.0),
-s"Naive Bayes requires nonnegative feature values but found $v.")
-}
-
-val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
-  val values = v match {
-case sv: SparseVector => sv.values
-case dv: DenseVector => dv.values
-  }
-
-  require(values.forall(v => v == 0.0 || v == 1.0),
-s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
-}
-
+val modelTypeValue = $(modelType)
 val requireValues: Vector => Unit = {
-  $(modelType) match {
+  modelTypeValue match {
 case Multinomial =>
   requireNonnegativeValues
 case Bernoulli =>
@@ -226,13 +206,33 @@ class NaiveBayes @Since("1.5.0") (
 @Since("1.6.0")
 object NaiveBayes extends DefaultParamsReadable[NaiveBayes] {
   /** String name for multinomial model type. */
-  private[spark] val Multinomial: String = "multinomial"
+  private[classification] val Multinomial: String = "multinomial"
 
   /** String name fo

spark git commit: [SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms training on libsvm data

2016-11-13 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master b91a51bb2 -> 07be232ea


[SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms training 
on libsvm data

## What changes were proposed in this pull request?
* Fix the following exceptions which throws when 
```spark.randomForest```(classification), ```spark.gbt```(classification), 
```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on 
libsvm data.
```
java.lang.IllegalArgumentException: requirement failed: If label column already 
exists, forceIndexLabel can not be set with true.
```
See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412) for more 
detail about how to reproduce this bug.
* Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of ML 
algorithm wrappers use this function.
* Drop some unwanted columns when making prediction.

## How was this patch tested?
Add unit test.

Author: Yanbo Liang 

Closes #15851 from yanboliang/spark-18412.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07be232e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07be232e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07be232e

Branch: refs/heads/master
Commit: 07be232ea12dfc8dc3701ca948814be7dbebf4ee
Parents: b91a51b
Author: Yanbo Liang 
Authored: Sun Nov 13 20:25:12 2016 -0800
Committer: Yanbo Liang 
Committed: Sun Nov 13 20:25:12 2016 -0800

--
 R/pkg/inst/tests/testthat/test_mllib.R  | 18 --
 .../spark/ml/r/GBTClassificationWrapper.scala   | 18 --
 .../r/GeneralizedLinearRegressionWrapper.scala  |  5 ++-
 .../apache/spark/ml/r/NaiveBayesWrapper.scala   | 14 +++-
 .../org/apache/spark/ml/r/RWrapperUtils.scala   | 36 +---
 .../r/RandomForestClassificationWrapper.scala   | 18 --
 6 files changed, 68 insertions(+), 41 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/07be232e/R/pkg/inst/tests/testthat/test_mllib.R
--
diff --git a/R/pkg/inst/tests/testthat/test_mllib.R 
b/R/pkg/inst/tests/testthat/test_mllib.R
index b76f75d..07df4b6 100644
--- a/R/pkg/inst/tests/testthat/test_mllib.R
+++ b/R/pkg/inst/tests/testthat/test_mllib.R
@@ -881,7 +881,8 @@ test_that("spark.kstest", {
   expect_match(capture.output(stats)[1], "Kolmogorov-Smirnov test summary:")
 })
 
-test_that("spark.randomForest Regression", {
+test_that("spark.randomForest", {
+  # regression
   data <- suppressWarnings(createDataFrame(longley))
   model <- spark.randomForest(data, Employed ~ ., "regression", maxDepth = 5, 
maxBins = 16,
   numTrees = 1)
@@ -923,9 +924,8 @@ test_that("spark.randomForest Regression", {
   expect_equal(stats$treeWeights, stats2$treeWeights)
 
   unlink(modelPath)
-})
 
-test_that("spark.randomForest Classification", {
+  # classification
   data <- suppressWarnings(createDataFrame(iris))
   model <- spark.randomForest(data, Species ~ Petal_Length + Petal_Width, 
"classification",
   maxDepth = 5, maxBins = 16)
@@ -971,6 +971,12 @@ test_that("spark.randomForest Classification", {
   predictions <- collect(predict(model, data))$prediction
   expect_equal(length(grep("1.0", predictions)), 50)
   expect_equal(length(grep("2.0", predictions)), 50)
+
+  # spark.randomForest classification can work on libsvm data
+  data <- 
read.df(absoluteSparkPath("data/mllib/sample_multiclass_classification_data.txt"),
+source = "libsvm")
+  model <- spark.randomForest(data, label ~ features, "classification")
+  expect_equal(summary(model)$numFeatures, 4)
 })
 
 test_that("spark.gbt", {
@@ -1039,6 +1045,12 @@ test_that("spark.gbt", {
   expect_equal(iris2$NumericSpecies, as.double(collect(predict(m, 
df))$prediction))
   expect_equal(s$numFeatures, 5)
   expect_equal(s$numTrees, 20)
+
+  # spark.gbt classification can work on libsvm data
+  data <- 
read.df(absoluteSparkPath("data/mllib/sample_binary_classification_data.txt"),
+source = "libsvm")
+  model <- spark.gbt(data, label ~ features, "classification")
+  expect_equal(summary(model)$numFeatures, 692)
 })
 
 sparkR.session.stop()

http://git-wip-us.apache.org/repos/asf/spark/blob/07be232e/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala 
b/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala
index 8946025..aacb41e 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala
@@ -23,10 +23,10 @@ import org.json4s.JsonDSL._

spark git commit: [SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms training on libsvm data

2016-11-13 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 0c69224ed -> 8fc6455c0


[SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms training 
on libsvm data

## What changes were proposed in this pull request?
* Fix the following exceptions which throws when 
```spark.randomForest```(classification), ```spark.gbt```(classification), 
```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on 
libsvm data.
```
java.lang.IllegalArgumentException: requirement failed: If label column already 
exists, forceIndexLabel can not be set with true.
```
See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412) for more 
detail about how to reproduce this bug.
* Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of ML 
algorithm wrappers use this function.
* Drop some unwanted columns when making prediction.

## How was this patch tested?
Add unit test.

Author: Yanbo Liang 

Closes #15851 from yanboliang/spark-18412.

(cherry picked from commit 07be232ea12dfc8dc3701ca948814be7dbebf4ee)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8fc6455c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8fc6455c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8fc6455c

Branch: refs/heads/branch-2.1
Commit: 8fc6455c0b77f81be79908bb65e6264bf61c90e7
Parents: 0c69224
Author: Yanbo Liang 
Authored: Sun Nov 13 20:25:12 2016 -0800
Committer: Yanbo Liang 
Committed: Sun Nov 13 20:25:30 2016 -0800

--
 R/pkg/inst/tests/testthat/test_mllib.R  | 18 --
 .../spark/ml/r/GBTClassificationWrapper.scala   | 18 --
 .../r/GeneralizedLinearRegressionWrapper.scala  |  5 ++-
 .../apache/spark/ml/r/NaiveBayesWrapper.scala   | 14 +++-
 .../org/apache/spark/ml/r/RWrapperUtils.scala   | 36 +---
 .../r/RandomForestClassificationWrapper.scala   | 18 --
 6 files changed, 68 insertions(+), 41 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8fc6455c/R/pkg/inst/tests/testthat/test_mllib.R
--
diff --git a/R/pkg/inst/tests/testthat/test_mllib.R 
b/R/pkg/inst/tests/testthat/test_mllib.R
index 33e85b7..4831ce2 100644
--- a/R/pkg/inst/tests/testthat/test_mllib.R
+++ b/R/pkg/inst/tests/testthat/test_mllib.R
@@ -881,7 +881,8 @@ test_that("spark.kstest", {
   expect_match(capture.output(stats)[1], "Kolmogorov-Smirnov test summary:")
 })
 
-test_that("spark.randomForest Regression", {
+test_that("spark.randomForest", {
+  # regression
   data <- suppressWarnings(createDataFrame(longley))
   model <- spark.randomForest(data, Employed ~ ., "regression", maxDepth = 5, 
maxBins = 16,
   numTrees = 1)
@@ -923,9 +924,8 @@ test_that("spark.randomForest Regression", {
   expect_equal(stats$treeWeights, stats2$treeWeights)
 
   unlink(modelPath)
-})
 
-test_that("spark.randomForest Classification", {
+  # classification
   data <- suppressWarnings(createDataFrame(iris))
   model <- spark.randomForest(data, Species ~ Petal_Length + Petal_Width, 
"classification",
   maxDepth = 5, maxBins = 16)
@@ -971,6 +971,12 @@ test_that("spark.randomForest Classification", {
   predictions <- collect(predict(model, data))$prediction
   expect_equal(length(grep("1.0", predictions)), 50)
   expect_equal(length(grep("2.0", predictions)), 50)
+
+  # spark.randomForest classification can work on libsvm data
+  data <- 
read.df(absoluteSparkPath("data/mllib/sample_multiclass_classification_data.txt"),
+source = "libsvm")
+  model <- spark.randomForest(data, label ~ features, "classification")
+  expect_equal(summary(model)$numFeatures, 4)
 })
 
 test_that("spark.gbt", {
@@ -1039,6 +1045,12 @@ test_that("spark.gbt", {
   expect_equal(iris2$NumericSpecies, as.double(collect(predict(m, 
df))$prediction))
   expect_equal(s$numFeatures, 5)
   expect_equal(s$numTrees, 20)
+
+  # spark.gbt classification can work on libsvm data
+  data <- 
read.df(absoluteSparkPath("data/mllib/sample_binary_classification_data.txt"),
+source = "libsvm")
+  model <- spark.gbt(data, label ~ features, "classification")
+  expect_equal(summary(model)$numFeatures, 692)
 })
 
 sparkR.session.stop()

http://git-wip-us.apache.org/repos/asf/spark/blob/8fc6455c/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala 
b/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala
index 8946025..aacb41e 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala
+++ b/mllib/src/main/sc

spark git commit: [SPARK-18438][SPARKR][ML] spark.mlp should support RFormula.

2016-11-16 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 4ac9759f8 -> 95eb06bd7


[SPARK-18438][SPARKR][ML] spark.mlp should support RFormula.

## What changes were proposed in this pull request?
```spark.mlp``` should support ```RFormula``` like other ML algorithm wrappers.
BTW, I did some cleanup and improvement for ```spark.mlp```.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang 

Closes #15883 from yanboliang/spark-18438.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/95eb06bd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/95eb06bd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/95eb06bd

Branch: refs/heads/master
Commit: 95eb06bd7d0f7110ef62c8d1cb6337c72b10d99f
Parents: 4ac9759
Author: Yanbo Liang 
Authored: Wed Nov 16 01:04:18 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Nov 16 01:04:18 2016 -0800

--
 R/pkg/R/generics.R  |  2 +-
 R/pkg/R/mllib.R | 30 ++
 R/pkg/inst/tests/testthat/test_mllib.R  | 63 +---
 .../MultilayerPerceptronClassifierWrapper.scala | 61 ++-
 4 files changed, 96 insertions(+), 60 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/95eb06bd/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 7653ca7..499c7b2 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -1373,7 +1373,7 @@ setGeneric("spark.logit", function(data, formula, ...) { 
standardGeneric("spark.
 
 #' @rdname spark.mlp
 #' @export
-setGeneric("spark.mlp", function(data, ...) { standardGeneric("spark.mlp") })
+setGeneric("spark.mlp", function(data, formula, ...) { 
standardGeneric("spark.mlp") })
 
 #' @rdname spark.naiveBayes
 #' @export

http://git-wip-us.apache.org/repos/asf/spark/blob/95eb06bd/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 1065b4b..265e64e 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -525,7 +525,7 @@ setMethod("write.ml", signature(object = "LDAModel", path = 
"character"),
 #' @note spark.isoreg since 2.1.0
 setMethod("spark.isoreg", signature(data = "SparkDataFrame", formula = 
"formula"),
   function(data, formula, isotonic = TRUE, featureIndex = 0, weightCol 
= NULL) {
-formula <- paste0(deparse(formula), collapse = "")
+formula <- paste(deparse(formula), collapse = "")
 
 if (is.null(weightCol)) {
   weightCol <- ""
@@ -775,7 +775,7 @@ setMethod("spark.logit", signature(data = "SparkDataFrame", 
formula = "formula")
tol = 1E-6, fitIntercept = TRUE, family = "auto", 
standardization = TRUE,
thresholds = 0.5, weightCol = NULL, aggregationDepth = 2,
probabilityCol = "probability") {
-formula <- paste0(deparse(formula), collapse = "")
+formula <- paste(deparse(formula), collapse = "")
 
 if (is.null(weightCol)) {
   weightCol <- ""
@@ -858,6 +858,8 @@ setMethod("summary", signature(object = 
"LogisticRegressionModel"),
 #'   Multilayer Perceptron}
 #'
 #' @param data a \code{SparkDataFrame} of observations and labels for model 
fitting.
+#' @param formula a symbolic description of the model to be fitted. Currently 
only a few formula
+#'operators are supported, including '~', '.', ':', '+', and 
'-'.
 #' @param blockSize blockSize parameter.
 #' @param layers integer vector containing the number of nodes for each layer
 #' @param solver solver parameter, supported options: "gd" (minibatch gradient 
descent) or "l-bfgs".
@@ -870,7 +872,7 @@ setMethod("summary", signature(object = 
"LogisticRegressionModel"),
 #' @param ... additional arguments passed to the method.
 #' @return \code{spark.mlp} returns a fitted Multilayer Perceptron 
Classification Model.
 #' @rdname spark.mlp
-#' @aliases spark.mlp,SparkDataFrame-method
+#' @aliases spark.mlp,SparkDataFrame,formula-method
 #' @name spark.mlp
 #' @seealso \link{read.ml}
 #' @export
@@ -879,7 +881,7 @@ setMethod("summary", signature(object = 
"LogisticRegressionModel"),
 #' df <- read.df("data/mllib/sample_multiclass_classification_data.txt", 
source = "libsvm")
 #'
 #' # fit a Multilayer Perceptron Classification Model
-#' model <- spark.mlp(df, blockSize = 128, layers = c(4, 3), solver = "l-bfgs",
+#' model <- spark.mlp(df, label ~ features, blockSize = 128, layers = c(4, 3), 
solver = "l-bfgs",
 #'maxIter = 100, tol = 0.5, stepSize = 1, seed = 1,
 #'initialWeights = c(0, 0, 0, 0, 0, 5, 5, 5, 5, 5, 9, 9, 
9, 9, 9))
 #'
@@ -896,9 +898,10 @@ setMethod("summary", signat

spark git commit: [SPARK-18438][SPARKR][ML] spark.mlp should support RFormula.

2016-11-16 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 436ae201f -> 7b57e480d


[SPARK-18438][SPARKR][ML] spark.mlp should support RFormula.

## What changes were proposed in this pull request?
```spark.mlp``` should support ```RFormula``` like other ML algorithm wrappers.
BTW, I did some cleanup and improvement for ```spark.mlp```.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang 

Closes #15883 from yanboliang/spark-18438.

(cherry picked from commit 95eb06bd7d0f7110ef62c8d1cb6337c72b10d99f)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7b57e480
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7b57e480
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7b57e480

Branch: refs/heads/branch-2.1
Commit: 7b57e480d2f2c0695eb4036199cd0db52c6f2008
Parents: 436ae20
Author: Yanbo Liang 
Authored: Wed Nov 16 01:04:18 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Nov 16 01:05:23 2016 -0800

--
 R/pkg/R/generics.R  |  2 +-
 R/pkg/R/mllib.R | 30 ++
 R/pkg/inst/tests/testthat/test_mllib.R  | 63 +---
 .../MultilayerPerceptronClassifierWrapper.scala | 61 ++-
 4 files changed, 96 insertions(+), 60 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7b57e480/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 7653ca7..499c7b2 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -1373,7 +1373,7 @@ setGeneric("spark.logit", function(data, formula, ...) { 
standardGeneric("spark.
 
 #' @rdname spark.mlp
 #' @export
-setGeneric("spark.mlp", function(data, ...) { standardGeneric("spark.mlp") })
+setGeneric("spark.mlp", function(data, formula, ...) { 
standardGeneric("spark.mlp") })
 
 #' @rdname spark.naiveBayes
 #' @export

http://git-wip-us.apache.org/repos/asf/spark/blob/7b57e480/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 1065b4b..265e64e 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -525,7 +525,7 @@ setMethod("write.ml", signature(object = "LDAModel", path = 
"character"),
 #' @note spark.isoreg since 2.1.0
 setMethod("spark.isoreg", signature(data = "SparkDataFrame", formula = 
"formula"),
   function(data, formula, isotonic = TRUE, featureIndex = 0, weightCol 
= NULL) {
-formula <- paste0(deparse(formula), collapse = "")
+formula <- paste(deparse(formula), collapse = "")
 
 if (is.null(weightCol)) {
   weightCol <- ""
@@ -775,7 +775,7 @@ setMethod("spark.logit", signature(data = "SparkDataFrame", 
formula = "formula")
tol = 1E-6, fitIntercept = TRUE, family = "auto", 
standardization = TRUE,
thresholds = 0.5, weightCol = NULL, aggregationDepth = 2,
probabilityCol = "probability") {
-formula <- paste0(deparse(formula), collapse = "")
+formula <- paste(deparse(formula), collapse = "")
 
 if (is.null(weightCol)) {
   weightCol <- ""
@@ -858,6 +858,8 @@ setMethod("summary", signature(object = 
"LogisticRegressionModel"),
 #'   Multilayer Perceptron}
 #'
 #' @param data a \code{SparkDataFrame} of observations and labels for model 
fitting.
+#' @param formula a symbolic description of the model to be fitted. Currently 
only a few formula
+#'operators are supported, including '~', '.', ':', '+', and 
'-'.
 #' @param blockSize blockSize parameter.
 #' @param layers integer vector containing the number of nodes for each layer
 #' @param solver solver parameter, supported options: "gd" (minibatch gradient 
descent) or "l-bfgs".
@@ -870,7 +872,7 @@ setMethod("summary", signature(object = 
"LogisticRegressionModel"),
 #' @param ... additional arguments passed to the method.
 #' @return \code{spark.mlp} returns a fitted Multilayer Perceptron 
Classification Model.
 #' @rdname spark.mlp
-#' @aliases spark.mlp,SparkDataFrame-method
+#' @aliases spark.mlp,SparkDataFrame,formula-method
 #' @name spark.mlp
 #' @seealso \link{read.ml}
 #' @export
@@ -879,7 +881,7 @@ setMethod("summary", signature(object = 
"LogisticRegressionModel"),
 #' df <- read.df("data/mllib/sample_multiclass_classification_data.txt", 
source = "libsvm")
 #'
 #' # fit a Multilayer Perceptron Classification Model
-#' model <- spark.mlp(df, blockSize = 128, layers = c(4, 3), solver = "l-bfgs",
+#' model <- spark.mlp(df, label ~ features, blockSize = 128, layers = c(4, 3), 
solver = "l-bfgs",
 #'maxIter = 100, tol = 0.5, stepSize = 1, seed = 1,
 #'initialWeigh

spark git commit: [SPARK-18434][ML] Add missing ParamValidations for ML algos

2016-11-16 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 241e04bc0 -> c68f1a38a


[SPARK-18434][ML] Add missing ParamValidations for ML algos

## What changes were proposed in this pull request?
Add missing ParamValidations for ML algos
## How was this patch tested?
existing tests

Author: Zheng RuiFeng 

Closes #15881 from zhengruifeng/arg_checking.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c68f1a38
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c68f1a38
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c68f1a38

Branch: refs/heads/master
Commit: c68f1a38af67957ee28889667193da8f64bb4342
Parents: 241e04b
Author: Zheng RuiFeng 
Authored: Wed Nov 16 02:46:27 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Nov 16 02:46:27 2016 -0800

--
 .../main/scala/org/apache/spark/ml/feature/IDF.scala   |  3 ++-
 .../main/scala/org/apache/spark/ml/feature/PCA.scala   |  3 ++-
 .../scala/org/apache/spark/ml/feature/Word2Vec.scala   | 13 -
 .../spark/ml/regression/IsotonicRegression.scala   |  3 ++-
 .../apache/spark/ml/regression/LinearRegression.scala  |  6 +-
 .../scala/org/apache/spark/ml/tree/treeParams.scala|  4 +++-
 6 files changed, 22 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c68f1a38/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
index 6386dd8..46a0730 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
@@ -44,7 +44,8 @@ private[feature] trait IDFBase extends Params with 
HasInputCol with HasOutputCol
* @group param
*/
   final val minDocFreq = new IntParam(
-this, "minDocFreq", "minimum number of documents in which a term should 
appear for filtering")
+this, "minDocFreq", "minimum number of documents in which a term should 
appear for filtering" +
+  " (>= 0)", ParamValidators.gtEq(0))
 
   setDefault(minDocFreq -> 0)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/c68f1a38/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
index 6b91348..444006f 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
@@ -44,7 +44,8 @@ private[feature] trait PCAParams extends Params with 
HasInputCol with HasOutputC
* The number of principal components.
* @group param
*/
-  final val k: IntParam = new IntParam(this, "k", "the number of principal 
components")
+  final val k: IntParam = new IntParam(this, "k", "the number of principal 
components (> 0)",
+ParamValidators.gt(0))
 
   /** @group getParam */
   def getK: Int = $(k)

http://git-wip-us.apache.org/repos/asf/spark/blob/c68f1a38/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
index d53f3df..3ed08c9 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
@@ -43,7 +43,8 @@ private[feature] trait Word2VecBase extends Params
* @group param
*/
   final val vectorSize = new IntParam(
-this, "vectorSize", "the dimension of codes after transforming from words")
+this, "vectorSize", "the dimension of codes after transforming from words 
(> 0)",
+ParamValidators.gt(0))
   setDefault(vectorSize -> 100)
 
   /** @group getParam */
@@ -55,7 +56,8 @@ private[feature] trait Word2VecBase extends Params
* @group expertParam
*/
   final val windowSize = new IntParam(
-this, "windowSize", "the window size (context words from [-window, 
window])")
+this, "windowSize", "the window size (context words from [-window, 
window]) (> 0)",
+ParamValidators.gt(0))
   setDefault(windowSize -> 5)
 
   /** @group expertGetParam */
@@ -67,7 +69,8 @@ private[feature] trait Word2VecBase extends Params
* @group param
*/
   final val numPartitions = new IntParam(
-this, "numPartitions", "number of partitions for sentences of words")
+this, "numPartitions", "number of partitions for sentences of words (> 0)",
+ParamValidators.gt(0))
   setDefault(numPartitions -> 1)
 
   /** @group getParam */
@@ -80,7 +83,7 @@ pri

spark git commit: [SPARK-18434][ML] Add missing ParamValidations for ML algos

2016-11-16 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 820847008 -> 6b6eb4e52


[SPARK-18434][ML] Add missing ParamValidations for ML algos

## What changes were proposed in this pull request?
Add missing ParamValidations for ML algos
## How was this patch tested?
existing tests

Author: Zheng RuiFeng 

Closes #15881 from zhengruifeng/arg_checking.

(cherry picked from commit c68f1a38af67957ee28889667193da8f64bb4342)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6b6eb4e5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6b6eb4e5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6b6eb4e5

Branch: refs/heads/branch-2.1
Commit: 6b6eb4e520d07a27aa68d3450f3c7613b233d928
Parents: 8208470
Author: Zheng RuiFeng 
Authored: Wed Nov 16 02:46:27 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Nov 16 02:46:54 2016 -0800

--
 .../main/scala/org/apache/spark/ml/feature/IDF.scala   |  3 ++-
 .../main/scala/org/apache/spark/ml/feature/PCA.scala   |  3 ++-
 .../scala/org/apache/spark/ml/feature/Word2Vec.scala   | 13 -
 .../spark/ml/regression/IsotonicRegression.scala   |  3 ++-
 .../apache/spark/ml/regression/LinearRegression.scala  |  6 +-
 .../scala/org/apache/spark/ml/tree/treeParams.scala|  4 +++-
 6 files changed, 22 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6b6eb4e5/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
index 6386dd8..46a0730 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
@@ -44,7 +44,8 @@ private[feature] trait IDFBase extends Params with 
HasInputCol with HasOutputCol
* @group param
*/
   final val minDocFreq = new IntParam(
-this, "minDocFreq", "minimum number of documents in which a term should 
appear for filtering")
+this, "minDocFreq", "minimum number of documents in which a term should 
appear for filtering" +
+  " (>= 0)", ParamValidators.gtEq(0))
 
   setDefault(minDocFreq -> 0)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/6b6eb4e5/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
index 6b91348..444006f 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala
@@ -44,7 +44,8 @@ private[feature] trait PCAParams extends Params with 
HasInputCol with HasOutputC
* The number of principal components.
* @group param
*/
-  final val k: IntParam = new IntParam(this, "k", "the number of principal 
components")
+  final val k: IntParam = new IntParam(this, "k", "the number of principal 
components (> 0)",
+ParamValidators.gt(0))
 
   /** @group getParam */
   def getK: Int = $(k)

http://git-wip-us.apache.org/repos/asf/spark/blob/6b6eb4e5/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
index d53f3df..3ed08c9 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
@@ -43,7 +43,8 @@ private[feature] trait Word2VecBase extends Params
* @group param
*/
   final val vectorSize = new IntParam(
-this, "vectorSize", "the dimension of codes after transforming from words")
+this, "vectorSize", "the dimension of codes after transforming from words 
(> 0)",
+ParamValidators.gt(0))
   setDefault(vectorSize -> 100)
 
   /** @group getParam */
@@ -55,7 +56,8 @@ private[feature] trait Word2VecBase extends Params
* @group expertParam
*/
   final val windowSize = new IntParam(
-this, "windowSize", "the window size (context words from [-window, 
window])")
+this, "windowSize", "the window size (context words from [-window, 
window]) (> 0)",
+ParamValidators.gt(0))
   setDefault(windowSize -> 5)
 
   /** @group expertGetParam */
@@ -67,7 +69,8 @@ private[feature] trait Word2VecBase extends Params
* @group param
*/
   final val numPartitions = new IntParam(
-this, "numPartitions", "number of partitions for sentences of words")
+this, "numPartitions", "number of partitions for sentences of words (> 0)",
+

spark git commit: [SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM

2016-11-21 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 658547974 -> e811fbf9e


[SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM

## What changes were proposed in this pull request?

Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in 
pyspark.

## How was this patch tested?

Unit tests.

Author: sethah 

Closes #15777 from sethah/pyspark_cluster_summaries.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e811fbf9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e811fbf9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e811fbf9

Branch: refs/heads/master
Commit: e811fbf9ed131bccbc46f3c5701c4ff317222fd9
Parents: 6585479
Author: sethah 
Authored: Mon Nov 21 05:36:49 2016 -0800
Committer: Yanbo Liang 
Committed: Mon Nov 21 05:36:49 2016 -0800

--
 .../ml/classification/LogisticRegression.scala  |  11 +-
 .../spark/ml/clustering/BisectingKMeans.scala   |   9 +-
 .../spark/ml/clustering/GaussianMixture.scala   |   9 +-
 .../org/apache/spark/ml/clustering/KMeans.scala |   9 +-
 .../GeneralizedLinearRegression.scala   |  11 +-
 .../spark/ml/regression/LinearRegression.scala  |  14 +-
 .../LogisticRegressionSuite.scala   |   2 +
 .../ml/clustering/BisectingKMeansSuite.scala|   3 +
 .../ml/clustering/GaussianMixtureSuite.scala|   3 +
 .../spark/ml/clustering/KMeansSuite.scala   |   3 +
 .../GeneralizedLinearRegressionSuite.scala  |   2 +
 .../ml/regression/LinearRegressionSuite.scala   |   2 +
 python/pyspark/ml/classification.py |  15 +-
 python/pyspark/ml/clustering.py | 162 ++-
 python/pyspark/ml/regression.py |  16 +-
 python/pyspark/ml/tests.py  |  32 
 16 files changed, 256 insertions(+), 47 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e811fbf9/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index f58efd3..d07b4ad 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -648,7 +648,7 @@ class LogisticRegression @Since("1.2.0") (
 $(labelCol),
 $(featuresCol),
 objectiveHistory)
-  model.setSummary(logRegSummary)
+  model.setSummary(Some(logRegSummary))
 } else {
   model
 }
@@ -790,9 +790,9 @@ class LogisticRegressionModel private[spark] (
 }
   }
 
-  private[classification] def setSummary(
-  summary: LogisticRegressionTrainingSummary): this.type = {
-this.trainingSummary = Some(summary)
+  private[classification]
+  def setSummary(summary: Option[LogisticRegressionTrainingSummary]): 
this.type = {
+this.trainingSummary = summary
 this
   }
 
@@ -887,8 +887,7 @@ class LogisticRegressionModel private[spark] (
   override def copy(extra: ParamMap): LogisticRegressionModel = {
 val newModel = copyValues(new LogisticRegressionModel(uid, 
coefficientMatrix, interceptVector,
   numClasses, isMultinomial), extra)
-if (trainingSummary.isDefined) newModel.setSummary(trainingSummary.get)
-newModel.setParent(parent)
+newModel.setSummary(trainingSummary).setParent(parent)
   }
 
   override protected def raw2prediction(rawPrediction: Vector): Double = {

http://git-wip-us.apache.org/repos/asf/spark/blob/e811fbf9/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
index f8a606d..e6ca3aed 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
@@ -95,8 +95,7 @@ class BisectingKMeansModel private[ml] (
   @Since("2.0.0")
   override def copy(extra: ParamMap): BisectingKMeansModel = {
 val copied = copyValues(new BisectingKMeansModel(uid, parentModel), extra)
-if (trainingSummary.isDefined) copied.setSummary(trainingSummary.get)
-copied.setParent(this.parent)
+copied.setSummary(trainingSummary).setParent(this.parent)
   }
 
   @Since("2.0.0")
@@ -132,8 +131,8 @@ class BisectingKMeansModel private[ml] (
 
   private var trainingSummary: Option[BisectingKMeansSummary] = None
 
-  private[clustering] def setSummary(summary: BisectingKMeansSummary):

spark git commit: [SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM

2016-11-21 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 fb4e6359d -> 31002e4a7


[SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM

## What changes were proposed in this pull request?

Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in 
pyspark.

## How was this patch tested?

Unit tests.

Author: sethah 

Closes #15777 from sethah/pyspark_cluster_summaries.

(cherry picked from commit e811fbf9ed131bccbc46f3c5701c4ff317222fd9)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/31002e4a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/31002e4a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/31002e4a

Branch: refs/heads/branch-2.1
Commit: 31002e4a77ca56492f41bf35e7c8f263d767d3aa
Parents: fb4e635
Author: sethah 
Authored: Mon Nov 21 05:36:49 2016 -0800
Committer: Yanbo Liang 
Committed: Mon Nov 21 05:37:34 2016 -0800

--
 .../ml/classification/LogisticRegression.scala  |  11 +-
 .../spark/ml/clustering/BisectingKMeans.scala   |   9 +-
 .../spark/ml/clustering/GaussianMixture.scala   |   9 +-
 .../org/apache/spark/ml/clustering/KMeans.scala |   9 +-
 .../GeneralizedLinearRegression.scala   |  11 +-
 .../spark/ml/regression/LinearRegression.scala  |  14 +-
 .../LogisticRegressionSuite.scala   |   2 +
 .../ml/clustering/BisectingKMeansSuite.scala|   3 +
 .../ml/clustering/GaussianMixtureSuite.scala|   3 +
 .../spark/ml/clustering/KMeansSuite.scala   |   3 +
 .../GeneralizedLinearRegressionSuite.scala  |   2 +
 .../ml/regression/LinearRegressionSuite.scala   |   2 +
 python/pyspark/ml/classification.py |  15 +-
 python/pyspark/ml/clustering.py | 162 ++-
 python/pyspark/ml/regression.py |  16 +-
 python/pyspark/ml/tests.py  |  32 
 16 files changed, 256 insertions(+), 47 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/31002e4a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index f58efd3..d07b4ad 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -648,7 +648,7 @@ class LogisticRegression @Since("1.2.0") (
 $(labelCol),
 $(featuresCol),
 objectiveHistory)
-  model.setSummary(logRegSummary)
+  model.setSummary(Some(logRegSummary))
 } else {
   model
 }
@@ -790,9 +790,9 @@ class LogisticRegressionModel private[spark] (
 }
   }
 
-  private[classification] def setSummary(
-  summary: LogisticRegressionTrainingSummary): this.type = {
-this.trainingSummary = Some(summary)
+  private[classification]
+  def setSummary(summary: Option[LogisticRegressionTrainingSummary]): 
this.type = {
+this.trainingSummary = summary
 this
   }
 
@@ -887,8 +887,7 @@ class LogisticRegressionModel private[spark] (
   override def copy(extra: ParamMap): LogisticRegressionModel = {
 val newModel = copyValues(new LogisticRegressionModel(uid, 
coefficientMatrix, interceptVector,
   numClasses, isMultinomial), extra)
-if (trainingSummary.isDefined) newModel.setSummary(trainingSummary.get)
-newModel.setParent(parent)
+newModel.setSummary(trainingSummary).setParent(parent)
   }
 
   override protected def raw2prediction(rawPrediction: Vector): Double = {

http://git-wip-us.apache.org/repos/asf/spark/blob/31002e4a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
index f8a606d..e6ca3aed 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
@@ -95,8 +95,7 @@ class BisectingKMeansModel private[ml] (
   @Since("2.0.0")
   override def copy(extra: ParamMap): BisectingKMeansModel = {
 val copied = copyValues(new BisectingKMeansModel(uid, parentModel), extra)
-if (trainingSummary.isDefined) copied.setSummary(trainingSummary.get)
-copied.setParent(this.parent)
+copied.setSummary(trainingSummary).setParent(this.parent)
   }
 
   @Since("2.0.0")
@@ -132,8 +131,8 @@ class BisectingKMeansModel private[ml] (
 
   private var trainingSummary: Option

spark git commit: [SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not download Spark package.

2016-11-22 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master ebeb0830a -> acb971577


[SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not download 
Spark package.

## What changes were proposed in this pull request?
When running SparkR job in yarn-cluster mode, it will download Spark package 
from apache website which is not necessary.
```
./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
```
The following is output:
```
Attaching package: âSparkRâ

The following objects are masked from âpackage:statsâ:

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from âpackage:baseâ:

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

Spark not found in SPARK_HOME:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
..
```
There's no ```SPARK_HOME``` in yarn-cluster mode since the R process is in a 
remote host of the yarn cluster rather than in the client host. The JVM comes 
up first and the R process then connects to it. So in such cases we should 
never have to download Spark as Spark is already running.

## How was this patch tested?
Offline test.

Author: Yanbo Liang 

Closes #15888 from yanboliang/spark-18444.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/acb97157
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/acb97157
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/acb97157

Branch: refs/heads/master
Commit: acb97157796231fef74aba985825b05b607b9279
Parents: ebeb083
Author: Yanbo Liang 
Authored: Tue Nov 22 00:05:30 2016 -0800
Committer: Yanbo Liang 
Committed: Tue Nov 22 00:05:30 2016 -0800

--
 R/pkg/R/sparkR.R| 20 
 R/pkg/R/utils.R |  4 +++
 R/pkg/inst/tests/testthat/test_sparkR.R | 46 
 3 files changed, 64 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/acb97157/R/pkg/R/sparkR.R
--
diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R
index 6b4a2f2..a7152b4 100644
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
@@ -373,8 +373,13 @@ sparkR.session <- function(
 overrideEnvs(sparkConfigMap, paramMap)
   }
 
+  deployMode <- ""
+  if (exists("spark.submit.deployMode", envir = sparkConfigMap)) {
+deployMode <- sparkConfigMap[["spark.submit.deployMode"]]
+  }
+
   if (!exists(".sparkRjsc", envir = .sparkREnv)) {
-retHome <- sparkCheckInstall(sparkHome, master)
+retHome <- sparkCheckInstall(sparkHome, master, deployMode)
 if (!is.null(retHome)) sparkHome <- retHome
 sparkExecutorEnvMap <- new.env()
 sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
sparkExecutorEnvMap,
@@ -550,24 +555,27 @@ processSparkPackages <- function(packages) {
 #
 # @param sparkHome directory to find Spark package.
 # @param master the Spark master URL, used to check local or remote mode.
+# @param deployMode whether to deploy your driver on the worker nodes (cluster)
+#or locally as an external client (client).
 # @return NULL if no need to update sparkHome, and new sparkHome otherwise.
-sparkCheckInstall <- function(sparkHome, master) {
+sparkCheckInstall <- function(sparkHome, master, deployMode) {
   if (!isSparkRShell()) {
 if (!is.na(file.info(sparkHome)$isdir)) {
   msg <- paste0("Spark package found in SPARK_HOME: ", sparkHome)
   message(msg)
   NULL
 } else {
-  if (!nzchar(master) || isMasterLocal(master)) {
-msg <- paste0("Spark not found in SPARK_HOME: ",
-  sparkHome)
+  if (isMasterLocal(master)) {
+msg <- paste0("Spark not found in SPARK_HOME: ", sparkHome)
 message(msg)
 packageLocalDir <- install.spark()
 packageLocalDir
-  } else {
+  } else if (isClientMode(master) || deployMode == "client") {
 msg <- paste0("Spark not found in SPARK_HOME: ",
   sparkHome, "\n", installInstruction("remote"))
 stop(msg)
+  } else {
+NULL
   }
 }
   } else {

http://git-wip-us.apache.org/repos/asf/spark/blob/acb97157/R/pkg/R/utils.R
--
diff --git a/R/pkg/R/utils.R b/R/pkg/R/utils.R
index 2000454..098c0e3 100644
--- a/R/pkg/R/utils.R
+++ b/R/pkg/R/utils.R
@@ -777,6 +777,10 @@ isMasterLocal <- function(master) {
   grepl("^local(\\[([0-9]+|\\*)\\])?$", master, perl = TRUE)
 }
 
+isClientMode <- function(master) {
+  grepl("([a-z]+)-client$", master, perl = TRUE)
+}
+
 isSparkRShell <- function() {
   g

spark git commit: [SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not download Spark package.

2016-11-22 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 aaa2a173a -> c70214075


[SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not download 
Spark package.

## What changes were proposed in this pull request?
When running SparkR job in yarn-cluster mode, it will download Spark package 
from apache website which is not necessary.
```
./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
```
The following is output:
```
Attaching package: âSparkRâ

The following objects are masked from âpackage:statsâ:

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from âpackage:baseâ:

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

Spark not found in SPARK_HOME:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
..
```
There's no ```SPARK_HOME``` in yarn-cluster mode since the R process is in a 
remote host of the yarn cluster rather than in the client host. The JVM comes 
up first and the R process then connects to it. So in such cases we should 
never have to download Spark as Spark is already running.

## How was this patch tested?
Offline test.

Author: Yanbo Liang 

Closes #15888 from yanboliang/spark-18444.

(cherry picked from commit acb97157796231fef74aba985825b05b607b9279)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c7021407
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c7021407
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c7021407

Branch: refs/heads/branch-2.1
Commit: c7021407597480bddf226ffa6d1d3f682408dfeb
Parents: aaa2a17
Author: Yanbo Liang 
Authored: Tue Nov 22 00:05:30 2016 -0800
Committer: Yanbo Liang 
Committed: Tue Nov 22 00:05:54 2016 -0800

--
 R/pkg/R/sparkR.R| 20 
 R/pkg/R/utils.R |  4 +++
 R/pkg/inst/tests/testthat/test_sparkR.R | 46 
 3 files changed, 64 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c7021407/R/pkg/R/sparkR.R
--
diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R
index 6b4a2f2..a7152b4 100644
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
@@ -373,8 +373,13 @@ sparkR.session <- function(
 overrideEnvs(sparkConfigMap, paramMap)
   }
 
+  deployMode <- ""
+  if (exists("spark.submit.deployMode", envir = sparkConfigMap)) {
+deployMode <- sparkConfigMap[["spark.submit.deployMode"]]
+  }
+
   if (!exists(".sparkRjsc", envir = .sparkREnv)) {
-retHome <- sparkCheckInstall(sparkHome, master)
+retHome <- sparkCheckInstall(sparkHome, master, deployMode)
 if (!is.null(retHome)) sparkHome <- retHome
 sparkExecutorEnvMap <- new.env()
 sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
sparkExecutorEnvMap,
@@ -550,24 +555,27 @@ processSparkPackages <- function(packages) {
 #
 # @param sparkHome directory to find Spark package.
 # @param master the Spark master URL, used to check local or remote mode.
+# @param deployMode whether to deploy your driver on the worker nodes (cluster)
+#or locally as an external client (client).
 # @return NULL if no need to update sparkHome, and new sparkHome otherwise.
-sparkCheckInstall <- function(sparkHome, master) {
+sparkCheckInstall <- function(sparkHome, master, deployMode) {
   if (!isSparkRShell()) {
 if (!is.na(file.info(sparkHome)$isdir)) {
   msg <- paste0("Spark package found in SPARK_HOME: ", sparkHome)
   message(msg)
   NULL
 } else {
-  if (!nzchar(master) || isMasterLocal(master)) {
-msg <- paste0("Spark not found in SPARK_HOME: ",
-  sparkHome)
+  if (isMasterLocal(master)) {
+msg <- paste0("Spark not found in SPARK_HOME: ", sparkHome)
 message(msg)
 packageLocalDir <- install.spark()
 packageLocalDir
-  } else {
+  } else if (isClientMode(master) || deployMode == "client") {
 msg <- paste0("Spark not found in SPARK_HOME: ",
   sparkHome, "\n", installInstruction("remote"))
 stop(msg)
+  } else {
+NULL
   }
 }
   } else {

http://git-wip-us.apache.org/repos/asf/spark/blob/c7021407/R/pkg/R/utils.R
--
diff --git a/R/pkg/R/utils.R b/R/pkg/R/utils.R
index 2000454..098c0e3 100644
--- a/R/pkg/R/utils.R
+++ b/R/pkg/R/utils.R
@@ -777,6 +777,10 @@ isMasterLocal <- function(master) {
   grepl("^local(\\[([0-9]+|\\*)\\])?$", master, perl = TRUE)
 }
 
+isClientMode <- fun

spark git commit: [SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not download Spark package.

2016-11-22 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 9dad3a7b0 -> a37238b06


[SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not download 
Spark package.

## What changes were proposed in this pull request?
When running SparkR job in yarn-cluster mode, it will download Spark package 
from apache website which is not necessary.
```
./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
```
The following is output:
```
Attaching package: âSparkRâ

The following objects are masked from âpackage:statsâ:

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from âpackage:baseâ:

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

Spark not found in SPARK_HOME:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
..
```
There's no ```SPARK_HOME``` in yarn-cluster mode since the R process is in a 
remote host of the yarn cluster rather than in the client host. The JVM comes 
up first and the R process then connects to it. So in such cases we should 
never have to download Spark as Spark is already running.

## How was this patch tested?
Offline test.

Author: Yanbo Liang 

Closes #15888 from yanboliang/spark-18444.

(cherry picked from commit acb97157796231fef74aba985825b05b607b9279)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a37238b0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a37238b0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a37238b0

Branch: refs/heads/branch-2.0
Commit: a37238b06f525a1e870750650cf1a4f2885ea265
Parents: 9dad3a7
Author: Yanbo Liang 
Authored: Tue Nov 22 00:05:30 2016 -0800
Committer: Yanbo Liang 
Committed: Tue Nov 22 00:08:51 2016 -0800

--
 R/pkg/R/sparkR.R| 20 
 R/pkg/R/utils.R |  4 +++
 R/pkg/inst/tests/testthat/test_sparkR.R | 46 
 3 files changed, 64 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a37238b0/R/pkg/R/sparkR.R
--
diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R
index cc6d591..6476693 100644
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
@@ -369,8 +369,13 @@ sparkR.session <- function(
 overrideEnvs(sparkConfigMap, paramMap)
   }
 
+  deployMode <- ""
+  if (exists("spark.submit.deployMode", envir = sparkConfigMap)) {
+deployMode <- sparkConfigMap[["spark.submit.deployMode"]]
+  }
+
   if (!exists(".sparkRjsc", envir = .sparkREnv)) {
-retHome <- sparkCheckInstall(sparkHome, master)
+retHome <- sparkCheckInstall(sparkHome, master, deployMode)
 if (!is.null(retHome)) sparkHome <- retHome
 sparkExecutorEnvMap <- new.env()
 sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
sparkExecutorEnvMap,
@@ -546,24 +551,27 @@ processSparkPackages <- function(packages) {
 #
 # @param sparkHome directory to find Spark package.
 # @param master the Spark master URL, used to check local or remote mode.
+# @param deployMode whether to deploy your driver on the worker nodes (cluster)
+#or locally as an external client (client).
 # @return NULL if no need to update sparkHome, and new sparkHome otherwise.
-sparkCheckInstall <- function(sparkHome, master) {
+sparkCheckInstall <- function(sparkHome, master, deployMode) {
   if (!isSparkRShell()) {
 if (!is.na(file.info(sparkHome)$isdir)) {
   msg <- paste0("Spark package found in SPARK_HOME: ", sparkHome)
   message(msg)
   NULL
 } else {
-  if (!nzchar(master) || isMasterLocal(master)) {
-msg <- paste0("Spark not found in SPARK_HOME: ",
-  sparkHome)
+  if (isMasterLocal(master)) {
+msg <- paste0("Spark not found in SPARK_HOME: ", sparkHome)
 message(msg)
 packageLocalDir <- install.spark()
 packageLocalDir
-  } else {
+  } else if (isClientMode(master) || deployMode == "client") {
 msg <- paste0("Spark not found in SPARK_HOME: ",
   sparkHome, "\n", installInstruction("remote"))
 stop(msg)
+  } else {
+NULL
   }
 }
   } else {

http://git-wip-us.apache.org/repos/asf/spark/blob/a37238b0/R/pkg/R/utils.R
--
diff --git a/R/pkg/R/utils.R b/R/pkg/R/utils.R
index 248c575..581a9a4 100644
--- a/R/pkg/R/utils.R
+++ b/R/pkg/R/utils.R
@@ -694,6 +694,10 @@ isMasterLocal <- function(master) {
   grepl("^local(\\[([0-9]+|\\*)\\])?$", master, perl = TRUE)
 }
 
+isClientMode <- fun

spark git commit: [SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear data

2016-11-22 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master d0212eb0f -> 982b82e32


[SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear data

## What changes were proposed in this pull request?
* Fix SparkR ```spark.glm``` errors when fitting on collinear data, since 
```standard error of coefficients, t value and p value``` are not available in 
this condition.
* Scala/Python GLM summary should throw exception if users get ```standard 
error of coefficients, t value and p value``` but the underlying WLS was solved 
by local "l-bfgs".

## How was this patch tested?
Add unit tests.

Author: Yanbo Liang 

Closes #15930 from yanboliang/spark-18501.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/982b82e3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/982b82e3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/982b82e3

Branch: refs/heads/master
Commit: 982b82e32e0fc7d30c5d557944a79eb3e6d2da59
Parents: d0212eb
Author: Yanbo Liang 
Authored: Tue Nov 22 19:17:48 2016 -0800
Committer: Yanbo Liang 
Committed: Tue Nov 22 19:17:48 2016 -0800

--
 R/pkg/R/mllib.R | 21 ++--
 R/pkg/inst/tests/testthat/test_mllib.R  |  9 
 .../r/GeneralizedLinearRegressionWrapper.scala  | 54 +++-
 .../GeneralizedLinearRegression.scala   | 46 ++---
 .../GeneralizedLinearRegressionSuite.scala  | 21 
 5 files changed, 115 insertions(+), 36 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/982b82e3/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 265e64e..02bc645 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -278,8 +278,10 @@ setMethod("glm", signature(formula = "formula", family = 
"ANY", data = "SparkDat
 
 #' @param object a fitted generalized linear model.
 #' @return \code{summary} returns a summary object of the fitted model, a list 
of components
-#' including at least the coefficients, null/residual deviance, 
null/residual degrees
-#' of freedom, AIC and number of iterations IRLS takes.
+#' including at least the coefficients matrix (which includes 
coefficients, standard error
+#' of coefficients, t value and p value), null/residual deviance, 
null/residual degrees of
+#' freedom, AIC and number of iterations IRLS takes. If there are 
collinear columns
+#' in you data, the coefficients matrix only provides coefficients.
 #'
 #' @rdname spark.glm
 #' @export
@@ -303,9 +305,18 @@ setMethod("summary", signature(object = 
"GeneralizedLinearRegressionModel"),
 } else {
   dataFrame(callJMethod(jobj, "rDevianceResiduals"))
 }
-coefficients <- matrix(coefficients, ncol = 4)
-colnames(coefficients) <- c("Estimate", "Std. Error", "t value", 
"Pr(>|t|)")
-rownames(coefficients) <- unlist(features)
+# If the underlying WeightedLeastSquares using "normal" solver, we 
can provide
+# coefficients, standard error of coefficients, t value and p 
value. Otherwise,
+# it will be fitted by local "l-bfgs", we can only provide 
coefficients.
+if (length(features) == length(coefficients)) {
+  coefficients <- matrix(coefficients, ncol = 1)
+  colnames(coefficients) <- c("Estimate")
+  rownames(coefficients) <- unlist(features)
+} else {
+  coefficients <- matrix(coefficients, ncol = 4)
+  colnames(coefficients) <- c("Estimate", "Std. Error", "t value", 
"Pr(>|t|)")
+  rownames(coefficients) <- unlist(features)
+}
 ans <- list(deviance.resid = deviance.resid, coefficients = 
coefficients,
 dispersion = dispersion, null.deviance = null.deviance,
 deviance = deviance, df.null = df.null, df.residual = 
df.residual,

http://git-wip-us.apache.org/repos/asf/spark/blob/982b82e3/R/pkg/inst/tests/testthat/test_mllib.R
--
diff --git a/R/pkg/inst/tests/testthat/test_mllib.R 
b/R/pkg/inst/tests/testthat/test_mllib.R
index 2a97a51..467e00c 100644
--- a/R/pkg/inst/tests/testthat/test_mllib.R
+++ b/R/pkg/inst/tests/testthat/test_mllib.R
@@ -169,6 +169,15 @@ test_that("spark.glm summary", {
   df <- suppressWarnings(createDataFrame(data))
   regStats <- summary(spark.glm(df, b ~ a1 + a2, regParam = 1.0))
   expect_equal(regStats$aic, 14.00976, tolerance = 1e-4) # 14.00976 is from 
summary() result
+
+  # Test spark.glm works on collinear data
+  A <- matrix(c(1, 2, 3, 4, 2, 4, 6, 8), 4, 2)
+  b <- c(1, 2, 3, 4)
+  data <- as.data.frame(cbi

spark git commit: [SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear data

2016-11-22 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 3be2d1e0b -> fc5fee83e


[SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear data

## What changes were proposed in this pull request?
* Fix SparkR ```spark.glm``` errors when fitting on collinear data, since 
```standard error of coefficients, t value and p value``` are not available in 
this condition.
* Scala/Python GLM summary should throw exception if users get ```standard 
error of coefficients, t value and p value``` but the underlying WLS was solved 
by local "l-bfgs".

## How was this patch tested?
Add unit tests.

Author: Yanbo Liang 

Closes #15930 from yanboliang/spark-18501.

(cherry picked from commit 982b82e32e0fc7d30c5d557944a79eb3e6d2da59)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fc5fee83
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fc5fee83
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fc5fee83

Branch: refs/heads/branch-2.1
Commit: fc5fee83e363bc6df22459a9b1ba2ba11bfdfa20
Parents: 3be2d1e
Author: Yanbo Liang 
Authored: Tue Nov 22 19:17:48 2016 -0800
Committer: Yanbo Liang 
Committed: Tue Nov 22 19:18:30 2016 -0800

--
 R/pkg/R/mllib.R | 21 ++--
 R/pkg/inst/tests/testthat/test_mllib.R  |  9 
 .../r/GeneralizedLinearRegressionWrapper.scala  | 54 +++-
 .../GeneralizedLinearRegression.scala   | 46 ++---
 .../GeneralizedLinearRegressionSuite.scala  | 21 
 5 files changed, 115 insertions(+), 36 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fc5fee83/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 265e64e..02bc645 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -278,8 +278,10 @@ setMethod("glm", signature(formula = "formula", family = 
"ANY", data = "SparkDat
 
 #' @param object a fitted generalized linear model.
 #' @return \code{summary} returns a summary object of the fitted model, a list 
of components
-#' including at least the coefficients, null/residual deviance, 
null/residual degrees
-#' of freedom, AIC and number of iterations IRLS takes.
+#' including at least the coefficients matrix (which includes 
coefficients, standard error
+#' of coefficients, t value and p value), null/residual deviance, 
null/residual degrees of
+#' freedom, AIC and number of iterations IRLS takes. If there are 
collinear columns
+#' in you data, the coefficients matrix only provides coefficients.
 #'
 #' @rdname spark.glm
 #' @export
@@ -303,9 +305,18 @@ setMethod("summary", signature(object = 
"GeneralizedLinearRegressionModel"),
 } else {
   dataFrame(callJMethod(jobj, "rDevianceResiduals"))
 }
-coefficients <- matrix(coefficients, ncol = 4)
-colnames(coefficients) <- c("Estimate", "Std. Error", "t value", 
"Pr(>|t|)")
-rownames(coefficients) <- unlist(features)
+# If the underlying WeightedLeastSquares using "normal" solver, we 
can provide
+# coefficients, standard error of coefficients, t value and p 
value. Otherwise,
+# it will be fitted by local "l-bfgs", we can only provide 
coefficients.
+if (length(features) == length(coefficients)) {
+  coefficients <- matrix(coefficients, ncol = 1)
+  colnames(coefficients) <- c("Estimate")
+  rownames(coefficients) <- unlist(features)
+} else {
+  coefficients <- matrix(coefficients, ncol = 4)
+  colnames(coefficients) <- c("Estimate", "Std. Error", "t value", 
"Pr(>|t|)")
+  rownames(coefficients) <- unlist(features)
+}
 ans <- list(deviance.resid = deviance.resid, coefficients = 
coefficients,
 dispersion = dispersion, null.deviance = null.deviance,
 deviance = deviance, df.null = df.null, df.residual = 
df.residual,

http://git-wip-us.apache.org/repos/asf/spark/blob/fc5fee83/R/pkg/inst/tests/testthat/test_mllib.R
--
diff --git a/R/pkg/inst/tests/testthat/test_mllib.R 
b/R/pkg/inst/tests/testthat/test_mllib.R
index 70a033d..b05be47 100644
--- a/R/pkg/inst/tests/testthat/test_mllib.R
+++ b/R/pkg/inst/tests/testthat/test_mllib.R
@@ -169,6 +169,15 @@ test_that("spark.glm summary", {
   df <- suppressWarnings(createDataFrame(data))
   regStats <- summary(spark.glm(df, b ~ a1 + a2, regParam = 1.0))
   expect_equal(regStats$aic, 14.00976, tolerance = 1e-4) # 14.00976 is from 
summary() result
+
+  # Test spark.glm works on colline

spark git commit: [SPARK-18520][ML] Add missing setXXXCol methods for BisectingKMeansModel and GaussianMixtureModel

2016-11-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 223fa218e -> 2dfabec38


[SPARK-18520][ML] Add missing setXXXCol methods for BisectingKMeansModel and 
GaussianMixtureModel

## What changes were proposed in this pull request?
add `setFeaturesCol` and `setPredictionCol` for BiKModel and GMModel
add `setProbabilityCol` for GMModel
## How was this patch tested?
existing tests

Author: Zheng RuiFeng 

Closes #15957 from zhengruifeng/bikm_set.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2dfabec3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2dfabec3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2dfabec3

Branch: refs/heads/master
Commit: 2dfabec38c24174e7f747c27c7144f7738483ec1
Parents: 223fa21
Author: Zheng RuiFeng 
Authored: Thu Nov 24 05:46:05 2016 -0800
Committer: Yanbo Liang 
Committed: Thu Nov 24 05:46:05 2016 -0800

--
 .../apache/spark/ml/clustering/BisectingKMeans.scala|  8 
 .../apache/spark/ml/clustering/GaussianMixture.scala| 12 
 2 files changed, 20 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2dfabec3/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
index e6ca3aed..cf11ba3 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
@@ -98,6 +98,14 @@ class BisectingKMeansModel private[ml] (
 copied.setSummary(trainingSummary).setParent(this.parent)
   }
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, value)
+
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)

http://git-wip-us.apache.org/repos/asf/spark/blob/2dfabec3/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
index 92d0b7d..19998ca 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
@@ -87,6 +87,18 @@ class GaussianMixtureModel private[ml] (
 @Since("2.0.0") val gaussians: Array[MultivariateGaussian])
   extends Model[GaussianMixtureModel] with GaussianMixtureParams with 
MLWritable {
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, value)
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setProbabilityCol(value: String): this.type = set(probabilityCol, value)
+
   @Since("2.0.0")
   override def copy(extra: ParamMap): GaussianMixtureModel = {
 val copied = copyValues(new GaussianMixtureModel(uid, weights, gaussians), 
extra)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18520][ML] Add missing setXXXCol methods for BisectingKMeansModel and GaussianMixtureModel

2016-11-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 27d81d000 -> 04ec74f12


[SPARK-18520][ML] Add missing setXXXCol methods for BisectingKMeansModel and 
GaussianMixtureModel

## What changes were proposed in this pull request?
add `setFeaturesCol` and `setPredictionCol` for BiKModel and GMModel
add `setProbabilityCol` for GMModel
## How was this patch tested?
existing tests

Author: Zheng RuiFeng 

Closes #15957 from zhengruifeng/bikm_set.

(cherry picked from commit 2dfabec38c24174e7f747c27c7144f7738483ec1)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/04ec74f1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/04ec74f1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/04ec74f1

Branch: refs/heads/branch-2.1
Commit: 04ec74f1274a164b2f72b31e2c147e042bf41bd9
Parents: 27d81d0
Author: Zheng RuiFeng 
Authored: Thu Nov 24 05:46:05 2016 -0800
Committer: Yanbo Liang 
Committed: Thu Nov 24 05:47:02 2016 -0800

--
 .../apache/spark/ml/clustering/BisectingKMeans.scala|  8 
 .../apache/spark/ml/clustering/GaussianMixture.scala| 12 
 2 files changed, 20 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/04ec74f1/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
index e6ca3aed..cf11ba3 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
@@ -98,6 +98,14 @@ class BisectingKMeansModel private[ml] (
 copied.setSummary(trainingSummary).setParent(this.parent)
   }
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, value)
+
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)

http://git-wip-us.apache.org/repos/asf/spark/blob/04ec74f1/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
index 92d0b7d..19998ca 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
@@ -87,6 +87,18 @@ class GaussianMixtureModel private[ml] (
 @Since("2.0.0") val gaussians: Array[MultivariateGaussian])
   extends Model[GaussianMixtureModel] with GaussianMixtureParams with 
MLWritable {
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, value)
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setProbabilityCol(value: String): this.type = set(probabilityCol, value)
+
   @Since("2.0.0")
   override def copy(extra: ParamMap): GaussianMixtureModel = {
 val copied = copyValues(new GaussianMixtureModel(uid, weights, gaussians), 
extra)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18481][ML] ML 2.1 QA: Remove deprecated methods for ML

2016-11-26 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master a88329d45 -> c4a7eef0c


[SPARK-18481][ML] ML 2.1 QA: Remove deprecated methods for ML

## What changes were proposed in this pull request?
Remove deprecated methods for ML.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #15913 from yanboliang/spark-18481.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c4a7eef0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c4a7eef0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c4a7eef0

Branch: refs/heads/master
Commit: c4a7eef0ce2d305c5c90a0a9a73b5a32eccfba95
Parents: a88329d
Author: Yanbo Liang 
Authored: Sat Nov 26 05:28:41 2016 -0800
Committer: Yanbo Liang 
Committed: Sat Nov 26 05:28:41 2016 -0800

--
 .../scala/org/apache/spark/ml/Pipeline.scala|  4 +
 .../spark/ml/classification/GBTClassifier.scala |  6 ++
 .../ml/classification/LogisticRegression.scala  |  8 +-
 .../classification/RandomForestClassifier.scala | 11 +--
 .../apache/spark/ml/feature/ChiSqSelector.scala |  7 --
 .../org/apache/spark/ml/param/params.scala  | 15 
 .../spark/ml/regression/GBTRegressor.scala  |  6 ++
 .../spark/ml/regression/LinearRegression.scala  |  3 -
 .../ml/regression/RandomForestRegressor.scala   | 10 +--
 .../org/apache/spark/ml/tree/treeModels.scala   |  5 --
 .../org/apache/spark/ml/tree/treeParams.scala   | 90 +---
 .../org/apache/spark/ml/util/ReadWrite.scala|  2 +-
 .../ml/classification/GBTClassifierSuite.scala  |  8 ++
 .../LogisticRegressionSuite.scala   |  6 ++
 project/MimaExcludes.scala  | 30 +++
 python/pyspark/ml/util.py   | 40 -
 16 files changed, 144 insertions(+), 107 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c4a7eef0/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala 
b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
index f406f8c..38176b9 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
@@ -46,6 +46,10 @@ abstract class PipelineStage extends Params with Logging {
*
* Check transform validity and derive the output schema from the input 
schema.
*
+   * We check validity for interactions between parameters during 
`transformSchema` and
+   * raise an exception if any parameter value is invalid. Parameter value 
checks which
+   * do not depend on other parameters are handled by `Param.validate()`.
+   *
* Typical implementation should first conduct verification on schema change 
and parameter
* validity, including complex parameter interaction checks.
*/

http://git-wip-us.apache.org/repos/asf/spark/blob/c4a7eef0/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
index 52f93f5..ca52231 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
@@ -203,6 +203,12 @@ class GBTClassificationModel private[ml](
   @Since("1.4.0")
   override def trees: Array[DecisionTreeRegressionModel] = _trees
 
+  /**
+   * Number of trees in ensemble
+   */
+  @Since("2.0.0")
+  val getNumTrees: Int = trees.length
+
   @Since("1.4.0")
   override def treeWeights: Array[Double] = _treeWeights
 

http://git-wip-us.apache.org/repos/asf/spark/blob/c4a7eef0/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index fe29926..41b84f4 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -40,7 +40,7 @@ import org.apache.spark.mllib.util.MLUtils
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{DataFrame, Dataset, Row}
 import org.apache.spark.sql.functions.{col, lit}
-import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.sql.types.{DataType, DoubleType, StructType}
 import org.apache.spark.storage.StorageLevel
 import org.apache.spark.util.VersionUtils
 
@@ -176,8 +176,12 @@ private[classification] tra

spark git commit: [SPARK-18481][ML] ML 2.1 QA: Remove deprecated methods for ML

2016-11-26 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 da66b9742 -> 830ee1345


[SPARK-18481][ML] ML 2.1 QA: Remove deprecated methods for ML

## What changes were proposed in this pull request?
Remove deprecated methods for ML.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #15913 from yanboliang/spark-18481.

(cherry picked from commit c4a7eef0ce2d305c5c90a0a9a73b5a32eccfba95)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/830ee134
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/830ee134
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/830ee134

Branch: refs/heads/branch-2.1
Commit: 830ee1345b491bf10fd089d931ef22e28f98e615
Parents: da66b97
Author: Yanbo Liang 
Authored: Sat Nov 26 05:28:41 2016 -0800
Committer: Yanbo Liang 
Committed: Sat Nov 26 05:29:32 2016 -0800

--
 .../scala/org/apache/spark/ml/Pipeline.scala|  4 +
 .../spark/ml/classification/GBTClassifier.scala |  6 ++
 .../ml/classification/LogisticRegression.scala  |  8 +-
 .../classification/RandomForestClassifier.scala | 11 +--
 .../apache/spark/ml/feature/ChiSqSelector.scala |  7 --
 .../org/apache/spark/ml/param/params.scala  | 15 
 .../spark/ml/regression/GBTRegressor.scala  |  6 ++
 .../spark/ml/regression/LinearRegression.scala  |  3 -
 .../ml/regression/RandomForestRegressor.scala   | 10 +--
 .../org/apache/spark/ml/tree/treeModels.scala   |  5 --
 .../org/apache/spark/ml/tree/treeParams.scala   | 90 +---
 .../org/apache/spark/ml/util/ReadWrite.scala|  2 +-
 .../ml/classification/GBTClassifierSuite.scala  |  8 ++
 .../LogisticRegressionSuite.scala   |  6 ++
 project/MimaExcludes.scala  | 30 +++
 python/pyspark/ml/util.py   | 40 -
 16 files changed, 144 insertions(+), 107 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/830ee134/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala 
b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
index f406f8c..38176b9 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
@@ -46,6 +46,10 @@ abstract class PipelineStage extends Params with Logging {
*
* Check transform validity and derive the output schema from the input 
schema.
*
+   * We check validity for interactions between parameters during 
`transformSchema` and
+   * raise an exception if any parameter value is invalid. Parameter value 
checks which
+   * do not depend on other parameters are handled by `Param.validate()`.
+   *
* Typical implementation should first conduct verification on schema change 
and parameter
* validity, including complex parameter interaction checks.
*/

http://git-wip-us.apache.org/repos/asf/spark/blob/830ee134/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
index 52f93f5..ca52231 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
@@ -203,6 +203,12 @@ class GBTClassificationModel private[ml](
   @Since("1.4.0")
   override def trees: Array[DecisionTreeRegressionModel] = _trees
 
+  /**
+   * Number of trees in ensemble
+   */
+  @Since("2.0.0")
+  val getNumTrees: Int = trees.length
+
   @Since("1.4.0")
   override def treeWeights: Array[Double] = _treeWeights
 

http://git-wip-us.apache.org/repos/asf/spark/blob/830ee134/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index fe29926..41b84f4 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -40,7 +40,7 @@ import org.apache.spark.mllib.util.MLUtils
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{DataFrame, Dataset, Row}
 import org.apache.spark.sql.functions.{col, lit}
-import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.sql.types.{DataType, DoubleType, StructType}
 import org.apache.spark.storage.

spark git commit: [SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark

2016-11-29 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 489845f3a -> 4c82ca86d


[SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark

## What changes were proposed in this pull request?

Add python api for KMeansSummary
## How was this patch tested?

unit test added

Author: Jeff Zhang 

Closes #13557 from zjffdu/SPARK-15819.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4c82ca86
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4c82ca86
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4c82ca86

Branch: refs/heads/master
Commit: 4c82ca86d979e5526a1583eef3c79c37dc68
Parents: 489845f
Author: Jeff Zhang 
Authored: Tue Nov 29 20:51:27 2016 -0800
Committer: Yanbo Liang 
Committed: Tue Nov 29 20:51:27 2016 -0800

--
 python/pyspark/ml/clustering.py | 41 
 python/pyspark/ml/tests.py  | 15 +
 2 files changed, 56 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4c82ca86/python/pyspark/ml/clustering.py
--
diff --git a/python/pyspark/ml/clustering.py b/python/pyspark/ml/clustering.py
index 7f8d845..35d0aef 100644
--- a/python/pyspark/ml/clustering.py
+++ b/python/pyspark/ml/clustering.py
@@ -292,6 +292,17 @@ class GaussianMixtureSummary(ClusteringSummary):
 return self._call_java("probability")
 
 
+class KMeansSummary(ClusteringSummary):
+"""
+.. note:: Experimental
+
+Summary of KMeans.
+
+.. versionadded:: 2.1.0
+"""
+pass
+
+
 class KMeansModel(JavaModel, JavaMLWritable, JavaMLReadable):
 """
 Model fitted by KMeans.
@@ -312,6 +323,27 @@ class KMeansModel(JavaModel, JavaMLWritable, 
JavaMLReadable):
 """
 return self._call_java("computeCost", dataset)
 
+@property
+@since("2.1.0")
+def hasSummary(self):
+"""
+Indicates whether a training summary exists for this model instance.
+"""
+return self._call_java("hasSummary")
+
+@property
+@since("2.1.0")
+def summary(self):
+"""
+Gets summary (e.g. cluster assignments, cluster sizes) of the model 
trained on the
+training set. An exception is thrown if no summary exists.
+"""
+if self.hasSummary:
+return KMeansSummary(self._call_java("summary"))
+else:
+raise RuntimeError("No training summary available for this %s" %
+   self.__class__.__name__)
+
 
 @inherit_doc
 class KMeans(JavaEstimator, HasFeaturesCol, HasPredictionCol, HasMaxIter, 
HasTol, HasSeed,
@@ -337,6 +369,13 @@ class KMeans(JavaEstimator, HasFeaturesCol, 
HasPredictionCol, HasMaxIter, HasTol
 True
 >>> rows[2].prediction == rows[3].prediction
 True
+>>> model.hasSummary
+True
+>>> summary = model.summary
+>>> summary.k
+2
+>>> summary.clusterSizes
+[2, 2]
 >>> kmeans_path = temp_path + "/kmeans"
 >>> kmeans.save(kmeans_path)
 >>> kmeans2 = KMeans.load(kmeans_path)
@@ -345,6 +384,8 @@ class KMeans(JavaEstimator, HasFeaturesCol, 
HasPredictionCol, HasMaxIter, HasTol
 >>> model_path = temp_path + "/kmeans_model"
 >>> model.save(model_path)
 >>> model2 = KMeansModel.load(model_path)
+>>> model2.hasSummary
+False
 >>> model.clusterCenters()[0] == model2.clusterCenters()[0]
 array([ True,  True], dtype=bool)
 >>> model.clusterCenters()[1] == model2.clusterCenters()[1]

http://git-wip-us.apache.org/repos/asf/spark/blob/4c82ca86/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index c0f0d40..a0c288a 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -1129,6 +1129,21 @@ class TrainingSummaryTest(SparkSessionTestCase):
 self.assertEqual(len(s.clusterSizes), 2)
 self.assertEqual(s.k, 2)
 
+def test_kmeans_summary(self):
+data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
+(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
+df = self.spark.createDataFrame(data, ["features"])
+kmeans = KMeans(k=2, seed=1)
+model = kmeans.fit(df)
+self.assertTrue(model.hasSummary)
+s = model.summary
+self.assertTrue(isinstance(s.predictions, DataFrame))
+self.assertEqual(s.featuresCol, "features")
+self.assertEqual(s.predictionCol, "prediction")
+self.assertTrue(isinstance(s.cluster, DataFrame))
+self.assertEqual(len(s.clusterSizes), 2)
+self.assertEqual(s.k, 2)
+
 
 class OneVsRestTests(SparkSessionTestCase):
 


---

spark git commit: [SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark

2016-11-29 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 55b1142bd -> b95aad7ca


[SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark

## What changes were proposed in this pull request?

Add python api for KMeansSummary
## How was this patch tested?

unit test added

Author: Jeff Zhang 

Closes #13557 from zjffdu/SPARK-15819.

(cherry picked from commit 4c82ca86d979e5526a1583eef3c79c37dc68)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b95aad7c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b95aad7c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b95aad7c

Branch: refs/heads/branch-2.1
Commit: b95aad7cad99a62851fe5e61692fda9bceb4b160
Parents: 55b1142
Author: Jeff Zhang 
Authored: Tue Nov 29 20:51:27 2016 -0800
Committer: Yanbo Liang 
Committed: Tue Nov 29 20:52:21 2016 -0800

--
 python/pyspark/ml/clustering.py | 41 
 python/pyspark/ml/tests.py  | 15 +
 2 files changed, 56 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b95aad7c/python/pyspark/ml/clustering.py
--
diff --git a/python/pyspark/ml/clustering.py b/python/pyspark/ml/clustering.py
index 7f8d845..35d0aef 100644
--- a/python/pyspark/ml/clustering.py
+++ b/python/pyspark/ml/clustering.py
@@ -292,6 +292,17 @@ class GaussianMixtureSummary(ClusteringSummary):
 return self._call_java("probability")
 
 
+class KMeansSummary(ClusteringSummary):
+"""
+.. note:: Experimental
+
+Summary of KMeans.
+
+.. versionadded:: 2.1.0
+"""
+pass
+
+
 class KMeansModel(JavaModel, JavaMLWritable, JavaMLReadable):
 """
 Model fitted by KMeans.
@@ -312,6 +323,27 @@ class KMeansModel(JavaModel, JavaMLWritable, 
JavaMLReadable):
 """
 return self._call_java("computeCost", dataset)
 
+@property
+@since("2.1.0")
+def hasSummary(self):
+"""
+Indicates whether a training summary exists for this model instance.
+"""
+return self._call_java("hasSummary")
+
+@property
+@since("2.1.0")
+def summary(self):
+"""
+Gets summary (e.g. cluster assignments, cluster sizes) of the model 
trained on the
+training set. An exception is thrown if no summary exists.
+"""
+if self.hasSummary:
+return KMeansSummary(self._call_java("summary"))
+else:
+raise RuntimeError("No training summary available for this %s" %
+   self.__class__.__name__)
+
 
 @inherit_doc
 class KMeans(JavaEstimator, HasFeaturesCol, HasPredictionCol, HasMaxIter, 
HasTol, HasSeed,
@@ -337,6 +369,13 @@ class KMeans(JavaEstimator, HasFeaturesCol, 
HasPredictionCol, HasMaxIter, HasTol
 True
 >>> rows[2].prediction == rows[3].prediction
 True
+>>> model.hasSummary
+True
+>>> summary = model.summary
+>>> summary.k
+2
+>>> summary.clusterSizes
+[2, 2]
 >>> kmeans_path = temp_path + "/kmeans"
 >>> kmeans.save(kmeans_path)
 >>> kmeans2 = KMeans.load(kmeans_path)
@@ -345,6 +384,8 @@ class KMeans(JavaEstimator, HasFeaturesCol, 
HasPredictionCol, HasMaxIter, HasTol
 >>> model_path = temp_path + "/kmeans_model"
 >>> model.save(model_path)
 >>> model2 = KMeansModel.load(model_path)
+>>> model2.hasSummary
+False
 >>> model.clusterCenters()[0] == model2.clusterCenters()[0]
 array([ True,  True], dtype=bool)
 >>> model.clusterCenters()[1] == model2.clusterCenters()[1]

http://git-wip-us.apache.org/repos/asf/spark/blob/b95aad7c/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index c0f0d40..a0c288a 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -1129,6 +1129,21 @@ class TrainingSummaryTest(SparkSessionTestCase):
 self.assertEqual(len(s.clusterSizes), 2)
 self.assertEqual(s.k, 2)
 
+def test_kmeans_summary(self):
+data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
+(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
+df = self.spark.createDataFrame(data, ["features"])
+kmeans = KMeans(k=2, seed=1)
+model = kmeans.fit(df)
+self.assertTrue(model.hasSummary)
+s = model.summary
+self.assertTrue(isinstance(s.predictions, DataFrame))
+self.assertEqual(s.featuresCol, "features")
+self.assertEqual(s.predictionCol, "prediction")
+self.assertTrue(isinstance(s.cluster, DataFrame))
+self.assertEqual(len(s.clusterSizes), 2)
+self.assertEqual(s.k, 2)
+
 
 class

spark git commit: [SPARK-18476][SPARKR][ML] SparkR Logistic Regression should should support output original label.

2016-11-30 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 0a811210f -> 2eb6764fb


[SPARK-18476][SPARKR][ML] SparkR Logistic Regression should should support 
output original label.

## What changes were proposed in this pull request?

Similar to SPARK-18401, as a classification algorithm, logistic regression 
should support output original label instead of supporting index label.

In this PR, original label output is supported and test cases are modified and 
added. Document is also modified.

## How was this patch tested?

Unit tests.

Author: wm...@hotmail.com 

Closes #15910 from wangmiao1981/audit.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2eb6764f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2eb6764f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2eb6764f

Branch: refs/heads/master
Commit: 2eb6764fbb23553fc17772d8a4a1cad55ff7ba6e
Parents: 0a81121
Author: wm...@hotmail.com 
Authored: Wed Nov 30 20:32:17 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Nov 30 20:32:17 2016 -0800

--
 R/pkg/R/mllib.R | 19 +-
 R/pkg/inst/tests/testthat/test_mllib.R  | 26 +-
 .../scala/org/apache/spark/SparkContext.scala   |  2 +-
 .../spark/ml/r/LogisticRegressionWrapper.scala  | 37 ++--
 4 files changed, 54 insertions(+), 30 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2eb6764f/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 02bc645..eed8293 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -712,7 +712,6 @@ setMethod("predict", signature(object = "KMeansModel"),
 #'of L1 and L2. Default is 0.0 which is an L2 penalty.
 #' @param maxIter maximum iteration number.
 #' @param tol convergence tolerance of iterations.
-#' @param fitIntercept whether to fit an intercept term.
 #' @param family the name of family which is a description of the label 
distribution to be used in the model.
 #'   Supported options:
 #' \itemize{
@@ -747,11 +746,11 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' \dontrun{
 #' sparkR.session()
 #' # binary logistic regression
-#' label <- c(1.0, 1.0, 1.0, 0.0, 0.0)
-#' feature <- c(1.1419053, 0.9194079, -0.9498666, -1.1069903, 0.2809776)
-#' binary_data <- as.data.frame(cbind(label, feature))
+#' label <- c(0.0, 0.0, 0.0, 1.0, 1.0)
+#' features <- c(1.1419053, 0.9194079, -0.9498666, -1.1069903, 0.2809776)
+#' binary_data <- as.data.frame(cbind(label, features))
 #' binary_df <- createDataFrame(binary_data)
-#' blr_model <- spark.logit(binary_df, label ~ feature, thresholds = 1.0)
+#' blr_model <- spark.logit(binary_df, label ~ features, thresholds = 1.0)
 #' blr_predict <- collect(select(predict(blr_model, binary_df), "prediction"))
 #'
 #' # summary of binary logistic regression
@@ -783,7 +782,7 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' @note spark.logit since 2.1.0
 setMethod("spark.logit", signature(data = "SparkDataFrame", formula = 
"formula"),
   function(data, formula, regParam = 0.0, elasticNetParam = 0.0, 
maxIter = 100,
-   tol = 1E-6, fitIntercept = TRUE, family = "auto", 
standardization = TRUE,
+   tol = 1E-6, family = "auto", standardization = TRUE,
thresholds = 0.5, weightCol = NULL, aggregationDepth = 2,
probabilityCol = "probability") {
 formula <- paste(deparse(formula), collapse = "")
@@ -795,10 +794,10 @@ setMethod("spark.logit", signature(data = 
"SparkDataFrame", formula = "formula")
 jobj <- 
callJStatic("org.apache.spark.ml.r.LogisticRegressionWrapper", "fit",
 data@sdf, formula, as.numeric(regParam),
 as.numeric(elasticNetParam), 
as.integer(maxIter),
-as.numeric(tol), as.logical(fitIntercept),
-as.character(family), 
as.logical(standardization),
-as.array(thresholds), as.character(weightCol),
-as.integer(aggregationDepth), 
as.character(probabilityCol))
+as.numeric(tol), as.character(family),
+as.logical(standardization), 
as.array(thresholds),
+as.character(weightCol), 
as.integer(aggregationDepth),
+as.character(probabilityCol))
 new("LogisticRegressionModel", jobj = jobj)
   })
 

http://git-wip-us.apache.org/repos/asf/spark/blob/2eb6764f/R/pkg/inst/tests/testthat/test_mllib.R
-

spark git commit: [SPARK-18476][SPARKR][ML] SparkR Logistic Regression should should support output original label.

2016-11-30 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 7d4596734 -> e8d8e3509


[SPARK-18476][SPARKR][ML] SparkR Logistic Regression should should support 
output original label.

## What changes were proposed in this pull request?

Similar to SPARK-18401, as a classification algorithm, logistic regression 
should support output original label instead of supporting index label.

In this PR, original label output is supported and test cases are modified and 
added. Document is also modified.

## How was this patch tested?

Unit tests.

Author: wm...@hotmail.com 

Closes #15910 from wangmiao1981/audit.

(cherry picked from commit 2eb6764fbb23553fc17772d8a4a1cad55ff7ba6e)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e8d8e350
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e8d8e350
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e8d8e350

Branch: refs/heads/branch-2.1
Commit: e8d8e350998e6e44a6dee7f78dbe2d1aa997c1d6
Parents: 7d45967
Author: wm...@hotmail.com 
Authored: Wed Nov 30 20:32:17 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Nov 30 20:33:07 2016 -0800

--
 R/pkg/R/mllib.R | 19 +-
 R/pkg/inst/tests/testthat/test_mllib.R  | 26 +-
 .../scala/org/apache/spark/SparkContext.scala   |  2 +-
 .../spark/ml/r/LogisticRegressionWrapper.scala  | 37 ++--
 4 files changed, 54 insertions(+), 30 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e8d8e350/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 02bc645..eed8293 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -712,7 +712,6 @@ setMethod("predict", signature(object = "KMeansModel"),
 #'of L1 and L2. Default is 0.0 which is an L2 penalty.
 #' @param maxIter maximum iteration number.
 #' @param tol convergence tolerance of iterations.
-#' @param fitIntercept whether to fit an intercept term.
 #' @param family the name of family which is a description of the label 
distribution to be used in the model.
 #'   Supported options:
 #' \itemize{
@@ -747,11 +746,11 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' \dontrun{
 #' sparkR.session()
 #' # binary logistic regression
-#' label <- c(1.0, 1.0, 1.0, 0.0, 0.0)
-#' feature <- c(1.1419053, 0.9194079, -0.9498666, -1.1069903, 0.2809776)
-#' binary_data <- as.data.frame(cbind(label, feature))
+#' label <- c(0.0, 0.0, 0.0, 1.0, 1.0)
+#' features <- c(1.1419053, 0.9194079, -0.9498666, -1.1069903, 0.2809776)
+#' binary_data <- as.data.frame(cbind(label, features))
 #' binary_df <- createDataFrame(binary_data)
-#' blr_model <- spark.logit(binary_df, label ~ feature, thresholds = 1.0)
+#' blr_model <- spark.logit(binary_df, label ~ features, thresholds = 1.0)
 #' blr_predict <- collect(select(predict(blr_model, binary_df), "prediction"))
 #'
 #' # summary of binary logistic regression
@@ -783,7 +782,7 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' @note spark.logit since 2.1.0
 setMethod("spark.logit", signature(data = "SparkDataFrame", formula = 
"formula"),
   function(data, formula, regParam = 0.0, elasticNetParam = 0.0, 
maxIter = 100,
-   tol = 1E-6, fitIntercept = TRUE, family = "auto", 
standardization = TRUE,
+   tol = 1E-6, family = "auto", standardization = TRUE,
thresholds = 0.5, weightCol = NULL, aggregationDepth = 2,
probabilityCol = "probability") {
 formula <- paste(deparse(formula), collapse = "")
@@ -795,10 +794,10 @@ setMethod("spark.logit", signature(data = 
"SparkDataFrame", formula = "formula")
 jobj <- 
callJStatic("org.apache.spark.ml.r.LogisticRegressionWrapper", "fit",
 data@sdf, formula, as.numeric(regParam),
 as.numeric(elasticNetParam), 
as.integer(maxIter),
-as.numeric(tol), as.logical(fitIntercept),
-as.character(family), 
as.logical(standardization),
-as.array(thresholds), as.character(weightCol),
-as.integer(aggregationDepth), 
as.character(probabilityCol))
+as.numeric(tol), as.character(family),
+as.logical(standardization), 
as.array(thresholds),
+as.character(weightCol), 
as.integer(aggregationDepth),
+as.character(probabilityCol))
 new("LogisticRegressionModel", jobj = jobj)
   })
 

http://git-wip-us.apache.org/repos/asf/spark/

spark git commit: [SPARK-18625][ML] OneVsRestModel should support setFeaturesCol and setPredictionCol

2016-12-05 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 c13c2939f -> 88e07efe8


[SPARK-18625][ML] OneVsRestModel should support setFeaturesCol and 
setPredictionCol

## What changes were proposed in this pull request?
add `setFeaturesCol` and `setPredictionCol` for `OneVsRestModel`

## How was this patch tested?
added tests

Author: Zheng RuiFeng 

Closes #16059 from zhengruifeng/ovrm_setCol.

(cherry picked from commit bdfe7f67468ecfd9927a1fec60d6605dd05ebe3f)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/88e07efe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/88e07efe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/88e07efe

Branch: refs/heads/branch-2.1
Commit: 88e07efe86512142eeada6a6f1f7fe858204c59b
Parents: c13c293
Author: Zheng RuiFeng 
Authored: Mon Dec 5 00:32:58 2016 -0800
Committer: Yanbo Liang 
Committed: Mon Dec 5 00:33:21 2016 -0800

--
 .../apache/spark/ml/classification/OneVsRest.scala|  9 +
 .../spark/ml/classification/OneVsRestSuite.scala  | 14 +-
 2 files changed, 22 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/88e07efe/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
index f4ab0a0..e58b30d 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
@@ -140,6 +140,14 @@ final class OneVsRestModel private[ml] (
 this(uid, Metadata.empty, models.asScala.toArray)
   }
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, value)
+
   @Since("1.4.0")
   override def transformSchema(schema: StructType): StructType = {
 validateAndTransformSchema(schema, fitting = false, 
getClassifier.featuresDataType)
@@ -175,6 +183,7 @@ final class OneVsRestModel private[ml] (
 val updateUDF = udf { (predictions: Map[Int, Double], prediction: 
Vector) =>
   predictions + ((index, prediction(1)))
 }
+model.setFeaturesCol($(featuresCol))
 val transformedDataset = model.transform(df).select(columns: _*)
 val updatedDataset = transformedDataset
   .withColumn(tmpColName, updateUDF(col(accColName), 
col(rawPredictionCol)))

http://git-wip-us.apache.org/repos/asf/spark/blob/88e07efe/mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala
index 3f9bcec..aacb792 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala
@@ -22,7 +22,7 @@ import org.apache.spark.ml.attribute.NominalAttribute
 import org.apache.spark.ml.classification.LogisticRegressionSuite._
 import org.apache.spark.ml.feature.LabeledPoint
 import org.apache.spark.ml.feature.StringIndexer
-import org.apache.spark.ml.linalg.{DenseMatrix, Vectors}
+import org.apache.spark.ml.linalg.Vectors
 import org.apache.spark.ml.param.{ParamMap, ParamsSuite}
 import org.apache.spark.ml.util.{DefaultReadWriteTest, MetadataUtils, 
MLTestingUtils}
 import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
@@ -33,6 +33,7 @@ import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.mllib.util.TestingUtils._
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.functions._
 import org.apache.spark.sql.types.Metadata
 
 class OneVsRestSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
@@ -136,6 +137,17 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 assert(outputFields.contains("p"))
   }
 
+  test("SPARK-18625 : OneVsRestModel should support setFeaturesCol and 
setPredictionCol") {
+val ova = new OneVsRest().setClassifier(new LogisticRegression)
+val ovaModel = ova.fit(dataset)
+val dataset2 = dataset.select(col("label").as("y"), 
col("features").as("fea"))
+ovaModel.setFeaturesCol("fea")
+ovaModel.setPredictionCol("pred")
+val transformedDataset = ovaModel.transform(dataset2)
+val outputFields =

spark git commit: [SPARK-18625][ML] OneVsRestModel should support setFeaturesCol and setPredictionCol

2016-12-05 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master e9730b707 -> bdfe7f674


[SPARK-18625][ML] OneVsRestModel should support setFeaturesCol and 
setPredictionCol

## What changes were proposed in this pull request?
add `setFeaturesCol` and `setPredictionCol` for `OneVsRestModel`

## How was this patch tested?
added tests

Author: Zheng RuiFeng 

Closes #16059 from zhengruifeng/ovrm_setCol.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bdfe7f67
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bdfe7f67
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bdfe7f67

Branch: refs/heads/master
Commit: bdfe7f67468ecfd9927a1fec60d6605dd05ebe3f
Parents: e9730b7
Author: Zheng RuiFeng 
Authored: Mon Dec 5 00:32:58 2016 -0800
Committer: Yanbo Liang 
Committed: Mon Dec 5 00:32:58 2016 -0800

--
 .../apache/spark/ml/classification/OneVsRest.scala|  9 +
 .../spark/ml/classification/OneVsRestSuite.scala  | 14 +-
 2 files changed, 22 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bdfe7f67/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
index f4ab0a0..e58b30d 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
@@ -140,6 +140,14 @@ final class OneVsRestModel private[ml] (
 this(uid, Metadata.empty, models.asScala.toArray)
   }
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, value)
+
   @Since("1.4.0")
   override def transformSchema(schema: StructType): StructType = {
 validateAndTransformSchema(schema, fitting = false, 
getClassifier.featuresDataType)
@@ -175,6 +183,7 @@ final class OneVsRestModel private[ml] (
 val updateUDF = udf { (predictions: Map[Int, Double], prediction: 
Vector) =>
   predictions + ((index, prediction(1)))
 }
+model.setFeaturesCol($(featuresCol))
 val transformedDataset = model.transform(df).select(columns: _*)
 val updatedDataset = transformedDataset
   .withColumn(tmpColName, updateUDF(col(accColName), 
col(rawPredictionCol)))

http://git-wip-us.apache.org/repos/asf/spark/blob/bdfe7f67/mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala
index 3f9bcec..aacb792 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala
@@ -22,7 +22,7 @@ import org.apache.spark.ml.attribute.NominalAttribute
 import org.apache.spark.ml.classification.LogisticRegressionSuite._
 import org.apache.spark.ml.feature.LabeledPoint
 import org.apache.spark.ml.feature.StringIndexer
-import org.apache.spark.ml.linalg.{DenseMatrix, Vectors}
+import org.apache.spark.ml.linalg.Vectors
 import org.apache.spark.ml.param.{ParamMap, ParamsSuite}
 import org.apache.spark.ml.util.{DefaultReadWriteTest, MetadataUtils, 
MLTestingUtils}
 import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
@@ -33,6 +33,7 @@ import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.mllib.util.TestingUtils._
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.functions._
 import org.apache.spark.sql.types.Metadata
 
 class OneVsRestSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
@@ -136,6 +137,17 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 assert(outputFields.contains("p"))
   }
 
+  test("SPARK-18625 : OneVsRestModel should support setFeaturesCol and 
setPredictionCol") {
+val ova = new OneVsRest().setClassifier(new LogisticRegression)
+val ovaModel = ova.fit(dataset)
+val dataset2 = dataset.select(col("label").as("y"), 
col("features").as("fea"))
+ovaModel.setFeaturesCol("fea")
+ovaModel.setPredictionCol("pred")
+val transformedDataset = ovaModel.transform(dataset2)
+val outputFields = transformedDataset.schema.fieldNames.toSet
+assert(outputFields === Set("y", "fea", "pred"))
+  }
+

spark git commit: [SPARK-18279][DOC][ML][SPARKR] Add R examples to ML programming guide.

2016-12-05 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master bdfe7f674 -> eb8dd6813


[SPARK-18279][DOC][ML][SPARKR] Add R examples to ML programming guide.

## What changes were proposed in this pull request?
Add R examples to ML programming guide for the following algorithms as POC:
* spark.glm
* spark.survreg
* spark.naiveBayes
* spark.kmeans

The four algorithms were added to SparkR since 2.0.0, more docs for algorithms 
added during 2.1 release cycle will be addressed in a separate follow-up PR.

## How was this patch tested?
This is the screenshots of generated ML programming guide for 
```GeneralizedLinearRegression```:
![image](https://cloud.githubusercontent.com/assets/1962026/20866403/babad856-b9e1-11e6-9984-62747801e8c4.png)

Author: Yanbo Liang 

Closes #16136 from yanboliang/spark-18279.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eb8dd681
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eb8dd681
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eb8dd681

Branch: refs/heads/master
Commit: eb8dd68132998aa00902dfeb935db1358781e1c1
Parents: bdfe7f6
Author: Yanbo Liang 
Authored: Mon Dec 5 00:39:44 2016 -0800
Committer: Yanbo Liang 
Committed: Mon Dec 5 00:39:44 2016 -0800

--
 docs/ml-classification-regression.md | 22 ++
 docs/ml-clustering.md|  8 
 2 files changed, 30 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/eb8dd681/docs/ml-classification-regression.md
--
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index 43cc79b..5759593 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -389,6 +389,14 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 
 {% include_example python/ml/naive_bayes_example.py %}
 
+
+
+
+Refer to the [R API docs](api/R/spark.naiveBayes.html) for more details.
+
+{% include_example naiveBayes r/ml.R %}
+
+
 
 
 
@@ -566,6 +574,13 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.regression.
 {% include_example python/ml/generalized_linear_regression_example.py %}
 
 
+
+
+Refer to the [R API docs](api/R/spark.glm.html) for more details.
+
+{% include_example glm r/ml.R %}
+
+
 
 
 
@@ -755,6 +770,13 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.regression.
 {% include_example python/ml/aft_survival_regression.py %}
 
 
+
+
+Refer to the [R API docs](api/R/spark.survreg.html) for more details.
+
+{% include_example survreg r/ml.R %}
+
+
 
 
 

http://git-wip-us.apache.org/repos/asf/spark/blob/eb8dd681/docs/ml-clustering.md
--
diff --git a/docs/ml-clustering.md b/docs/ml-clustering.md
index eedacb1..da23442 100644
--- a/docs/ml-clustering.md
+++ b/docs/ml-clustering.md
@@ -86,6 +86,14 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.clustering.
 
 {% include_example python/ml/kmeans_example.py %}
 
+
+
+
+Refer to the [R API docs](api/R/spark.kmeans.html) for more details.
+
+{% include_example kmeans r/ml.R %}
+
+
 
 
 ## Latent Dirichlet allocation (LDA)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18279][DOC][ML][SPARKR] Add R examples to ML programming guide.

2016-12-05 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 88e07efe8 -> 1821cbead


[SPARK-18279][DOC][ML][SPARKR] Add R examples to ML programming guide.

## What changes were proposed in this pull request?
Add R examples to ML programming guide for the following algorithms as POC:
* spark.glm
* spark.survreg
* spark.naiveBayes
* spark.kmeans

The four algorithms were added to SparkR since 2.0.0, more docs for algorithms 
added during 2.1 release cycle will be addressed in a separate follow-up PR.

## How was this patch tested?
This is the screenshots of generated ML programming guide for 
```GeneralizedLinearRegression```:
![image](https://cloud.githubusercontent.com/assets/1962026/20866403/babad856-b9e1-11e6-9984-62747801e8c4.png)

Author: Yanbo Liang 

Closes #16136 from yanboliang/spark-18279.

(cherry picked from commit eb8dd68132998aa00902dfeb935db1358781e1c1)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1821cbea
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1821cbea
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1821cbea

Branch: refs/heads/branch-2.1
Commit: 1821cbead1875fbe1c16d7c50563aa0839e1f70f
Parents: 88e07ef
Author: Yanbo Liang 
Authored: Mon Dec 5 00:39:44 2016 -0800
Committer: Yanbo Liang 
Committed: Mon Dec 5 00:40:33 2016 -0800

--
 docs/ml-classification-regression.md | 22 ++
 docs/ml-clustering.md|  8 
 2 files changed, 30 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1821cbea/docs/ml-classification-regression.md
--
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index c72c01f..5148ad0 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -389,6 +389,14 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 
 {% include_example python/ml/naive_bayes_example.py %}
 
+
+
+
+Refer to the [R API docs](api/R/spark.naiveBayes.html) for more details.
+
+{% include_example naiveBayes r/ml.R %}
+
+
 
 
 
@@ -566,6 +574,13 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.regression.
 {% include_example python/ml/generalized_linear_regression_example.py %}
 
 
+
+
+Refer to the [R API docs](api/R/spark.glm.html) for more details.
+
+{% include_example glm r/ml.R %}
+
+
 
 
 
@@ -755,6 +770,13 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.regression.
 {% include_example python/ml/aft_survival_regression.py %}
 
 
+
+
+Refer to the [R API docs](api/R/spark.survreg.html) for more details.
+
+{% include_example survreg r/ml.R %}
+
+
 
 
 

http://git-wip-us.apache.org/repos/asf/spark/blob/1821cbea/docs/ml-clustering.md
--
diff --git a/docs/ml-clustering.md b/docs/ml-clustering.md
index 8a0a61c..4731abc 100644
--- a/docs/ml-clustering.md
+++ b/docs/ml-clustering.md
@@ -86,6 +86,14 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.clustering.
 
 {% include_example python/ml/kmeans_example.py %}
 
+
+
+
+Refer to the [R API docs](api/R/spark.kmeans.html) for more details.
+
+{% include_example kmeans r/ml.R %}
+
+
 
 
 ## Latent Dirichlet allocation (LDA)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit.

2016-12-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 5c6bcdbda -> 90b59d1bf


[SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit.

## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for 
each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most 
of them are DataFrame which are less important for R users. Meanwhile, these 
metrics ignore instance weights (setting all to 1.0) which will be changed in 
later Spark version. In case it will introduce breaking changes, we do not 
expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an 
expert Param(related with Spark architecture and job execution) that would be 
used rarely by R users.

## How was this patch tested?
Unit tests.

The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
 versicolor  virginica   setosa
(Intercept)  1.514031-2.609108   1.095077
Sepal_Length 0.02511006  0.2649821   -0.2900921
Sepal_Width  -0.5291215  -0.02016446 0.549286
Petal_Length 0.03647411  0.1544119   -0.190886
Petal_Width  0.000236092 0.4195804   -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
 Estimate
(Intercept)  -6.053815
Sepal_Length 0.2449379
Sepal_Width  0.1648321
Petal_Length 0.4730718
Petal_Width  1.031947
```

Author: Yanbo Liang 

Closes #16117 from yanboliang/spark-18686.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/90b59d1b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/90b59d1b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/90b59d1b

Branch: refs/heads/master
Commit: 90b59d1bf262b41c3a5f780697f504030f9d079c
Parents: 5c6bcdb
Author: Yanbo Liang 
Authored: Wed Dec 7 00:31:11 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 7 00:31:11 2016 -0800

--
 R/pkg/R/mllib.R |  86 +++--
 R/pkg/inst/tests/testthat/test_mllib.R  | 183 +--
 .../spark/ml/r/LogisticRegressionWrapper.scala  |  81 
 3 files changed, 203 insertions(+), 147 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/90b59d1b/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index eed8293..074e9cb 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -733,8 +733,6 @@ setMethod("predict", signature(object = "KMeansModel"),
 #'  excepting that at most one value may be 0. The class with 
largest value p/t is predicted, where p
 #'  is the original probability of that class and t is the 
class's threshold.
 #' @param weightCol The weight column name.
-#' @param aggregationDepth depth for treeAggregate (>= 2). If the dimensions 
of features or the number of partitions
-#' are large, this param could be adjusted to a larger 
size.
 #' @param probabilityCol column name for predicted class conditional 
probabilities.
 #' @param ... additional arguments passed to the method.
 #' @return \code{spark.logit} returns a fitted logistic regression model
@@ -746,45 +744,35 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' \dontrun{
 #' sparkR.session()
 #' # binary logistic regression
-#' label <- c(0.0, 0.0, 0.0, 1.0, 1.0)
-#' features <- c(1.1419053, 0.9194079, -0.9498666, -1.1069903, 0.2809776)
-#' binary_data <- as.data.frame(cbind(label, features))
-#' binary_df <- createDataFrame(binary_data)
-#' blr_model <- spark.logit(binary_df, label ~ features, thresholds = 1.0)
-#' blr_predict <- collect(select(predict(blr_model, binary_df), "prediction"))
-#'
-#' # summary of binary logistic regression
-#' blr_summary <- summary(blr_model)
-#' blr_fmeasure <- collect(select(blr_summary$fMeasureByThreshold, 
"threshold", "F-Measure"))
+#' df <- createDataFrame(iris)
+#' training <- df[df$Species %in% c("versicolor", "virginica"), ]
+#' model <- spark.logit(training, Species ~ ., regParam = 0.5)
+#' summary <- summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, training)
+#'
 #' # save fitted model to input path
 #' path <- "path/to/model"
-#' write.ml(blr_model, path)
+#' write.ml(m

spark git commit: [SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit.

2016-12-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 3750c6e9b -> 340e9aea4


[SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit.

## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for 
each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most 
of them are DataFrame which are less important for R users. Meanwhile, these 
metrics ignore instance weights (setting all to 1.0) which will be changed in 
later Spark version. In case it will introduce breaking changes, we do not 
expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an 
expert Param(related with Spark architecture and job execution) that would be 
used rarely by R users.

## How was this patch tested?
Unit tests.

The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
 versicolor  virginica   setosa
(Intercept)  1.514031-2.609108   1.095077
Sepal_Length 0.02511006  0.2649821   -0.2900921
Sepal_Width  -0.5291215  -0.02016446 0.549286
Petal_Length 0.03647411  0.1544119   -0.190886
Petal_Width  0.000236092 0.4195804   -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
 Estimate
(Intercept)  -6.053815
Sepal_Length 0.2449379
Sepal_Width  0.1648321
Petal_Length 0.4730718
Petal_Width  1.031947
```

Author: Yanbo Liang 

Closes #16117 from yanboliang/spark-18686.

(cherry picked from commit 90b59d1bf262b41c3a5f780697f504030f9d079c)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/340e9aea
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/340e9aea
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/340e9aea

Branch: refs/heads/branch-2.1
Commit: 340e9aea4853805c42b8739004d93efe8fe16ba4
Parents: 3750c6e
Author: Yanbo Liang 
Authored: Wed Dec 7 00:31:11 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 7 00:32:32 2016 -0800

--
 R/pkg/R/mllib.R |  86 +++--
 R/pkg/inst/tests/testthat/test_mllib.R  | 183 +--
 .../spark/ml/r/LogisticRegressionWrapper.scala  |  81 
 3 files changed, 203 insertions(+), 147 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/340e9aea/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index eed8293..074e9cb 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -733,8 +733,6 @@ setMethod("predict", signature(object = "KMeansModel"),
 #'  excepting that at most one value may be 0. The class with 
largest value p/t is predicted, where p
 #'  is the original probability of that class and t is the 
class's threshold.
 #' @param weightCol The weight column name.
-#' @param aggregationDepth depth for treeAggregate (>= 2). If the dimensions 
of features or the number of partitions
-#' are large, this param could be adjusted to a larger 
size.
 #' @param probabilityCol column name for predicted class conditional 
probabilities.
 #' @param ... additional arguments passed to the method.
 #' @return \code{spark.logit} returns a fitted logistic regression model
@@ -746,45 +744,35 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' \dontrun{
 #' sparkR.session()
 #' # binary logistic regression
-#' label <- c(0.0, 0.0, 0.0, 1.0, 1.0)
-#' features <- c(1.1419053, 0.9194079, -0.9498666, -1.1069903, 0.2809776)
-#' binary_data <- as.data.frame(cbind(label, features))
-#' binary_df <- createDataFrame(binary_data)
-#' blr_model <- spark.logit(binary_df, label ~ features, thresholds = 1.0)
-#' blr_predict <- collect(select(predict(blr_model, binary_df), "prediction"))
-#'
-#' # summary of binary logistic regression
-#' blr_summary <- summary(blr_model)
-#' blr_fmeasure <- collect(select(blr_summary$fMeasureByThreshold, 
"threshold", "F-Measure"))
+#' df <- createDataFrame(iris)
+#' training <- df[df$Species %in% c("versicolor", "virginica"), ]
+#' model <- spark.logit(training, Species ~ ., regParam = 0.5)
+#' summary <- summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, training)
+#'
 #'

spark git commit: [SPARK-18705][ML][DOC] Update user guide to reflect one pass solver for L1 and elastic-net

2016-12-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 9ab725eab -> 82253617f


[SPARK-18705][ML][DOC] Update user guide to reflect one pass solver for L1 and 
elastic-net

## What changes were proposed in this pull request?

WeightedLeastSquares now supports L1 and elastic net penalties and has an 
additional solver option: QuasiNewton. The docs are updated to reflect this 
change.

## How was this patch tested?

Docs only. Generated documentation to make sure Latex looks ok.

Author: sethah 

Closes #16139 from sethah/SPARK-18705.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/82253617
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/82253617
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/82253617

Branch: refs/heads/master
Commit: 82253617f5b3cdbd418c48f94e748651ee80077e
Parents: 9ab725e
Author: sethah 
Authored: Wed Dec 7 19:41:32 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 7 19:41:32 2016 -0800

--
 docs/ml-advanced.md | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/82253617/docs/ml-advanced.md
--
diff --git a/docs/ml-advanced.md b/docs/ml-advanced.md
index 12a03d3..2747f2d 100644
--- a/docs/ml-advanced.md
+++ b/docs/ml-advanced.md
@@ -59,17 +59,25 @@ Given $n$ weighted observations $(w_i, a_i, b_i)$:
 
 The number of features for each observation is $m$. We use the following 
weighted least squares formulation:
 `\[   
-minimize_{x}\frac{1}{2} \sum_{i=1}^n \frac{w_i(a_i^T x -b_i)^2}{\sum_{k=1}^n 
w_k} + \frac{1}{2}\frac{\lambda}{\delta}\sum_{j=1}^m(\sigma_{j} x_{j})^2
+\min_{\mathbf{x}}\frac{1}{2} \sum_{i=1}^n \frac{w_i(\mathbf{a}_i^T \mathbf{x} 
-b_i)^2}{\sum_{k=1}^n w_k} + \frac{\lambda}{\delta}\left[\frac{1}{2}(1 - 
\alpha)\sum_{j=1}^m(\sigma_j x_j)^2 + \alpha\sum_{j=1}^m |\sigma_j x_j|\right]
 \]`
-where $\lambda$ is the regularization parameter, $\delta$ is the population 
standard deviation of the label
+where $\lambda$ is the regularization parameter, $\alpha$ is the elastic-net 
mixing parameter, $\delta$ is the population standard deviation of the label
 and $\sigma_j$ is the population standard deviation of the j-th feature column.
 
-This objective function has an analytic solution and it requires only one pass 
over the data to collect necessary statistics to solve.
-Unlike the original dataset which can only be stored in a distributed system,
-these statistics can be loaded into memory on a single machine if the number 
of features is relatively small, and then we can solve the objective function 
through Cholesky factorization on the driver.
+This objective function requires only one pass over the data to collect the 
statistics necessary to solve it. For an
+$n \times m$ data matrix, these statistics require only $O(m^2)$ storage and 
so can be stored on a single machine when $m$ (the number of features) is
+relatively small. We can then solve the normal equations on a single machine 
using local methods like direct Cholesky factorization or iterative 
optimization programs.
 
-WeightedLeastSquares only supports L2 regularization and provides options to 
enable or disable regularization and standardization.
-In order to make the normal equation approach efficient, WeightedLeastSquares 
requires that the number of features be no more than 4096. For larger problems, 
use L-BFGS instead.
+Spark MLlib currently supports two types of solvers for the normal equations: 
Cholesky factorization and Quasi-Newton methods (L-BFGS/OWL-QN). Cholesky 
factorization
+depends on a positive definite covariance matrix (i.e. columns of the data 
matrix must be linearly independent) and will fail if this condition is 
violated. Quasi-Newton methods
+are still capable of providing a reasonable solution even when the covariance 
matrix is not positive definite, so the normal equation solver can also fall 
back to 
+Quasi-Newton methods in this case. This fallback is currently always enabled 
for the `LinearRegression` and `GeneralizedLinearRegression` estimators.
+
+`WeightedLeastSquares` supports L1, L2, and elastic-net regularization and 
provides options to enable or disable regularization and standardization. In 
the case where no 
+L1 regularization is applied (i.e. $\alpha = 0$), there exists an analytical 
solution and either Cholesky or Quasi-Newton solver may be used. When $\alpha > 
0$ no analytical 
+solution exists and we instead use the Quasi-Newton solver to find the 
coefficients iteratively. 
+
+In order to make the normal equation approach efficient, 
`WeightedLeastSquares` requires that the number of features be no more than 
4096. For larger problems, use L-BFGS instead.
 
 ## Iteratively reweighted least

spark git commit: [SPARK-18705][ML][DOC] Update user guide to reflect one pass solver for L1 and elastic-net

2016-12-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 617ce3ba7 -> ab865cfd9


[SPARK-18705][ML][DOC] Update user guide to reflect one pass solver for L1 and 
elastic-net

## What changes were proposed in this pull request?

WeightedLeastSquares now supports L1 and elastic net penalties and has an 
additional solver option: QuasiNewton. The docs are updated to reflect this 
change.

## How was this patch tested?

Docs only. Generated documentation to make sure Latex looks ok.

Author: sethah 

Closes #16139 from sethah/SPARK-18705.

(cherry picked from commit 82253617f5b3cdbd418c48f94e748651ee80077e)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ab865cfd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ab865cfd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ab865cfd

Branch: refs/heads/branch-2.1
Commit: ab865cfd9dc87154e7d4fc5d09168868c88db6b0
Parents: 617ce3b
Author: sethah 
Authored: Wed Dec 7 19:41:32 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 7 19:42:06 2016 -0800

--
 docs/ml-advanced.md | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ab865cfd/docs/ml-advanced.md
--
diff --git a/docs/ml-advanced.md b/docs/ml-advanced.md
index 12a03d3..2747f2d 100644
--- a/docs/ml-advanced.md
+++ b/docs/ml-advanced.md
@@ -59,17 +59,25 @@ Given $n$ weighted observations $(w_i, a_i, b_i)$:
 
 The number of features for each observation is $m$. We use the following 
weighted least squares formulation:
 `\[   
-minimize_{x}\frac{1}{2} \sum_{i=1}^n \frac{w_i(a_i^T x -b_i)^2}{\sum_{k=1}^n 
w_k} + \frac{1}{2}\frac{\lambda}{\delta}\sum_{j=1}^m(\sigma_{j} x_{j})^2
+\min_{\mathbf{x}}\frac{1}{2} \sum_{i=1}^n \frac{w_i(\mathbf{a}_i^T \mathbf{x} 
-b_i)^2}{\sum_{k=1}^n w_k} + \frac{\lambda}{\delta}\left[\frac{1}{2}(1 - 
\alpha)\sum_{j=1}^m(\sigma_j x_j)^2 + \alpha\sum_{j=1}^m |\sigma_j x_j|\right]
 \]`
-where $\lambda$ is the regularization parameter, $\delta$ is the population 
standard deviation of the label
+where $\lambda$ is the regularization parameter, $\alpha$ is the elastic-net 
mixing parameter, $\delta$ is the population standard deviation of the label
 and $\sigma_j$ is the population standard deviation of the j-th feature column.
 
-This objective function has an analytic solution and it requires only one pass 
over the data to collect necessary statistics to solve.
-Unlike the original dataset which can only be stored in a distributed system,
-these statistics can be loaded into memory on a single machine if the number 
of features is relatively small, and then we can solve the objective function 
through Cholesky factorization on the driver.
+This objective function requires only one pass over the data to collect the 
statistics necessary to solve it. For an
+$n \times m$ data matrix, these statistics require only $O(m^2)$ storage and 
so can be stored on a single machine when $m$ (the number of features) is
+relatively small. We can then solve the normal equations on a single machine 
using local methods like direct Cholesky factorization or iterative 
optimization programs.
 
-WeightedLeastSquares only supports L2 regularization and provides options to 
enable or disable regularization and standardization.
-In order to make the normal equation approach efficient, WeightedLeastSquares 
requires that the number of features be no more than 4096. For larger problems, 
use L-BFGS instead.
+Spark MLlib currently supports two types of solvers for the normal equations: 
Cholesky factorization and Quasi-Newton methods (L-BFGS/OWL-QN). Cholesky 
factorization
+depends on a positive definite covariance matrix (i.e. columns of the data 
matrix must be linearly independent) and will fail if this condition is 
violated. Quasi-Newton methods
+are still capable of providing a reasonable solution even when the covariance 
matrix is not positive definite, so the normal equation solver can also fall 
back to 
+Quasi-Newton methods in this case. This fallback is currently always enabled 
for the `LinearRegression` and `GeneralizedLinearRegression` estimators.
+
+`WeightedLeastSquares` supports L1, L2, and elastic-net regularization and 
provides options to enable or disable regularization and standardization. In 
the case where no 
+L1 regularization is applied (i.e. $\alpha = 0$), there exists an analytical 
solution and either Cholesky or Quasi-Newton solver may be used. When $\alpha > 
0$ no analytical 
+solution exists and we instead use the Quasi-Newton solver to find the 
coefficients iteratively. 
+
+In order to make the normal equation approach efficient, 
`WeightedLeastSquares` requires that the number of f

spark git commit: [SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1

2016-12-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 ab865cfd9 -> 1c3f1da82


[SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1

## What changes were proposed in this pull request?
Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues:
* Remove ```probabilityCol``` from the argument list of ```spark.logit``` and 
```spark.randomForest```. Since it was used when making prediction and should 
be an argument of ```predict```, and we will work on this at 
[SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) in the next 
release cycle.
* Fix ```spark.als``` params to make it consistent with MLlib.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #16169 from yanboliang/spark-18326.

(cherry picked from commit 97255497d885f0f8ccfc808e868bc8aa5e4d1063)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1c3f1da8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1c3f1da8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1c3f1da8

Branch: refs/heads/branch-2.1
Commit: 1c3f1da82356426b6b550fee67e66dc82eaf1c85
Parents: ab865cf
Author: Yanbo Liang 
Authored: Wed Dec 7 20:23:28 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 7 20:23:45 2016 -0800

--
 R/pkg/R/mllib.R | 23 +---
 R/pkg/inst/tests/testthat/test_mllib.R  |  4 ++--
 .../spark/ml/r/LogisticRegressionWrapper.scala  |  4 +---
 .../r/RandomForestClassificationWrapper.scala   |  2 --
 4 files changed, 13 insertions(+), 20 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1c3f1da8/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 074e9cb..632e4ad 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -733,7 +733,6 @@ setMethod("predict", signature(object = "KMeansModel"),
 #'  excepting that at most one value may be 0. The class with 
largest value p/t is predicted, where p
 #'  is the original probability of that class and t is the 
class's threshold.
 #' @param weightCol The weight column name.
-#' @param probabilityCol column name for predicted class conditional 
probabilities.
 #' @param ... additional arguments passed to the method.
 #' @return \code{spark.logit} returns a fitted logistic regression model
 #' @rdname spark.logit
@@ -772,7 +771,7 @@ setMethod("predict", signature(object = "KMeansModel"),
 setMethod("spark.logit", signature(data = "SparkDataFrame", formula = 
"formula"),
   function(data, formula, regParam = 0.0, elasticNetParam = 0.0, 
maxIter = 100,
tol = 1E-6, family = "auto", standardization = TRUE,
-   thresholds = 0.5, weightCol = NULL, probabilityCol = 
"probability") {
+   thresholds = 0.5, weightCol = NULL) {
 formula <- paste(deparse(formula), collapse = "")
 
 if (is.null(weightCol)) {
@@ -784,7 +783,7 @@ setMethod("spark.logit", signature(data = "SparkDataFrame", 
formula = "formula")
 as.numeric(elasticNetParam), 
as.integer(maxIter),
 as.numeric(tol), as.character(family),
 as.logical(standardization), 
as.array(thresholds),
-as.character(weightCol), 
as.character(probabilityCol))
+as.character(weightCol))
 new("LogisticRegressionModel", jobj = jobj)
   })
 
@@ -1425,7 +1424,7 @@ setMethod("predict", signature(object = 
"GaussianMixtureModel"),
 #' @param userCol column name for user ids. Ids must be (or can be coerced 
into) integers.
 #' @param itemCol column name for item ids. Ids must be (or can be coerced 
into) integers.
 #' @param rank rank of the matrix factorization (> 0).
-#' @param reg regularization parameter (>= 0).
+#' @param regParam regularization parameter (>= 0).
 #' @param maxIter maximum number of iterations (>= 0).
 #' @param nonnegative logical value indicating whether to apply nonnegativity 
constraints.
 #' @param implicitPrefs logical value indicating whether to use implicit 
preference.
@@ -1464,21 +1463,21 @@ setMethod("predict", signature(object = 
"GaussianMixtureModel"),
 #'
 #' # set other arguments
 #' modelS <- spark.als(df, "rating", "user", "item", rank = 20,
-#' reg = 0.1, nonnegative = TRUE)
+#' regParam = 0.1, nonnegative = TRUE)
 #' statsS <- summary(modelS)
 #' }
 #' @note spark.als since 2.1.0
 setMethod("spark.als", signature(data = "SparkDataFrame"),
   function(data, ratingCol = "rating", userCol = "user", itemCol = 
"item",
-   rank = 10, reg = 0.1, maxI

spark git commit: [SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1

2016-12-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 82253617f -> 97255497d


[SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1

## What changes were proposed in this pull request?
Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues:
* Remove ```probabilityCol``` from the argument list of ```spark.logit``` and 
```spark.randomForest```. Since it was used when making prediction and should 
be an argument of ```predict```, and we will work on this at 
[SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) in the next 
release cycle.
* Fix ```spark.als``` params to make it consistent with MLlib.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #16169 from yanboliang/spark-18326.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/97255497
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/97255497
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/97255497

Branch: refs/heads/master
Commit: 97255497d885f0f8ccfc808e868bc8aa5e4d1063
Parents: 8225361
Author: Yanbo Liang 
Authored: Wed Dec 7 20:23:28 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 7 20:23:28 2016 -0800

--
 R/pkg/R/mllib.R | 23 +---
 R/pkg/inst/tests/testthat/test_mllib.R  |  4 ++--
 .../spark/ml/r/LogisticRegressionWrapper.scala  |  4 +---
 .../r/RandomForestClassificationWrapper.scala   |  2 --
 4 files changed, 13 insertions(+), 20 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/97255497/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 074e9cb..632e4ad 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -733,7 +733,6 @@ setMethod("predict", signature(object = "KMeansModel"),
 #'  excepting that at most one value may be 0. The class with 
largest value p/t is predicted, where p
 #'  is the original probability of that class and t is the 
class's threshold.
 #' @param weightCol The weight column name.
-#' @param probabilityCol column name for predicted class conditional 
probabilities.
 #' @param ... additional arguments passed to the method.
 #' @return \code{spark.logit} returns a fitted logistic regression model
 #' @rdname spark.logit
@@ -772,7 +771,7 @@ setMethod("predict", signature(object = "KMeansModel"),
 setMethod("spark.logit", signature(data = "SparkDataFrame", formula = 
"formula"),
   function(data, formula, regParam = 0.0, elasticNetParam = 0.0, 
maxIter = 100,
tol = 1E-6, family = "auto", standardization = TRUE,
-   thresholds = 0.5, weightCol = NULL, probabilityCol = 
"probability") {
+   thresholds = 0.5, weightCol = NULL) {
 formula <- paste(deparse(formula), collapse = "")
 
 if (is.null(weightCol)) {
@@ -784,7 +783,7 @@ setMethod("spark.logit", signature(data = "SparkDataFrame", 
formula = "formula")
 as.numeric(elasticNetParam), 
as.integer(maxIter),
 as.numeric(tol), as.character(family),
 as.logical(standardization), 
as.array(thresholds),
-as.character(weightCol), 
as.character(probabilityCol))
+as.character(weightCol))
 new("LogisticRegressionModel", jobj = jobj)
   })
 
@@ -1425,7 +1424,7 @@ setMethod("predict", signature(object = 
"GaussianMixtureModel"),
 #' @param userCol column name for user ids. Ids must be (or can be coerced 
into) integers.
 #' @param itemCol column name for item ids. Ids must be (or can be coerced 
into) integers.
 #' @param rank rank of the matrix factorization (> 0).
-#' @param reg regularization parameter (>= 0).
+#' @param regParam regularization parameter (>= 0).
 #' @param maxIter maximum number of iterations (>= 0).
 #' @param nonnegative logical value indicating whether to apply nonnegativity 
constraints.
 #' @param implicitPrefs logical value indicating whether to use implicit 
preference.
@@ -1464,21 +1463,21 @@ setMethod("predict", signature(object = 
"GaussianMixtureModel"),
 #'
 #' # set other arguments
 #' modelS <- spark.als(df, "rating", "user", "item", rank = 20,
-#' reg = 0.1, nonnegative = TRUE)
+#' regParam = 0.1, nonnegative = TRUE)
 #' statsS <- summary(modelS)
 #' }
 #' @note spark.als since 2.1.0
 setMethod("spark.als", signature(data = "SparkDataFrame"),
   function(data, ratingCol = "rating", userCol = "user", itemCol = 
"item",
-   rank = 10, reg = 0.1, maxIter = 10, nonnegative = FALSE,
+   rank = 10, regParam = 0.1, maxIter = 10, nonnegative =

spark git commit: [SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide

2016-12-08 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master b47b892e4 -> 9bf8f3cd4


[SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide

## What changes were proposed in this pull request?
* Add all R examples for ML wrappers which were added during 2.1 release cycle.
* Split the whole ```ml.R``` example file into individual example for each 
algorithm, which will be convenient for users to rerun them.
* Add corresponding examples to ML user guide.
* Update ML section of SparkR user guide.

Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR 
examples may different from them, since R users may use the algorithms in a 
different way, for example, using R ```formula``` to specify ```featuresCol``` 
and ```labelCol```.

## How was this patch tested?
Run all examples manually.

Author: Yanbo Liang 

Closes #16148 from yanboliang/spark-18325.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9bf8f3cd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9bf8f3cd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9bf8f3cd

Branch: refs/heads/master
Commit: 9bf8f3cd4f62f921c32fb50b8abf49576a80874f
Parents: b47b892
Author: Yanbo Liang 
Authored: Thu Dec 8 06:19:38 2016 -0800
Committer: Yanbo Liang 
Committed: Thu Dec 8 06:19:38 2016 -0800

--
 docs/ml-classification-regression.md |  67 +++-
 docs/ml-clustering.md|  18 +++-
 docs/ml-collaborative-filtering.md   |   8 ++
 docs/sparkr.md   |  46 
 examples/src/main/r/ml.R | 148 --
 examples/src/main/r/ml/als.R |  45 
 examples/src/main/r/ml/gaussianMixture.R |  42 
 examples/src/main/r/ml/gbt.R |  63 +++
 examples/src/main/r/ml/glm.R |  57 ++
 examples/src/main/r/ml/isoreg.R  |  42 
 examples/src/main/r/ml/kmeans.R  |  44 
 examples/src/main/r/ml/kstest.R  |  39 +++
 examples/src/main/r/ml/lda.R |  46 
 examples/src/main/r/ml/logit.R   |  63 +++
 examples/src/main/r/ml/ml.R  |  65 +++
 examples/src/main/r/ml/mlp.R |  48 +
 examples/src/main/r/ml/naiveBayes.R  |  41 +++
 examples/src/main/r/ml/randomForest.R|  63 +++
 examples/src/main/r/ml/survreg.R |  43 
 19 files changed, 810 insertions(+), 178 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9bf8f3cd/docs/ml-classification-regression.md
--
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index bb9390f..782ee58 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -75,6 +75,13 @@ More details on parameters can be found in the [Python API 
documentation](api/py
 {% include_example python/ml/logistic_regression_with_elastic_net.py %}
 
 
+
+
+More details on parameters can be found in the [R API 
documentation](api/R/spark.logit.html).
+
+{% include_example binomial r/ml/logit.R %}
+
+
 
 
 The `spark.ml` implementation of logistic regression also supports
@@ -171,6 +178,13 @@ model with elastic net regularization.
 {% include_example 
python/ml/multiclass_logistic_regression_with_elastic_net.py %}
 
 
+
+
+More details on parameters can be found in the [R API 
documentation](api/R/spark.logit.html).
+
+{% include_example multinomial r/ml/logit.R %}
+
+
 
 
 
@@ -242,6 +256,14 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 
 {% include_example python/ml/random_forest_classifier_example.py %}
 
+
+
+
+Refer to the [R API docs](api/R/spark.randomForest.html) for more details.
+
+{% include_example classification r/ml/randomForest.R %}
+
+
 
 
 ## Gradient-boosted tree classifier
@@ -275,6 +297,14 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 
 {% include_example python/ml/gradient_boosted_tree_classifier_example.py %}
 
+
+
+
+Refer to the [R API docs](api/R/spark.gbt.html) for more details.
+
+{% include_example classification r/ml/gbt.R %}
+
+
 
 
 ## Multilayer perceptron classifier
@@ -324,6 +354,13 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 {% include_example python/ml/multilayer_perceptron_classification.py %}
 
 
+
+
+Refer to the [R API docs](api/R/spark.mlp.html) for more details.
+
+{% include_example r/ml/mlp.R %}
+
+
 
 
 
@@ -400,7 +437,7 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 
 Refer to the [R API docs](api/R/spark.naiveBayes.html) for more details.
 
-{% include_example naiveBayes r/ml.R %}
+{%

spark git commit: [SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide

2016-12-08 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 48aa6775d -> 9095c152e


[SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide

## What changes were proposed in this pull request?
* Add all R examples for ML wrappers which were added during 2.1 release cycle.
* Split the whole ```ml.R``` example file into individual example for each 
algorithm, which will be convenient for users to rerun them.
* Add corresponding examples to ML user guide.
* Update ML section of SparkR user guide.

Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR 
examples may different from them, since R users may use the algorithms in a 
different way, for example, using R ```formula``` to specify ```featuresCol``` 
and ```labelCol```.

## How was this patch tested?
Run all examples manually.

Author: Yanbo Liang 

Closes #16148 from yanboliang/spark-18325.

(cherry picked from commit 9bf8f3cd4f62f921c32fb50b8abf49576a80874f)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9095c152
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9095c152
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9095c152

Branch: refs/heads/branch-2.1
Commit: 9095c152e7fedf469dcc4887f5b6a1882cd74c28
Parents: 48aa677
Author: Yanbo Liang 
Authored: Thu Dec 8 06:19:38 2016 -0800
Committer: Yanbo Liang 
Committed: Thu Dec 8 06:20:28 2016 -0800

--
 docs/ml-classification-regression.md |  67 +++-
 docs/ml-clustering.md|  18 +++-
 docs/ml-collaborative-filtering.md   |   8 ++
 docs/sparkr.md   |  46 
 examples/src/main/r/ml.R | 148 --
 examples/src/main/r/ml/als.R |  45 
 examples/src/main/r/ml/gaussianMixture.R |  42 
 examples/src/main/r/ml/gbt.R |  63 +++
 examples/src/main/r/ml/glm.R |  57 ++
 examples/src/main/r/ml/isoreg.R  |  42 
 examples/src/main/r/ml/kmeans.R  |  44 
 examples/src/main/r/ml/kstest.R  |  39 +++
 examples/src/main/r/ml/lda.R |  46 
 examples/src/main/r/ml/logit.R   |  63 +++
 examples/src/main/r/ml/ml.R  |  65 +++
 examples/src/main/r/ml/mlp.R |  48 +
 examples/src/main/r/ml/naiveBayes.R  |  41 +++
 examples/src/main/r/ml/randomForest.R|  63 +++
 examples/src/main/r/ml/survreg.R |  43 
 19 files changed, 810 insertions(+), 178 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9095c152/docs/ml-classification-regression.md
--
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index 557a53c..2ffea64 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -75,6 +75,13 @@ More details on parameters can be found in the [Python API 
documentation](api/py
 {% include_example python/ml/logistic_regression_with_elastic_net.py %}
 
 
+
+
+More details on parameters can be found in the [R API 
documentation](api/R/spark.logit.html).
+
+{% include_example binomial r/ml/logit.R %}
+
+
 
 
 The `spark.ml` implementation of logistic regression also supports
@@ -171,6 +178,13 @@ model with elastic net regularization.
 {% include_example 
python/ml/multiclass_logistic_regression_with_elastic_net.py %}
 
 
+
+
+More details on parameters can be found in the [R API 
documentation](api/R/spark.logit.html).
+
+{% include_example multinomial r/ml/logit.R %}
+
+
 
 
 
@@ -242,6 +256,14 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 
 {% include_example python/ml/random_forest_classifier_example.py %}
 
+
+
+
+Refer to the [R API docs](api/R/spark.randomForest.html) for more details.
+
+{% include_example classification r/ml/randomForest.R %}
+
+
 
 
 ## Gradient-boosted tree classifier
@@ -275,6 +297,14 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 
 {% include_example python/ml/gradient_boosted_tree_classifier_example.py %}
 
+
+
+
+Refer to the [R API docs](api/R/spark.gbt.html) for more details.
+
+{% include_example classification r/ml/gbt.R %}
+
+
 
 
 ## Multilayer perceptron classifier
@@ -324,6 +354,13 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 {% include_example python/ml/multilayer_perceptron_classification.py %}
 
 
+
+
+Refer to the [R API docs](api/R/spark.mlp.html) for more details.
+
+{% include_example r/ml/mlp.R %}
+
+
 
 
 
@@ -400,7 +437,7 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 
 Refer to the [

spark git commit: [MINOR][SPARKR] fix kstest example error and add unit test

2016-12-13 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master e104e55c1 -> f2ddabfa0


[MINOR][SPARKR] fix kstest example error and add unit test

## What changes were proposed in this pull request?

While adding vignettes for kstest, I found some errors in the example:
1. There is a typo of kstest;
2. print.summary.KStest doesn't work with the example;

Fix the example errors;
Add a new unit test for print.summary.KStest;

## How was this patch tested?
Manual test;
Add new unit test;

Author: wm...@hotmail.com 

Closes #16259 from wangmiao1981/ks.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f2ddabfa
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f2ddabfa
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f2ddabfa

Branch: refs/heads/master
Commit: f2ddabfa09fda26ff0391d026dd67545dab33e01
Parents: e104e55
Author: wm...@hotmail.com 
Authored: Tue Dec 13 18:52:05 2016 -0800
Committer: Yanbo Liang 
Committed: Tue Dec 13 18:52:05 2016 -0800

--
 R/pkg/R/mllib.R| 4 ++--
 R/pkg/inst/tests/testthat/test_mllib.R | 6 ++
 2 files changed, 8 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f2ddabfa/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 5df843c..d736bbb 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -1595,14 +1595,14 @@ setMethod("write.ml", signature(object = "ALSModel", 
path = "character"),
 #' \dontrun{
 #' data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25))
 #' df <- createDataFrame(data)
-#' test <- spark.ktest(df, "test", "norm", c(0, 1))
+#' test <- spark.kstest(df, "test", "norm", c(0, 1))
 #'
 #' # get a summary of the test result
 #' testSummary <- summary(test)
 #' testSummary
 #'
 #' # print out the summary in an organized way
-#' print.summary.KSTest(test)
+#' print.summary.KSTest(testSummary)
 #' }
 #' @note spark.kstest since 2.1.0
 setMethod("spark.kstest", signature(data = "SparkDataFrame"),

http://git-wip-us.apache.org/repos/asf/spark/blob/f2ddabfa/R/pkg/inst/tests/testthat/test_mllib.R
--
diff --git a/R/pkg/inst/tests/testthat/test_mllib.R 
b/R/pkg/inst/tests/testthat/test_mllib.R
index 986af4a..0f0d831 100644
--- a/R/pkg/inst/tests/testthat/test_mllib.R
+++ b/R/pkg/inst/tests/testthat/test_mllib.R
@@ -986,6 +986,12 @@ test_that("spark.kstest", {
   expect_equal(stats$p.value, rStats$p.value, tolerance = 1e-4)
   expect_equal(stats$statistic, unname(rStats$statistic), tolerance = 1e-4)
   expect_match(capture.output(stats)[1], "Kolmogorov-Smirnov test summary:")
+
+  # Test print.summary.KSTest
+  printStats <- capture.output(print.summary.KSTest(stats))
+  expect_match(printStats[1], "Kolmogorov-Smirnov test summary:")
+  expect_match(printStats[5],
+   "Low presumption against null hypothesis: Sample follows 
theoretical distribution. ")
 })
 
 test_that("spark.randomForest", {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][SPARKR] fix kstest example error and add unit test

2016-12-13 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 019d1fa3d -> 8ef005931


[MINOR][SPARKR] fix kstest example error and add unit test

## What changes were proposed in this pull request?

While adding vignettes for kstest, I found some errors in the example:
1. There is a typo of kstest;
2. print.summary.KStest doesn't work with the example;

Fix the example errors;
Add a new unit test for print.summary.KStest;

## How was this patch tested?
Manual test;
Add new unit test;

Author: wm...@hotmail.com 

Closes #16259 from wangmiao1981/ks.

(cherry picked from commit f2ddabfa09fda26ff0391d026dd67545dab33e01)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8ef00593
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8ef00593
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8ef00593

Branch: refs/heads/branch-2.1
Commit: 8ef005931a242d087f4879805571be0660aefaf9
Parents: 019d1fa
Author: wm...@hotmail.com 
Authored: Tue Dec 13 18:52:05 2016 -0800
Committer: Yanbo Liang 
Committed: Tue Dec 13 18:52:22 2016 -0800

--
 R/pkg/R/mllib.R| 4 ++--
 R/pkg/inst/tests/testthat/test_mllib.R | 6 ++
 2 files changed, 8 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8ef00593/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 5df843c..d736bbb 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -1595,14 +1595,14 @@ setMethod("write.ml", signature(object = "ALSModel", 
path = "character"),
 #' \dontrun{
 #' data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25))
 #' df <- createDataFrame(data)
-#' test <- spark.ktest(df, "test", "norm", c(0, 1))
+#' test <- spark.kstest(df, "test", "norm", c(0, 1))
 #'
 #' # get a summary of the test result
 #' testSummary <- summary(test)
 #' testSummary
 #'
 #' # print out the summary in an organized way
-#' print.summary.KSTest(test)
+#' print.summary.KSTest(testSummary)
 #' }
 #' @note spark.kstest since 2.1.0
 setMethod("spark.kstest", signature(data = "SparkDataFrame"),

http://git-wip-us.apache.org/repos/asf/spark/blob/8ef00593/R/pkg/inst/tests/testthat/test_mllib.R
--
diff --git a/R/pkg/inst/tests/testthat/test_mllib.R 
b/R/pkg/inst/tests/testthat/test_mllib.R
index 46dffe3..40c0446 100644
--- a/R/pkg/inst/tests/testthat/test_mllib.R
+++ b/R/pkg/inst/tests/testthat/test_mllib.R
@@ -986,6 +986,12 @@ test_that("spark.kstest", {
   expect_equal(stats$p.value, rStats$p.value, tolerance = 1e-4)
   expect_equal(stats$statistic, unname(rStats$statistic), tolerance = 1e-4)
   expect_match(capture.output(stats)[1], "Kolmogorov-Smirnov test summary:")
+
+  # Test print.summary.KSTest
+  printStats <- capture.output(print.summary.KSTest(stats))
+  expect_match(printStats[1], "Kolmogorov-Smirnov test summary:")
+  expect_match(printStats[5],
+   "Low presumption against null hypothesis: Sample follows 
theoretical distribution. ")
 })
 
 test_that("spark.randomForest", {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE)

2016-12-28 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 2af8b5cff -> 79ff85363


[SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery 
Rate (FDR) and Family wise error rate (FWE)

## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on 
univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 
25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure 
in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, 
or type I errors, among all the hypotheses when performing multiple hypotheses 
tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add  FDR and FWE methods for ChiSqSelector in this PR, like it is 
implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?

ut will be added soon

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Author: Peng 
Author: Peng, Meng 

Closes #15212 from mpjlu/fdr_fwe.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/79ff8536
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/79ff8536
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/79ff8536

Branch: refs/heads/master
Commit: 79ff8536315aef97ee940c52d71cd8de777c7ce6
Parents: 2af8b5c
Author: Peng 
Authored: Wed Dec 28 00:49:36 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 28 00:49:36 2016 -0800

--
 docs/ml-features.md |   6 +-
 docs/mllib-feature-extraction.md|   4 +-
 .../apache/spark/ml/feature/ChiSqSelector.scala |  48 +-
 .../spark/mllib/api/python/PythonMLLibAPI.scala |   4 +
 .../spark/mllib/feature/ChiSqSelector.scala |  62 ++--
 .../spark/ml/feature/ChiSqSelectorSuite.scala   |   6 +
 .../mllib/feature/ChiSqSelectorSuite.scala  | 147 +++
 python/pyspark/ml/feature.py|  74 +-
 python/pyspark/mllib/feature.py |  50 ++-
 9 files changed, 337 insertions(+), 64 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/79ff8536/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index ca1ccc4..1d34497 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1423,12 +1423,12 @@ for more details on the API.
 `ChiSqSelector` stands for Chi-Squared feature selection. It operates on 
labeled data with
 categorical features. ChiSqSelector uses the
 [Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
-features to choose. It supports three selection methods: `numTopFeatures`, 
`percentile`, `fpr`:
-
+features to choose. It supports five selection methods: `numTopFeatures`, 
`percentile`, `fpr`, `fdr`, `fwe`:
 * `numTopFeatures` chooses a fixed number of top features according to a 
chi-squared test. This is akin to yielding the features with the most 
predictive power.
 * `percentile` is similar to `numTopFeatures` but chooses a fraction of all 
features instead of a fixed number.
 * `fpr` chooses all features whose p-value is below a threshold, thus 
controlling the false positive rate of selection.
-
+* `fdr` uses the [Benjamini-Hochberg 
procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
 to choose all features whose false discovery rate is below a threshold.
+* `fwe` chooses all features whose p-values is below a threshold, thus 
controlling the family-wise error rate of selection.
 By default, the selection method is `numTopFeatures`, with the default number 
of top features set to 50.
 The user can choose a selection method using `setSelectorType`.
 

http://git-wip-us.apache.org/repos/asf/spark/blob/79ff8536/docs/mllib-feature-extraction.md
--
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 42568c3..acd2894 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -227,11 +227,13 @@ both speed and statistical learning behavior.
 
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
 implements
 Chi-Squared feature selection. It operates on labeled data with categorical 
features. ChiSqSelector uses the
 [Chi-Squared test of 
independence](https://e

spark git commit: [MINOR][ML] Correct test cases of LoR raw2prediction & probability2prediction.

2016-12-28 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 79ff85363 -> 9cff67f34


[MINOR][ML] Correct test cases of LoR raw2prediction & probability2prediction.

## What changes were proposed in this pull request?
Correct test cases of ```LogisticRegression``` raw2prediction & 
probability2prediction.

## How was this patch tested?
Changed unit tests.

Author: Yanbo Liang 

Closes #16407 from yanboliang/raw-probability.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9cff67f3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9cff67f3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9cff67f3

Branch: refs/heads/master
Commit: 9cff67f3465bc6ffe1b5abee9501e3c17f8fd194
Parents: 79ff853
Author: Yanbo Liang 
Authored: Wed Dec 28 01:24:18 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 28 01:24:18 2016 -0800

--
 .../LogisticRegressionSuite.scala   | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9cff67f3/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
index 9c4c59a..f8bcbee 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
@@ -359,8 +359,16 @@ class LogisticRegressionSuite
 assert(pred == predFromProb)
 }
 
-// force it to use probability2prediction
+// force it to use raw2prediction
 model.setProbabilityCol("")
+val resultsUsingRaw2Predict =
+  
model.transform(smallMultinomialDataset).select("prediction").as[Double].collect()
+
resultsUsingRaw2Predict.zip(results.select("prediction").as[Double].collect()).foreach
 {
+  case (pred1, pred2) => assert(pred1 === pred2)
+}
+
+// force it to use probability2prediction
+model.setRawPredictionCol("")
 val resultsUsingProb2Predict =
   
model.transform(smallMultinomialDataset).select("prediction").as[Double].collect()
 
resultsUsingProb2Predict.zip(results.select("prediction").as[Double].collect()).foreach
 {
@@ -405,8 +413,16 @@ class LogisticRegressionSuite
 assert(pred == predFromProb)
 }
 
-// force it to use probability2prediction
+// force it to use raw2prediction
 model.setProbabilityCol("")
+val resultsUsingRaw2Predict =
+  
model.transform(smallBinaryDataset).select("prediction").as[Double].collect()
+
resultsUsingRaw2Predict.zip(results.select("prediction").as[Double].collect()).foreach
 {
+  case (pred1, pred2) => assert(pred1 === pred2)
+}
+
+// force it to use probability2prediction
+model.setRawPredictionCol("")
 val resultsUsingProb2Predict =
   
model.transform(smallBinaryDataset).select("prediction").as[Double].collect()
 
resultsUsingProb2Predict.zip(results.select("prediction").as[Double].collect()).foreach
 {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17772][ML][TEST] Add test functions for ML sample weights

2016-12-28 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master d7bce3bd3 -> 6a475ae46


[SPARK-17772][ML][TEST] Add test functions for ML sample weights

## What changes were proposed in this pull request?

More and more ML algos are accepting sample weights, and they have been tested 
rather heterogeneously and with code duplication. This patch adds extensible 
helper methods to `MLTestingUtils` that can be reused by various algorithms 
accepting sample weights. Up to now, there seems to be a few tests that have 
been implemented commonly:

* Check that oversampling is the same as giving the instances sample weights 
proportional to the number of samples
* Check that outliers with tiny sample weights do not affect the algorithm's 
performance

This patch adds an additional test:

* Check that algorithms are invariant to constant scaling of the sample 
weights. i.e. uniform sample weights with `w_i = 1.0` is effectively the same 
as uniform sample weights with `w_i = 1` or `w_i = 0.0001`

The instances of these tests occurred in LinearRegression, NaiveBayes, and 
LogisticRegression. Those tests have been removed/modified to use the new 
helper methods. These helper functions will be of use when 
[SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478) is implemented.

## How was this patch tested?

This patch only involves modifying test suites.

## Other notes

Both IsotonicRegression and GeneralizedLinearRegression also extend 
`HasWeightCol`. I did not modify these test suites because it will make this 
patch easier to review, and because they did not duplicate the same tests as 
the three suites that were modified. If we want to change them later, we can 
create a JIRA for it now, but it's open for debate.

Author: sethah 

Closes #15721 from sethah/SPARK-17772.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6a475ae4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6a475ae4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6a475ae4

Branch: refs/heads/master
Commit: 6a475ae466a7ce28d507244bf6db91be06ed81ef
Parents: d7bce3b
Author: sethah 
Authored: Wed Dec 28 07:01:14 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 28 07:01:14 2016 -0800

--
 .../LogisticRegressionSuite.scala   |  60 +++---
 .../ml/classification/NaiveBayesSuite.scala |  81 +
 .../ml/regression/LinearRegressionSuite.scala   | 120 ++-
 .../apache/spark/ml/util/MLTestingUtils.scala   | 111 +++--
 4 files changed, 154 insertions(+), 218 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6a475ae4/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
index f8bcbee..1308210 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
@@ -1836,52 +1836,24 @@ class LogisticRegressionSuite
 .forall(x => x(0) >= x(1)))
   }
 
-  test("binary logistic regression with weighted data") {
-val numClasses = 2
-val numPoints = 40
-val outlierData = 
MLTestingUtils.genClassificationInstancesWithWeightedOutliers(spark,
-  numClasses, numPoints)
-val testData = Array.tabulate[LabeledPoint](numClasses) { i =>
-  LabeledPoint(i.toDouble, Vectors.dense(i.toDouble))
-}.toSeq.toDF()
-val lr = new 
LogisticRegression().setFamily("binomial").setWeightCol("weight")
-val model = lr.fit(outlierData)
-val results = model.transform(testData).select("label", 
"prediction").collect()
-
-// check that the predictions are the one to one mapping
-results.foreach { case Row(label: Double, pred: Double) =>
-  assert(label === pred)
+  test("logistic regression with sample weights") {
+def modelEquals(m1: LogisticRegressionModel, m2: LogisticRegressionModel): 
Unit = {
+  assert(m1.coefficientMatrix ~== m2.coefficientMatrix absTol 0.05)
+  assert(m1.interceptVector ~== m2.interceptVector absTol 0.05)
 }
-val (overSampledData, weightedData) =
-  MLTestingUtils.genEquivalentOversampledAndWeightedInstances(outlierData, 
"label", "features",
-42L)
-val weightedModel = lr.fit(weightedData)
-val overSampledModel = lr.setWeightCol("").fit(overSampledData)
-assert(weightedModel.coefficientMatrix ~== 
overSampledModel.coefficientMatrix relTol 0.01)
-  }
-
-  test("multinomial logistic regression with weighted data") {
-val numClasses = 5
-val numPoints =

spark git commit: [MINOR][ML][MLLIB] Remove work around for breeze sparse matrix.

2016-09-04 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master cdeb97a8c -> 1b001b520


[MINOR][ML][MLLIB] Remove work around for breeze sparse matrix.

## What changes were proposed in this pull request?
Since we have updated breeze version to 0.12, we should remove work around for 
bug of breeze sparse matrix in v0.11.
I checked all mllib code and found this is the only work around for breeze 0.11.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #14953 from yanboliang/matrices.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1b001b52
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1b001b52
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1b001b52

Branch: refs/heads/master
Commit: 1b001b5203444cc8d5c4887a30e03e8fb298d17d
Parents: cdeb97a
Author: Yanbo Liang 
Authored: Sun Sep 4 05:38:47 2016 -0700
Committer: Yanbo Liang 
Committed: Sun Sep 4 05:38:47 2016 -0700

--
 .../main/scala/org/apache/spark/ml/linalg/Matrices.scala  | 10 +-
 .../scala/org/apache/spark/mllib/linalg/Matrices.scala| 10 +-
 2 files changed, 2 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1b001b52/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala
--
diff --git 
a/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala 
b/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala
index 98080bb..207f662 100644
--- a/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala
+++ b/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala
@@ -846,16 +846,8 @@ object Matrices {
   case dm: BDM[Double] =>
 new DenseMatrix(dm.rows, dm.cols, dm.data, dm.isTranspose)
   case sm: BSM[Double] =>
-// Spark-11507. work around breeze issue 479.
-val mat = if (sm.colPtrs.last != sm.data.length) {
-  val matCopy = sm.copy
-  matCopy.compact()
-  matCopy
-} else {
-  sm
-}
 // There is no isTranspose flag for sparse matrices in Breeze
-new SparseMatrix(mat.rows, mat.cols, mat.colPtrs, mat.rowIndices, 
mat.data)
+new SparseMatrix(sm.rows, sm.cols, sm.colPtrs, sm.rowIndices, sm.data)
   case _ =>
 throw new UnsupportedOperationException(
   s"Do not support conversion from type ${breeze.getClass.getName}.")

http://git-wip-us.apache.org/repos/asf/spark/blob/1b001b52/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
index ad882c9..8659cea 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
@@ -983,16 +983,8 @@ object Matrices {
   case dm: BDM[Double] =>
 new DenseMatrix(dm.rows, dm.cols, dm.data, dm.isTranspose)
   case sm: BSM[Double] =>
-// Spark-11507. work around breeze issue 479.
-val mat = if (sm.colPtrs.last != sm.data.length) {
-  val matCopy = sm.copy
-  matCopy.compact()
-  matCopy
-} else {
-  sm
-}
 // There is no isTranspose flag for sparse matrices in Breeze
-new SparseMatrix(mat.rows, mat.cols, mat.colPtrs, mat.rowIndices, 
mat.data)
+new SparseMatrix(sm.rows, sm.cols, sm.colPtrs, sm.rowIndices, sm.data)
   case _ =>
 throw new UnsupportedOperationException(
   s"Do not support conversion from type ${breeze.getClass.getName}.")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][ML] Correct weights doc of MultilayerPerceptronClassificationModel.

2016-09-06 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 6f13aa7df -> 39d538ddd


[MINOR][ML] Correct weights doc of MultilayerPerceptronClassificationModel.

## What changes were proposed in this pull request?
```weights``` of ```MultilayerPerceptronClassificationModel``` should be the 
output weights of layers rather than initial weights, this PR correct it.

## How was this patch tested?
Doc change.

Author: Yanbo Liang 

Closes #14967 from yanboliang/mlp-weights.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/39d538dd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/39d538dd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/39d538dd

Branch: refs/heads/master
Commit: 39d538dddf7d44bf4603c966d0f7b2c92f1e951a
Parents: 6f13aa7
Author: Yanbo Liang 
Authored: Tue Sep 6 03:30:37 2016 -0700
Committer: Yanbo Liang 
Committed: Tue Sep 6 03:30:37 2016 -0700

--
 .../spark/ml/classification/MultilayerPerceptronClassifier.scala   | 2 +-
 python/pyspark/ml/classification.py| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/39d538dd/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
index 7264a99..88fe7cb 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
@@ -288,7 +288,7 @@ object MultilayerPerceptronClassifier
  *
  * @param uid uid
  * @param layers array of layer sizes including input and output layers
- * @param weights vector of initial weights for the model that consists of the 
weights of layers
+ * @param weights the weights of layers
  * @return prediction model
  */
 @Since("1.5.0")

http://git-wip-us.apache.org/repos/asf/spark/blob/39d538dd/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index d1522d7..b4c01fd 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -1276,7 +1276,7 @@ class MultilayerPerceptronClassificationModel(JavaModel, 
JavaPredictionModel, Ja
 @since("2.0.0")
 def weights(self):
 """
-vector of initial weights for the model that consists of the weights 
of layers.
+the weights of layers.
 """
 return self._call_java("weights")
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17456][CORE] Utility for parsing Spark versions

2016-09-09 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 92ce8d484 -> 65b814bf5


[SPARK-17456][CORE] Utility for parsing Spark versions

## What changes were proposed in this pull request?

This patch adds methods for extracting major and minor versions as Int types in 
Scala from a Spark version string.

Motivation: There are many hacks within Spark's codebase to identify and 
compare Spark versions. We should add a simple utility to standardize these 
code paths, especially since there have been mistakes made in the past. This 
will let us add unit tests as well.  Currently, I want this functionality to 
check Spark versions to provide backwards compatibility for ML model 
persistence.

## How was this patch tested?

Unit tests

Author: Joseph K. Bradley 

Closes #15017 from jkbradley/version-parsing.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/65b814bf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/65b814bf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/65b814bf

Branch: refs/heads/master
Commit: 65b814bf50e92e2e9b622d1602f18bacd217181c
Parents: 92ce8d4
Author: Joseph K. Bradley 
Authored: Fri Sep 9 05:35:10 2016 -0700
Committer: Yanbo Liang 
Committed: Fri Sep 9 05:35:10 2016 -0700

--
 .../org/apache/spark/util/VersionUtils.scala| 52 ++
 .../apache/spark/util/VersionUtilsSuite.scala   | 76 
 2 files changed, 128 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/65b814bf/core/src/main/scala/org/apache/spark/util/VersionUtils.scala
--
diff --git a/core/src/main/scala/org/apache/spark/util/VersionUtils.scala 
b/core/src/main/scala/org/apache/spark/util/VersionUtils.scala
new file mode 100644
index 000..828153b8
--- /dev/null
+++ b/core/src/main/scala/org/apache/spark/util/VersionUtils.scala
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+/**
+ * Utilities for working with Spark version strings
+ */
+private[spark] object VersionUtils {
+
+  private val majorMinorRegex = """^(\d+)\.(\d+)(\..*)?$""".r
+
+  /**
+   * Given a Spark version string, return the major version number.
+   * E.g., for 2.0.1-SNAPSHOT, return 2.
+   */
+  def majorVersion(sparkVersion: String): Int = 
majorMinorVersion(sparkVersion)._1
+
+  /**
+   * Given a Spark version string, return the minor version number.
+   * E.g., for 2.0.1-SNAPSHOT, return 0.
+   */
+  def minorVersion(sparkVersion: String): Int = 
majorMinorVersion(sparkVersion)._2
+
+  /**
+   * Given a Spark version string, return the (major version number, minor 
version number).
+   * E.g., for 2.0.1-SNAPSHOT, return (2, 0).
+   */
+  def majorMinorVersion(sparkVersion: String): (Int, Int) = {
+majorMinorRegex.findFirstMatchIn(sparkVersion) match {
+  case Some(m) =>
+(m.group(1).toInt, m.group(2).toInt)
+  case None =>
+throw new IllegalArgumentException(s"Spark tried to parse 
'$sparkVersion' as a Spark" +
+  s" version string, but it could not find the major and minor version 
numbers.")
+}
+  }
+}

http://git-wip-us.apache.org/repos/asf/spark/blob/65b814bf/core/src/test/scala/org/apache/spark/util/VersionUtilsSuite.scala
--
diff --git a/core/src/test/scala/org/apache/spark/util/VersionUtilsSuite.scala 
b/core/src/test/scala/org/apache/spark/util/VersionUtilsSuite.scala
new file mode 100644
index 000..aaf79eb
--- /dev/null
+++ b/core/src/test/scala/org/apache/spark/util/VersionUtilsSuite.scala
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You ma

spark git commit: [SPARK-17464][SPARKR][ML] SparkR spark.als argument reg should be 0.1 by default.

2016-09-09 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 65b814bf5 -> 2ed601217


[SPARK-17464][SPARKR][ML] SparkR spark.als argument reg should be 0.1 by 
default.

## What changes were proposed in this pull request?
SparkR ```spark.als``` arguments ```reg``` should be 0.1 by default, which need 
to be consistent with ML.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #15021 from yanboliang/spark-17464.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2ed60121
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2ed60121
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2ed60121

Branch: refs/heads/master
Commit: 2ed601217ffd8945829ac762fae35202f3e55686
Parents: 65b814b
Author: Yanbo Liang 
Authored: Fri Sep 9 05:43:34 2016 -0700
Committer: Yanbo Liang 
Committed: Fri Sep 9 05:43:34 2016 -0700

--
 R/pkg/R/mllib.R | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2ed60121/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index f321fd1..f8d1095 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -1241,7 +1241,7 @@ setMethod("predict", signature(object = 
"GaussianMixtureModel"),
 #' @note spark.als since 2.1.0
 setMethod("spark.als", signature(data = "SparkDataFrame"),
   function(data, ratingCol = "rating", userCol = "user", itemCol = 
"item",
-   rank = 10, reg = 1.0, maxIter = 10, nonnegative = FALSE,
+   rank = 10, reg = 0.1, maxIter = 10, nonnegative = FALSE,
implicitPrefs = FALSE, alpha = 1.0, numUserBlocks = 10, 
numItemBlocks = 10,
checkpointInterval = 10, seed = 0) {
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-15509][FOLLOW-UP][ML][SPARKR] R MLlib algorithms should support input columns "features" and "label"

2016-09-10 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 1fec3ce4e -> bcdd259c3


[SPARK-15509][FOLLOW-UP][ML][SPARKR] R MLlib algorithms should support input 
columns "features" and "label"

## What changes were proposed in this pull request?
#13584 resolved the issue of features and label columns conflict with 
```RFormula``` default ones when loading libsvm data, but it still left some 
issues should be resolved:
1, Itâs not necessary to check and rename label column.
Since we have considerations on the design of ```RFormula```, it can handle the 
case of label column already exists(with restriction of the existing label 
column should be numeric/boolean type). So itâs not necessary to change the 
column name to avoid conflict. If the label column is not numeric/boolean type, 
```RFormula``` will throw exception.

2, We should rename features column name to new one if there is conflict, but 
appending a random value is enough since it was used internally only. We done 
similar work when implementing ```SQLTransformer```.

3, We should set correct new features column for the estimators. Take ```GLM``` 
as example:
```GLM``` estimator should set features column with the changed 
one(rFormula.getFeaturesCol) rather than the default âfeaturesâ. Although 
itâs same when training model, but it involves problems when predicting. The 
following is the prediction result of GLM before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18308227/84c3c452-74a8-11e6-9caa-9d6d846cc957.png)
We should drop the internal used feature column name, otherwise, it will appear 
on the prediction DataFrame which will confused users. And this behavior is 
same as other scenarios which does not exist column name conflict.
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18308240/92082a04-74a8-11e6-9226-801f52b856d9.png)

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang 

Closes #14993 from yanboliang/spark-15509.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bcdd259c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bcdd259c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bcdd259c

Branch: refs/heads/master
Commit: bcdd259c371b1dcdb41baf227867d7e2ecb923c6
Parents: 1fec3ce
Author: Yanbo Liang 
Authored: Sat Sep 10 00:27:10 2016 -0700
Committer: Yanbo Liang 
Committed: Sat Sep 10 00:27:10 2016 -0700

--
 .../ml/r/AFTSurvivalRegressionWrapper.scala |  1 +
 .../spark/ml/r/GaussianMixtureWrapper.scala |  1 +
 .../r/GeneralizedLinearRegressionWrapper.scala  |  1 +
 .../spark/ml/r/IsotonicRegressionWrapper.scala  |  1 +
 .../org/apache/spark/ml/r/KMeansWrapper.scala   |  1 +
 .../apache/spark/ml/r/NaiveBayesWrapper.scala   |  1 +
 .../org/apache/spark/ml/r/RWrapperUtils.scala   | 34 +++-
 .../apache/spark/ml/r/RWrapperUtilsSuite.scala  | 16 +++--
 8 files changed, 14 insertions(+), 42 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bcdd259c/mllib/src/main/scala/org/apache/spark/ml/r/AFTSurvivalRegressionWrapper.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/r/AFTSurvivalRegressionWrapper.scala 
b/mllib/src/main/scala/org/apache/spark/ml/r/AFTSurvivalRegressionWrapper.scala
index 67d037e..bd965ac 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/r/AFTSurvivalRegressionWrapper.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/r/AFTSurvivalRegressionWrapper.scala
@@ -99,6 +99,7 @@ private[r] object AFTSurvivalRegressionWrapper extends 
MLReadable[AFTSurvivalReg
 val aft = new AFTSurvivalRegression()
   .setCensorCol(censorCol)
   .setFitIntercept(rFormula.hasIntercept)
+  .setFeaturesCol(rFormula.getFeaturesCol)
 
 val pipeline = new Pipeline()
   .setStages(Array(rFormulaModel, aft))

http://git-wip-us.apache.org/repos/asf/spark/blob/bcdd259c/mllib/src/main/scala/org/apache/spark/ml/r/GaussianMixtureWrapper.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/r/GaussianMixtureWrapper.scala 
b/mllib/src/main/scala/org/apache/spark/ml/r/GaussianMixtureWrapper.scala
index b654233..b708702 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/r/GaussianMixtureWrapper.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/r/GaussianMixtureWrapper.scala
@@ -85,6 +85,7 @@ private[r] object GaussianMixtureWrapper extends 
MLReadable[GaussianMixtureWrapp
   .setK(k)
   .setMaxIter(maxIter)
   .setTol(tol)
+  .setFeaturesCol(rFormula.getFeaturesCol)
 
 val pipeline = new Pipeline()
   .setStages(Array(rFormulaModel, gm))

http://git-wip-us.apache.org/repos/asf/spark/blob/bcdd259c/mllib/src/m

spark git commit: [SPARK-17585][PYSPARK][CORE] PySpark SparkContext.addFile supports adding files recursively

2016-09-21 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 61876a427 -> d3b886976


[SPARK-17585][PYSPARK][CORE] PySpark SparkContext.addFile supports adding files 
recursively

## What changes were proposed in this pull request?
Users would like to add a directory as dependency in some cases, they can use 
```SparkContext.addFile``` with argument ```recursive=true``` to recursively 
add all files under the directory by using Scala. But Python users can only add 
file not directory, we should also make it supported.

## How was this patch tested?
Unit test.

Author: Yanbo Liang 

Closes #15140 from yanboliang/spark-17585.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d3b88697
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d3b88697
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d3b88697

Branch: refs/heads/master
Commit: d3b88697638dcf32854fe21a6c53dfb3782773b9
Parents: 61876a4
Author: Yanbo Liang 
Authored: Wed Sep 21 01:37:03 2016 -0700
Committer: Yanbo Liang 
Committed: Wed Sep 21 01:37:03 2016 -0700

--
 .../spark/api/java/JavaSparkContext.scala   | 13 +
 python/pyspark/context.py   |  7 +--
 python/pyspark/tests.py | 20 +++-
 python/test_support/hello.txt   |  1 -
 python/test_support/hello/hello.txt |  1 +
 .../test_support/hello/sub_hello/sub_hello.txt  |  1 +
 6 files changed, 35 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d3b88697/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala 
b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
index 131f36f..4e50c26 100644
--- a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
@@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext)
   }
 
   /**
+   * Add a file to be downloaded with this Spark job on every node.
+   * The `path` passed can be either a local file, a file in HDFS (or other 
Hadoop-supported
+   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in Spark 
jobs,
+   * use `SparkFiles.get(fileName)` to find its download location.
+   *
+   * A directory can be given if the recursive option is set to true. 
Currently directories are only
+   * supported for Hadoop-supported filesystems.
+   */
+  def addFile(path: String, recursive: Boolean): Unit = {
+sc.addFile(path, recursive)
+  }
+
+  /**
* Adds a JAR dependency for all tasks to be executed on this SparkContext 
in the future.
* The `path` passed can be either a local file, a file in HDFS (or other 
Hadoop-supported
* filesystems), or an HTTP, HTTPS or FTP URI.

http://git-wip-us.apache.org/repos/asf/spark/blob/d3b88697/python/pyspark/context.py
--
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 5c32f8e..7a7f59c 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -767,7 +767,7 @@ class SparkContext(object):
 SparkContext._next_accum_id += 1
 return Accumulator(SparkContext._next_accum_id - 1, value, accum_param)
 
-def addFile(self, path):
+def addFile(self, path, recursive=False):
 """
 Add a file to be downloaded with this Spark job on every node.
 The C{path} passed can be either a local file, a file in HDFS
@@ -778,6 +778,9 @@ class SparkContext(object):
 L{SparkFiles.get(fileName)} with the
 filename to find its download location.
 
+A directory can be given if the recursive option is set to True.
+Currently directories are only supported for Hadoop-supported 
filesystems.
+
 >>> from pyspark import SparkFiles
 >>> path = os.path.join(tempdir, "test.txt")
 >>> with open(path, "w") as testFile:
@@ -790,7 +793,7 @@ class SparkContext(object):
 >>> sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect()
 [100, 200, 300, 400]
 """
-self._jsc.sc().addFile(path)
+self._jsc.sc().addFile(path, recursive)
 
 def addPyFile(self, path):
 """

http://git-wip-us.apache.org/repos/asf/spark/blob/d3b88697/python/pyspark/tests.py
--
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index 0a029b6..b075691 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -409,13 +409,23 @@ class AddFileTests(PySparkTestCase):
 self.assertEqual("Hello World!"

spark git commit: [SPARK-17577][SPARKR][CORE] SparkR support add files to Spark job and get by executors

2016-09-21 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 7cbe21644 -> c133907c5


[SPARK-17577][SPARKR][CORE] SparkR support add files to Spark job and get by 
executors

## What changes were proposed in this pull request?
Scala/Python users can add files to Spark job by submit options ```--files``` 
or ```SparkContext.addFile()```. Meanwhile, users can get the added file by 
```SparkFiles.get(filename)```.
We should also support this function for SparkR users, since they also have the 
requirements for some shared dependency files. For example, SparkR users can 
download third party R packages to driver firstly, add these files to the Spark 
job as dependency by this API and then each executor can install these packages 
by ```install.packages```.

## How was this patch tested?
Add unit test.

Author: Yanbo Liang 

Closes #15131 from yanboliang/spark-17577.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c133907c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c133907c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c133907c

Branch: refs/heads/master
Commit: c133907c5d9a6e6411b896b5e0cff48b2beff09f
Parents: 7cbe216
Author: Yanbo Liang 
Authored: Wed Sep 21 20:08:28 2016 -0700
Committer: Yanbo Liang 
Committed: Wed Sep 21 20:08:28 2016 -0700

--
 R/pkg/NAMESPACE |  3 ++
 R/pkg/R/context.R   | 48 
 R/pkg/inst/tests/testthat/test_context.R| 13 ++
 .../scala/org/apache/spark/SparkContext.scala   |  6 +--
 4 files changed, 67 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c133907c/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index a5e9cbd..267a38c 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -336,6 +336,9 @@ export("as.DataFrame",
"read.parquet",
"read.text",
"spark.lapply",
+   "spark.addFile",
+   "spark.getSparkFilesRootDirectory",
+   "spark.getSparkFiles",
"sql",
"str",
"tableToDF",

http://git-wip-us.apache.org/repos/asf/spark/blob/c133907c/R/pkg/R/context.R
--
diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R
index 13ade49..4793578 100644
--- a/R/pkg/R/context.R
+++ b/R/pkg/R/context.R
@@ -225,6 +225,54 @@ setCheckpointDir <- function(sc, dirName) {
   invisible(callJMethod(sc, "setCheckpointDir", 
suppressWarnings(normalizePath(dirName
 }
 
+#' Add a file or directory to be downloaded with this Spark job on every node.
+#'
+#' The path passed can be either a local file, a file in HDFS (or other 
Hadoop-supported
+#' filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark 
jobs,
+#' use spark.getSparkFiles(fileName) to find its download location.
+#'
+#' @rdname spark.addFile
+#' @param path The path of the file to be added
+#' @export
+#' @examples
+#'\dontrun{
+#' spark.addFile("~/myfile")
+#'}
+#' @note spark.addFile since 2.1.0
+spark.addFile <- function(path) {
+  sc <- getSparkContext()
+  invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path
+}
+
+#' Get the root directory that contains files added through spark.addFile.
+#'
+#' @rdname spark.getSparkFilesRootDirectory
+#' @return the root directory that contains files added through spark.addFile
+#' @export
+#' @examples
+#'\dontrun{
+#' spark.getSparkFilesRootDirectory()
+#'}
+#' @note spark.getSparkFilesRootDirectory since 2.1.0
+spark.getSparkFilesRootDirectory <- function() {
+  callJStatic("org.apache.spark.SparkFiles", "getRootDirectory")
+}
+
+#' Get the absolute path of a file added through spark.addFile.
+#'
+#' @rdname spark.getSparkFiles
+#' @param fileName The name of the file added through spark.addFile
+#' @return the absolute path of a file added through spark.addFile.
+#' @export
+#' @examples
+#'\dontrun{
+#' spark.getSparkFiles("myfile")
+#'}
+#' @note spark.getSparkFiles since 2.1.0
+spark.getSparkFiles <- function(fileName) {
+  callJStatic("org.apache.spark.SparkFiles", "get", as.character(fileName))
+}
+
 #' Run a function over a list of elements, distributing the computations with 
Spark
 #'
 #' Run a function over a list of elements, distributing the computations with 
Spark. Applies a

http://git-wip-us.apache.org/repos/asf/spark/blob/c133907c/R/pkg/inst/tests/testthat/test_context.R
--
diff --git a/R/pkg/inst/tests/testthat/test_context.R 
b/R/pkg/inst/tests/testthat/test_context.R
index 1ab7f31..0495418 100644
--- a/R/pkg/inst/tests/testthat/test_context.R
+++ b/R/pkg/inst/tests/testthat/test_context.R
@@ -166,3 +166,16 @@ test_that("spark

spark git commit: [SPARK-17315][FOLLOW-UP][SPARKR][ML] Fix print of Kolmogorov-Smirnov test summary

2016-09-21 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master c133907c5 -> 6902edab7


[SPARK-17315][FOLLOW-UP][SPARKR][ML] Fix print of Kolmogorov-Smirnov test 
summary

## What changes were proposed in this pull request?
#14881 added Kolmogorov-Smirnov Test wrapper to SparkR. I found that 
```print.summary.KSTest``` was implemented inappropriately and result in no 
effect.
Running the following code for KSTest:
```Scala
data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25, -1, -0.5))
df <- createDataFrame(data)
testResult <- spark.kstest(df, "test", "norm")
summary(testResult)
```
Before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18615016/b9a2823a-7d4f-11e6-934b-128beade355e.png)
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18615014/aafe2798-7d4f-11e6-8b99-c705bb9fe8f2.png)
The new implementation is similar with 
[```print.summary.GeneralizedLinearRegressionModel```](https://github.com/apache/spark/blob/master/R/pkg/R/mllib.R#L284)
 of SparkR and 
[```print.summary.glm```](https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R)
 of native R.

BTW, I removed the comparison of ```print.summary.KSTest``` in unit test, since 
it's only wrappers of the summary output which has been checked. Another reason 
is that these comparison will output summary information to the test console, 
it will make the test output in a mess.

## How was this patch tested?
Existing test.

Author: Yanbo Liang 

Closes #15139 from yanboliang/spark-17315.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6902edab
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6902edab
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6902edab

Branch: refs/heads/master
Commit: 6902edab7e80e96e3f57cf80f26cefb209d4d63c
Parents: c133907
Author: Yanbo Liang 
Authored: Wed Sep 21 20:14:18 2016 -0700
Committer: Yanbo Liang 
Committed: Wed Sep 21 20:14:18 2016 -0700

--
 R/pkg/R/mllib.R| 16 +---
 R/pkg/inst/tests/testthat/test_mllib.R | 16 ++--
 2 files changed, 11 insertions(+), 21 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6902edab/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 234b208..98db367 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -1398,20 +1398,22 @@ setMethod("summary", signature(object = "KSTest"),
 distParams <- unlist(callJMethod(jobj, "distParams"))
 degreesOfFreedom <- callJMethod(jobj, "degreesOfFreedom")
 
-list(p.value = pValue, statistic = statistic, nullHypothesis = 
nullHypothesis,
- nullHypothesis.name = distName, nullHypothesis.parameters = 
distParams,
- degreesOfFreedom = degreesOfFreedom)
+ans <- list(p.value = pValue, statistic = statistic, 
nullHypothesis = nullHypothesis,
+nullHypothesis.name = distName, 
nullHypothesis.parameters = distParams,
+degreesOfFreedom = degreesOfFreedom, jobj = jobj)
+class(ans) <- "summary.KSTest"
+ans
   })
 
 #  Prints the summary of KSTest
 
 #' @rdname spark.kstest
-#' @param x test result object of KSTest by \code{spark.kstest}.
+#' @param x summary object of KSTest returned by \code{summary}.
 #' @export
 #' @note print.summary.KSTest since 2.1.0
 print.summary.KSTest <- function(x, ...) {
-  jobj <- x@jobj
+  jobj <- x$jobj
   summaryStr <- callJMethod(jobj, "summary")
-  cat(summaryStr)
-  invisible(summaryStr)
+  cat(summaryStr, "\n")
+  invisible(x)
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/6902edab/R/pkg/inst/tests/testthat/test_mllib.R
--
diff --git a/R/pkg/inst/tests/testthat/test_mllib.R 
b/R/pkg/inst/tests/testthat/test_mllib.R
index 5b1404c..24c40a8 100644
--- a/R/pkg/inst/tests/testthat/test_mllib.R
+++ b/R/pkg/inst/tests/testthat/test_mllib.R
@@ -760,13 +760,7 @@ test_that("spark.kstest", {
 
   expect_equal(stats$p.value, rStats$p.value, tolerance = 1e-4)
   expect_equal(stats$statistic, unname(rStats$statistic), tolerance = 1e-4)
-
-  printStr <- print.summary.KSTest(testResult)
-  expect_match(printStr, paste0("Kolmogorov-Smirnov test summary:\\n",
-"degrees of freedom = 0 \\n",
-"statistic = 0.38208[0-9]* \\n",
-"pValue = 0.19849[0-9]* \\n",
-".*"), perl = TRUE)
+  expect_match(capture.output(stats)[1], "Kolmogorov-Smirnov test summary:")
 
   testResult <- spark.kstest(df, "test", "norm", -0.5)
   stats <- summary(testResult)
@@ -775,13 +769,7 @@ test_that("spark.kstes

spark git commit: [SPARK-17281][ML][MLLIB] Add treeAggregateDepth parameter for AFTSurvivalRegression

2016-09-22 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 646f38346 -> 72d9fba26


[SPARK-17281][ML][MLLIB] Add treeAggregateDepth parameter for 
AFTSurvivalRegression

## What changes were proposed in this pull request?

Add treeAggregateDepth parameter for AFTSurvivalRegression to keep consistent 
with LiR/LoR.

## How was this patch tested?

Existing tests.

Author: WeichenXu 

Closes #14851 from WeichenXu123/add_treeAggregate_param_for_survival_regression.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/72d9fba2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/72d9fba2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/72d9fba2

Branch: refs/heads/master
Commit: 72d9fba26c19aae73116fd0d00b566967934c6fc
Parents: 646f383
Author: WeichenXu 
Authored: Thu Sep 22 04:35:54 2016 -0700
Committer: Yanbo Liang 
Committed: Thu Sep 22 04:35:54 2016 -0700

--
 .../ml/regression/AFTSurvivalRegression.scala   | 24 
 python/pyspark/ml/regression.py | 11 +
 2 files changed, 25 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/72d9fba2/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
index 3179f48..9d5ba99 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
@@ -46,7 +46,7 @@ import org.apache.spark.storage.StorageLevel
  */
 private[regression] trait AFTSurvivalRegressionParams extends Params
   with HasFeaturesCol with HasLabelCol with HasPredictionCol with HasMaxIter
-  with HasTol with HasFitIntercept with Logging {
+  with HasTol with HasFitIntercept with HasAggregationDepth with Logging {
 
   /**
* Param for censor column name.
@@ -184,6 +184,17 @@ class AFTSurvivalRegression @Since("1.6.0") 
(@Since("1.6.0") override val uid: S
   setDefault(tol -> 1E-6)
 
   /**
+   * Suggested depth for treeAggregate (>= 2).
+   * If the dimensions of features or the number of partitions are large,
+   * this param could be adjusted to a larger size.
+   * Default is 2.
+   * @group expertSetParam
+   */
+  @Since("2.1.0")
+  def setAggregationDepth(value: Int): this.type = set(aggregationDepth, value)
+  setDefault(aggregationDepth -> 2)
+
+  /**
* Extract [[featuresCol]], [[labelCol]] and [[censorCol]] from input 
dataset,
* and put it in an RDD with strong types.
*/
@@ -207,7 +218,9 @@ class AFTSurvivalRegression @Since("1.6.0") 
(@Since("1.6.0") override val uid: S
   val combOp = (c1: MultivariateOnlineSummarizer, c2: 
MultivariateOnlineSummarizer) => {
 c1.merge(c2)
   }
-  instances.treeAggregate(new MultivariateOnlineSummarizer)(seqOp, combOp)
+  instances.treeAggregate(
+new MultivariateOnlineSummarizer
+  )(seqOp, combOp, $(aggregationDepth))
 }
 
 val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
@@ -222,7 +235,7 @@ class AFTSurvivalRegression @Since("1.6.0") 
(@Since("1.6.0") override val uid: S
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
 
-val costFun = new AFTCostFun(instances, $(fitIntercept), bcFeaturesStd)
+val costFun = new AFTCostFun(instances, $(fitIntercept), bcFeaturesStd, 
$(aggregationDepth))
 val optimizer = new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 
 /*
@@ -591,7 +604,8 @@ private class AFTAggregator(
 private class AFTCostFun(
 data: RDD[AFTPoint],
 fitIntercept: Boolean,
-bcFeaturesStd: Broadcast[Array[Double]]) extends DiffFunction[BDV[Double]] 
{
+bcFeaturesStd: Broadcast[Array[Double]],
+aggregationDepth: Int) extends DiffFunction[BDV[Double]] {
 
   override def calculate(parameters: BDV[Double]): (Double, BDV[Double]) = {
 
@@ -604,7 +618,7 @@ private class AFTCostFun(
   },
   combOp = (c1, c2) => (c1, c2) match {
 case (aggregator1, aggregator2) => aggregator1.merge(aggregator2)
-  })
+  }, depth = aggregationDepth)
 
 bcParameters.destroy(blocking = false)
 (aftAggregator.loss, aftAggregator.gradient)

http://git-wip-us.apache.org/repos/asf/spark/blob/72d9fba2/python/pyspark/ml/regression.py
--
diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py
index 19afc72..55d3803 100644
--- a/python/pyspark/ml/regression.py
+++ b/python/pyspark/ml/regression.py
@@ -1088,7 +1088,8 @@ class GBTRegressionModel(TreeEnsembleModel, 
JavaPredi

spark git commit: [MINOR][SPARKR] Add sparkr-vignettes.html to gitignore.

2016-09-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 248916f55 -> 7945daed1


[MINOR][SPARKR] Add sparkr-vignettes.html to gitignore.

## What changes were proposed in this pull request?
Add ```sparkr-vignettes.html``` to ```.gitignore```.

## How was this patch tested?
No need test.

Author: Yanbo Liang 

Closes #15215 from yanboliang/ignore.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7945daed
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7945daed
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7945daed

Branch: refs/heads/master
Commit: 7945daed12542587d51ece8f07e5c828b40db14a
Parents: 248916f
Author: Yanbo Liang 
Authored: Sat Sep 24 01:03:11 2016 -0700
Committer: Yanbo Liang 
Committed: Sat Sep 24 01:03:11 2016 -0700

--
 .gitignore | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7945daed/.gitignore
--
diff --git a/.gitignore b/.gitignore
index cfa8ad0..39d17e1 100644
--- a/.gitignore
+++ b/.gitignore
@@ -24,6 +24,7 @@
 R-unit-tests.log
 R/unit-tests.out
 R/cran-check.out
+R/pkg/vignettes/sparkr-vignettes.html
 build/*.jar
 build/apache-maven*
 build/scala*


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[2/2] spark git commit: [SPARK-16356][ML] Add testImplicits for ML unit tests and promote toDF()

2016-09-26 Thread yliang

[SPARK-16356][ML] Add testImplicits for ML unit tests and promote toDF()

## What changes were proposed in this pull request?

This was suggested in 
https://github.com/apache/spark/commit/101663f1ae222a919fc40510aa4f2bad22d1be6f#commitcomment-17114968.

This PR adds `testImplicits` to `MLlibTestSparkContext` so that some implicits 
such as `toDF()` can be sued across ml tests.

This PR also changes all the usages of `spark.createDataFrame( ... )` to 
`toDF()` where applicable in ml tests in Scala.

## How was this patch tested?

Existing tests should work.

Author: hyukjinkwon 

Closes #14035 from HyukjinKwon/minor-ml-test.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f234b7cd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f234b7cd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f234b7cd

Branch: refs/heads/master
Commit: f234b7cd795dd9baa3feff541c211b4daf39ccc6
Parents: 50b89d0
Author: hyukjinkwon 
Authored: Mon Sep 26 04:19:39 2016 -0700
Committer: Yanbo Liang 
Committed: Mon Sep 26 04:19:39 2016 -0700

--
 .../org/apache/spark/ml/PipelineSuite.scala |  13 +-
 .../ml/classification/ClassifierSuite.scala |  16 +--
 .../DecisionTreeClassifierSuite.scala   |   3 +-
 .../ml/classification/GBTClassifierSuite.scala  |   6 +-
 .../LogisticRegressionSuite.scala   |  43 +++
 .../MultilayerPerceptronClassifierSuite.scala   |  26 ++--
 .../ml/classification/NaiveBayesSuite.scala |  20 +--
 .../ml/classification/OneVsRestSuite.scala  |   4 +-
 .../RandomForestClassifierSuite.scala   |   3 +-
 .../apache/spark/ml/clustering/LDASuite.scala   |   6 +-
 .../BinaryClassificationEvaluatorSuite.scala|  14 ++-
 .../evaluation/RegressionEvaluatorSuite.scala   |   8 +-
 .../spark/ml/feature/BinarizerSuite.scala   |  16 +--
 .../spark/ml/feature/BucketizerSuite.scala  |  15 ++-
 .../spark/ml/feature/ChiSqSelectorSuite.scala   |   3 +-
 .../spark/ml/feature/CountVectorizerSuite.scala |  30 ++---
 .../org/apache/spark/ml/feature/DCTSuite.scala  |  10 +-
 .../spark/ml/feature/HashingTFSuite.scala   |  10 +-
 .../org/apache/spark/ml/feature/IDFSuite.scala  |   6 +-
 .../spark/ml/feature/InteractionSuite.scala |  53 
 .../spark/ml/feature/MaxAbsScalerSuite.scala|   5 +-
 .../spark/ml/feature/MinMaxScalerSuite.scala|  13 +-
 .../apache/spark/ml/feature/NGramSuite.scala|  35 +++---
 .../spark/ml/feature/NormalizerSuite.scala  |   4 +-
 .../spark/ml/feature/OneHotEncoderSuite.scala   |  10 +-
 .../org/apache/spark/ml/feature/PCASuite.scala  |   4 +-
 .../ml/feature/PolynomialExpansionSuite.scala   |  11 +-
 .../apache/spark/ml/feature/RFormulaSuite.scala | 126 ---
 .../spark/ml/feature/SQLTransformerSuite.scala  |   8 +-
 .../spark/ml/feature/StandardScalerSuite.scala  |  12 +-
 .../ml/feature/StopWordsRemoverSuite.scala  |  29 +++--
 .../spark/ml/feature/StringIndexerSuite.scala   |  32 ++---
 .../spark/ml/feature/TokenizerSuite.scala   |  17 +--
 .../spark/ml/feature/VectorAssemblerSuite.scala |  10 +-
 .../spark/ml/feature/VectorIndexerSuite.scala   |  15 ++-
 .../regression/AFTSurvivalRegressionSuite.scala |  26 ++--
 .../spark/ml/regression/GBTRegressorSuite.scala |   7 +-
 .../GeneralizedLinearRegressionSuite.scala  | 115 -
 .../ml/regression/IsotonicRegressionSuite.scala |  14 +--
 .../ml/regression/LinearRegressionSuite.scala   |  62 +
 .../tree/impl/GradientBoostedTreesSuite.scala   |   6 +-
 .../spark/ml/tuning/CrossValidatorSuite.scala   |  12 +-
 .../ml/tuning/TrainValidationSplitSuite.scala   |  13 +-
 .../apache/spark/mllib/util/MLUtilsSuite.scala  |  18 +--
 .../mllib/util/MLlibTestSparkContext.scala  |  13 +-
 45 files changed, 462 insertions(+), 460 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f234b7cd/mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala
--
diff --git a/mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala
index 3b490cd..6413ca1 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala
@@ -36,6 +36,8 @@ import org.apache.spark.sql.types.StructType
 
 class PipelineSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
 
+  import testImplicits._
+
   abstract class MyModel extends Model[MyModel]
 
   test("pipeline") {
@@ -183,12 +185,11 @@ class PipelineSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
   }
 
   test("pipeline validateParams") {
-val df = spark.createDataFrame(
-  Seq(
-(1, Vectors.dense(0.0, 1.0, 4.0),

[1/2] spark git commit: [SPARK-16356][ML] Add testImplicits for ML unit tests and promote toDF()

2016-09-26 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 50b89d05b -> f234b7cd7


http://git-wip-us.apache.org/repos/asf/spark/blob/f234b7cd/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala
index b478fea..a6bbb94 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala
@@ -29,6 +29,8 @@ import org.apache.spark.sql.types.{DoubleType, StringType, 
StructField, StructTy
 class StringIndexerSuite
   extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
 
+  import testImplicits._
+
   test("params") {
 ParamsSuite.checkParams(new StringIndexer)
 val model = new StringIndexerModel("indexer", Array("a", "b"))
@@ -38,8 +40,8 @@ class StringIndexerSuite
   }
 
   test("StringIndexer") {
-val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
"a"), (5, "c")), 2)
-val df = spark.createDataFrame(data).toDF("id", "label")
+val data = Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"))
+val df = data.toDF("id", "label")
 val indexer = new StringIndexer()
   .setInputCol("label")
   .setOutputCol("labelIndex")
@@ -61,10 +63,10 @@ class StringIndexerSuite
   }
 
   test("StringIndexerUnseen") {
-val data = sc.parallelize(Seq((0, "a"), (1, "b"), (4, "b")), 2)
-val data2 = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c")), 2)
-val df = spark.createDataFrame(data).toDF("id", "label")
-val df2 = spark.createDataFrame(data2).toDF("id", "label")
+val data = Seq((0, "a"), (1, "b"), (4, "b"))
+val data2 = Seq((0, "a"), (1, "b"), (2, "c"))
+val df = data.toDF("id", "label")
+val df2 = data2.toDF("id", "label")
 val indexer = new StringIndexer()
   .setInputCol("label")
   .setOutputCol("labelIndex")
@@ -92,8 +94,8 @@ class StringIndexerSuite
   }
 
   test("StringIndexer with a numeric input column") {
-val data = sc.parallelize(Seq((0, 100), (1, 200), (2, 300), (3, 100), (4, 
100), (5, 300)), 2)
-val df = spark.createDataFrame(data).toDF("id", "label")
+val data = Seq((0, 100), (1, 200), (2, 300), (3, 100), (4, 100), (5, 300))
+val df = data.toDF("id", "label")
 val indexer = new StringIndexer()
   .setInputCol("label")
   .setOutputCol("labelIndex")
@@ -119,7 +121,7 @@ class StringIndexerSuite
   }
 
   test("StringIndexerModel can't overwrite output column") {
-val df = spark.createDataFrame(Seq((1, 2), (3, 4))).toDF("input", "output")
+val df = Seq((1, 2), (3, 4)).toDF("input", "output")
 intercept[IllegalArgumentException] {
   new StringIndexer()
 .setInputCol("input")
@@ -161,9 +163,7 @@ class StringIndexerSuite
 
   test("IndexToString.transform") {
 val labels = Array("a", "b", "c")
-val df0 = spark.createDataFrame(Seq(
-  (0, "a"), (1, "b"), (2, "c"), (0, "a")
-)).toDF("index", "expected")
+val df0 = Seq((0, "a"), (1, "b"), (2, "c"), (0, "a")).toDF("index", 
"expected")
 
 val idxToStr0 = new IndexToString()
   .setInputCol("index")
@@ -187,8 +187,8 @@ class StringIndexerSuite
   }
 
   test("StringIndexer, IndexToString are inverses") {
-val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
"a"), (5, "c")), 2)
-val df = spark.createDataFrame(data).toDF("id", "label")
+val data = Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"))
+val df = data.toDF("id", "label")
 val indexer = new StringIndexer()
   .setInputCol("label")
   .setOutputCol("labelIndex")
@@ -220,8 +220,8 @@ class StringIndexerSuite
   }
 
   test("StringIndexer metadata") {
-val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
"a"), (5, "c")), 2)
-val df = spark.createDataFrame(data).toDF("id", "label")
+val data = Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"))
+val df = data.toDF("id", "label")
 val indexer = new StringIndexer()
   .setInputCol("label")
   .setOutputCol("labelIndex")

http://git-wip-us.apache.org/repos/asf/spark/blob/f234b7cd/mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala
index f30bdc3..c895659 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala
@@ -46,6 +46,7 @@ class RegexTokenizerSuite
   extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
 
   import org.apa

spark git commit: [SPARK-17577][FOLLOW-UP][SPARKR] SparkR spark.addFile supports adding directory recursively

2016-09-26 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 00be16df6 -> 93c743f1a


[SPARK-17577][FOLLOW-UP][SPARKR] SparkR spark.addFile supports adding directory 
recursively

## What changes were proposed in this pull request?
#15140 exposed ```JavaSparkContext.addFile(path: String, recursive: Boolean)``` 
to Python/R, then we can update SparkR ```spark.addFile``` to support adding 
directory recursively.

## How was this patch tested?
Added unit test.

Author: Yanbo Liang 

Closes #15216 from yanboliang/spark-17577-2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/93c743f1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/93c743f1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/93c743f1

Branch: refs/heads/master
Commit: 93c743f1aca433144611b11d4e1b169d66e0f57b
Parents: 00be16d
Author: Yanbo Liang 
Authored: Mon Sep 26 16:47:57 2016 -0700
Committer: Yanbo Liang 
Committed: Mon Sep 26 16:47:57 2016 -0700

--
 R/pkg/R/context.R|  9 +++--
 R/pkg/inst/tests/testthat/test_context.R | 22 ++
 2 files changed, 29 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/93c743f1/R/pkg/R/context.R
--
diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R
index 4793578..fe2f3e3 100644
--- a/R/pkg/R/context.R
+++ b/R/pkg/R/context.R
@@ -231,17 +231,22 @@ setCheckpointDir <- function(sc, dirName) {
 #' filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark 
jobs,
 #' use spark.getSparkFiles(fileName) to find its download location.
 #'
+#' A directory can be given if the recursive option is set to true.
+#' Currently directories are only supported for Hadoop-supported filesystems.
+#' Refer Hadoop-supported filesystems at 
\url{https://wiki.apache.org/hadoop/HCFS}.
+#'
 #' @rdname spark.addFile
 #' @param path The path of the file to be added
+#' @param recursive Whether to add files recursively from the path. Default is 
FALSE.
 #' @export
 #' @examples
 #'\dontrun{
 #' spark.addFile("~/myfile")
 #'}
 #' @note spark.addFile since 2.1.0
-spark.addFile <- function(path) {
+spark.addFile <- function(path, recursive = FALSE) {
   sc <- getSparkContext()
-  invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path
+  invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path)), 
recursive))
 }
 
 #' Get the root directory that contains files added through spark.addFile.

http://git-wip-us.apache.org/repos/asf/spark/blob/93c743f1/R/pkg/inst/tests/testthat/test_context.R
--
diff --git a/R/pkg/inst/tests/testthat/test_context.R 
b/R/pkg/inst/tests/testthat/test_context.R
index 0495418..caca069 100644
--- a/R/pkg/inst/tests/testthat/test_context.R
+++ b/R/pkg/inst/tests/testthat/test_context.R
@@ -169,6 +169,7 @@ test_that("spark.lapply should perform simple transforms", {
 
 test_that("add and get file to be downloaded with Spark job on every node", {
   sparkR.sparkContext()
+  # Test add file.
   path <- tempfile(pattern = "hello", fileext = ".txt")
   filename <- basename(path)
   words <- "Hello World!"
@@ -177,5 +178,26 @@ test_that("add and get file to be downloaded with Spark 
job on every node", {
   download_path <- spark.getSparkFiles(filename)
   expect_equal(readLines(download_path), words)
   unlink(path)
+
+  # Test add directory recursively.
+  path <- paste0(tempdir(), "/", "recursive_dir")
+  dir.create(path)
+  dir_name <- basename(path)
+  path1 <- paste0(path, "/", "hello.txt")
+  file.create(path1)
+  sub_path <- paste0(path, "/", "sub_hello")
+  dir.create(sub_path)
+  path2 <- paste0(sub_path, "/", "sub_hello.txt")
+  file.create(path2)
+  words <- "Hello World!"
+  sub_words <- "Sub Hello World!"
+  writeLines(words, path1)
+  writeLines(sub_words, path2)
+  spark.addFile(path, recursive = TRUE)
+  download_path1 <- spark.getSparkFiles(paste0(dir_name, "/", "hello.txt"))
+  expect_equal(readLines(download_path1), words)
+  download_path2 <- spark.getSparkFiles(paste0(dir_name, "/", 
"sub_hello/sub_hello.txt"))
+  expect_equal(readLines(download_path2), sub_words)
+  unlink(path, recursive = TRUE)
   sparkR.session.stop()
 })


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17138][ML][MLIB] Add Python API for multinomial logistic regression

2016-09-27 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 85b0a1575 -> 7f16affa2


[SPARK-17138][ML][MLIB] Add Python API for multinomial logistic regression

## What changes were proposed in this pull request?

Add Python API for multinomial logistic regression.

- add `family` param in python api.
- expose `coefficientMatrix` and `interceptVector` for `LogisticRegressionModel`
- add python-side testcase for multinomial logistic regression
- update python doc.

## How was this patch tested?

existing and added doc tests.

Author: WeichenXu 

Closes #14852 from WeichenXu123/add_MLOR_python.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7f16affa
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7f16affa
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7f16affa

Branch: refs/heads/master
Commit: 7f16affa262b059580ed2775a7b05a767aa72315
Parents: 85b0a15
Author: WeichenXu 
Authored: Tue Sep 27 00:00:21 2016 -0700
Committer: Yanbo Liang 
Committed: Tue Sep 27 00:00:21 2016 -0700

--
 python/pyspark/ml/classification.py | 90 +---
 1 file changed, 70 insertions(+), 20 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7f16affa/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index b4c01fd..505e7bf 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -67,21 +67,34 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredicti
  HasWeightCol, HasAggregationDepth, JavaMLWritable, 
JavaMLReadable):
 """
 Logistic regression.
-Currently, this class only supports binary classification.
+This class supports multinomial logistic (softmax) and binomial logistic 
regression.
 
 >>> from pyspark.sql import Row
 >>> from pyspark.ml.linalg import Vectors
->>> df = sc.parallelize([
+>>> bdf = sc.parallelize([
 ... Row(label=1.0, weight=2.0, features=Vectors.dense(1.0)),
 ... Row(label=0.0, weight=2.0, features=Vectors.sparse(1, [], 
[]))]).toDF()
->>> lr = LogisticRegression(maxIter=5, regParam=0.01, weightCol="weight")
->>> model = lr.fit(df)
->>> model.coefficients
+>>> blor = LogisticRegression(maxIter=5, regParam=0.01, weightCol="weight")
+>>> blorModel = blor.fit(bdf)
+>>> blorModel.coefficients
 DenseVector([5.5...])
->>> model.intercept
+>>> blorModel.intercept
 -2.68...
+>>> mdf = sc.parallelize([
+... Row(label=1.0, weight=2.0, features=Vectors.dense(1.0)),
+... Row(label=0.0, weight=2.0, features=Vectors.sparse(1, [], [])),
+... Row(label=2.0, weight=2.0, features=Vectors.dense(3.0))]).toDF()
+>>> mlor = LogisticRegression(maxIter=5, regParam=0.01, weightCol="weight",
+... family="multinomial")
+>>> mlorModel = mlor.fit(mdf)
+>>> print(mlorModel.coefficientMatrix)
+DenseMatrix([[-2.3...],
+ [ 0.2...],
+ [ 2.1... ]])
+>>> mlorModel.interceptVector
+DenseVector([2.0..., 0.8..., -2.8...])
 >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0))]).toDF()
->>> result = model.transform(test0).head()
+>>> result = blorModel.transform(test0).head()
 >>> result.prediction
 0.0
 >>> result.probability
@@ -89,23 +102,23 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredicti
 >>> result.rawPrediction
 DenseVector([8.22..., -8.22...])
 >>> test1 = sc.parallelize([Row(features=Vectors.sparse(1, [0], 
[1.0]))]).toDF()
->>> model.transform(test1).head().prediction
+>>> blorModel.transform(test1).head().prediction
 1.0
->>> lr.setParams("vector")
+>>> blor.setParams("vector")
 Traceback (most recent call last):
 ...
 TypeError: Method setParams forces keyword arguments.
 >>> lr_path = temp_path + "/lr"
->>> lr.save(lr_path)
+>>> blor.save(lr_path)
 >>> lr2 = LogisticRegression.load(lr_path)
 >>> lr2.getMaxIter()
 5
 >>> model_path = temp_path + "/lr_model"
->>> model.save(model_path)
+>>> blorModel.save(model_path)
 >>> model2 = LogisticRegressionModel.load(model_path)
->>> model.coefficients[0] == model2.coefficients[0]
+>>> blorModel.coefficients[0] == model2.coefficients[0]
 True
->>> model.intercept == model2.intercept
+>>> blorModel.intercept == model2.intercept
 True
 
 .. versionadded:: 1.3.0
@@ -117,24 +130,29 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredicti
   "e.g. if threshold is p, then thresholds must be equal 
to [1-p, p].",

spark git commit: [SPARK-16356][FOLLOW-UP][ML] Enforce ML test of exception for local/distributed Dataset.

2016-09-29 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 37eb9184f -> a19a1bb59


[SPARK-16356][FOLLOW-UP][ML] Enforce ML test of exception for local/distributed 
Dataset.

## What changes were proposed in this pull request?
#14035 added ```testImplicits``` to ML unit tests and promoted ```toDF()```, 
but left one minor issue at ```VectorIndexerSuite```. If we create the 
DataFrame by ```Seq(...).toDF()```, it will throw different error/exception 
compared with ```sc.parallelize(Seq(...)).toDF()``` for one of the test cases.
After in-depth study, I found it was caused by different behavior of local and 
distributed Dataset if the UDF failed at ```assert```. If the data is local 
Dataset, it throws ```AssertionError``` directly; If the data is distributed 
Dataset, it throws ```SparkException``` which is the wrapper of 
```AssertionError```. I think we should enforce this test to cover both case.

## How was this patch tested?
Unit test.

Author: Yanbo Liang 

Closes #15261 from yanboliang/spark-16356.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a19a1bb5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a19a1bb5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a19a1bb5

Branch: refs/heads/master
Commit: a19a1bb59411177caaf99581e89098826b7d0c7b
Parents: 37eb918
Author: Yanbo Liang 
Authored: Thu Sep 29 00:54:26 2016 -0700
Committer: Yanbo Liang 
Committed: Thu Sep 29 00:54:26 2016 -0700

--
 .../apache/spark/ml/feature/VectorIndexerSuite.scala   | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a19a1bb5/mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala
index 4da1b13..b28ce2a 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala
@@ -88,9 +88,7 @@ class VectorIndexerSuite extends SparkFunSuite with 
MLlibTestSparkContext
 
 densePoints1 = densePoints1Seq.map(FeatureData).toDF()
 sparsePoints1 = sparsePoints1Seq.map(FeatureData).toDF()
-// TODO: If we directly use `toDF` without parallelize, the test in
-// "Throws error when given RDDs with different size vectors" is failed 
for an unknown reason.
-densePoints2 = sc.parallelize(densePoints2Seq, 2).map(FeatureData).toDF()
+densePoints2 = densePoints2Seq.map(FeatureData).toDF()
 sparsePoints2 = sparsePoints2Seq.map(FeatureData).toDF()
 badPoints = badPointsSeq.map(FeatureData).toDF()
   }
@@ -121,10 +119,17 @@ class VectorIndexerSuite extends SparkFunSuite with 
MLlibTestSparkContext
 
 model.transform(densePoints1) // should work
 model.transform(sparsePoints1) // should work
-intercept[SparkException] {
+// If the data is local Dataset, it throws AssertionError directly.
+intercept[AssertionError] {
   model.transform(densePoints2).collect()
   logInfo("Did not throw error when fit, transform were called on vectors 
of different lengths")
 }
+// If the data is distributed Dataset, it throws SparkException
+// which is the wrapper of AssertionError.
+intercept[SparkException] {
+  model.transform(densePoints2.repartition(2)).collect()
+  logInfo("Did not throw error when fit, transform were called on vectors 
of different lengths")
+}
 intercept[SparkException] {
   vectorIndexer.fit(badPoints)
   logInfo("Did not throw error when fitting vectors of different lengths 
in same RDD.")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17704][ML][MLLIB] ChiSqSelector performance improvement.

2016-09-29 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master a19a1bb59 -> f7082ac12


[SPARK-17704][ML][MLLIB] ChiSqSelector performance improvement.

## What changes were proposed in this pull request?
Several performance improvement for ```ChiSqSelector```:
1, Keep ```selectedFeatures``` ordered ascendent.
```ChiSqSelectorModel.transform``` need ```selectedFeatures``` ordered to make 
prediction. We should sort it when training model rather than making 
prediction, since users usually train model once and use the model to do 
prediction multiple times.
2, When training ```fpr``` type ```ChiSqSelectorModel```, it's not necessary to 
sort the ChiSq test result by statistic.

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang 

Closes #15277 from yanboliang/spark-17704.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f7082ac1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f7082ac1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f7082ac1

Branch: refs/heads/master
Commit: f7082ac12518ae84d6d1d4b7330a9f12cf95e7c1
Parents: a19a1bb
Author: Yanbo Liang 
Authored: Thu Sep 29 04:30:42 2016 -0700
Committer: Yanbo Liang 
Committed: Thu Sep 29 04:30:42 2016 -0700

--
 .../spark/mllib/feature/ChiSqSelector.scala | 45 +---
 project/MimaExcludes.scala  |  3 --
 2 files changed, 30 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f7082ac1/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
index 0f7c6e8..706ce78 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
@@ -35,12 +35,24 @@ import org.apache.spark.sql.{Row, SparkSession}
 /**
  * Chi Squared selector model.
  *
- * @param selectedFeatures list of indices to select (filter).
+ * @param selectedFeatures list of indices to select (filter). Must be ordered 
asc
  */
 @Since("1.3.0")
 class ChiSqSelectorModel @Since("1.3.0") (
   @Since("1.3.0") val selectedFeatures: Array[Int]) extends VectorTransformer 
with Saveable {
 
+  require(isSorted(selectedFeatures), "Array has to be sorted asc")
+
+  protected def isSorted(array: Array[Int]): Boolean = {
+var i = 1
+val len = array.length
+while (i < len) {
+  if (array(i) < array(i-1)) return false
+  i += 1
+}
+true
+  }
+
   /**
* Applies transformation on a vector.
*
@@ -57,22 +69,21 @@ class ChiSqSelectorModel @Since("1.3.0") (
* Preserves the order of filtered features the same as their indices are 
stored.
* Might be moved to Vector as .slice
* @param features vector
-   * @param filterIndices indices of features to filter
+   * @param filterIndices indices of features to filter, must be ordered asc
*/
   private def compress(features: Vector, filterIndices: Array[Int]): Vector = {
-val orderedIndices = filterIndices.sorted
 features match {
   case SparseVector(size, indices, values) =>
-val newSize = orderedIndices.length
+val newSize = filterIndices.length
 val newValues = new ArrayBuilder.ofDouble
 val newIndices = new ArrayBuilder.ofInt
 var i = 0
 var j = 0
 var indicesIdx = 0
 var filterIndicesIdx = 0
-while (i < indices.length && j < orderedIndices.length) {
+while (i < indices.length && j < filterIndices.length) {
   indicesIdx = indices(i)
-  filterIndicesIdx = orderedIndices(j)
+  filterIndicesIdx = filterIndices(j)
   if (indicesIdx == filterIndicesIdx) {
 newIndices += j
 newValues += values(i)
@@ -90,7 +101,7 @@ class ChiSqSelectorModel @Since("1.3.0") (
 Vectors.sparse(newSize, newIndices.result(), newValues.result())
   case DenseVector(values) =>
 val values = features.toArray
-Vectors.dense(orderedIndices.map(i => values(i)))
+Vectors.dense(filterIndices.map(i => values(i)))
   case other =>
 throw new UnsupportedOperationException(
   s"Only sparse and dense vectors are supported but got 
${other.getClass}.")
@@ -220,18 +231,22 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
 val chiSqTestResult = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
 val features = selectorType match {
-  case ChiSqSelector.KBest => chiSqTestResult

spark git commit: [SPARK-14077][ML] Refactor NaiveBayes to support weighted instances

2016-09-29 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 74ac1c438 -> 1fad55968


[SPARK-14077][ML] Refactor NaiveBayes to support weighted instances

## What changes were proposed in this pull request?
1,support weighted data
2,use dataset/dataframe instead of rdd
3,make mllib as a wrapper to call ml

## How was this patch tested?
local manual tests in spark-shell
unit tests

Author: Zheng RuiFeng 

Closes #12819 from zhengruifeng/weighted_nb.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1fad5596
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1fad5596
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1fad5596

Branch: refs/heads/master
Commit: 1fad5596885aab8b32d2307c0edecbae50d5bd7a
Parents: 74ac1c4
Author: Zheng RuiFeng 
Authored: Thu Sep 29 23:55:42 2016 -0700
Committer: Yanbo Liang 
Committed: Thu Sep 29 23:55:42 2016 -0700

--
 .../spark/ml/classification/NaiveBayes.scala| 154 ++-
 .../spark/mllib/classification/NaiveBayes.scala |  99 
 .../ml/classification/NaiveBayesSuite.scala |  50 +-
 3 files changed, 191 insertions(+), 112 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1fad5596/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
index f939a1c..0d652aa 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
@@ -19,23 +19,20 @@ package org.apache.spark.ml.classification
 
 import org.apache.hadoop.fs.Path
 
-import org.apache.spark.SparkException
 import org.apache.spark.annotation.Since
 import org.apache.spark.ml.PredictorParams
 import org.apache.spark.ml.linalg._
 import org.apache.spark.ml.param.{DoubleParam, Param, ParamMap, 
ParamValidators}
+import org.apache.spark.ml.param.shared.HasWeightCol
 import org.apache.spark.ml.util._
-import org.apache.spark.mllib.classification.{NaiveBayes => OldNaiveBayes}
-import org.apache.spark.mllib.classification.{NaiveBayesModel => 
OldNaiveBayesModel}
-import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
-import org.apache.spark.mllib.util.MLUtils
-import org.apache.spark.rdd.RDD
-import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
 
 /**
  * Params for Naive Bayes Classifiers.
  */
-private[ml] trait NaiveBayesParams extends PredictorParams {
+private[ml] trait NaiveBayesParams extends PredictorParams with HasWeightCol {
 
   /**
* The smoothing parameter.
@@ -56,7 +53,7 @@ private[ml] trait NaiveBayesParams extends PredictorParams {
*/
   final val modelType: Param[String] = new Param[String](this, "modelType", 
"The model type " +
 "which is a string (case-sensitive). Supported options: multinomial 
(default) and bernoulli.",
-ParamValidators.inArray[String](OldNaiveBayes.supportedModelTypes.toArray))
+ParamValidators.inArray[String](NaiveBayes.supportedModelTypes.toArray))
 
   /** @group getParam */
   final def getModelType: String = $(modelType)
@@ -64,7 +61,7 @@ private[ml] trait NaiveBayesParams extends PredictorParams {
 
 /**
  * Naive Bayes Classifiers.
- * It supports both Multinomial NB
+ * It supports Multinomial NB
  * 
([[http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html]])
  * which can handle finitely supported discrete data. For example, by 
converting documents into
  * TF-IDF vectors, it can be used for document classification. By making every 
vector a
@@ -78,6 +75,8 @@ class NaiveBayes @Since("1.5.0") (
   extends ProbabilisticClassifier[Vector, NaiveBayes, NaiveBayesModel]
   with NaiveBayesParams with DefaultParamsWritable {
 
+  import NaiveBayes.{Bernoulli, Multinomial}
+
   @Since("1.5.0")
   def this() = this(Identifiable.randomUID("nb"))
 
@@ -98,7 +97,17 @@ class NaiveBayes @Since("1.5.0") (
*/
   @Since("1.5.0")
   def setModelType(value: String): this.type = set(modelType, value)
-  setDefault(modelType -> OldNaiveBayes.Multinomial)
+  setDefault(modelType -> NaiveBayes.Multinomial)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
 
   override protected def train(dataset: Dataset[_]): NaiveBayesModel = {
 val numCl

spark git commit: [SPARK-14077][ML][FOLLOW-UP] Revert change for NB Model's Load to maintain compatibility with the model stored before 2.0

2016-09-30 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 1fad55968 -> 8e491af52


[SPARK-14077][ML][FOLLOW-UP] Revert change for NB Model's Load to maintain 
compatibility with the model stored before 2.0

## What changes were proposed in this pull request?
Revert change for NB Model's Load to maintain compatibility with the model 
stored before 2.0

## How was this patch tested?
local build

Author: Zheng RuiFeng 

Closes #15313 from zhengruifeng/revert_save_load.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8e491af5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8e491af5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8e491af5

Branch: refs/heads/master
Commit: 8e491af52930886cbe0c54e7d67add3796ddb15f
Parents: 1fad559
Author: Zheng RuiFeng 
Authored: Fri Sep 30 08:18:48 2016 -0700
Committer: Yanbo Liang 
Committed: Fri Sep 30 08:18:48 2016 -0700

--
 .../org/apache/spark/ml/classification/NaiveBayes.scala  | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8e491af5/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
index 0d652aa..6775745 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
@@ -25,7 +25,8 @@ import org.apache.spark.ml.linalg._
 import org.apache.spark.ml.param.{DoubleParam, Param, ParamMap, 
ParamValidators}
 import org.apache.spark.ml.param.shared.HasWeightCol
 import org.apache.spark.ml.util._
-import org.apache.spark.sql.Dataset
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.sql.{Dataset, Row}
 import org.apache.spark.sql.functions.{col, lit}
 import org.apache.spark.sql.types.DoubleType
 
@@ -362,9 +363,11 @@ object NaiveBayesModel extends MLReadable[NaiveBayesModel] 
{
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
 
   val dataPath = new Path(path, "data").toString
-  val data = sparkSession.read.parquet(dataPath).select("pi", 
"theta").head()
-  val pi = data.getAs[Vector](0)
-  val theta = data.getAs[Matrix](1)
+  val data = sparkSession.read.parquet(dataPath)
+  val vecConverted = MLUtils.convertVectorColumnsToML(data, "pi")
+  val Row(pi: Vector, theta: Matrix) = 
MLUtils.convertMatrixColumnsToML(vecConverted, "theta")
+.select("pi", "theta")
+.head()
   val model = new NaiveBayesModel(metadata.uid, pi, theta)
 
   DefaultParamsReader.getAndSetParams(model, metadata)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17744][ML] Parity check between the ml and mllib test suites for NB

2016-10-04 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 7d5160883 -> c17f97183


[SPARK-17744][ML] Parity check between the ml and mllib test suites for NB

## What changes were proposed in this pull request?
1,parity check and add missing test suites for ml's NB
2,remove some unused imports

## How was this patch tested?
 manual tests in spark-shell

Author: Zheng RuiFeng 

Closes #15312 from zhengruifeng/nb_test_parity.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c17f9718
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c17f9718
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c17f9718

Branch: refs/heads/master
Commit: c17f971839816e68f8abe2c8eb4e4db47c57ab67
Parents: 7d51608
Author: Zheng RuiFeng 
Authored: Tue Oct 4 06:54:48 2016 -0700
Committer: Yanbo Liang 
Committed: Tue Oct 4 06:54:48 2016 -0700

--
 .../apache/spark/ml/feature/LabeledPoint.scala  |  2 +-
 .../spark/ml/feature/QuantileDiscretizer.scala  |  2 +-
 .../org/apache/spark/ml/python/MLSerDe.scala|  5 --
 .../spark/ml/regression/GBTRegressor.scala  |  2 +-
 .../spark/ml/regression/LinearRegression.scala  |  1 -
 .../ml/classification/NaiveBayesSuite.scala | 69 +++-
 python/pyspark/ml/classification.py |  1 -
 7 files changed, 70 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c17f9718/mllib/src/main/scala/org/apache/spark/ml/feature/LabeledPoint.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/LabeledPoint.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/LabeledPoint.scala
index 6cefa70..7d8e4ad 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/LabeledPoint.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/LabeledPoint.scala
@@ -25,7 +25,7 @@ import org.apache.spark.ml.linalg.Vector
 /**
  * :: Experimental ::
  *
- * Class that represents the features and labels of a data point.
+ * Class that represents the features and label of a data point.
  *
  * @param label Label for this data point.
  * @param features List of features for this data point.

http://git-wip-us.apache.org/repos/asf/spark/blob/c17f9718/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
index 1e59d71..05e034d 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
@@ -25,7 +25,7 @@ import org.apache.spark.ml.param._
 import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.Dataset
-import org.apache.spark.sql.types.{DoubleType, StructType}
+import org.apache.spark.sql.types.StructType
 
 /**
  * Params for [[QuantileDiscretizer]].

http://git-wip-us.apache.org/repos/asf/spark/blob/c17f9718/mllib/src/main/scala/org/apache/spark/ml/python/MLSerDe.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/python/MLSerDe.scala 
b/mllib/src/main/scala/org/apache/spark/ml/python/MLSerDe.scala
index 4b805e1..da62f85 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/python/MLSerDe.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/python/MLSerDe.scala
@@ -19,17 +19,12 @@ package org.apache.spark.ml.python
 
 import java.io.OutputStream
 import java.nio.{ByteBuffer, ByteOrder}
-import java.util.{ArrayList => JArrayList}
-
-import scala.collection.JavaConverters._
 
 import net.razorvine.pickle._
 
-import org.apache.spark.api.java.JavaRDD
 import org.apache.spark.api.python.SerDeUtil
 import org.apache.spark.ml.linalg._
 import org.apache.spark.mllib.api.python.SerDeBase
-import org.apache.spark.rdd.RDD
 
 /**
  * SerDe utility functions for pyspark.ml.

http://git-wip-us.apache.org/repos/asf/spark/blob/c17f9718/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala 
b/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala
index ce35593..bb01f9d 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala
@@ -21,7 +21,7 @@ import com.github.fommil.netlib.BLAS.{getInstance => blas}
 import org.json4s.{DefaultFormats, JObject}
 import org.json4s.JsonDSL._
 
-

spark git commit: [MINOR][ML] Avoid 2D array flatten in NB training.

2016-10-05 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master b678e465a -> 7aeb20be7


[MINOR][ML] Avoid 2D array flatten in NB training.

## What changes were proposed in this pull request?
Avoid 2D array flatten in ```NaiveBayes``` training, since flatten method might 
be expensive (It will create another array and copy data there).

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #15359 from yanboliang/nb-theta.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7aeb20be
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7aeb20be
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7aeb20be

Branch: refs/heads/master
Commit: 7aeb20be7e999523784aca7be1a7c9c99dec125e
Parents: b678e46
Author: Yanbo Liang 
Authored: Wed Oct 5 23:03:09 2016 -0700
Committer: Yanbo Liang 
Committed: Wed Oct 5 23:03:09 2016 -0700

--
 .../org/apache/spark/ml/classification/NaiveBayes.scala  | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7aeb20be/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
index 6775745..e565a6f 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
@@ -176,8 +176,8 @@ class NaiveBayes @Since("1.5.0") (
 val numLabels = aggregated.length
 val numDocuments = aggregated.map(_._2._1).sum
 
-val piArray = Array.fill[Double](numLabels)(0.0)
-val thetaArrays = Array.fill[Double](numLabels, numFeatures)(0.0)
+val piArray = new Array[Double](numLabels)
+val thetaArray = new Array[Double](numLabels * numFeatures)
 
 val lambda = $(smoothing)
 val piLogDenom = math.log(numDocuments + numLabels * lambda)
@@ -193,14 +193,14 @@ class NaiveBayes @Since("1.5.0") (
   }
   var j = 0
   while (j < numFeatures) {
-thetaArrays(i)(j) = math.log(sumTermFreqs(j) + lambda) - thetaLogDenom
+thetaArray(i * numFeatures + j) = math.log(sumTermFreqs(j) + lambda) - 
thetaLogDenom
 j += 1
   }
   i += 1
 }
 
 val pi = Vectors.dense(piArray)
-val theta = new DenseMatrix(numLabels, thetaArrays(0).length, 
thetaArrays.flatten, true)
+val theta = new DenseMatrix(numLabels, numFeatures, thetaArray, true)
 new NaiveBayesModel(uid, pi, theta)
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17792][ML] L-BFGS solver for linear regression does not accept general numeric label column types

2016-10-06 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 49d11d499 -> 3713bb199


[SPARK-17792][ML] L-BFGS solver for linear regression does not accept general 
numeric label column types

## What changes were proposed in this pull request?

Before, we computed `instances` in LinearRegression in two spots, even though 
they did the same thing. One of them did not cast the label column to 
`DoubleType`. This patch consolidates the computation and always casts the 
label column to `DoubleType`.

## How was this patch tested?

Added a unit test to check all solvers. This test failed before this patch.

Author: sethah 

Closes #15364 from sethah/linreg_numeric_type.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3713bb19
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3713bb19
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3713bb19

Branch: refs/heads/master
Commit: 3713bb199142c5e06e2e527c99650f02f41f47b1
Parents: 49d11d4
Author: sethah 
Authored: Thu Oct 6 21:10:17 2016 -0700
Committer: Yanbo Liang 
Committed: Thu Oct 6 21:10:17 2016 -0700

--
 .../spark/ml/regression/LinearRegression.scala | 17 ++---
 .../ml/regression/LinearRegressionSuite.scala  |  8 +---
 2 files changed, 11 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3713bb19/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala 
b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
index 536c58f..025ed20 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
@@ -188,17 +188,18 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") 
override val uid: String
 val numFeatures = 
dataset.select(col($(featuresCol))).first().getAs[Vector](0).size
 val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else 
col($(weightCol))
 
+val instances: RDD[Instance] = dataset.select(
+  col($(labelCol)).cast(DoubleType), w, col($(featuresCol))).rdd.map {
+  case Row(label: Double, weight: Double, features: Vector) =>
+Instance(label, weight, features)
+}
+
 if (($(solver) == "auto" && $(elasticNetParam) == 0.0 &&
   numFeatures <= WeightedLeastSquares.MAX_NUM_FEATURES) || $(solver) == 
"normal") {
   require($(elasticNetParam) == 0.0, "Only L2 regularization can be used 
when normal " +
 "solver is used.'")
   // For low dimensional data, WeightedLeastSquares is more efficiently 
since the
   // training algorithm only requires one pass through the data. 
(SPARK-10668)
-  val instances: RDD[Instance] = dataset.select(
-col($(labelCol)).cast(DoubleType), w, col($(featuresCol))).rdd.map {
-  case Row(label: Double, weight: Double, features: Vector) =>
-Instance(label, weight, features)
-  }
 
   val optimizer = new WeightedLeastSquares($(fitIntercept), $(regParam),
 $(standardization), true)
@@ -221,12 +222,6 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") 
override val uid: String
   return lrModel.setSummary(trainingSummary)
 }
 
-val instances: RDD[Instance] =
-  dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
-case Row(label: Double, weight: Double, features: Vector) =>
-  Instance(label, weight, features)
-  }
-
 val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
 if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/3713bb19/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
index 5ae371b..1c94ec6 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
@@ -1015,12 +1015,14 @@ class LinearRegressionSuite
   }
 
   test("should support all NumericType labels and not support other types") {
-val lr = new LinearRegression().setMaxIter(1)
-MLTestingUtils.checkNumericTypes[LinearRegressionModel, LinearRegression](
-  lr, spark, isClassification = false) { (expected, actual) =>
+for (solver <- Seq("auto", "l-bfgs", "normal")) {
+  val lr = new LinearRegression().setMaxIter

spark git commit: [SPARK-17792][ML] L-BFGS solver for linear regression does not accept general numeric label column types

2016-10-06 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 b1a9c41e8 -> 594a2cf6f


[SPARK-17792][ML] L-BFGS solver for linear regression does not accept general 
numeric label column types

## What changes were proposed in this pull request?

Before, we computed `instances` in LinearRegression in two spots, even though 
they did the same thing. One of them did not cast the label column to 
`DoubleType`. This patch consolidates the computation and always casts the 
label column to `DoubleType`.

## How was this patch tested?

Added a unit test to check all solvers. This test failed before this patch.

Author: sethah 

Closes #15364 from sethah/linreg_numeric_type.

(cherry picked from commit 3713bb199142c5e06e2e527c99650f02f41f47b1)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/594a2cf6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/594a2cf6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/594a2cf6

Branch: refs/heads/branch-2.0
Commit: 594a2cf6f7c74c54127b8c3947aadbe0052b404c
Parents: b1a9c41
Author: sethah 
Authored: Thu Oct 6 21:10:17 2016 -0700
Committer: Yanbo Liang 
Committed: Thu Oct 6 21:14:44 2016 -0700

--
 .../spark/ml/regression/LinearRegression.scala | 17 ++---
 .../ml/regression/LinearRegressionSuite.scala  |  8 +---
 2 files changed, 11 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/594a2cf6/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala 
b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
index f82f2c3..600bbcb 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
@@ -163,17 +163,18 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") 
override val uid: String
 val numFeatures = 
dataset.select(col($(featuresCol))).first().getAs[Vector](0).size
 val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else 
col($(weightCol))
 
+val instances: RDD[Instance] = dataset.select(
+  col($(labelCol)).cast(DoubleType), w, col($(featuresCol))).rdd.map {
+  case Row(label: Double, weight: Double, features: Vector) =>
+Instance(label, weight, features)
+}
+
 if (($(solver) == "auto" && $(elasticNetParam) == 0.0 &&
   numFeatures <= WeightedLeastSquares.MAX_NUM_FEATURES) || $(solver) == 
"normal") {
   require($(elasticNetParam) == 0.0, "Only L2 regularization can be used 
when normal " +
 "solver is used.'")
   // For low dimensional data, WeightedLeastSquares is more efficiently 
since the
   // training algorithm only requires one pass through the data. 
(SPARK-10668)
-  val instances: RDD[Instance] = dataset.select(
-col($(labelCol)).cast(DoubleType), w, col($(featuresCol))).rdd.map {
-  case Row(label: Double, weight: Double, features: Vector) =>
-Instance(label, weight, features)
-  }
 
   val optimizer = new WeightedLeastSquares($(fitIntercept), $(regParam),
 $(standardization), true)
@@ -196,12 +197,6 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") 
override val uid: String
   return lrModel.setSummary(trainingSummary)
 }
 
-val instances: RDD[Instance] =
-  dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
-case Row(label: Double, weight: Double, features: Vector) =>
-  Instance(label, weight, features)
-  }
-
 val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
 if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/594a2cf6/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
index 265f2f4..df67a3a 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
@@ -1019,12 +1019,14 @@ class LinearRegressionSuite
   }
 
   test("should support all NumericType labels and not support other types") {
-val lr = new LinearRegression().setMaxIter(1)
-MLTestingUtils.checkNumericTypes[LinearRegressionModel, LinearRegression](
-  lr, spark, isClassification = false) { (expected, actual) =

spark git commit: [SPARK-15957][ML] RFormula supports forcing to index label

2016-10-10 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master b515768f2 -> 19401a203


[SPARK-15957][ML] RFormula supports forcing to index label

## What changes were proposed in this pull request?
```RFormula``` will index label only when it is string type currently. If the 
label is numeric type and we use ```RFormula``` to present a classification 
model, there is no label attributes in label column metadata. The label 
attributes are useful when making prediction for classification, so we can 
force to index label by ```StringIndexer``` whether it is numeric or string 
type for classification. Then SparkR wrappers can extract label attributes from 
label column metadata successfully. This feature can help us to fix bug similar 
with [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153).
For regression, we will still to keep label as numeric type.
In this PR, we add a param ```indexLabel``` to control whether to force to 
index label for ```RFormula```.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang 

Closes #13675 from yanboliang/spark-15957.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/19401a20
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/19401a20
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/19401a20

Branch: refs/heads/master
Commit: 19401a203b441e3355f0d3fc3fd062b6d5bdee1f
Parents: b515768
Author: Yanbo Liang 
Authored: Mon Oct 10 22:50:59 2016 -0700
Committer: Yanbo Liang 
Committed: Mon Oct 10 22:50:59 2016 -0700

--
 .../org/apache/spark/ml/feature/RFormula.scala  | 29 ++--
 .../apache/spark/ml/feature/RFormulaSuite.scala | 27 +-
 2 files changed, 52 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/19401a20/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
index 2ee899b..3898986 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
@@ -26,7 +26,7 @@ import org.apache.spark.annotation.{Experimental, Since}
 import org.apache.spark.ml.{Estimator, Model, Pipeline, PipelineModel, 
PipelineStage, Transformer}
 import org.apache.spark.ml.attribute.AttributeGroup
 import org.apache.spark.ml.linalg.VectorUDT
-import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.{BooleanParam, Param, ParamMap}
 import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset}
@@ -104,6 +104,27 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override 
val uid: String)
   @Since("1.5.0")
   def setLabelCol(value: String): this.type = set(labelCol, value)
 
+  /**
+   * Force to index label whether it is numeric or string type.
+   * Usually we index label only when it is string type.
+   * If the formula was used by classification algorithms,
+   * we can force to index label even it is numeric type by setting this param 
with true.
+   * Default: false.
+   * @group param
+   */
+  @Since("2.1.0")
+  val forceIndexLabel: BooleanParam = new BooleanParam(this, "forceIndexLabel",
+"Force to index label whether it is numeric or string")
+  setDefault(forceIndexLabel -> false)
+
+  /** @group getParam */
+  @Since("2.1.0")
+  def getForceIndexLabel: Boolean = $(forceIndexLabel)
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setForceIndexLabel(value: Boolean): this.type = set(forceIndexLabel, 
value)
+
   /** Whether the formula specifies fitting an intercept. */
   private[ml] def hasIntercept: Boolean = {
 require(isDefined(formula), "Formula must be defined first.")
@@ -167,8 +188,8 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override 
val uid: String)
 encoderStages += new VectorAttributeRewriter($(featuresCol), 
prefixesToRewrite.toMap)
 encoderStages += new ColumnPruner(tempColumns.toSet)
 
-if (dataset.schema.fieldNames.contains(resolvedFormula.label) &&
-  dataset.schema(resolvedFormula.label).dataType == StringType) {
+if ((dataset.schema.fieldNames.contains(resolvedFormula.label) &&
+  dataset.schema(resolvedFormula.label).dataType == StringType) || 
$(forceIndexLabel)) {
   encoderStages += new StringIndexer()
 .setInputCol(resolvedFormula.label)
 .setOutputCol($(labelCol))
@@ -181,6 +202,8 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override 
val uid: String)
   @Since("1.5.0")
   // optimistic schema; does not contain any ML attributes
   override def transformSchema(schema: StructType):

spark git commit: [SPARK-17745][ML][PYSPARK] update NB python api - add weight col parameter

2016-10-12 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 6f20a92ca -> 0d4a69527


[SPARK-17745][ML][PYSPARK] update NB python api - add weight col parameter

## What changes were proposed in this pull request?

update python api for NaiveBayes: add weight col parameter.

## How was this patch tested?

doctests added.

Author: WeichenXu 

Closes #15406 from WeichenXu123/nb_python_update.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0d4a6952
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0d4a6952
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0d4a6952

Branch: refs/heads/master
Commit: 0d4a695279c514c76aa0e9288c70ac7aaef91b03
Parents: 6f20a92
Author: WeichenXu 
Authored: Wed Oct 12 19:52:57 2016 -0700
Committer: Yanbo Liang 
Committed: Wed Oct 12 19:52:57 2016 -0700

--
 python/pyspark/ml/classification.py | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0d4a6952/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index ea60fab..3f763a1 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -981,7 +981,7 @@ class GBTClassificationModel(TreeEnsembleModel, 
JavaPredictionModel, JavaMLWrita
 
 @inherit_doc
 class NaiveBayes(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, 
HasProbabilityCol,
- HasRawPredictionCol, HasThresholds, JavaMLWritable, 
JavaMLReadable):
+ HasRawPredictionCol, HasThresholds, HasWeightCol, 
JavaMLWritable, JavaMLReadable):
 """
 Naive Bayes Classifiers.
 It supports both Multinomial and Bernoulli NB. `Multinomial NB
@@ -995,23 +995,23 @@ class NaiveBayes(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredictionCol, H
 >>> from pyspark.sql import Row
 >>> from pyspark.ml.linalg import Vectors
 >>> df = spark.createDataFrame([
-... Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
-... Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
-... Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])
->>> nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
+... Row(label=0.0, weight=0.1, features=Vectors.dense([0.0, 0.0])),
+... Row(label=0.0, weight=0.5, features=Vectors.dense([0.0, 1.0])),
+... Row(label=1.0, weight=1.0, features=Vectors.dense([1.0, 0.0]))])
+>>> nb = NaiveBayes(smoothing=1.0, modelType="multinomial", 
weightCol="weight")
 >>> model = nb.fit(df)
 >>> model.pi
-DenseVector([-0.51..., -0.91...])
+DenseVector([-0.81..., -0.58...])
 >>> model.theta
-DenseMatrix(2, 2, [-1.09..., -0.40..., -0.40..., -1.09...], 1)
+DenseMatrix(2, 2, [-0.91..., -0.51..., -0.40..., -1.09...], 1)
 >>> test0 = sc.parallelize([Row(features=Vectors.dense([1.0, 
0.0]))]).toDF()
 >>> result = model.transform(test0).head()
 >>> result.prediction
 1.0
 >>> result.probability
-DenseVector([0.42..., 0.57...])
+DenseVector([0.32..., 0.67...])
 >>> result.rawPrediction
-DenseVector([-1.60..., -1.32...])
+DenseVector([-1.72..., -0.99...])
 >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], 
[1.0]))]).toDF()
 >>> model.transform(test1).head().prediction
 1.0
@@ -1045,11 +1045,11 @@ class NaiveBayes(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredictionCol, H
 @keyword_only
 def __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
  probabilityCol="probability", 
rawPredictionCol="rawPrediction", smoothing=1.0,
- modelType="multinomial", thresholds=None):
+ modelType="multinomial", thresholds=None, weightCol=None):
 """
 __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction", \
  probabilityCol="probability", 
rawPredictionCol="rawPrediction", smoothing=1.0, \
- modelType="multinomial", thresholds=None)
+ modelType="multinomial", thresholds=None, weightCol=None)
 """
 super(NaiveBayes, self).__init__()
 self._java_obj = self._new_java_obj(
@@ -1062,11 +1062,11 @@ class NaiveBayes(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredictionCol, H
 @since("1.5.0")
 def setParams(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
   probabilityCol="probability", 
rawPredictionCol="rawPrediction", smoothing=1.0,
-  modelType="multinomial", thresholds=None):
+  modelType="multinomial", thresholds=None, weightCol=None):

spark git commit: [SPARK-17835][ML][MLLIB] Optimize NaiveBayes mllib wrapper to eliminate extra pass on data

2016-10-12 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 0d4a69527 -> 21cb59f1c


[SPARK-17835][ML][MLLIB] Optimize NaiveBayes mllib wrapper to eliminate extra 
pass on data

## What changes were proposed in this pull request?
[SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077) copied the 
```NaiveBayes``` implementation from mllib to ml and left mllib as a wrapper. 
However, there are some difference between mllib and ml to handle labels:
* mllib allow input labels as {-1, +1}, however, ml assumes the input labels in 
range [0, numClasses).
* mllib ```NaiveBayesModel``` expose ```labels``` but ml did not due to the 
assumption mention above.

During the copy in 
[SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077), we use
```val labels = data.map(_.label).distinct().collect().sorted```
to get the distinct labels firstly, and then encode the labels for training. It 
involves extra Spark job compared with the original implementation. Since 
```NaiveBayes``` only do one pass aggregation during training, adding another 
one seems less efficient. We can get the labels in a single pass along with 
```NaiveBayes``` training and send them to MLlib side.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #15402 from yanboliang/spark-17835.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/21cb59f1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/21cb59f1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/21cb59f1

Branch: refs/heads/master
Commit: 21cb59f1cd137d96b2596f1abe691b544581cf59
Parents: 0d4a695
Author: Yanbo Liang 
Authored: Wed Oct 12 19:56:40 2016 -0700
Committer: Yanbo Liang 
Committed: Wed Oct 12 19:56:40 2016 -0700

--
 .../spark/ml/classification/NaiveBayes.scala| 46 
 .../spark/mllib/classification/NaiveBayes.scala | 15 +++
 2 files changed, 43 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/21cb59f1/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
index e565a6f..994ed99 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
@@ -110,16 +110,28 @@ class NaiveBayes @Since("1.5.0") (
   @Since("2.1.0")
   def setWeightCol(value: String): this.type = set(weightCol, value)
 
-  override protected def train(dataset: Dataset[_]): NaiveBayesModel = {
-val numClasses = getNumClasses(dataset)
+  /**
+   * ml assumes input labels in range [0, numClasses). But this implementation
+   * is also called by mllib NaiveBayes which allows other kinds of input 
labels
+   * such as {-1, +1}. Here we use this parameter to switch between different 
processing logic.
+   * It should be removed when we remove mllib NaiveBayes.
+   */
+  private[spark] var isML: Boolean = true
 
-if (isDefined(thresholds)) {
-  require($(thresholds).length == numClasses, this.getClass.getSimpleName +
-".train() called with non-matching numClasses and thresholds.length." +
-s" numClasses=$numClasses, but thresholds has length 
${$(thresholds).length}")
-}
+  private[spark] def setIsML(isML: Boolean): this.type = {
+this.isML = isML
+this
+  }
 
-val numFeatures = 
dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
+  override protected def train(dataset: Dataset[_]): NaiveBayesModel = {
+if (isML) {
+  val numClasses = getNumClasses(dataset)
+  if (isDefined(thresholds)) {
+require($(thresholds).length == numClasses, 
this.getClass.getSimpleName +
+  ".train() called with non-matching numClasses and 
thresholds.length." +
+  s" numClasses=$numClasses, but thresholds has length 
${$(thresholds).length}")
+  }
+}
 
 val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
   val values = v match {
@@ -153,6 +165,7 @@ class NaiveBayes @Since("1.5.0") (
   }
 }
 
+val numFeatures = 
dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
 val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else 
col($(weightCol))
 
 // Aggregates term frequencies per label.
@@ -176,6 +189,7 @@ class NaiveBayes @Since("1.5.0") (
 val numLabels = aggregated.length
 val numDocuments = aggregated.map(_._2._1).sum
 
+val labelArray = new Array[Double](numLabels)
 val piArray = new Array[Double](numLabels)
 val thetaArray = new Array[Double](numLabels * numFeatures)
 
@@ -183,6 +197,7 @@ class Nai

spark git commit: [SPARK-15957][FOLLOW-UP][ML][PYSPARK] Add Python API for RFormula forceIndexLabel.

2016-10-13 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 9dc0ca060 -> 44cbb61b3


[SPARK-15957][FOLLOW-UP][ML][PYSPARK] Add Python API for RFormula 
forceIndexLabel.

## What changes were proposed in this pull request?
Follow-up work of #13675, add Python API for ```RFormula forceIndexLabel```.

## How was this patch tested?
Unit test.

Author: Yanbo Liang 

Closes #15430 from yanboliang/spark-15957-python.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/44cbb61b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/44cbb61b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/44cbb61b

Branch: refs/heads/master
Commit: 44cbb61b34a98e3e0d8e2543a4eb6e950e0019a5
Parents: 9dc0ca0
Author: Yanbo Liang 
Authored: Thu Oct 13 19:44:24 2016 -0700
Committer: Yanbo Liang 
Committed: Thu Oct 13 19:44:24 2016 -0700

--
 python/pyspark/ml/feature.py | 31 +++
 python/pyspark/ml/tests.py   | 16 
 2 files changed, 43 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/44cbb61b/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 64b21ca..a33c3e7 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -2494,21 +2494,30 @@ class RFormula(JavaEstimator, HasFeaturesCol, 
HasLabelCol, JavaMLReadable, JavaM
 formula = Param(Params._dummy(), "formula", "R model formula",
 typeConverter=TypeConverters.toString)
 
+forceIndexLabel = Param(Params._dummy(), "forceIndexLabel",
+"Force to index label whether it is numeric or 
string",
+typeConverter=TypeConverters.toBoolean)
+
 @keyword_only
-def __init__(self, formula=None, featuresCol="features", labelCol="label"):
+def __init__(self, formula=None, featuresCol="features", labelCol="label",
+ forceIndexLabel=False):
 """
-__init__(self, formula=None, featuresCol="features", labelCol="label")
+__init__(self, formula=None, featuresCol="features", labelCol="label", 
\
+ forceIndexLabel=False)
 """
 super(RFormula, self).__init__()
 self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.RFormula", self.uid)
+self._setDefault(forceIndexLabel=False)
 kwargs = self.__init__._input_kwargs
 self.setParams(**kwargs)
 
 @keyword_only
 @since("1.5.0")
-def setParams(self, formula=None, featuresCol="features", 
labelCol="label"):
+def setParams(self, formula=None, featuresCol="features", labelCol="label",
+  forceIndexLabel=False):
 """
-setParams(self, formula=None, featuresCol="features", labelCol="label")
+setParams(self, formula=None, featuresCol="features", 
labelCol="label", \
+  forceIndexLabel=False)
 Sets params for RFormula.
 """
 kwargs = self.setParams._input_kwargs
@@ -2528,6 +2537,20 @@ class RFormula(JavaEstimator, HasFeaturesCol, 
HasLabelCol, JavaMLReadable, JavaM
 """
 return self.getOrDefault(self.formula)
 
+@since("2.1.0")
+def setForceIndexLabel(self, value):
+"""
+Sets the value of :py:attr:`forceIndexLabel`.
+"""
+return self._set(forceIndexLabel=value)
+
+@since("2.1.0")
+def getForceIndexLabel(self):
+"""
+Gets the value of :py:attr:`forceIndexLabel`.
+"""
+return self.getOrDefault(self.forceIndexLabel)
+
 def _create_model(self, java_model):
 return RFormulaModel(java_model)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/44cbb61b/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index e233549..9d46cc3 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -477,6 +477,22 @@ class FeatureTests(SparkSessionTestCase):
 feature, expected = r
 self.assertEqual(feature, expected)
 
+def test_rformula_force_index_label(self):
+df = self.spark.createDataFrame([
+(1.0, 1.0, "a"),
+(0.0, 2.0, "b"),
+(1.0, 0.0, "a")], ["y", "x", "s"])
+# Does not index label by default since it's numeric type.
+rf = RFormula(formula="y ~ x + s")
+model = rf.fit(df)
+transformedDF = model.transform(df)
+self.assertEqual(transformedDF.head().label, 1.0)
+# Force to index label.
+rf2 = RFormula(formula="y ~ x + s").setForceIndexLabel(True)
+model2 = rf2.fit(df)
+transformedDF2 = mo

spark git commit: [SPARK-15402][ML][PYSPARK] PySpark ml.evaluation should support save/load

2016-10-14 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 2fb12b0a3 -> 1db8feab8


[SPARK-15402][ML][PYSPARK] PySpark ml.evaluation should support save/load

## What changes were proposed in this pull request?
Since ```ml.evaluation``` has supported save/load at Scala side, supporting it 
at Python side is very straightforward and easy.

## How was this patch tested?
Add python doctest.

Author: Yanbo Liang 

Closes #13194 from yanboliang/spark-15402.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1db8feab
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1db8feab
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1db8feab

Branch: refs/heads/master
Commit: 1db8feab8c564053c05e8bdc1a7f5026fd637d4f
Parents: 2fb12b0
Author: Yanbo Liang 
Authored: Fri Oct 14 04:17:03 2016 -0700
Committer: Yanbo Liang 
Committed: Fri Oct 14 04:17:03 2016 -0700

--
 python/pyspark/ml/evaluation.py | 45 
 1 file changed, 36 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1db8feab/python/pyspark/ml/evaluation.py
--
diff --git a/python/pyspark/ml/evaluation.py b/python/pyspark/ml/evaluation.py
index 1fe8772..7aa16fa 100644
--- a/python/pyspark/ml/evaluation.py
+++ b/python/pyspark/ml/evaluation.py
@@ -22,6 +22,7 @@ from pyspark.ml.wrapper import JavaParams
 from pyspark.ml.param import Param, Params, TypeConverters
 from pyspark.ml.param.shared import HasLabelCol, HasPredictionCol, 
HasRawPredictionCol
 from pyspark.ml.common import inherit_doc
+from pyspark.ml.util import JavaMLReadable, JavaMLWritable
 
 __all__ = ['Evaluator', 'BinaryClassificationEvaluator', 'RegressionEvaluator',
'MulticlassClassificationEvaluator']
@@ -103,7 +104,8 @@ class JavaEvaluator(JavaParams, Evaluator):
 
 
 @inherit_doc
-class BinaryClassificationEvaluator(JavaEvaluator, HasLabelCol, 
HasRawPredictionCol):
+class BinaryClassificationEvaluator(JavaEvaluator, HasLabelCol, 
HasRawPredictionCol,
+JavaMLReadable, JavaMLWritable):
 """
 .. note:: Experimental
 
@@ -121,6 +123,11 @@ class BinaryClassificationEvaluator(JavaEvaluator, 
HasLabelCol, HasRawPrediction
 0.70...
 >>> evaluator.evaluate(dataset, {evaluator.metricName: "areaUnderPR"})
 0.83...
+>>> bce_path = temp_path + "/bce"
+>>> evaluator.save(bce_path)
+>>> evaluator2 = BinaryClassificationEvaluator.load(bce_path)
+>>> str(evaluator2.getRawPredictionCol())
+'raw'
 
 .. versionadded:: 1.4.0
 """
@@ -172,7 +179,8 @@ class BinaryClassificationEvaluator(JavaEvaluator, 
HasLabelCol, HasRawPrediction
 
 
 @inherit_doc
-class RegressionEvaluator(JavaEvaluator, HasLabelCol, HasPredictionCol):
+class RegressionEvaluator(JavaEvaluator, HasLabelCol, HasPredictionCol,
+  JavaMLReadable, JavaMLWritable):
 """
 .. note:: Experimental
 
@@ -190,6 +198,11 @@ class RegressionEvaluator(JavaEvaluator, HasLabelCol, 
HasPredictionCol):
 0.993...
 >>> evaluator.evaluate(dataset, {evaluator.metricName: "mae"})
 2.649...
+>>> re_path = temp_path + "/re"
+>>> evaluator.save(re_path)
+>>> evaluator2 = RegressionEvaluator.load(re_path)
+>>> str(evaluator2.getPredictionCol())
+'raw'
 
 .. versionadded:: 1.4.0
 """
@@ -244,7 +257,8 @@ class RegressionEvaluator(JavaEvaluator, HasLabelCol, 
HasPredictionCol):
 
 
 @inherit_doc
-class MulticlassClassificationEvaluator(JavaEvaluator, HasLabelCol, 
HasPredictionCol):
+class MulticlassClassificationEvaluator(JavaEvaluator, HasLabelCol, 
HasPredictionCol,
+JavaMLReadable, JavaMLWritable):
 """
 .. note:: Experimental
 
@@ -260,6 +274,11 @@ class MulticlassClassificationEvaluator(JavaEvaluator, 
HasLabelCol, HasPredictio
 0.66...
 >>> evaluator.evaluate(dataset, {evaluator.metricName: "accuracy"})
 0.66...
+>>> mce_path = temp_path + "/mce"
+>>> evaluator.save(mce_path)
+>>> evaluator2 = MulticlassClassificationEvaluator.load(mce_path)
+>>> str(evaluator2.getPredictionCol())
+'prediction'
 
 .. versionadded:: 1.5.0
 """
@@ -311,19 +330,27 @@ class MulticlassClassificationEvaluator(JavaEvaluator, 
HasLabelCol, HasPredictio
 
 if __name__ == "__main__":
 import doctest
+import tempfile
+import pyspark.ml.evaluation
 from pyspark.sql import SparkSession
-globs = globals().copy()
+globs = pyspark.ml.evaluation.__dict__.copy()
 # The small batch size here ensures that we see multiple batches,
 # even in these small test examples:
 spark = SparkSession.builder\
 .master("local[2]")\
 .appName("ml.evaluation tests")\
 .getO

spark git commit: [SPARK-14634][ML] Add BisectingKMeansSummary

2016-10-14 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 1db8feab8 -> a1b136d05


[SPARK-14634][ML] Add BisectingKMeansSummary

## What changes were proposed in this pull request?
Add BisectingKMeansSummary

## How was this patch tested?
unit test

Author: Zheng RuiFeng 

Closes #12394 from zhengruifeng/biKMSummary.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a1b136d0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a1b136d0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a1b136d0

Branch: refs/heads/master
Commit: a1b136d05c6c458ae8211b0844bfc98d7693fa42
Parents: 1db8fea
Author: Zheng RuiFeng 
Authored: Fri Oct 14 04:25:14 2016 -0700
Committer: Yanbo Liang 
Committed: Fri Oct 14 04:25:14 2016 -0700

--
 .../spark/ml/clustering/BisectingKMeans.scala   | 74 +++-
 .../ml/clustering/BisectingKMeansSuite.scala| 18 -
 .../ml/clustering/GaussianMixtureSuite.scala|  2 +-
 .../spark/ml/clustering/KMeansSuite.scala   |  2 +-
 4 files changed, 91 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a1b136d0/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
index a97bd0f..add8ee2 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
@@ -19,6 +19,7 @@ package org.apache.spark.ml.clustering
 
 import org.apache.hadoop.fs.Path
 
+import org.apache.spark.SparkException
 import org.apache.spark.annotation.{Experimental, Since}
 import org.apache.spark.ml.{Estimator, Model}
 import org.apache.spark.ml.linalg.{Vector, VectorUDT}
@@ -127,6 +128,29 @@ class BisectingKMeansModel private[ml] (
 
   @Since("2.0.0")
   override def write: MLWriter = new 
BisectingKMeansModel.BisectingKMeansModelWriter(this)
+
+  private var trainingSummary: Option[BisectingKMeansSummary] = None
+
+  private[clustering] def setSummary(summary: BisectingKMeansSummary): 
this.type = {
+this.trainingSummary = Some(summary)
+this
+  }
+
+  /**
+   * Return true if there exists summary of model.
+   */
+  @Since("2.1.0")
+  def hasSummary: Boolean = trainingSummary.nonEmpty
+
+  /**
+   * Gets summary of model on training set. An exception is
+   * thrown if `trainingSummary == None`.
+   */
+  @Since("2.1.0")
+  def summary: BisectingKMeansSummary = trainingSummary.getOrElse {
+throw new SparkException(
+  s"No training summary available for the ${this.getClass.getSimpleName}")
+  }
 }
 
 object BisectingKMeansModel extends MLReadable[BisectingKMeansModel] {
@@ -228,14 +252,22 @@ class BisectingKMeans @Since("2.0.0") (
   case Row(point: Vector) => OldVectors.fromML(point)
 }
 
+val instr = Instrumentation.create(this, rdd)
+instr.logParams(featuresCol, predictionCol, k, maxIter, seed, 
minDivisibleClusterSize)
+
 val bkm = new MLlibBisectingKMeans()
   .setK($(k))
   .setMaxIterations($(maxIter))
   .setMinDivisibleClusterSize($(minDivisibleClusterSize))
   .setSeed($(seed))
 val parentModel = bkm.run(rdd)
-val model = new BisectingKMeansModel(uid, parentModel)
-copyValues(model.setParent(this))
+val model = copyValues(new BisectingKMeansModel(uid, 
parentModel).setParent(this))
+val summary = new BisectingKMeansSummary(
+  model.transform(dataset), $(predictionCol), $(featuresCol), $(k))
+model.setSummary(summary)
+val m = model.setSummary(summary)
+instr.logSuccess(m)
+m
   }
 
   @Since("2.0.0")
@@ -251,3 +283,41 @@ object BisectingKMeans extends 
DefaultParamsReadable[BisectingKMeans] {
   @Since("2.0.0")
   override def load(path: String): BisectingKMeans = super.load(path)
 }
+
+
+/**
+ * :: Experimental ::
+ * Summary of BisectingKMeans.
+ *
+ * @param predictions  [[DataFrame]] produced by 
[[BisectingKMeansModel.transform()]]
+ * @param predictionCol  Name for column of predicted clusters in `predictions`
+ * @param featuresCol  Name for column of features in `predictions`
+ * @param k  Number of clusters
+ */
+@Since("2.1.0")
+@Experimental
+class BisectingKMeansSummary private[clustering] (
+@Since("2.1.0") @transient val predictions: DataFrame,
+@Since("2.1.0") val predictionCol: String,
+@Since("2.1.0") val featuresCol: String,
+@Since("2.1.0") val k: Int) extends Serializable {
+
+  /**
+   * Cluster centers of the transformed data.
+   */
+  @Since("2.1.0")
+  @transient lazy val cluster: DataFrame = predictions.select(predictionCol)
+
+  /**
+   * Size of (number of data points

spark git commit: [SPARK-17986][ML] SQLTransformer should remove temporary tables

2016-10-22 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 01b26a064 -> ab3363e9f


[SPARK-17986][ML] SQLTransformer should remove temporary tables

## What changes were proposed in this pull request?

A call to the method `SQLTransformer.transform` previously would create a 
temporary table and never delete it. This change adds a call to 
`dropTempView()` that deletes this temporary table before returning the result 
so that the table will not remain in spark's table catalog. Because `tableName` 
is randomized and not exposed, there should be no expected use of this table 
outside of the `transform` method.

## How was this patch tested?

A single new assertion was added to the existing test of the 
`SQLTransformer.transform` method that all temporary tables are removed. 
Without the corresponding code change, this new assertion fails. I am not aware 
of any circumstances in which removing this temporary view would be bad for 
performance or correctness in other ways, but some expertise here would be 
helpful.

Author: Drew Robb 

Closes #15526 from drewrobb/SPARK-17986.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ab3363e9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ab3363e9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ab3363e9

Branch: refs/heads/master
Commit: ab3363e9f6b1f7fc26682509fe7382c570f91778
Parents: 01b26a0
Author: Drew Robb 
Authored: Sat Oct 22 01:59:36 2016 -0700
Committer: Yanbo Liang 
Committed: Sat Oct 22 01:59:36 2016 -0700

--
 .../main/scala/org/apache/spark/ml/feature/SQLTransformer.scala  | 4 +++-
 .../scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala  | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ab3363e9/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala
index 259be26..b25fff9 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala
@@ -67,7 +67,9 @@ class SQLTransformer @Since("1.6.0") (@Since("1.6.0") 
override val uid: String)
 val tableName = Identifiable.randomUID(uid)
 dataset.createOrReplaceTempView(tableName)
 val realStatement = $(statement).replace(tableIdentifier, tableName)
-dataset.sparkSession.sql(realStatement)
+val result = dataset.sparkSession.sql(realStatement)
+dataset.sparkSession.catalog.dropTempView(tableName)
+result
   }
 
   @Since("1.6.0")

http://git-wip-us.apache.org/repos/asf/spark/blob/ab3363e9/mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala
index 2346407..753f890 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala
@@ -43,6 +43,7 @@ class SQLTransformerSuite
 assert(result.schema.toString == resultSchema.toString)
 assert(resultSchema == expected.schema)
 assert(result.collect().toSeq == expected.collect().toSeq)
+assert(original.sparkSession.catalog.listTables().count() == 0)
   }
 
   test("read/write") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17986][ML] SQLTransformer should remove temporary tables

2016-10-22 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 a0c03c925 -> b959dab32


[SPARK-17986][ML] SQLTransformer should remove temporary tables

## What changes were proposed in this pull request?

A call to the method `SQLTransformer.transform` previously would create a 
temporary table and never delete it. This change adds a call to 
`dropTempView()` that deletes this temporary table before returning the result 
so that the table will not remain in spark's table catalog. Because `tableName` 
is randomized and not exposed, there should be no expected use of this table 
outside of the `transform` method.

## How was this patch tested?

A single new assertion was added to the existing test of the 
`SQLTransformer.transform` method that all temporary tables are removed. 
Without the corresponding code change, this new assertion fails. I am not aware 
of any circumstances in which removing this temporary view would be bad for 
performance or correctness in other ways, but some expertise here would be 
helpful.

Author: Drew Robb 

Closes #15526 from drewrobb/SPARK-17986.

(cherry picked from commit ab3363e9f6b1f7fc26682509fe7382c570f91778)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b959dab3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b959dab3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b959dab3

Branch: refs/heads/branch-2.0
Commit: b959dab32a455e0f9a9ea0fd2111e28a5faf796c
Parents: a0c03c9
Author: Drew Robb 
Authored: Sat Oct 22 01:59:36 2016 -0700
Committer: Yanbo Liang 
Committed: Sat Oct 22 02:00:05 2016 -0700

--
 .../main/scala/org/apache/spark/ml/feature/SQLTransformer.scala  | 4 +++-
 .../scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala  | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b959dab3/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala
index 259be26..b25fff9 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala
@@ -67,7 +67,9 @@ class SQLTransformer @Since("1.6.0") (@Since("1.6.0") 
override val uid: String)
 val tableName = Identifiable.randomUID(uid)
 dataset.createOrReplaceTempView(tableName)
 val realStatement = $(statement).replace(tableIdentifier, tableName)
-dataset.sparkSession.sql(realStatement)
+val result = dataset.sparkSession.sql(realStatement)
+dataset.sparkSession.catalog.dropTempView(tableName)
+result
   }
 
   @Since("1.6.0")

http://git-wip-us.apache.org/repos/asf/spark/blob/b959dab3/mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala
index 1401ea9..9d3c007 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala
@@ -43,6 +43,7 @@ class SQLTransformerSuite
 assert(result.schema.toString == resultSchema.toString)
 assert(resultSchema == expected.schema)
 assert(result.collect().toSeq == expected.collect().toSeq)
+assert(original.sparkSession.catalog.listTables().count() == 0)
   }
 
   test("read/write") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17748][ML] One pass solver for Weighted Least Squares with ElasticNet

2016-10-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 483c37c58 -> 78d740a08


[SPARK-17748][ML] One pass solver for Weighted Least Squares with ElasticNet

## What changes were proposed in this pull request?

1. Make a pluggable solver interface for `WeightedLeastSquares`
2. Add a `QuasiNewton` solver to handle elastic net regularization for 
`WeightedLeastSquares`
3. Add method `BLAS.dspmv` used by QN solver
4. Add mechanism for WLS to handle singular covariance matrices by falling back 
to QN solver when Cholesky fails.

## How was this patch tested?
Unit tests - see below.

## Design choices

**Pluggable Normal Solver**

Before, the `WeightedLeastSquares` package always used the Cholesky 
decomposition solver to compute the solution to the normal equations. Now, we 
specify the solver as a constructor argument to the `WeightedLeastSquares`. We 
introduce a new trait:

scala
private[ml] sealed trait NormalEquationSolver {

  def solve(
  bBar: Double,
  bbBar: Double,
  abBar: DenseVector,
  aaBar: DenseVector,
  aBar: DenseVector): NormalEquationSolution
}


We extend this trait for different variants of normal equation solvers. In the 
future, we can easily add others (like QR) using this interface.

**Always train in the standardized space**

The normal solver did not previously standardize the data, but this patch 
introduces a change such that we always solve the normal equations in the 
standardized space. We convert back to the original space in the same way that 
is done for distributed L-BFGS/OWL-QN. We add test cases for zero variance 
features/labels.

**Use L-BFGS locally to solve normal equations for singular matrix**

When linear regression with the normal solver is called for a singular matrix, 
we initially try to solve with Cholesky. We use the output of `lapack.dppsv` to 
determine if the matrix is singular. If it is, we fall back to using L-BFGS 
locally to solve the normal equations. We add test cases for this as well.

## Test cases
I found it helpful to enumerate some of the test cases and hopefully it makes 
review easier.

**WeightedLeastSquares**

1. Constant columns - Cholesky solver fails with no regularization, Auto solver 
falls back to QN, and QN trains successfully.
2. Collinear features - Cholesky solver fails with no regularization, Auto 
solver falls back to QN, and QN trains successfully.
3. Label is constant zero - no training is performed regardless of intercept. 
Coefficients are zero and intercept is zero.
4. Label is constant - if fitIntercept, then no training is performed and 
intercept equals label mean. If not fitIntercept, then we train and return an 
answer that matches R's lm package.
5. Test with L1 - go through various combinations of L1/L2, standardization, 
fitIntercept and verify that output matches glmnet.
6. Initial intercept - verify that setting the initial intercept to label mean 
is correct by training model with strong L1 regularization so that all 
coefficients are zero and intercept converges to label mean.
7. Test diagInvAtWA - since we are standardizing features now during training, 
we should test that the inverse is computed to match R.

**LinearRegression**
1. For all existing L1 test cases, test the "normal" solver too.
2. Check that using the normal solver now handles singular matrices.
3. Check that using the normal solver with L1 produces an objective history in 
the model summary, but does not produce the inverse of AtA.

**BLAS**
1. Test new method `dspmv`.

## Performance Testing
This patch will speed up linear regression with L1/elasticnet penalties when 
the feature size is < 4096. I have not conducted performance tests at scale, 
only observed by testing locally that there is a speed improvement.

We should decide if this PR needs to be blocked before performance testing is 
conducted.

Author: sethah 

Closes #15394 from sethah/SPARK-17748.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/78d740a0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/78d740a0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/78d740a0

Branch: refs/heads/master
Commit: 78d740a08a04b74b49b5cba4bb6a821631390ab4
Parents: 483c37c
Author: sethah 
Authored: Mon Oct 24 23:47:59 2016 -0700
Committer: Yanbo Liang 
Committed: Mon Oct 24 23:47:59 2016 -0700

--
 .../scala/org/apache/spark/ml/linalg/BLAS.scala |  18 +
 .../org/apache/spark/ml/linalg/BLASSuite.scala  |  45 ++
 .../IterativelyReweightedLeastSquares.scala |   4 +-
 .../spark/ml/optim/NormalEquationSolver.scala   | 163 +++
 .../spark/ml/optim/WeightedLeastSquares.scala   | 270 +---
 .../GeneralizedLinearRegression.scala   |   4 +-
 .../spark/ml/regression/LinearRegression.scala  |  20 +-
 .../mllib/linalg/CholeskyDecomposition.scala|   4 +-
 ...IterativelyReweightedLeastSquaresSuite.scala |

spark git commit: [SPARK-14634][ML][FOLLOWUP] Delete superfluous line in BisectingKMeans

2016-10-25 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 6f31833db -> 38cdd6ccd


[SPARK-14634][ML][FOLLOWUP] Delete superfluous line in BisectingKMeans

## What changes were proposed in this pull request?
As commented by jkbradley in https://github.com/apache/spark/pull/12394, 
`model.setSummary(summary)` is superfluous

## How was this patch tested?
existing tests

Author: Zheng RuiFeng 

Closes #15619 from zhengruifeng/del_superfluous.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/38cdd6cc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/38cdd6cc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/38cdd6cc

Branch: refs/heads/master
Commit: 38cdd6ccdaba7f8da985c4f4efe5bd93a46a2b53
Parents: 6f31833
Author: Zheng RuiFeng 
Authored: Tue Oct 25 03:19:50 2016 -0700
Committer: Yanbo Liang 
Committed: Tue Oct 25 03:19:50 2016 -0700

--
 .../scala/org/apache/spark/ml/clustering/BisectingKMeans.scala | 5 ++---
 .../src/main/scala/org/apache/spark/ml/clustering/KMeans.scala | 6 +++---
 2 files changed, 5 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/38cdd6cc/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
index add8ee2..ef2d918 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
@@ -265,9 +265,8 @@ class BisectingKMeans @Since("2.0.0") (
 val summary = new BisectingKMeansSummary(
   model.transform(dataset), $(predictionCol), $(featuresCol), $(k))
 model.setSummary(summary)
-val m = model.setSummary(summary)
-instr.logSuccess(m)
-m
+instr.logSuccess(model)
+model
   }
 
   @Since("2.0.0")

http://git-wip-us.apache.org/repos/asf/spark/blob/38cdd6cc/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
index b04e828..0d2405b 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
@@ -324,9 +324,9 @@ class KMeans @Since("1.5.0") (
 val model = copyValues(new KMeansModel(uid, parentModel).setParent(this))
 val summary = new KMeansSummary(
   model.transform(dataset), $(predictionCol), $(featuresCol), $(k))
-val m = model.setSummary(summary)
-instr.logSuccess(m)
-m
+model.setSummary(summary)
+instr.logSuccess(model)
+model
   }
 
   @Since("1.5.0")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17748][FOLLOW-UP][ML] Fix build error for Scala 2.10.

2016-10-25 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 38cdd6ccd -> ac8ff920f


[SPARK-17748][FOLLOW-UP][ML] Fix build error for Scala 2.10.

## What changes were proposed in this pull request?
#15394 introduced build error for Scala 2.10, this PR fix it.

## How was this patch tested?
Existing test.

Author: Yanbo Liang 

Closes #15625 from yanboliang/spark-17748-scala.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ac8ff920
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ac8ff920
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ac8ff920

Branch: refs/heads/master
Commit: ac8ff920faec6ee06e17212e2b5d2ee117495e87
Parents: 38cdd6c
Author: Yanbo Liang 
Authored: Tue Oct 25 10:22:02 2016 -0700
Committer: Yanbo Liang 
Committed: Tue Oct 25 10:22:02 2016 -0700

--
 .../spark/ml/optim/WeightedLeastSquaresSuite.scala | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ac8ff920/mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala
index 5f638b4..3cdab03 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala
@@ -280,7 +280,7 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with 
MLlibTestSparkContext
 }
 
 // Cholesky also fails when regularization is added but we don't wish to 
standardize
-val wls = new WeightedLeastSquares(true, regParam = 0.5, elasticNetParam = 
0.0,
+val wls = new WeightedLeastSquares(fitIntercept = true, regParam = 0.5, 
elasticNetParam = 0.0,
   standardizeFeatures = false, standardizeLabel = false,
   solverType = WeightedLeastSquares.Cholesky)
 intercept[SingularMatrixException] {
@@ -470,10 +470,11 @@ class WeightedLeastSquaresSuite extends SparkFunSuite 
with MLlibTestSparkContext
 var idx = 0
 for (fitIntercept <- Seq(false, true);
  regParam <- Seq(0.1, 0.5, 1.0);
- standardizeFeatures <- Seq(false, true);
+ standardization <- Seq(false, true);
  elasticNetParam <- Seq(0.1, 0.5, 1.0)) {
-  val wls = new WeightedLeastSquares(fitIntercept, regParam, 
elasticNetParam = elasticNetParam,
-standardizeFeatures, standardizeLabel = true, solverType = 
WeightedLeastSquares.Auto)
+  val wls = new WeightedLeastSquares(fitIntercept, regParam, 
elasticNetParam,
+standardizeFeatures = standardization, standardizeLabel = true,
+solverType = WeightedLeastSquares.Auto)
 .fit(instances)
   val actual = Vectors.dense(wls.intercept, wls.coefficients(0), 
wls.coefficients(1))
   assert(actual ~== expected(idx) absTol 1e-4)
@@ -528,10 +529,10 @@ class WeightedLeastSquaresSuite extends SparkFunSuite 
with MLlibTestSparkContext
 var idx = 0
 for (fitIntercept <- Seq(false, true);
  regParam <- Seq(0.0, 0.1, 1.0);
- standardizeFeatures <- Seq(false, true)) {
+ standardization <- Seq(false, true)) {
   for (solver <- WeightedLeastSquares.supportedSolvers) {
 val wls = new WeightedLeastSquares(fitIntercept, regParam, 
elasticNetParam = 0.0,
-  standardizeFeatures, standardizeLabel = true, solverType = solver)
+  standardizeFeatures = standardization, standardizeLabel = true, 
solverType = solver)
   .fit(instances)
 val actual = Vectors.dense(wls.intercept, wls.coefficients(0), 
wls.coefficients(1))
 assert(actual ~== expected(idx) absTol 1e-4)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17748][FOLLOW-UP][ML] Reorg variables of WeightedLeastSquares.

2016-10-26 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 4bee95407 -> 312ea3f7f


[SPARK-17748][FOLLOW-UP][ML] Reorg variables of WeightedLeastSquares.

## What changes were proposed in this pull request?
This is follow-up work of #15394.
Reorg some variables of ```WeightedLeastSquares``` and fix one minor issue of 
```WeightedLeastSquaresSuite```.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #15621 from yanboliang/spark-17748.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/312ea3f7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/312ea3f7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/312ea3f7

Branch: refs/heads/master
Commit: 312ea3f7f65532818e11016d6d780ad47485175f
Parents: 4bee954
Author: Yanbo Liang 
Authored: Wed Oct 26 09:28:28 2016 -0700
Committer: Yanbo Liang 
Committed: Wed Oct 26 09:28:28 2016 -0700

--
 .../spark/ml/optim/WeightedLeastSquares.scala   | 139 +++
 .../ml/optim/WeightedLeastSquaresSuite.scala|  15 +-
 2 files changed, 86 insertions(+), 68 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/312ea3f7/mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala 
b/mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala
index 2223f12..90c24e1 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala
@@ -101,23 +101,19 @@ private[ml] class WeightedLeastSquares(
 summary.validate()
 logInfo(s"Number of instances: ${summary.count}.")
 val k = if (fitIntercept) summary.k + 1 else summary.k
+val numFeatures = summary.k
 val triK = summary.triK
 val wSum = summary.wSum
-val bBar = summary.bBar
-val bbBar = summary.bbBar
-val aBar = summary.aBar
-val aStd = summary.aStd
-val abBar = summary.abBar
-val aaBar = summary.aaBar
-val numFeatures = abBar.size
+
 val rawBStd = summary.bStd
+val rawBBar = summary.bBar
 // if b is constant (rawBStd is zero), then b cannot be scaled. In this 
case
-// setting bStd=abs(bBar) ensures that b is not scaled anymore in l-bfgs 
algorithm.
-val bStd = if (rawBStd == 0.0) math.abs(bBar) else rawBStd
+// setting bStd=abs(rawBBar) ensures that b is not scaled anymore in 
l-bfgs algorithm.
+val bStd = if (rawBStd == 0.0) math.abs(rawBBar) else rawBStd
 
 if (rawBStd == 0) {
-  if (fitIntercept || bBar == 0.0) {
-if (bBar == 0.0) {
+  if (fitIntercept || rawBBar == 0.0) {
+if (rawBBar == 0.0) {
   logWarning(s"Mean and standard deviation of the label are zero, so 
the coefficients " +
 s"and the intercept will all be zero; as a result, training is not 
needed.")
 } else {
@@ -126,7 +122,7 @@ private[ml] class WeightedLeastSquares(
 s"training is not needed.")
 }
 val coefficients = new DenseVector(Array.ofDim(numFeatures))
-val intercept = bBar
+val intercept = rawBBar
 val diagInvAtWA = new DenseVector(Array(0D))
 return new WeightedLeastSquaresModel(coefficients, intercept, 
diagInvAtWA, Array(0D))
   } else {
@@ -137,53 +133,70 @@ private[ml] class WeightedLeastSquares(
   }
 }
 
-// scale aBar to standardized space in-place
-val aBarValues = aBar.values
-var j = 0
-while (j < numFeatures) {
-  if (aStd(j) == 0.0) {
-aBarValues(j) = 0.0
-  } else {
-aBarValues(j) /= aStd(j)
-  }
-  j += 1
-}
+val bBar = summary.bBar / bStd
+val bbBar = summary.bbBar / (bStd * bStd)
 
-// scale abBar to standardized space in-place
-val abBarValues = abBar.values
+val aStd = summary.aStd
 val aStdValues = aStd.values
-j = 0
-while (j < numFeatures) {
-  if (aStdValues(j) == 0.0) {
-abBarValues(j) = 0.0
-  } else {
-abBarValues(j) /= (aStdValues(j) * bStd)
+
+val aBar = {
+  val _aBar = summary.aBar
+  val _aBarValues = _aBar.values
+  var i = 0
+  // scale aBar to standardized space in-place
+  while (i < numFeatures) {
+if (aStdValues(i) == 0.0) {
+  _aBarValues(i) = 0.0
+} else {
+  _aBarValues(i) /= aStdValues(i)
+}
+i += 1
   }
-  j += 1
+  _aBar
 }
+val aBarValues = aBar.values
 
-// scale aaBar to standardized space in-place
-val aaBarValues = aaBar.values
-j = 0
-var p = 0
-while (j < numFeatures) {
-  val aStdJ = aStdValues(j)
+val abBar = {
+  val _abBar = sum

spark git commit: [SPARK-18109][ML] Add instrumentation to GMM

2016-10-28 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master ab5f938bc -> 569788a55


[SPARK-18109][ML] Add instrumentation to GMM

## What changes were proposed in this pull request?

Add instrumentation to GMM

## How was this patch tested?

Test in spark-shell

Author: Zheng RuiFeng 

Closes #15636 from zhengruifeng/gmm_instr.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/569788a5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/569788a5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/569788a5

Branch: refs/heads/master
Commit: 569788a55e4c6b218fb697e1e54c6138ffe657a6
Parents: ab5f938
Author: Zheng RuiFeng 
Authored: Fri Oct 28 00:40:06 2016 -0700
Committer: Yanbo Liang 
Committed: Fri Oct 28 00:40:06 2016 -0700

--
 .../scala/org/apache/spark/ml/clustering/GaussianMixture.scala | 6 ++
 1 file changed, 6 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/569788a5/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
index e3cb92f..8fac63f 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
@@ -323,6 +323,9 @@ class GaussianMixture @Since("2.0.0") (
   case Row(point: Vector) => OldVectors.fromML(point)
 }
 
+val instr = Instrumentation.create(this, rdd)
+instr.logParams(featuresCol, predictionCol, probabilityCol, k, maxIter, 
seed, tol)
+
 val algo = new MLlibGM()
   .setK($(k))
   .setMaxIterations($(maxIter))
@@ -337,6 +340,9 @@ class GaussianMixture @Since("2.0.0") (
 val summary = new GaussianMixtureSummary(model.transform(dataset),
   $(predictionCol), $(probabilityCol), $(featuresCol), $(k))
 model.setSummary(summary)
+instr.logNumFeatures(model.gaussians.head.mean.size)
+instr.logSuccess(model)
+model
   }
 
   @Since("2.0.0")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18133][EXAMPLES][ML] Python ML Pipeline Example has syntax e…

2016-10-28 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 569788a55 -> e9746f87d


[SPARK-18133][EXAMPLES][ML] Python ML Pipeline Example has syntax eâ¦

## What changes were proposed in this pull request?

In Python 3, there is only one integer type (i.e., int), which mostly behaves 
like the long type in Python 2. Since Python 3 won't accept "L", so removed "L" 
in all examples.

## How was this patch tested?

Unit tests.

â¦rrors]

Author: Jagadeesan 

Closes #15660 from jagadeesanas2/SPARK-18133.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e9746f87
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e9746f87
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e9746f87

Branch: refs/heads/master
Commit: e9746f87d0b553b8115948acb79f7e32c23dfd86
Parents: 569788a
Author: Jagadeesan 
Authored: Fri Oct 28 02:26:55 2016 -0700
Committer: Yanbo Liang 
Committed: Fri Oct 28 02:26:55 2016 -0700

--
 examples/src/main/python/ml/cross_validator.py  |  8 
 .../src/main/python/ml/gaussian_mixture_example.py  |  2 +-
 examples/src/main/python/ml/pipeline_example.py | 16 
 .../mllib/binary_classification_metrics_example.py  |  2 +-
 .../python/mllib/multi_class_metrics_example.py |  2 +-
 5 files changed, 15 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e9746f87/examples/src/main/python/ml/cross_validator.py
--
diff --git a/examples/src/main/python/ml/cross_validator.py 
b/examples/src/main/python/ml/cross_validator.py
index 907eec6..db70543 100644
--- a/examples/src/main/python/ml/cross_validator.py
+++ b/examples/src/main/python/ml/cross_validator.py
@@ -84,10 +84,10 @@ if __name__ == "__main__":
 
 # Prepare test documents, which are unlabeled.
 test = spark.createDataFrame([
-(4L, "spark i j k"),
-(5L, "l m n"),
-(6L, "mapreduce spark"),
-(7L, "apache hadoop")
+(4, "spark i j k"),
+(5, "l m n"),
+(6, "mapreduce spark"),
+(7, "apache hadoop")
 ], ["id", "text"])
 
 # Make predictions on test documents. cvModel uses the best model found 
(lrModel).

http://git-wip-us.apache.org/repos/asf/spark/blob/e9746f87/examples/src/main/python/ml/gaussian_mixture_example.py
--
diff --git a/examples/src/main/python/ml/gaussian_mixture_example.py 
b/examples/src/main/python/ml/gaussian_mixture_example.py
index 8ad450b..e4a0d31 100644
--- a/examples/src/main/python/ml/gaussian_mixture_example.py
+++ b/examples/src/main/python/ml/gaussian_mixture_example.py
@@ -38,7 +38,7 @@ if __name__ == "__main__":
 # loads data
 dataset = 
spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
 
-gmm = GaussianMixture().setK(2).setSeed(538009335L)
+gmm = GaussianMixture().setK(2).setSeed(538009335)
 model = gmm.fit(dataset)
 
 print("Gaussians shown as a DataFrame: ")

http://git-wip-us.apache.org/repos/asf/spark/blob/e9746f87/examples/src/main/python/ml/pipeline_example.py
--
diff --git a/examples/src/main/python/ml/pipeline_example.py 
b/examples/src/main/python/ml/pipeline_example.py
index f63e4db..e1fab7c 100644
--- a/examples/src/main/python/ml/pipeline_example.py
+++ b/examples/src/main/python/ml/pipeline_example.py
@@ -35,10 +35,10 @@ if __name__ == "__main__":
 # $example on$
 # Prepare training documents from a list of (id, text, label) tuples.
 training = spark.createDataFrame([
-(0L, "a b c d e spark", 1.0),
-(1L, "b d", 0.0),
-(2L, "spark f g h", 1.0),
-(3L, "hadoop mapreduce", 0.0)
+(0, "a b c d e spark", 1.0),
+(1, "b d", 0.0),
+(2, "spark f g h", 1.0),
+(3, "hadoop mapreduce", 0.0)
 ], ["id", "text", "label"])
 
 # Configure an ML pipeline, which consists of three stages: tokenizer, 
hashingTF, and lr.
@@ -52,10 +52,10 @@ if __name__ == "__main__":
 
 # Prepare test documents, which are unlabeled (id, text) tuples.
 test = spark.createDataFrame([
-(4L, "spark i j k"),
-(5L, "l m n"),
-(6L, "spark hadoop spark"),
-(7L, "apache hadoop")
+(4, "spark i j k"),
+(5, "l m n"),
+(6, "spark hadoop spark"),
+(7, "apache hadoop")
 ], ["id", "text"])
 
 # Make predictions on test documents and print columns of interest.

http://git-wip-us.apache.org/repos/asf/spark/blob/e9746f87/examples/src/main/python/mllib/binary_classification_metrics_example.py
--
diff --git 
a/examples/src/main/python/mllib/binary_cla

spark git commit: [SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate' of pyspark GBTClassifier

2016-11-03 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 0ea5d5b24 -> 9dc9f9a5d


[SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate' of pyspark 
GBTClassifier

## What changes were proposed in this pull request?
Add missing 'subsamplingRate' of pyspark GBTClassifier

## How was this patch tested?
existing tests

Author: Zheng RuiFeng 

Closes #15692 from zhengruifeng/gbt_subsamplingRate.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9dc9f9a5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9dc9f9a5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9dc9f9a5

Branch: refs/heads/master
Commit: 9dc9f9a5dde37d085808a264cfb9cf4d4f72417d
Parents: 0ea5d5b
Author: Zheng RuiFeng 
Authored: Thu Nov 3 07:45:20 2016 -0700
Committer: Yanbo Liang 
Committed: Thu Nov 3 07:45:20 2016 -0700

--
 python/pyspark/ml/classification.py | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9dc9f9a5/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index d9ff356..56c8c62 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -900,19 +900,19 @@ class GBTClassifier(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredictionCol
 def __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
  maxDepth=5, maxBins=32, minInstancesPerNode=1, 
minInfoGain=0.0,
  maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, 
lossType="logistic",
- maxIter=20, stepSize=0.1, seed=None):
+ maxIter=20, stepSize=0.1, seed=None, subsamplingRate=1.0):
 """
 __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction", \
  maxDepth=5, maxBins=32, minInstancesPerNode=1, 
minInfoGain=0.0, \
  maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, 
\
- lossType="logistic", maxIter=20, stepSize=0.1, seed=None)
+ lossType="logistic", maxIter=20, stepSize=0.1, seed=None, 
subsamplingRate=1.0)
 """
 super(GBTClassifier, self).__init__()
 self._java_obj = self._new_java_obj(
 "org.apache.spark.ml.classification.GBTClassifier", self.uid)
 self._setDefault(maxDepth=5, maxBins=32, minInstancesPerNode=1, 
minInfoGain=0.0,
  maxMemoryInMB=256, cacheNodeIds=False, 
checkpointInterval=10,
- lossType="logistic", maxIter=20, stepSize=0.1)
+ lossType="logistic", maxIter=20, stepSize=0.1, 
subsamplingRate=1.0)
 kwargs = self.__init__._input_kwargs
 self.setParams(**kwargs)
 
@@ -921,12 +921,12 @@ class GBTClassifier(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredictionCol
 def setParams(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
   maxDepth=5, maxBins=32, minInstancesPerNode=1, 
minInfoGain=0.0,
   maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10,
-  lossType="logistic", maxIter=20, stepSize=0.1, seed=None):
+  lossType="logistic", maxIter=20, stepSize=0.1, seed=None, 
subsamplingRate=1.0):
 """
 setParams(self, featuresCol="features", labelCol="label", 
predictionCol="prediction", \
   maxDepth=5, maxBins=32, minInstancesPerNode=1, 
minInfoGain=0.0, \
   maxMemoryInMB=256, cacheNodeIds=False, 
checkpointInterval=10, \
-  lossType="logistic", maxIter=20, stepSize=0.1, seed=None)
+  lossType="logistic", maxIter=20, stepSize=0.1, seed=None, 
subsamplingRate=1.0)
 Sets params for Gradient Boosted Tree Classification.
 """
 kwargs = self.setParams._input_kwargs


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate' of pyspark GBTClassifier

2016-11-03 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 71104c9c9 -> 99891e56e


[SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate' of pyspark 
GBTClassifier

## What changes were proposed in this pull request?
Add missing 'subsamplingRate' of pyspark GBTClassifier

## How was this patch tested?
existing tests

Author: Zheng RuiFeng 

Closes #15692 from zhengruifeng/gbt_subsamplingRate.

(cherry picked from commit 9dc9f9a5dde37d085808a264cfb9cf4d4f72417d)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/99891e56
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/99891e56
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/99891e56

Branch: refs/heads/branch-2.1
Commit: 99891e56ea286580323fd82e303064d3c0730d85
Parents: 71104c9
Author: Zheng RuiFeng 
Authored: Thu Nov 3 07:45:20 2016 -0700
Committer: Yanbo Liang 
Committed: Thu Nov 3 07:45:56 2016 -0700

--
 python/pyspark/ml/classification.py | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/99891e56/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index d9ff356..56c8c62 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -900,19 +900,19 @@ class GBTClassifier(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredictionCol
 def __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
  maxDepth=5, maxBins=32, minInstancesPerNode=1, 
minInfoGain=0.0,
  maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, 
lossType="logistic",
- maxIter=20, stepSize=0.1, seed=None):
+ maxIter=20, stepSize=0.1, seed=None, subsamplingRate=1.0):
 """
 __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction", \
  maxDepth=5, maxBins=32, minInstancesPerNode=1, 
minInfoGain=0.0, \
  maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, 
\
- lossType="logistic", maxIter=20, stepSize=0.1, seed=None)
+ lossType="logistic", maxIter=20, stepSize=0.1, seed=None, 
subsamplingRate=1.0)
 """
 super(GBTClassifier, self).__init__()
 self._java_obj = self._new_java_obj(
 "org.apache.spark.ml.classification.GBTClassifier", self.uid)
 self._setDefault(maxDepth=5, maxBins=32, minInstancesPerNode=1, 
minInfoGain=0.0,
  maxMemoryInMB=256, cacheNodeIds=False, 
checkpointInterval=10,
- lossType="logistic", maxIter=20, stepSize=0.1)
+ lossType="logistic", maxIter=20, stepSize=0.1, 
subsamplingRate=1.0)
 kwargs = self.__init__._input_kwargs
 self.setParams(**kwargs)
 
@@ -921,12 +921,12 @@ class GBTClassifier(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredictionCol
 def setParams(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
   maxDepth=5, maxBins=32, minInstancesPerNode=1, 
minInfoGain=0.0,
   maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10,
-  lossType="logistic", maxIter=20, stepSize=0.1, seed=None):
+  lossType="logistic", maxIter=20, stepSize=0.1, seed=None, 
subsamplingRate=1.0):
 """
 setParams(self, featuresCol="features", labelCol="label", 
predictionCol="prediction", \
   maxDepth=5, maxBins=32, minInstancesPerNode=1, 
minInfoGain=0.0, \
   maxMemoryInMB=256, cacheNodeIds=False, 
checkpointInterval=10, \
-  lossType="logistic", maxIter=20, stepSize=0.1, seed=None)
+  lossType="logistic", maxIter=20, stepSize=0.1, seed=None, 
subsamplingRate=1.0)
 Sets params for Gradient Boosted Tree Classification.
 """
 kwargs = self.setParams._input_kwargs


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18276][ML] ML models should copy the training summary and set parent

2016-11-05 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 15d392688 -> 23ce0d1e9


[SPARK-18276][ML] ML models should copy the training summary and set parent

## What changes were proposed in this pull request?

Only some of the models which contain a training summary currently set the 
summaries in the copy method. Linear/Logistic regression do, GLR, GMM, KM, and 
BKM do not. Additionally, these copy methods did not set the parent pointer of 
the copied model. This patch modifies the copy methods of the four models 
mentioned above to copy the training summary and set the parent.

## How was this patch tested?

Add unit tests in Linear/Logistic/GeneralizedLinear regression and 
GaussianMixture/KMeans/BisectingKMeans to check the parent pointer of the 
copied model and check that the copied model has a summary.

Author: sethah 

Closes #15773 from sethah/SPARK-18276.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/23ce0d1e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/23ce0d1e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/23ce0d1e

Branch: refs/heads/master
Commit: 23ce0d1e91076d90c1a87d698a94d283d08cf899
Parents: 15d3926
Author: sethah 
Authored: Sat Nov 5 22:38:07 2016 -0700
Committer: Yanbo Liang 
Committed: Sat Nov 5 22:38:07 2016 -0700

--
 .../org/apache/spark/ml/clustering/BisectingKMeans.scala |  5 +++--
 .../org/apache/spark/ml/clustering/GaussianMixture.scala |  5 +++--
 .../scala/org/apache/spark/ml/clustering/KMeans.scala|  5 +++--
 .../ml/regression/GeneralizedLinearRegression.scala  |  6 --
 .../apache/spark/ml/tuning/TrainValidationSplit.scala|  2 +-
 .../ml/classification/LogisticRegressionSuite.scala  | 11 +++
 .../spark/ml/clustering/BisectingKMeansSuite.scala   | 10 +-
 .../spark/ml/clustering/GaussianMixtureSuite.scala   | 10 +-
 .../org/apache/spark/ml/clustering/KMeansSuite.scala | 10 +-
 .../ml/regression/GeneralizedLinearRegressionSuite.scala |  5 -
 .../spark/ml/regression/LinearRegressionSuite.scala  |  5 -
 .../spark/ml/tuning/TrainValidationSplitSuite.scala  |  8 ++--
 12 files changed, 62 insertions(+), 20 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/23ce0d1e/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
index 2718dd9..f8a606d 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
@@ -94,8 +94,9 @@ class BisectingKMeansModel private[ml] (
 
   @Since("2.0.0")
   override def copy(extra: ParamMap): BisectingKMeansModel = {
-val copied = new BisectingKMeansModel(uid, parentModel)
-copyValues(copied, extra)
+val copied = copyValues(new BisectingKMeansModel(uid, parentModel), extra)
+if (trainingSummary.isDefined) copied.setSummary(trainingSummary.get)
+copied.setParent(this.parent)
   }
 
   @Since("2.0.0")

http://git-wip-us.apache.org/repos/asf/spark/blob/23ce0d1e/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
index 8fac63f..a0bd66e 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
@@ -89,8 +89,9 @@ class GaussianMixtureModel private[ml] (
 
   @Since("2.0.0")
   override def copy(extra: ParamMap): GaussianMixtureModel = {
-val copied = new GaussianMixtureModel(uid, weights, gaussians)
-copyValues(copied, extra).setParent(this.parent)
+val copied = copyValues(new GaussianMixtureModel(uid, weights, gaussians), 
extra)
+if (trainingSummary.isDefined) copied.setSummary(trainingSummary.get)
+copied.setParent(this.parent)
   }
 
   @Since("2.0.0")

http://git-wip-us.apache.org/repos/asf/spark/blob/23ce0d1e/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
index 85bb8c9..a0d481b 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
@@ -108,8 +108,9 @@ cla

spark git commit: [SPARK-18276][ML] ML models should copy the training summary and set parent

2016-11-05 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 e9f1d4aaa -> c42301f1e


[SPARK-18276][ML] ML models should copy the training summary and set parent

## What changes were proposed in this pull request?

Only some of the models which contain a training summary currently set the 
summaries in the copy method. Linear/Logistic regression do, GLR, GMM, KM, and 
BKM do not. Additionally, these copy methods did not set the parent pointer of 
the copied model. This patch modifies the copy methods of the four models 
mentioned above to copy the training summary and set the parent.

## How was this patch tested?

Add unit tests in Linear/Logistic/GeneralizedLinear regression and 
GaussianMixture/KMeans/BisectingKMeans to check the parent pointer of the 
copied model and check that the copied model has a summary.

Author: sethah 

Closes #15773 from sethah/SPARK-18276.

(cherry picked from commit 23ce0d1e91076d90c1a87d698a94d283d08cf899)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c42301f1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c42301f1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c42301f1

Branch: refs/heads/branch-2.1
Commit: c42301f1eb09565cfaa044b05984ed67879bd946
Parents: e9f1d4a
Author: sethah 
Authored: Sat Nov 5 22:38:07 2016 -0700
Committer: Yanbo Liang 
Committed: Sat Nov 5 22:38:40 2016 -0700

--
 .../org/apache/spark/ml/clustering/BisectingKMeans.scala |  5 +++--
 .../org/apache/spark/ml/clustering/GaussianMixture.scala |  5 +++--
 .../scala/org/apache/spark/ml/clustering/KMeans.scala|  5 +++--
 .../ml/regression/GeneralizedLinearRegression.scala  |  6 --
 .../apache/spark/ml/tuning/TrainValidationSplit.scala|  2 +-
 .../ml/classification/LogisticRegressionSuite.scala  | 11 +++
 .../spark/ml/clustering/BisectingKMeansSuite.scala   | 10 +-
 .../spark/ml/clustering/GaussianMixtureSuite.scala   | 10 +-
 .../org/apache/spark/ml/clustering/KMeansSuite.scala | 10 +-
 .../ml/regression/GeneralizedLinearRegressionSuite.scala |  5 -
 .../spark/ml/regression/LinearRegressionSuite.scala  |  5 -
 .../spark/ml/tuning/TrainValidationSplitSuite.scala  |  8 ++--
 12 files changed, 62 insertions(+), 20 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c42301f1/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
index 2718dd9..f8a606d 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
@@ -94,8 +94,9 @@ class BisectingKMeansModel private[ml] (
 
   @Since("2.0.0")
   override def copy(extra: ParamMap): BisectingKMeansModel = {
-val copied = new BisectingKMeansModel(uid, parentModel)
-copyValues(copied, extra)
+val copied = copyValues(new BisectingKMeansModel(uid, parentModel), extra)
+if (trainingSummary.isDefined) copied.setSummary(trainingSummary.get)
+copied.setParent(this.parent)
   }
 
   @Since("2.0.0")

http://git-wip-us.apache.org/repos/asf/spark/blob/c42301f1/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
index 8fac63f..a0bd66e 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
@@ -89,8 +89,9 @@ class GaussianMixtureModel private[ml] (
 
   @Since("2.0.0")
   override def copy(extra: ParamMap): GaussianMixtureModel = {
-val copied = new GaussianMixtureModel(uid, weights, gaussians)
-copyValues(copied, extra).setParent(this.parent)
+val copied = copyValues(new GaussianMixtureModel(uid, weights, gaussians), 
extra)
+if (trainingSummary.isDefined) copied.setSummary(trainingSummary.get)
+copied.setParent(this.parent)
   }
 
   @Since("2.0.0")

http://git-wip-us.apache.org/repos/asf/spark/blob/c42301f1/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
index 85bb8c9..a0d481b 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/K

spark git commit: [SPARK-18210][ML] Pipeline.copy does not create an instance with the same UID

2016-11-06 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 dcbf3fd4b -> d2f2cf68a


[SPARK-18210][ML] Pipeline.copy does not create an instance with the same UID

## What changes were proposed in this pull request?

Motivation:
`org.apache.spark.ml.Pipeline.copy(extra: ParamMap)` does not create an 
instance with the same UID. It does not conform to the method specification 
from its base class `org.apache.spark.ml.param.Params.copy(extra: ParamMap)`

Solution:
- fix for Pipeline UID
- introduced new tests for `org.apache.spark.ml.Pipeline.copy`
- minor improvements in test for `org.apache.spark.ml.PipelineModel.copy`

## How was this patch tested?

Introduced new unit test: `org.apache.spark.ml.PipelineSuite."Pipeline.copy"`
Improved existing unit test: 
`org.apache.spark.ml.PipelineSuite."PipelineModel.copy"`

Author: Wojciech Szymanski 

Closes #15759 from wojtek-szymanski/SPARK-18210.

(cherry picked from commit b89d0556dff0520ab35882382242fbfa7d9478eb)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d2f2cf68
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d2f2cf68
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d2f2cf68

Branch: refs/heads/branch-2.1
Commit: d2f2cf68a62a3f8beb7cdfef8393acfdcb785975
Parents: dcbf3fd
Author: Wojciech Szymanski 
Authored: Sun Nov 6 07:43:13 2016 -0800
Committer: Yanbo Liang 
Committed: Sun Nov 6 07:43:36 2016 -0800

--
 .../scala/org/apache/spark/ml/Pipeline.scala|  2 +-
 .../org/apache/spark/ml/PipelineSuite.scala | 22 ++--
 2 files changed, 21 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d2f2cf68/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala 
b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
index 195a93e..f406f8c 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
@@ -169,7 +169,7 @@ class Pipeline @Since("1.4.0") (
   override def copy(extra: ParamMap): Pipeline = {
 val map = extractParamMap(extra)
 val newStages = map(stages).map(_.copy(extra))
-new Pipeline().setStages(newStages)
+new Pipeline(uid).setStages(newStages)
   }
 
   @Since("1.2.0")

http://git-wip-us.apache.org/repos/asf/spark/blob/d2f2cf68/mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala
--
diff --git a/mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala
index 6413ca1..dafc6c2 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala
@@ -101,13 +101,31 @@ class PipelineSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
 }
   }
 
+  test("Pipeline.copy") {
+val hashingTF = new HashingTF()
+  .setNumFeatures(100)
+val pipeline = new 
Pipeline("pipeline").setStages(Array[Transformer](hashingTF))
+val copied = pipeline.copy(ParamMap(hashingTF.numFeatures -> 10))
+
+assert(copied.uid === pipeline.uid,
+  "copy should create an instance with the same UID")
+assert(copied.getStages(0).asInstanceOf[HashingTF].getNumFeatures === 10,
+  "copy should handle extra stage params")
+  }
+
   test("PipelineModel.copy") {
 val hashingTF = new HashingTF()
   .setNumFeatures(100)
-val model = new PipelineModel("pipeline", Array[Transformer](hashingTF))
+val model = new PipelineModel("pipelineModel", 
Array[Transformer](hashingTF))
+  .setParent(new Pipeline())
 val copied = model.copy(ParamMap(hashingTF.numFeatures -> 10))
-require(copied.stages(0).asInstanceOf[HashingTF].getNumFeatures === 10,
+
+assert(copied.uid === model.uid,
+  "copy should create an instance with the same UID")
+assert(copied.stages(0).asInstanceOf[HashingTF].getNumFeatures === 10,
   "copy should handle extra stage params")
+assert(copied.parent === model.parent,
+  "copy should create an instance with the same parent")
   }
 
   test("pipeline model constructors") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18210][ML] Pipeline.copy does not create an instance with the same UID

2016-11-06 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 340f09d10 -> b89d0556d


[SPARK-18210][ML] Pipeline.copy does not create an instance with the same UID

## What changes were proposed in this pull request?

Motivation:
`org.apache.spark.ml.Pipeline.copy(extra: ParamMap)` does not create an 
instance with the same UID. It does not conform to the method specification 
from its base class `org.apache.spark.ml.param.Params.copy(extra: ParamMap)`

Solution:
- fix for Pipeline UID
- introduced new tests for `org.apache.spark.ml.Pipeline.copy`
- minor improvements in test for `org.apache.spark.ml.PipelineModel.copy`

## How was this patch tested?

Introduced new unit test: `org.apache.spark.ml.PipelineSuite."Pipeline.copy"`
Improved existing unit test: 
`org.apache.spark.ml.PipelineSuite."PipelineModel.copy"`

Author: Wojciech Szymanski 

Closes #15759 from wojtek-szymanski/SPARK-18210.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b89d0556
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b89d0556
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b89d0556

Branch: refs/heads/master
Commit: b89d0556dff0520ab35882382242fbfa7d9478eb
Parents: 340f09d
Author: Wojciech Szymanski 
Authored: Sun Nov 6 07:43:13 2016 -0800
Committer: Yanbo Liang 
Committed: Sun Nov 6 07:43:13 2016 -0800

--
 .../scala/org/apache/spark/ml/Pipeline.scala|  2 +-
 .../org/apache/spark/ml/PipelineSuite.scala | 22 ++--
 2 files changed, 21 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b89d0556/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala 
b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
index 195a93e..f406f8c 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
@@ -169,7 +169,7 @@ class Pipeline @Since("1.4.0") (
   override def copy(extra: ParamMap): Pipeline = {
 val map = extractParamMap(extra)
 val newStages = map(stages).map(_.copy(extra))
-new Pipeline().setStages(newStages)
+new Pipeline(uid).setStages(newStages)
   }
 
   @Since("1.2.0")

http://git-wip-us.apache.org/repos/asf/spark/blob/b89d0556/mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala
--
diff --git a/mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala
index 6413ca1..dafc6c2 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala
@@ -101,13 +101,31 @@ class PipelineSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
 }
   }
 
+  test("Pipeline.copy") {
+val hashingTF = new HashingTF()
+  .setNumFeatures(100)
+val pipeline = new 
Pipeline("pipeline").setStages(Array[Transformer](hashingTF))
+val copied = pipeline.copy(ParamMap(hashingTF.numFeatures -> 10))
+
+assert(copied.uid === pipeline.uid,
+  "copy should create an instance with the same UID")
+assert(copied.getStages(0).asInstanceOf[HashingTF].getNumFeatures === 10,
+  "copy should handle extra stage params")
+  }
+
   test("PipelineModel.copy") {
 val hashingTF = new HashingTF()
   .setNumFeatures(100)
-val model = new PipelineModel("pipeline", Array[Transformer](hashingTF))
+val model = new PipelineModel("pipelineModel", 
Array[Transformer](hashingTF))
+  .setParent(new Pipeline())
 val copied = model.copy(ParamMap(hashingTF.numFeatures -> 10))
-require(copied.stages(0).asInstanceOf[HashingTF].getNumFeatures === 10,
+
+assert(copied.uid === model.uid,
+  "copy should create an instance with the same UID")
+assert(copied.stages(0).asInstanceOf[HashingTF].getNumFeatures === 10,
   "copy should handle extra stage params")
+assert(copied.parent === model.parent,
+  "copy should create an instance with the same parent")
   }
 
   test("pipeline model constructors") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when family = binomial.

2016-11-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master a814eeac6 -> daa975f4b


[SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when 
family = binomial.

## What changes were proposed in this pull request?
SparkR ```spark.glm``` predict should output original label when family = 
"binomial".

## How was this patch tested?
Add unit test.
You can also run the following code to test:
```R
training <- suppressWarnings(createDataFrame(iris))
training <- training[training$Species %in% c("versicolor", "virginica"), ]
model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = 
binomial(link = "logit"))
showDF(predict(model, training))
```
Before this change:
```
++---++---+--+-+---+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label| 
prediction|
++---++---+--+-+---+
| 7.0|3.2| 4.7|1.4|versicolor|  0.0| 
0.8271421517601544|
| 6.4|3.2| 4.5|1.5|versicolor|  0.0| 
0.6044595910413112|
| 6.9|3.1| 4.9|1.5|versicolor|  0.0| 
0.7916340858281998|
| 5.5|2.3| 4.0|1.3|versicolor|  
0.0|0.16080518180591158|
| 6.5|2.8| 4.6|1.5|versicolor|  0.0| 
0.6112229217050189|
| 5.7|2.8| 4.5|1.3|versicolor|  0.0| 
0.2555087295500885|
| 6.3|3.3| 4.7|1.6|versicolor|  0.0| 
0.5681507664364834|
| 4.9|2.4| 3.3|1.0|versicolor|  
0.0|0.05990570219972002|
| 6.6|2.9| 4.6|1.3|versicolor|  0.0| 
0.6644434078306246|
| 5.2|2.7| 3.9|1.4|versicolor|  
0.0|0.11293577405862379|
| 5.0|2.0| 3.5|1.0|versicolor|  
0.0|0.06152372321585971|
| 5.9|3.0| 4.2|1.5|versicolor|  
0.0|0.35250697207602555|
| 6.0|2.2| 4.0|1.0|versicolor|  
0.0|0.32267018290814303|
| 6.1|2.9| 4.7|1.4|versicolor|  0.0|  
0.433391153814592|
| 5.6|2.9| 3.6|1.3|versicolor|  0.0| 
0.2280744262436993|
| 6.7|3.1| 4.4|1.4|versicolor|  0.0| 
0.7219848389339459|
| 5.6|3.0| 4.5|1.5|versicolor|  
0.0|0.23527698971404695|
| 5.8|2.7| 4.1|1.0|versicolor|  0.0|  
0.285024533520016|
| 6.2|2.2| 4.5|1.5|versicolor|  0.0| 
0.4107047877447493|
| 5.6|2.5| 3.9|1.1|versicolor|  
0.0|0.20083561961645083|
++---++---+--+-+---+
```
After this change:
```
++---++---+--+-+--+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label|prediction|
++---++---+--+-+--+
| 7.0|3.2| 4.7|1.4|versicolor|  0.0| virginica|
| 6.4|3.2| 4.5|1.5|versicolor|  0.0| virginica|
| 6.9|3.1| 4.9|1.5|versicolor|  0.0| virginica|
| 5.5|2.3| 4.0|1.3|versicolor|  0.0|versicolor|
| 6.5|2.8| 4.6|1.5|versicolor|  0.0| virginica|
| 5.7|2.8| 4.5|1.3|versicolor|  0.0|versicolor|
| 6.3|3.3| 4.7|1.6|versicolor|  0.0| virginica|
| 4.9|2.4| 3.3|1.0|versicolor|  0.0|versicolor|
| 6.6|2.9| 4.6|1.3|versicolor|  0.0| virginica|
| 5.2|2.7| 3.9|1.4|versicolor|  0.0|versicolor|
| 5.0|2.0| 3.5|1.0|versicolor|  0.0|versicolor|
| 5.9|3.0| 4.2|1.5|versicolor|  0.0|versicolor|
| 6.0|2.2| 4.0|1.0|versicolor|  0.0|versicolor|
| 6.1|2.9| 4.7|1.4|versicolor|  0.0|versicolor|
| 5.6|2.9| 3.6|1.3|versicolor|  0.0|versicolor|
| 6.7|3.1| 4.4|1.4|versicolor|  0.0| virginica|
| 5.6|3.0| 4.5|1.5|versicolor|  0.0|versicolor|
| 5.8|2.7| 4.1|1.0|versicolor|  0.0|versicolor|
| 6.2|2.2| 4.5|1.5|versicolor|  0.0|versicolor|
| 5.6|2.5| 3.9|1.1|versicolor|  0.0|versicolor|
++---++---+--+-+--+
```

Author: Yanbo Liang 

Closes #15788 from yanboliang/spark-18291.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commi

spark git commit: [SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when family = binomial.

2016-11-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 df40ee2b4 -> 6b332909f


[SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when 
family = binomial.

## What changes were proposed in this pull request?
SparkR ```spark.glm``` predict should output original label when family = 
"binomial".

## How was this patch tested?
Add unit test.
You can also run the following code to test:
```R
training <- suppressWarnings(createDataFrame(iris))
training <- training[training$Species %in% c("versicolor", "virginica"), ]
model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = 
binomial(link = "logit"))
showDF(predict(model, training))
```
Before this change:
```
++---++---+--+-+---+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label| 
prediction|
++---++---+--+-+---+
| 7.0|3.2| 4.7|1.4|versicolor|  0.0| 
0.8271421517601544|
| 6.4|3.2| 4.5|1.5|versicolor|  0.0| 
0.6044595910413112|
| 6.9|3.1| 4.9|1.5|versicolor|  0.0| 
0.7916340858281998|
| 5.5|2.3| 4.0|1.3|versicolor|  
0.0|0.16080518180591158|
| 6.5|2.8| 4.6|1.5|versicolor|  0.0| 
0.6112229217050189|
| 5.7|2.8| 4.5|1.3|versicolor|  0.0| 
0.2555087295500885|
| 6.3|3.3| 4.7|1.6|versicolor|  0.0| 
0.5681507664364834|
| 4.9|2.4| 3.3|1.0|versicolor|  
0.0|0.05990570219972002|
| 6.6|2.9| 4.6|1.3|versicolor|  0.0| 
0.6644434078306246|
| 5.2|2.7| 3.9|1.4|versicolor|  
0.0|0.11293577405862379|
| 5.0|2.0| 3.5|1.0|versicolor|  
0.0|0.06152372321585971|
| 5.9|3.0| 4.2|1.5|versicolor|  
0.0|0.35250697207602555|
| 6.0|2.2| 4.0|1.0|versicolor|  
0.0|0.32267018290814303|
| 6.1|2.9| 4.7|1.4|versicolor|  0.0|  
0.433391153814592|
| 5.6|2.9| 3.6|1.3|versicolor|  0.0| 
0.2280744262436993|
| 6.7|3.1| 4.4|1.4|versicolor|  0.0| 
0.7219848389339459|
| 5.6|3.0| 4.5|1.5|versicolor|  
0.0|0.23527698971404695|
| 5.8|2.7| 4.1|1.0|versicolor|  0.0|  
0.285024533520016|
| 6.2|2.2| 4.5|1.5|versicolor|  0.0| 
0.4107047877447493|
| 5.6|2.5| 3.9|1.1|versicolor|  
0.0|0.20083561961645083|
++---++---+--+-+---+
```
After this change:
```
++---++---+--+-+--+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label|prediction|
++---++---+--+-+--+
| 7.0|3.2| 4.7|1.4|versicolor|  0.0| virginica|
| 6.4|3.2| 4.5|1.5|versicolor|  0.0| virginica|
| 6.9|3.1| 4.9|1.5|versicolor|  0.0| virginica|
| 5.5|2.3| 4.0|1.3|versicolor|  0.0|versicolor|
| 6.5|2.8| 4.6|1.5|versicolor|  0.0| virginica|
| 5.7|2.8| 4.5|1.3|versicolor|  0.0|versicolor|
| 6.3|3.3| 4.7|1.6|versicolor|  0.0| virginica|
| 4.9|2.4| 3.3|1.0|versicolor|  0.0|versicolor|
| 6.6|2.9| 4.6|1.3|versicolor|  0.0| virginica|
| 5.2|2.7| 3.9|1.4|versicolor|  0.0|versicolor|
| 5.0|2.0| 3.5|1.0|versicolor|  0.0|versicolor|
| 5.9|3.0| 4.2|1.5|versicolor|  0.0|versicolor|
| 6.0|2.2| 4.0|1.0|versicolor|  0.0|versicolor|
| 6.1|2.9| 4.7|1.4|versicolor|  0.0|versicolor|
| 5.6|2.9| 3.6|1.3|versicolor|  0.0|versicolor|
| 6.7|3.1| 4.4|1.4|versicolor|  0.0| virginica|
| 5.6|3.0| 4.5|1.5|versicolor|  0.0|versicolor|
| 5.8|2.7| 4.1|1.0|versicolor|  0.0|versicolor|
| 6.2|2.2| 4.5|1.5|versicolor|  0.0|versicolor|
| 5.6|2.5| 3.9|1.1|versicolor|  0.0|versicolor|
++---++---+--+-+--+
```

Author: Yanbo Liang 

Closes #15788 from yanboliang/spark-18291.

(cherry picked from commit daa975f4bfa4f904697bf3365a4be9987032e490)
Signed-off-by: Yanbo Liang 


Project: http:/

spark git commit: [SPARK-18901][ML] Require in LR LogisticAggregator is redundant

2017-04-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 776a2c0e9 -> 90264aced


[SPARK-18901][ML] Require in LR LogisticAggregator is redundant

## What changes were proposed in this pull request?

In MultivariateOnlineSummarizer,

`add` and `merge` have check for weights and feature sizes. The checks in LR 
are redundant, which are removed from this PR.

## How was this patch tested?

Existing tests.

Author: wm...@hotmail.com 

Closes #17478 from wangmiao1981/logit.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/90264ace
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/90264ace
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/90264ace

Branch: refs/heads/master
Commit: 90264aced7cfdf265636517b91e5d1324fe60112
Parents: 776a2c0
Author: wm...@hotmail.com 
Authored: Mon Apr 24 23:43:06 2017 +0800
Committer: Yanbo Liang 
Committed: Mon Apr 24 23:43:06 2017 +0800

--
 .../org/apache/spark/ml/classification/LogisticRegression.scala | 5 -
 1 file changed, 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/90264ace/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index bc81546..44b3478 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -1571,9 +1571,6 @@ private class LogisticAggregator(
*/
   def add(instance: Instance): this.type = {
 instance match { case Instance(label, weight, features) =>
-  require(numFeatures == features.size, s"Dimensions mismatch when adding 
new instance." +
-s" Expecting $numFeatures but got ${features.size}.")
-  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
 
   if (weight == 0.0) return this
 
@@ -1596,8 +1593,6 @@ private class LogisticAggregator(
* @return This LogisticAggregator object.
*/
   def merge(other: LogisticAggregator): this.type = {
-require(numFeatures == other.numFeatures, s"Dimensions mismatch when 
merging with another " +
-  s"LogisticAggregator. Expecting $numFeatures but got 
${other.numFeatures}.")
 
 if (other.weightSum != 0.0) {
   weightSum += other.weightSum


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18901][ML] Require in LR LogisticAggregator is redundant

2017-04-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 2bef01f64 -> cf16c3250


[SPARK-18901][ML] Require in LR LogisticAggregator is redundant

## What changes were proposed in this pull request?

In MultivariateOnlineSummarizer,

`add` and `merge` have check for weights and feature sizes. The checks in LR 
are redundant, which are removed from this PR.

## How was this patch tested?

Existing tests.

Author: wm...@hotmail.com 

Closes #17478 from wangmiao1981/logit.

(cherry picked from commit 90264aced7cfdf265636517b91e5d1324fe60112)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cf16c325
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cf16c325
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cf16c325

Branch: refs/heads/branch-2.2
Commit: cf16c3250e946c4f89edc999d8764e8fa3dfb056
Parents: 2bef01f
Author: wm...@hotmail.com 
Authored: Mon Apr 24 23:43:06 2017 +0800
Committer: Yanbo Liang 
Committed: Mon Apr 24 23:43:23 2017 +0800

--
 .../org/apache/spark/ml/classification/LogisticRegression.scala | 5 -
 1 file changed, 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cf16c325/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index bc81546..44b3478 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -1571,9 +1571,6 @@ private class LogisticAggregator(
*/
   def add(instance: Instance): this.type = {
 instance match { case Instance(label, weight, features) =>
-  require(numFeatures == features.size, s"Dimensions mismatch when adding 
new instance." +
-s" Expecting $numFeatures but got ${features.size}.")
-  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
 
   if (weight == 0.0) return this
 
@@ -1596,8 +1593,6 @@ private class LogisticAggregator(
* @return This LogisticAggregator object.
*/
   def merge(other: LogisticAggregator): this.type = {
-require(numFeatures == other.numFeatures, s"Dimensions mismatch when 
merging with another " +
-  s"LogisticAggregator. Expecting $numFeatures but got 
${other.numFeatures}.")
 
 if (other.weightSum != 0.0) {
   weightSum += other.weightSum


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant

2017-04-25 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 0bc7a9021 -> 387565cf1


[SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant

## What changes were proposed in this pull request?

This is a follow-up PR of #17478.

## How was this patch tested?

Existing tests

Author: wangmiao1981 

Closes #17754 from wangmiao1981/followup.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/387565cf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/387565cf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/387565cf

Branch: refs/heads/master
Commit: 387565cf14b490810f9479ff3adbf776e2edecdc
Parents: 0bc7a90
Author: wangmiao1981 
Authored: Tue Apr 25 16:30:36 2017 +0800
Committer: Yanbo Liang 
Committed: Tue Apr 25 16:30:36 2017 +0800

--
 .../scala/org/apache/spark/ml/classification/LinearSVC.scala| 5 ++---
 .../scala/org/apache/spark/ml/regression/LinearRegression.scala | 5 -
 2 files changed, 2 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/387565cf/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
index f76b14e..7507c75 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
@@ -458,9 +458,7 @@ private class LinearSVCAggregator(
*/
   def add(instance: Instance): this.type = {
 instance match { case Instance(label, weight, features) =>
-  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
-  require(numFeatures == features.size, s"Dimensions mismatch when adding 
new instance." +
-s" Expecting $numFeatures but got ${features.size}.")
+
   if (weight == 0.0) return this
   val localFeaturesStd = bcFeaturesStd.value
   val localCoefficients = coefficientsArray
@@ -512,6 +510,7 @@ private class LinearSVCAggregator(
* @return This LinearSVCAggregator object.
*/
   def merge(other: LinearSVCAggregator): this.type = {
+
 if (other.weightSum != 0.0) {
   weightSum += other.weightSum
   lossSum += other.lossSum

http://git-wip-us.apache.org/repos/asf/spark/blob/387565cf/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala 
b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
index f7e3c8f..eaad549 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
@@ -971,9 +971,6 @@ private class LeastSquaresAggregator(
*/
   def add(instance: Instance): this.type = {
 instance match { case Instance(label, weight, features) =>
-  require(dim == features.size, s"Dimensions mismatch when adding new 
sample." +
-s" Expecting $dim but got ${features.size}.")
-  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
 
   if (weight == 0.0) return this
 
@@ -1005,8 +1002,6 @@ private class LeastSquaresAggregator(
* @return This LeastSquaresAggregator object.
*/
   def merge(other: LeastSquaresAggregator): this.type = {
-require(dim == other.dim, s"Dimensions mismatch when merging with another 
" +
-  s"LeastSquaresAggregator. Expecting $dim but got ${other.dim}.")
 
 if (other.weightSum != 0) {
   totalCnt += other.totalCnt


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant

2017-04-25 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 b62ebd91b -> e2591c6d7


[SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant

## What changes were proposed in this pull request?

This is a follow-up PR of #17478.

## How was this patch tested?

Existing tests

Author: wangmiao1981 

Closes #17754 from wangmiao1981/followup.

(cherry picked from commit 387565cf14b490810f9479ff3adbf776e2edecdc)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e2591c6d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e2591c6d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e2591c6d

Branch: refs/heads/branch-2.2
Commit: e2591c6d74081e9edad2e8982c0125a4f1d21437
Parents: b62ebd9
Author: wangmiao1981 
Authored: Tue Apr 25 16:30:36 2017 +0800
Committer: Yanbo Liang 
Committed: Tue Apr 25 16:30:53 2017 +0800

--
 .../scala/org/apache/spark/ml/classification/LinearSVC.scala| 5 ++---
 .../scala/org/apache/spark/ml/regression/LinearRegression.scala | 5 -
 2 files changed, 2 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e2591c6d/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
index f76b14e..7507c75 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
@@ -458,9 +458,7 @@ private class LinearSVCAggregator(
*/
   def add(instance: Instance): this.type = {
 instance match { case Instance(label, weight, features) =>
-  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
-  require(numFeatures == features.size, s"Dimensions mismatch when adding 
new instance." +
-s" Expecting $numFeatures but got ${features.size}.")
+
   if (weight == 0.0) return this
   val localFeaturesStd = bcFeaturesStd.value
   val localCoefficients = coefficientsArray
@@ -512,6 +510,7 @@ private class LinearSVCAggregator(
* @return This LinearSVCAggregator object.
*/
   def merge(other: LinearSVCAggregator): this.type = {
+
 if (other.weightSum != 0.0) {
   weightSum += other.weightSum
   lossSum += other.lossSum

http://git-wip-us.apache.org/repos/asf/spark/blob/e2591c6d/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala 
b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
index f7e3c8f..eaad549 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
@@ -971,9 +971,6 @@ private class LeastSquaresAggregator(
*/
   def add(instance: Instance): this.type = {
 instance match { case Instance(label, weight, features) =>
-  require(dim == features.size, s"Dimensions mismatch when adding new 
sample." +
-s" Expecting $dim but got ${features.size}.")
-  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
 
   if (weight == 0.0) return this
 
@@ -1005,8 +1002,6 @@ private class LeastSquaresAggregator(
* @return This LeastSquaresAggregator object.
*/
   def merge(other: LeastSquaresAggregator): this.type = {
-require(dim == other.dim, s"Dimensions mismatch when merging with another 
" +
-  s"LeastSquaresAggregator. Expecting $dim but got ${other.dim}.")
 
 if (other.weightSum != 0) {
   totalCnt += other.totalCnt


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

1 2 3 >

1 - 100 of 242 matches

Mail list logo