from:"yliang"

spark git commit: [SPARK-23291][SPARK-23291][R][FOLLOWUP] Update SparkR migration note for

2018-05-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 56a52e0a5 -> 1c9c5de95


[SPARK-23291][SPARK-23291][R][FOLLOWUP] Update SparkR migration note for

## What changes were proposed in this pull request?

This PR fixes the migration note for SPARK-23291 since it's going to backport 
to 2.3.1. See the discussion in 
https://issues.apache.org/jira/browse/SPARK-23291

## How was this patch tested?

N/A

Author: hyukjinkwon 

Closes #21249 from HyukjinKwon/SPARK-23291.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1c9c5de9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1c9c5de9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1c9c5de9

Branch: refs/heads/master
Commit: 1c9c5de951ed86290bcd7d8edaab952b8cacd290
Parents: 56a52e0
Author: hyukjinkwon 
Authored: Mon May 7 14:52:14 2018 -0700
Committer: Yanbo Liang 
Committed: Mon May 7 14:52:14 2018 -0700

--
 docs/sparkr.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1c9c5de9/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index 7fabab5..4faad2c 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -664,6 +664,6 @@ You can inspect the search path in R with 
[`search()`](https://stat.ethz.ch/R-ma
  - For `summary`, option for statistics to compute has been added. Its output 
is changed from that from `describe`.
  - A warning can be raised if versions of SparkR package and the Spark JVM do 
not match.
 
-## Upgrading to Spark 2.4.0
+## Upgrading to SparkR 2.3.1 and above
 
- - The `start` parameter of `substr` method was wrongly subtracted by one, 
previously. In other words, the index specified by `start` parameter was 
considered as 0-base. This can lead to inconsistent substring results and also 
does not match with the behaviour with `substr` in R. It has been fixed so the 
`start` parameter of `substr` method is now 1-base, e.g., therefore to get the 
same result as `substr(df$a, 2, 5)`, it should be changed to `substr(df$a, 1, 
4)`.
+ - In SparkR 2.3.0 and earlier, the `start` parameter of `substr` method was 
wrongly subtracted by one and considered as 0-based. This can lead to 
inconsistent substring results and also does not match with the behaviour with 
`substr` in R. In version 2.3.1 and later, it has been fixed so the `start` 
parameter of `substr` method is now 1-base. As an example, 
`substr(lit('abcdef'), 2, 4))` would result to `abc` in SparkR 2.3.0, and the 
result would be `bcd` in SparkR 2.3.1.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23291][SQL][R][BRANCH-2.3] R's substr should not reduce starting position by 1 when calling Scala API

2018-05-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 f87785a76 -> 3a22feab4


[SPARK-23291][SQL][R][BRANCH-2.3] R's substr should not reduce starting 
position by 1 when calling Scala API

## What changes were proposed in this pull request?

This PR backports 
https://github.com/apache/spark/commit/24b5c69ee3feded439e5bb6390e4b63f503eeafe 
and https://github.com/apache/spark/pull/21249

There's no conflict but I opened this just to run the test and for sure.

See the discussion in https://issues.apache.org/jira/browse/SPARK-23291

## How was this patch tested?

Jenkins tests.

Author: hyukjinkwon 
Author: Liang-Chi Hsieh 

Closes #21250 from HyukjinKwon/SPARK-23291-backport.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3a22feab
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3a22feab
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3a22feab

Branch: refs/heads/branch-2.3
Commit: 3a22feab4dc9f0cffe3aaec692e27ab277666507
Parents: f87785a
Author: hyukjinkwon 
Authored: Mon May 7 14:48:28 2018 -0700
Committer: Yanbo Liang 
Committed: Mon May 7 14:48:28 2018 -0700

--
 R/pkg/R/column.R  | 10 --
 R/pkg/tests/fulltests/test_sparkSQL.R |  1 +
 docs/sparkr.md|  4 
 3 files changed, 13 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3a22feab/R/pkg/R/column.R
--
diff --git a/R/pkg/R/column.R b/R/pkg/R/column.R
index 3095adb..3d6d9f9 100644
--- a/R/pkg/R/column.R
+++ b/R/pkg/R/column.R
@@ -164,12 +164,18 @@ setMethod("alias",
 #' @aliases substr,Column-method
 #'
 #' @param x a Column.
-#' @param start starting position.
+#' @param start starting position. It should be 1-base.
 #' @param stop ending position.
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(list(list(a="abcdef")))
+#' collect(select(df, substr(df$a, 1, 4))) # the result is `abcd`.
+#' collect(select(df, substr(df$a, 2, 4))) # the result is `bcd`.
+#' }
 #' @note substr since 1.4.0
 setMethod("substr", signature(x = "Column"),
   function(x, start, stop) {
-jc <- callJMethod(x@jc, "substr", as.integer(start - 1), 
as.integer(stop - start + 1))
+jc <- callJMethod(x@jc, "substr", as.integer(start), 
as.integer(stop - start + 1))
 column(jc)
   })
 

http://git-wip-us.apache.org/repos/asf/spark/blob/3a22feab/R/pkg/tests/fulltests/test_sparkSQL.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
index 5197838..bed26ec 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -1649,6 +1649,7 @@ test_that("string operators", {
   expect_false(first(select(df, startsWith(df$name, "m")))[[1]])
   expect_true(first(select(df, endsWith(df$name, "el")))[[1]])
   expect_equal(first(select(df, substr(df$name, 1, 2)))[[1]], "Mi")
+  expect_equal(first(select(df, substr(df$name, 4, 6)))[[1]], "hae")
   if (as.numeric(R.version$major) >= 3 && as.numeric(R.version$minor) >= 3) {
 expect_true(startsWith("Hello World", "Hello"))
 expect_false(endsWith("Hello World", "a"))

http://git-wip-us.apache.org/repos/asf/spark/blob/3a22feab/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index 6685b58..73f9424 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -663,3 +663,7 @@ You can inspect the search path in R with 
[`search()`](https://stat.ethz.ch/R-ma
  - The `stringsAsFactors` parameter was previously ignored with `collect`, for 
example, in `collect(createDataFrame(iris), stringsAsFactors = TRUE))`. It has 
been corrected.
  - For `summary`, option for statistics to compute has been added. Its output 
is changed from that from `describe`.
  - A warning can be raised if versions of SparkR package and the Spark JVM do 
not match.
+
+## Upgrading to SparkR 2.3.1 and above
+
+ - In SparkR 2.3.0 and earlier, the `start` parameter of `substr` method was 
wrongly subtracted by one and considered as 0-based. This can lead to 
inconsistent substring results and also does not match with the behaviour with 
`substr` in R. In version 2.3.1 and later, it has been fixed so the `start` 
parameter of `substr` method is now 1-base. As an example, 
`substr(lit('abcdef'), 2, 4))` would result to `abc` in SparkR 2.3.0, and the 
result would be `bcd` in SparkR 2.3.1.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands,

spark git commit: [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of dataframe vectorized summarizer

2017-12-20 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 9c289a5cb -> d3ae3e1e8


[SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of dataframe vectorized 
summarizer

## What changes were proposed in this pull request?

Make several improvements in dataframe vectorized summarizer.

1. Make the summarizer return `Vector` type for all metrics (except "count").
It will return "WrappedArray" type before which won't be very convenient.

2. Make `MetricsAggregate` inherit `ImplicitCastInputTypes` trait. So it can 
check and implicitly cast input values.

3. Add "weight" parameter for all single metric method.

4. Update doc and improve the example code in doc.

5. Simplified test cases.

## How was this patch tested?

Test added and simplified.

Author: WeichenXu 

Closes #19156 from WeichenXu123/improve_vec_summarizer.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d3ae3e1e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d3ae3e1e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d3ae3e1e

Branch: refs/heads/master
Commit: d3ae3e1e894f88a8500752d9633fe9ad00da5f20
Parents: 9c289a5
Author: WeichenXu 
Authored: Wed Dec 20 19:53:35 2017 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 20 19:53:35 2017 -0800

--
 .../org/apache/spark/ml/stat/Summarizer.scala   | 128 ---
 .../spark/ml/stat/JavaSummarizerSuite.java  |  64 
 .../apache/spark/ml/stat/SummarizerSuite.scala  | 362 ++-
 3 files changed, 341 insertions(+), 213 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d3ae3e1e/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala
index cae41ed..9bed74a 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala
@@ -24,7 +24,7 @@ import org.apache.spark.internal.Logging
 import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
 import org.apache.spark.sql.Column
 import org.apache.spark.sql.catalyst.InternalRow
-import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData}
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
ImplicitCastInputTypes, UnsafeArrayData}
 import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
 import org.apache.spark.sql.functions.lit
 import org.apache.spark.sql.types._
@@ -41,7 +41,7 @@ sealed abstract class SummaryBuilder {
   /**
* Returns an aggregate object that contains the summary of the column with 
the requested metrics.
* @param featuresCol a column that contains features Vector object.
-   * @param weightCol a column that contains weight value.
+   * @param weightCol a column that contains weight value. Default weight is 
1.0.
* @return an aggregate column that contains the statistics. The exact 
content of this
* structure is determined during the creation of the builder.
*/
@@ -50,6 +50,7 @@ sealed abstract class SummaryBuilder {
 
   @Since("2.3.0")
   def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0))
+
 }
 
 /**
@@ -60,15 +61,18 @@ sealed abstract class SummaryBuilder {
  * This class lets users pick the statistics they would like to extract for a 
given column. Here is
  * an example in Scala:
  * {{{
- *   val dataframe = ... // Some dataframe containing a feature column
- *   val allStats = dataframe.select(Summarizer.metrics("min", 
"max").summary($"features"))
- *   val Row(Row(min_, max_)) = allStats.first()
+ *   import org.apache.spark.ml.linalg._
+ *   import org.apache.spark.sql.Row
+ *   val dataframe = ... // Some dataframe containing a feature column and a 
weight column
+ *   val multiStatsDF = dataframe.select(
+ *   Summarizer.metrics("min", "max", "count").summary($"features", 
$"weight")
+ *   val Row(Row(minVec, maxVec, count)) = multiStatsDF.first()
  * }}}
  *
  * If one wants to get a single metric, shortcuts are also available:
  * {{{
  *   val meanDF = dataframe.select(Summarizer.mean($"features"))
- *   val Row(mean_) = meanDF.first()
+ *   val Row(meanVec) = meanDF.first()
  * }}}
  *
  * Note: Currently, the performance of this interface is about 2x~3x slower 
then using the RDD
@@ -94,8 +98,7 @@ object Summarizer extends Logging {
*  - min: the minimum for each coefficient.
*  - normL2: the Euclidian norm for each coefficient.
*  - normL1: the L1 norm of each coefficient (sum of the absolute values).
-   * @param firstMetric the metric being

spark git commit: [SPARK-22810][ML][PYSPARK] Expose Python API for LinearRegression with huber loss.

2017-12-20 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 0114c89d0 -> fb0562f34


[SPARK-22810][ML][PYSPARK] Expose Python API for LinearRegression with huber 
loss.

## What changes were proposed in this pull request?
Expose Python API for _LinearRegression_ with _huber_ loss.

## How was this patch tested?
Unit test.

Author: Yanbo Liang 

Closes #19994 from yanboliang/spark-22810.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fb0562f3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fb0562f3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fb0562f3

Branch: refs/heads/master
Commit: fb0562f34605cd27fd39d09e6664a46e55eac327
Parents: 0114c89
Author: Yanbo Liang 
Authored: Wed Dec 20 17:51:42 2017 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 20 17:51:42 2017 -0800

--
 .../pyspark/ml/param/_shared_params_code_gen.py |  3 +-
 python/pyspark/ml/param/shared.py   | 23 +++
 python/pyspark/ml/regression.py | 64 +++-
 python/pyspark/ml/tests.py  | 21 +++
 4 files changed, 96 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fb0562f3/python/pyspark/ml/param/_shared_params_code_gen.py
--
diff --git a/python/pyspark/ml/param/_shared_params_code_gen.py 
b/python/pyspark/ml/param/_shared_params_code_gen.py
index 130d1a0..d55d209 100644
--- a/python/pyspark/ml/param/_shared_params_code_gen.py
+++ b/python/pyspark/ml/param/_shared_params_code_gen.py
@@ -154,7 +154,8 @@ if __name__ == "__main__":
 ("aggregationDepth", "suggested depth for treeAggregate (>= 2).", "2",
  "TypeConverters.toInt"),
 ("parallelism", "the number of threads to use when running parallel 
algorithms (>= 1).",
- "1", "TypeConverters.toInt")]
+ "1", "TypeConverters.toInt"),
+("loss", "the loss function to be optimized.", None, 
"TypeConverters.toString")]
 
 code = []
 for name, doc, defaultValueStr, typeConverter in shared:

http://git-wip-us.apache.org/repos/asf/spark/blob/fb0562f3/python/pyspark/ml/param/shared.py
--
diff --git a/python/pyspark/ml/param/shared.py 
b/python/pyspark/ml/param/shared.py
index 4041d9c..e5c5ddf 100644
--- a/python/pyspark/ml/param/shared.py
+++ b/python/pyspark/ml/param/shared.py
@@ -632,6 +632,29 @@ class HasParallelism(Params):
 return self.getOrDefault(self.parallelism)
 
 
+class HasLoss(Params):
+"""
+Mixin for param loss: the loss function to be optimized.
+"""
+
+loss = Param(Params._dummy(), "loss", "the loss function to be 
optimized.", typeConverter=TypeConverters.toString)
+
+def __init__(self):
+super(HasLoss, self).__init__()
+
+def setLoss(self, value):
+"""
+Sets the value of :py:attr:`loss`.
+"""
+return self._set(loss=value)
+
+def getLoss(self):
+"""
+Gets the value of loss or its default value.
+"""
+return self.getOrDefault(self.loss)
+
+
 class DecisionTreeParams(Params):
 """
 Mixin for Decision Tree parameters.

http://git-wip-us.apache.org/repos/asf/spark/blob/fb0562f3/python/pyspark/ml/regression.py
--
diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py
index 9d5b768..f0812bd 100644
--- a/python/pyspark/ml/regression.py
+++ b/python/pyspark/ml/regression.py
@@ -39,23 +39,26 @@ __all__ = ['AFTSurvivalRegression', 
'AFTSurvivalRegressionModel',
 @inherit_doc
 class LinearRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, 
HasPredictionCol, HasMaxIter,
HasRegParam, HasTol, HasElasticNetParam, 
HasFitIntercept,
-   HasStandardization, HasSolver, HasWeightCol, 
HasAggregationDepth,
+   HasStandardization, HasSolver, HasWeightCol, 
HasAggregationDepth, HasLoss,
JavaMLWritable, JavaMLReadable):
 """
 Linear regression.
 
-The learning objective is to minimize the squared error, with 
regularization.
-The specific squared error loss function used is: L = 1/2n ||A 
coefficients - y||^2^
+The learning objective is to minimize the specified loss function, with 
regularization.
+This supports two kinds of loss:
 
-This supports multiple types of regularization:
-
- * none (a.k.a. ordinary least squares)
+* squaredError (a.k.a squared loss)
+* huber (a hybrid of squared error for relatively small errors and 
absolute error for \
+relatively large ones, and we estimate the scale parameter

spark git commit: [SPARK-3181][ML] Implement huber loss for LinearRegression.

2017-12-13 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 2a29a60da -> 1e44dd004


[SPARK-3181][ML] Implement huber loss for LinearRegression.

## What changes were proposed in this pull request?
MLlib ```LinearRegression``` supports _huber_ loss addition to _leastSquares_ 
loss. The huber loss objective function is:
![image](https://user-images.githubusercontent.com/1962026/29554124-9544d198-8750-11e7-8afa-33579ec419d5.png)
Refer Eq.(6) and Eq.(8) in [A robust hybrid of lasso and ridge 
regression](http://statweb.stanford.edu/~owen/reports/hhu.pdf). This objective 
is jointly convex as a function of (w, Ï) â R Ã (0,â), we can use 
L-BFGS-B to solve it.

The current implementation is a straight forward porting for Python 
scikit-learn 
[```HuberRegressor```](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html).
 There are some differences:
* We use mean loss (```lossSum/weightSum```), but sklearn uses total loss 
(```lossSum```).
* We multiply the loss function and L2 regularization by 1/2. It does not 
affect the result if we multiply the whole formula by a factor, we just keep 
consistent with _leastSquares_ loss.

So if fitting w/o regularization, MLlib and sklearn produce the same output. If 
fitting w/ regularization, MLlib should set ```regParam``` divide by the number 
of instances to match the output of sklearn.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang 

Closes #19020 from yanboliang/spark-3181.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e44dd00
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e44dd00
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e44dd00

Branch: refs/heads/master
Commit: 1e44dd004425040912f2cf16362d2c13f12e1689
Parents: 2a29a60
Author: Yanbo Liang 
Authored: Wed Dec 13 21:19:14 2017 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 13 21:19:14 2017 -0800

--
 .../ml/optim/aggregator/HuberAggregator.scala   | 150 ++
 .../ml/param/shared/SharedParamsCodeGen.scala   |   3 +-
 .../spark/ml/param/shared/sharedParams.scala|  17 ++
 .../spark/ml/regression/LinearRegression.scala  | 299 +++
 .../optim/aggregator/HuberAggregatorSuite.scala | 170 +++
 .../ml/regression/LinearRegressionSuite.scala   | 244 ++-
 project/MimaExcludes.scala  |   5 +
 7 files changed, 823 insertions(+), 65 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1e44dd00/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala
new file mode 100644
index 000..13f64d2
--- /dev/null
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.optim.aggregator
+
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg.Vector
+
+/**
+ * HuberAggregator computes the gradient and loss for a huber loss function,
+ * as used in robust regression for samples in sparse or dense vector in an 
online fashion.
+ *
+ * The huber loss function based on:
+ * http://statweb.stanford.edu/~owen/reports/hhu.pdf;>Art B. Owen 
(2006),
+ * A robust hybrid of lasso and ridge regression.
+ *
+ * Two HuberAggregator can be merged together to have a summary of loss and 
gradient of
+ * the corresponding joint dataset.
+ *
+ * The huber loss function is given by
+ *
+ * 
+ *   $$
+ *   \begin{align}
+ *   \min_{w, \sigma}\frac{1}{2n}{\sum_{i=1}^n\left(\sigma +
+ *   H_m\left(\frac{X_{i}w - y_{i}}{\sigma}\right)\sigma\right) + 
\frac{1}{2}\lambda {||w||_2}^2}
+ *   \end{align}
+ *   $$
+ * 
+ *
+ *

spark git commit: [SPARK-21087][ML][FOLLOWUP] Sync SharedParamsCodeGen and sharedParams.

2017-12-12 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 17cdabb88 -> b03af8b58


[SPARK-21087][ML][FOLLOWUP] Sync SharedParamsCodeGen and sharedParams.

## What changes were proposed in this pull request?
#19208 modified ```sharedParams.scala```, but didn't generated by 
```SharedParamsCodeGen.scala```. This involves mismatch between them.

## How was this patch tested?
Existing test.

Author: Yanbo Liang 

Closes #19958 from yanboliang/spark-21087.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b03af8b5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b03af8b5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b03af8b5

Branch: refs/heads/master
Commit: b03af8b582b9b71b09eaf3a1c01d1b3ef5f072e8
Parents: 17cdabb
Author: Yanbo Liang 
Authored: Tue Dec 12 17:37:01 2017 -0800
Committer: Yanbo Liang 
Committed: Tue Dec 12 17:37:01 2017 -0800

--
 .../spark/ml/param/shared/SharedParamsCodeGen.scala   |  8 
 .../org/apache/spark/ml/param/shared/sharedParams.scala   | 10 ++
 2 files changed, 10 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b03af8b5/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
index c540629..a267bbc 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
@@ -84,10 +84,10 @@ private[shared] object SharedParamsCodeGen {
   ParamDesc[String]("solver", "the solver algorithm for optimization", 
finalFields = false),
   ParamDesc[Int]("aggregationDepth", "suggested depth for treeAggregate 
(>= 2)", Some("2"),
 isValid = "ParamValidators.gtEq(2)", isExpertParam = true),
-  ParamDesc[Boolean]("collectSubModels", "If set to false, then only the 
single best " +
-"sub-model will be available after fitting. If set to true, then all 
sub-models will be " +
-"available. Warning: For large models, collecting all sub-models can 
cause OOMs on the " +
-"Spark driver.",
+  ParamDesc[Boolean]("collectSubModels", "whether to collect a list of 
sub-models trained " +
+"during tuning. If set to false, then only the single best sub-model 
will be available " +
+"after fitting. If set to true, then all sub-models will be available. 
Warning: For " +
+"large models, collecting all sub-models can cause OOMs on the Spark 
driver",
 Some("false"), isExpertParam = true)
 )
 

http://git-wip-us.apache.org/repos/asf/spark/blob/b03af8b5/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala 
b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
index 34aa38a..0004f08 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
@@ -470,15 +470,17 @@ trait HasAggregationDepth extends Params {
 }
 
 /**
- * Trait for shared param collectSubModels (default: false).
+ * Trait for shared param collectSubModels (default: false). This trait may be 
changed or
+ * removed between minor versions.
  */
-private[ml] trait HasCollectSubModels extends Params {
+@DeveloperApi
+trait HasCollectSubModels extends Params {
 
   /**
-   * Param for whether to collect a list of sub-models trained during tuning.
+   * Param for whether to collect a list of sub-models trained during tuning. 
If set to false, then only the single best sub-model will be available after 
fitting. If set to true, then all sub-models will be available. Warning: For 
large models, collecting all sub-models can cause OOMs on the Spark driver.
* @group expertParam
*/
-  final val collectSubModels: BooleanParam = new BooleanParam(this, 
"collectSubModels", "whether to collect a list of sub-models trained during 
tuning")
+  final val collectSubModels: BooleanParam = new BooleanParam(this, 
"collectSubModels", "whether to collect a list of sub-models trained during 
tuning. If set to false, then only the single best sub-model will be available 
after fitting. If set to true, then all sub-models will be available. Warning: 
For large models, collecting all sub-models can cause OOMs on the Spark driver")
 
   setDefault(collectSubModels, false)

spark git commit: [SPARK-22289][ML] Add JSON support for Matrix parameters (LR with coefficients bound)

2017-12-12 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 9e2d96d1d -> 00cdb38dc


[SPARK-22289][ML] Add JSON support for Matrix parameters (LR with coefficients 
bound)

## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-22289

add JSON encoding/decoding for Param[Matrix].

The issue was reported by Nic Eggert during saving LR model with 
LowerBoundsOnCoefficients.
There're two ways to resolve this as I see:
1. Support save/load on LogisticRegressionParams, and also adjust the save/load 
in LogisticRegression and LogisticRegressionModel.
2. Directly support Matrix in Param.jsonEncode, similar to what we have done 
for Vector.

After some discussion in jira, we prefer the fix to support Matrix as a valid 
Param type, for simplicity and convenience for other classes.

Note that in the implementation, I added a "class" field in the JSON object to 
match different JSON converters when loading, which is for preciseness and 
future extension.

## How was this patch tested?

new unit test to cover the LR case and JsonMatrixConverter

Author: Yuhao Yang 

Closes #19525 from hhbyyh/lrsave.

(cherry picked from commit 10c27a6559803797e89c28ced11c1087127b82eb)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/00cdb38d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/00cdb38d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/00cdb38d

Branch: refs/heads/branch-2.2
Commit: 00cdb38dcd0f617de7f0559214a8b1a35e9b179c
Parents: 9e2d96d
Author: Yuhao Yang 
Authored: Tue Dec 12 11:27:01 2017 -0800
Committer: Yanbo Liang 
Committed: Tue Dec 12 11:27:40 2017 -0800

--
 .../org/apache/spark/ml/linalg/Matrices.scala   |  7 ++
 .../spark/ml/linalg/JsonMatrixConverter.scala   | 79 
 .../org/apache/spark/ml/param/params.scala  | 36 +++--
 .../LogisticRegressionSuite.scala   | 11 +++
 .../ml/linalg/JsonMatrixConverterSuite.scala| 45 +++
 5 files changed, 170 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/00cdb38d/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala
--
diff --git 
a/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala 
b/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala
index 07f3bc2..ed3e493 100644
--- a/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala
+++ b/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala
@@ -476,6 +476,9 @@ class DenseMatrix @Since("2.0.0") (
 @Since("2.0.0")
 object DenseMatrix {
 
+  private[ml] def unapply(dm: DenseMatrix): Option[(Int, Int, Array[Double], 
Boolean)] =
+Some((dm.numRows, dm.numCols, dm.values, dm.isTransposed))
+
   /**
* Generate a `DenseMatrix` consisting of zeros.
* @param numRows number of rows of the matrix
@@ -827,6 +830,10 @@ class SparseMatrix @Since("2.0.0") (
 @Since("2.0.0")
 object SparseMatrix {
 
+  private[ml] def unapply(
+   sm: SparseMatrix): Option[(Int, Int, Array[Int], Array[Int], 
Array[Double], Boolean)] =
+Some((sm.numRows, sm.numCols, sm.colPtrs, sm.rowIndices, sm.values, 
sm.isTransposed))
+
   /**
* Generate a `SparseMatrix` from Coordinate List (COO) format. Input must 
be an array of
* (i, j, value) tuples. Entries that have duplicate values of i and j are

http://git-wip-us.apache.org/repos/asf/spark/blob/00cdb38d/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala 
b/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala
new file mode 100644
index 000..0bee643
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific

spark git commit: [SPARK-22289][ML] Add JSON support for Matrix parameters (LR with coefficients bound)

2017-12-12 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master e6dc5f280 -> 10c27a655


[SPARK-22289][ML] Add JSON support for Matrix parameters (LR with coefficients 
bound)

## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-22289

add JSON encoding/decoding for Param[Matrix].

The issue was reported by Nic Eggert during saving LR model with 
LowerBoundsOnCoefficients.
There're two ways to resolve this as I see:
1. Support save/load on LogisticRegressionParams, and also adjust the save/load 
in LogisticRegression and LogisticRegressionModel.
2. Directly support Matrix in Param.jsonEncode, similar to what we have done 
for Vector.

After some discussion in jira, we prefer the fix to support Matrix as a valid 
Param type, for simplicity and convenience for other classes.

Note that in the implementation, I added a "class" field in the JSON object to 
match different JSON converters when loading, which is for preciseness and 
future extension.

## How was this patch tested?

new unit test to cover the LR case and JsonMatrixConverter

Author: Yuhao Yang 

Closes #19525 from hhbyyh/lrsave.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/10c27a65
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/10c27a65
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/10c27a65

Branch: refs/heads/master
Commit: 10c27a6559803797e89c28ced11c1087127b82eb
Parents: e6dc5f2
Author: Yuhao Yang 
Authored: Tue Dec 12 11:27:01 2017 -0800
Committer: Yanbo Liang 
Committed: Tue Dec 12 11:27:01 2017 -0800

--
 .../org/apache/spark/ml/linalg/Matrices.scala   |  7 ++
 .../spark/ml/linalg/JsonMatrixConverter.scala   | 79 
 .../org/apache/spark/ml/param/params.scala  | 36 +++--
 .../LogisticRegressionSuite.scala   | 11 +++
 .../ml/linalg/JsonMatrixConverterSuite.scala| 45 +++
 5 files changed, 170 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/10c27a65/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala
--
diff --git 
a/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala 
b/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala
index 66c5362..14428c6 100644
--- a/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala
+++ b/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala
@@ -476,6 +476,9 @@ class DenseMatrix @Since("2.0.0") (
 @Since("2.0.0")
 object DenseMatrix {
 
+  private[ml] def unapply(dm: DenseMatrix): Option[(Int, Int, Array[Double], 
Boolean)] =
+Some((dm.numRows, dm.numCols, dm.values, dm.isTransposed))
+
   /**
* Generate a `DenseMatrix` consisting of zeros.
* @param numRows number of rows of the matrix
@@ -827,6 +830,10 @@ class SparseMatrix @Since("2.0.0") (
 @Since("2.0.0")
 object SparseMatrix {
 
+  private[ml] def unapply(
+   sm: SparseMatrix): Option[(Int, Int, Array[Int], Array[Int], 
Array[Double], Boolean)] =
+Some((sm.numRows, sm.numCols, sm.colPtrs, sm.rowIndices, sm.values, 
sm.isTransposed))
+
   /**
* Generate a `SparseMatrix` from Coordinate List (COO) format. Input must 
be an array of
* (i, j, value) tuples. Entries that have duplicate values of i and j are

http://git-wip-us.apache.org/repos/asf/spark/blob/10c27a65/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala 
b/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala
new file mode 100644
index 000..0bee643
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.linalg
+
+import

spark git commit: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSuite test data to data/mllib.

2017-11-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 7475a9655 -> 3da3d7635


[SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSuite test data to 
data/mllib.

## What changes were proposed in this pull request?
Move ```ClusteringEvaluatorSuite``` test data(iris) to data/mllib, to prevent 
from re-creating a new folder.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #19648 from yanboliang/spark-14516.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3da3d763
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3da3d763
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3da3d763

Branch: refs/heads/master
Commit: 3da3d76352cc471252a54088cc55208bb4ea5b3a
Parents: 7475a96
Author: Yanbo Liang 
Authored: Tue Nov 7 20:07:30 2017 -0800
Committer: Yanbo Liang 
Committed: Tue Nov 7 20:07:30 2017 -0800

--
 data/mllib/iris_libsvm.txt  | 150 +++
 mllib/src/test/resources/test-data/iris.libsvm  | 150 ---
 .../evaluation/ClusteringEvaluatorSuite.scala   |  30 ++--
 3 files changed, 161 insertions(+), 169 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3da3d763/data/mllib/iris_libsvm.txt
--
diff --git a/data/mllib/iris_libsvm.txt b/data/mllib/iris_libsvm.txt
new file mode 100644
index 000..db95901
--- /dev/null
+++ b/data/mllib/iris_libsvm.txt
@@ -0,0 +1,150 @@
+0.0 1:5.1 2:3.5 3:1.4 4:0.2
+0.0 1:4.9 2:3.0 3:1.4 4:0.2
+0.0 1:4.7 2:3.2 3:1.3 4:0.2
+0.0 1:4.6 2:3.1 3:1.5 4:0.2
+0.0 1:5.0 2:3.6 3:1.4 4:0.2
+0.0 1:5.4 2:3.9 3:1.7 4:0.4
+0.0 1:4.6 2:3.4 3:1.4 4:0.3
+0.0 1:5.0 2:3.4 3:1.5 4:0.2
+0.0 1:4.4 2:2.9 3:1.4 4:0.2
+0.0 1:4.9 2:3.1 3:1.5 4:0.1
+0.0 1:5.4 2:3.7 3:1.5 4:0.2
+0.0 1:4.8 2:3.4 3:1.6 4:0.2
+0.0 1:4.8 2:3.0 3:1.4 4:0.1
+0.0 1:4.3 2:3.0 3:1.1 4:0.1
+0.0 1:5.8 2:4.0 3:1.2 4:0.2
+0.0 1:5.7 2:4.4 3:1.5 4:0.4
+0.0 1:5.4 2:3.9 3:1.3 4:0.4
+0.0 1:5.1 2:3.5 3:1.4 4:0.3
+0.0 1:5.7 2:3.8 3:1.7 4:0.3
+0.0 1:5.1 2:3.8 3:1.5 4:0.3
+0.0 1:5.4 2:3.4 3:1.7 4:0.2
+0.0 1:5.1 2:3.7 3:1.5 4:0.4
+0.0 1:4.6 2:3.6 3:1.0 4:0.2
+0.0 1:5.1 2:3.3 3:1.7 4:0.5
+0.0 1:4.8 2:3.4 3:1.9 4:0.2
+0.0 1:5.0 2:3.0 3:1.6 4:0.2
+0.0 1:5.0 2:3.4 3:1.6 4:0.4
+0.0 1:5.2 2:3.5 3:1.5 4:0.2
+0.0 1:5.2 2:3.4 3:1.4 4:0.2
+0.0 1:4.7 2:3.2 3:1.6 4:0.2
+0.0 1:4.8 2:3.1 3:1.6 4:0.2
+0.0 1:5.4 2:3.4 3:1.5 4:0.4
+0.0 1:5.2 2:4.1 3:1.5 4:0.1
+0.0 1:5.5 2:4.2 3:1.4 4:0.2
+0.0 1:4.9 2:3.1 3:1.5 4:0.1
+0.0 1:5.0 2:3.2 3:1.2 4:0.2
+0.0 1:5.5 2:3.5 3:1.3 4:0.2
+0.0 1:4.9 2:3.1 3:1.5 4:0.1
+0.0 1:4.4 2:3.0 3:1.3 4:0.2
+0.0 1:5.1 2:3.4 3:1.5 4:0.2
+0.0 1:5.0 2:3.5 3:1.3 4:0.3
+0.0 1:4.5 2:2.3 3:1.3 4:0.3
+0.0 1:4.4 2:3.2 3:1.3 4:0.2
+0.0 1:5.0 2:3.5 3:1.6 4:0.6
+0.0 1:5.1 2:3.8 3:1.9 4:0.4
+0.0 1:4.8 2:3.0 3:1.4 4:0.3
+0.0 1:5.1 2:3.8 3:1.6 4:0.2
+0.0 1:4.6 2:3.2 3:1.4 4:0.2
+0.0 1:5.3 2:3.7 3:1.5 4:0.2
+0.0 1:5.0 2:3.3 3:1.4 4:0.2
+1.0 1:7.0 2:3.2 3:4.7 4:1.4
+1.0 1:6.4 2:3.2 3:4.5 4:1.5
+1.0 1:6.9 2:3.1 3:4.9 4:1.5
+1.0 1:5.5 2:2.3 3:4.0 4:1.3
+1.0 1:6.5 2:2.8 3:4.6 4:1.5
+1.0 1:5.7 2:2.8 3:4.5 4:1.3
+1.0 1:6.3 2:3.3 3:4.7 4:1.6
+1.0 1:4.9 2:2.4 3:3.3 4:1.0
+1.0 1:6.6 2:2.9 3:4.6 4:1.3
+1.0 1:5.2 2:2.7 3:3.9 4:1.4
+1.0 1:5.0 2:2.0 3:3.5 4:1.0
+1.0 1:5.9 2:3.0 3:4.2 4:1.5
+1.0 1:6.0 2:2.2 3:4.0 4:1.0
+1.0 1:6.1 2:2.9 3:4.7 4:1.4
+1.0 1:5.6 2:2.9 3:3.6 4:1.3
+1.0 1:6.7 2:3.1 3:4.4 4:1.4
+1.0 1:5.6 2:3.0 3:4.5 4:1.5
+1.0 1:5.8 2:2.7 3:4.1 4:1.0
+1.0 1:6.2 2:2.2 3:4.5 4:1.5
+1.0 1:5.6 2:2.5 3:3.9 4:1.1
+1.0 1:5.9 2:3.2 3:4.8 4:1.8
+1.0 1:6.1 2:2.8 3:4.0 4:1.3
+1.0 1:6.3 2:2.5 3:4.9 4:1.5
+1.0 1:6.1 2:2.8 3:4.7 4:1.2
+1.0 1:6.4 2:2.9 3:4.3 4:1.3
+1.0 1:6.6 2:3.0 3:4.4 4:1.4
+1.0 1:6.8 2:2.8 3:4.8 4:1.4
+1.0 1:6.7 2:3.0 3:5.0 4:1.7
+1.0 1:6.0 2:2.9 3:4.5 4:1.5
+1.0 1:5.7 2:2.6 3:3.5 4:1.0
+1.0 1:5.5 2:2.4 3:3.8 4:1.1
+1.0 1:5.5 2:2.4 3:3.7 4:1.0
+1.0 1:5.8 2:2.7 3:3.9 4:1.2
+1.0 1:6.0 2:2.7 3:5.1 4:1.6
+1.0 1:5.4 2:3.0 3:4.5 4:1.5
+1.0 1:6.0 2:3.4 3:4.5 4:1.6
+1.0 1:6.7 2:3.1 3:4.7 4:1.5
+1.0 1:6.3 2:2.3 3:4.4 4:1.3
+1.0 1:5.6 2:3.0 3:4.1 4:1.3
+1.0 1:5.5 2:2.5 3:4.0 4:1.3
+1.0 1:5.5 2:2.6 3:4.4 4:1.2
+1.0 1:6.1 2:3.0 3:4.6 4:1.4
+1.0 1:5.8 2:2.6 3:4.0 4:1.2
+1.0 1:5.0 2:2.3 3:3.3 4:1.0
+1.0 1:5.6 2:2.7 3:4.2 4:1.3
+1.0 1:5.7 2:3.0 3:4.2 4:1.2
+1.0 1:5.7 2:2.9 3:4.2 4:1.3
+1.0 1:6.2 2:2.9 3:4.3 4:1.3
+1.0 1:5.1 2:2.5 3:3.0 4:1.1
+1.0 1:5.7 2:2.8 3:4.1 4:1.3
+2.0 1:6.3 2:3.3 3:6.0 4:2.5
+2.0 1:5.8 2:2.7 3:5.1 4:1.9
+2.0 1:7.1 2:3.0 3:5.9 4:2.1
+2.0 1:6.3 2:2.9 3:5.6 4:1.8
+2.0 1:6.5 2:3.0 3:5.8 4:2.2
+2.0 1:7.6 2:3.0 3:6.6 4:2.1
+2.0 1:4.9 2:2.5 3:4.5 4:1.7
+2.0 1:7.3 2:2.9 3:6.3 4:1.8
+2.0 1:6.7 2:2.5 3:5.8 4:1.8
+2.0 1:7.2 2:3.6 3:6.1 4:2.5
+2.0 1:6.5 2:3.2 3:5.1 4:2.0
+2.0 1:6.4 2:2.7 3:5.3 4:1.9
+2.0 1:6.8 2:3.0 3:5.5

spark git commit: [SPARK-21981][PYTHON][ML] Added Python interface for ClusteringEvaluator

2017-09-21 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master fedf6961b -> 5ac96854c


[SPARK-21981][PYTHON][ML] Added Python interface for ClusteringEvaluator

## What changes were proposed in this pull request?

Added Python interface for ClusteringEvaluator

## How was this patch tested?

Manual test, eg. the example Python code in the comments.

cc yanboliang

Author: Marco Gaido 
Author: Marco Gaido 

Closes #19204 from mgaido91/SPARK-21981.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5ac96854
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5ac96854
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5ac96854

Branch: refs/heads/master
Commit: 5ac96854cc6186fa2dad602d0906ff2705e3f610
Parents: fedf696
Author: Marco Gaido 
Authored: Fri Sep 22 13:12:33 2017 +0800
Committer: Yanbo Liang 
Committed: Fri Sep 22 13:12:33 2017 +0800

--
 python/pyspark/ml/evaluation.py | 76 +++-
 1 file changed, 74 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5ac96854/python/pyspark/ml/evaluation.py
--
diff --git a/python/pyspark/ml/evaluation.py b/python/pyspark/ml/evaluation.py
index 09cdf9b..aa8dbe7 100644
--- a/python/pyspark/ml/evaluation.py
+++ b/python/pyspark/ml/evaluation.py
@@ -20,12 +20,13 @@ from abc import abstractmethod, ABCMeta
 from pyspark import since, keyword_only
 from pyspark.ml.wrapper import JavaParams
 from pyspark.ml.param import Param, Params, TypeConverters
-from pyspark.ml.param.shared import HasLabelCol, HasPredictionCol, 
HasRawPredictionCol
+from pyspark.ml.param.shared import HasLabelCol, HasPredictionCol, 
HasRawPredictionCol, \
+HasFeaturesCol
 from pyspark.ml.common import inherit_doc
 from pyspark.ml.util import JavaMLReadable, JavaMLWritable
 
 __all__ = ['Evaluator', 'BinaryClassificationEvaluator', 'RegressionEvaluator',
-   'MulticlassClassificationEvaluator']
+   'MulticlassClassificationEvaluator', 'ClusteringEvaluator']
 
 
 @inherit_doc
@@ -325,6 +326,77 @@ class MulticlassClassificationEvaluator(JavaEvaluator, 
HasLabelCol, HasPredictio
 kwargs = self._input_kwargs
 return self._set(**kwargs)
 
+
+@inherit_doc
+class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+Evaluator for Clustering results, which expects two input
+columns: prediction and features.
+
+>>> from pyspark.ml.linalg import Vectors
+>>> featureAndPredictions = map(lambda x: (Vectors.dense(x[0]), x[1]),
+... [([0.0, 0.5], 0.0), ([0.5, 0.0], 0.0), ([10.0, 11.0], 1.0),
+... ([10.5, 11.5], 1.0), ([1.0, 1.0], 0.0), ([8.0, 6.0], 1.0)])
+>>> dataset = spark.createDataFrame(featureAndPredictions, ["features", 
"prediction"])
+...
+>>> evaluator = ClusteringEvaluator(predictionCol="prediction")
+>>> evaluator.evaluate(dataset)
+0.9079...
+>>> ce_path = temp_path + "/ce"
+>>> evaluator.save(ce_path)
+>>> evaluator2 = ClusteringEvaluator.load(ce_path)
+>>> str(evaluator2.getPredictionCol())
+'prediction'
+
+.. versionadded:: 2.3.0
+"""
+metricName = Param(Params._dummy(), "metricName",
+   "metric name in evaluation (silhouette)",
+   typeConverter=TypeConverters.toString)
+
+@keyword_only
+def __init__(self, predictionCol="prediction", featuresCol="features",
+ metricName="silhouette"):
+"""
+__init__(self, predictionCol="prediction", featuresCol="features", \
+ metricName="silhouette")
+"""
+super(ClusteringEvaluator, self).__init__()
+self._java_obj = self._new_java_obj(
+"org.apache.spark.ml.evaluation.ClusteringEvaluator", self.uid)
+self._setDefault(metricName="silhouette")
+kwargs = self._input_kwargs
+self._set(**kwargs)
+
+@since("2.3.0")
+def setMetricName(self, value):
+"""
+Sets the value of :py:attr:`metricName`.
+"""
+return self._set(metricName=value)
+
+@since("2.3.0")
+def getMetricName(self):
+"""
+Gets the value of metricName or its default value.
+"""
+return self.getOrDefault(self.metricName)
+
+@keyword_only
+@since("2.3.0")
+def setParams(self, predictionCol="prediction", featuresCol="features",
+  metricName="silhouette"):
+"""
+setParams(self, predictionCol="prediction", featuresCol="features", \
+  metricName="silhouette")
+

spark git commit: [MINOR][ML] Remove unnecessary default value setting for evaluators.

2017-09-19 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 8319432af -> 2f962422a


[MINOR][ML] Remove unnecessary default value setting for evaluators.

## What changes were proposed in this pull request?
Remove unnecessary default value setting for all evaluators, as we have set 
them in corresponding _HasXXX_ base classes.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #19262 from yanboliang/evaluation.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2f962422
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2f962422
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2f962422

Branch: refs/heads/master
Commit: 2f962422a25020582c915e15819f91f43c0b9d68
Parents: 8319432
Author: Yanbo Liang 
Authored: Tue Sep 19 22:22:35 2017 +0800
Committer: Yanbo Liang 
Committed: Tue Sep 19 22:22:35 2017 +0800

--
 python/pyspark/ml/evaluation.py | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2f962422/python/pyspark/ml/evaluation.py
--
diff --git a/python/pyspark/ml/evaluation.py b/python/pyspark/ml/evaluation.py
index 7cb8d62..09cdf9b 100644
--- a/python/pyspark/ml/evaluation.py
+++ b/python/pyspark/ml/evaluation.py
@@ -146,8 +146,7 @@ class BinaryClassificationEvaluator(JavaEvaluator, 
HasLabelCol, HasRawPrediction
 super(BinaryClassificationEvaluator, self).__init__()
 self._java_obj = self._new_java_obj(
 "org.apache.spark.ml.evaluation.BinaryClassificationEvaluator", 
self.uid)
-self._setDefault(rawPredictionCol="rawPrediction", labelCol="label",
- metricName="areaUnderROC")
+self._setDefault(metricName="areaUnderROC")
 kwargs = self._input_kwargs
 self._set(**kwargs)
 
@@ -224,8 +223,7 @@ class RegressionEvaluator(JavaEvaluator, HasLabelCol, 
HasPredictionCol,
 super(RegressionEvaluator, self).__init__()
 self._java_obj = self._new_java_obj(
 "org.apache.spark.ml.evaluation.RegressionEvaluator", self.uid)
-self._setDefault(predictionCol="prediction", labelCol="label",
- metricName="rmse")
+self._setDefault(metricName="rmse")
 kwargs = self._input_kwargs
 self._set(**kwargs)
 
@@ -297,8 +295,7 @@ class MulticlassClassificationEvaluator(JavaEvaluator, 
HasLabelCol, HasPredictio
 super(MulticlassClassificationEvaluator, self).__init__()
 self._java_obj = self._new_java_obj(
 
"org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator", self.uid)
-self._setDefault(predictionCol="prediction", labelCol="label",
- metricName="f1")
+self._setDefault(metricName="f1")
 kwargs = self._input_kwargs
 self._set(**kwargs)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest.

2017-09-14 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 3a692e355 -> 51e5a821d


[SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest.

## What changes were proposed in this pull request?
#19197 fixed double caching for MLlib algorithms, but missed PySpark 
```OneVsRest```, this PR fixed it.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #19220 from yanboliang/SPARK-18608.

(cherry picked from commit c76153cc7dd25b8de5266fe119095066be7f78f5)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/51e5a821
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/51e5a821
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/51e5a821

Branch: refs/heads/branch-2.2
Commit: 51e5a821dcaa1d5f529afafc88cb8cfb4ad48e09
Parents: 3a692e3
Author: Yanbo Liang 
Authored: Thu Sep 14 14:09:44 2017 +0800
Committer: Yanbo Liang 
Committed: Thu Sep 14 14:10:10 2017 +0800

--
 python/pyspark/ml/classification.py | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/51e5a821/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index 80bb054..ea6800a 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -1576,8 +1576,7 @@ class OneVsRest(Estimator, OneVsRestParams, MLReadable, 
MLWritable):
 multiclassLabeled = dataset.select(labelCol, featuresCol)
 
 # persist if underlying dataset is not persistent.
-handlePersistence = \
-dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, 
False)
+handlePersistence = dataset.storageLevel == StorageLevel(False, False, 
False, False)
 if handlePersistence:
 multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
 
@@ -1690,8 +1689,7 @@ class OneVsRestModel(Model, OneVsRestParams, MLReadable, 
MLWritable):
 newDataset = dataset.withColumn(accColName, 
initUDF(dataset[origCols[0]]))
 
 # persist if underlying dataset is not persistent.
-handlePersistence = \
-dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, 
False)
+handlePersistence = dataset.storageLevel == StorageLevel(False, False, 
False, False)
 if handlePersistence:
 newDataset.persist(StorageLevel.MEMORY_AND_DISK)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest.

2017-09-14 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 66cb72d7b -> c76153cc7


[SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest.

## What changes were proposed in this pull request?
#19197 fixed double caching for MLlib algorithms, but missed PySpark 
```OneVsRest```, this PR fixed it.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #19220 from yanboliang/SPARK-18608.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c76153cc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c76153cc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c76153cc

Branch: refs/heads/master
Commit: c76153cc7dd25b8de5266fe119095066be7f78f5
Parents: 66cb72d
Author: Yanbo Liang 
Authored: Thu Sep 14 14:09:44 2017 +0800
Committer: Yanbo Liang 
Committed: Thu Sep 14 14:09:44 2017 +0800

--
 python/pyspark/ml/classification.py | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c76153cc/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index 0caafa6..27ad1e8 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -1773,8 +1773,7 @@ class OneVsRest(Estimator, OneVsRestParams, 
HasParallelism, JavaMLReadable, Java
 multiclassLabeled = dataset.select(labelCol, featuresCol)
 
 # persist if underlying dataset is not persistent.
-handlePersistence = \
-dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, 
False)
+handlePersistence = dataset.storageLevel == StorageLevel(False, False, 
False, False)
 if handlePersistence:
 multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
 
@@ -1928,8 +1927,7 @@ class OneVsRestModel(Model, OneVsRestParams, 
JavaMLReadable, JavaMLWritable):
 newDataset = dataset.withColumn(accColName, 
initUDF(dataset[origCols[0]]))
 
 # persist if underlying dataset is not persistent.
-handlePersistence = \
-dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, 
False)
+handlePersistence = dataset.storageLevel == StorageLevel(False, False, 
False, False)
 if handlePersistence:
 newDataset.persist(StorageLevel.MEMORY_AND_DISK)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][DOC] Add missing call of `update()` in examples of PeriodicGraphCheckpointer & PeriodicRDDCheckpointer

2017-09-14 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 8d8641f12 -> 66cb72d7b


[MINOR][DOC] Add missing call of `update()` in examples of 
PeriodicGraphCheckpointer & PeriodicRDDCheckpointer

## What changes were proposed in this pull request?
forgot to call `update()` with `graph1` & `rdd1` in examples for 
`PeriodicGraphCheckpointer` & `PeriodicRDDCheckpoin`
## How was this patch tested?
existing tests

Author: Zheng RuiFeng 

Closes #19198 from zhengruifeng/fix_doc_checkpointer.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/66cb72d7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/66cb72d7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/66cb72d7

Branch: refs/heads/master
Commit: 66cb72d7b9178774ba253e244bb2eddb1345b21f
Parents: 8d8641f
Author: Zheng RuiFeng 
Authored: Thu Sep 14 14:04:43 2017 +0800
Committer: Yanbo Liang 
Committed: Thu Sep 14 14:04:43 2017 +0800

--
 .../scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala   | 1 +
 .../org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala| 1 +
 2 files changed, 2 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/66cb72d7/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala 
b/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala
index ab72add..facbb83 100644
--- 
a/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala
+++ 
b/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala
@@ -50,6 +50,7 @@ import org.apache.spark.util.PeriodicCheckpointer
  * {{{
  *  val (rdd1, rdd2, rdd3, ...) = ...
  *  val cp = new PeriodicRDDCheckpointer(2, sc)
+ *  cp.update(rdd1)
  *  rdd1.count();
  *  // persisted: rdd1
  *  cp.update(rdd2)

http://git-wip-us.apache.org/repos/asf/spark/blob/66cb72d7/graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala
--
diff --git 
a/graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala
 
b/graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala
index fda501a..539b66f 100644
--- 
a/graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala
+++ 
b/graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala
@@ -50,6 +50,7 @@ import org.apache.spark.util.PeriodicCheckpointer
  * {{{
  *  val (graph1, graph2, graph3, ...) = ...
  *  val cp = new PeriodicGraphCheckpointer(2, sc)
+ *  cp.updateGraph(graph1)
  *  graph1.vertices.count(); graph1.edges.count()
  *  // persisted: graph1
  *  cp.updateGraph(graph2)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-21854] Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API

2017-09-13 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master dcbb22943 -> 8d8641f12


[SPARK-21854] Added LogisticRegressionTrainingSummary for 
MultinomialLogisticRegression in Python API

## What changes were proposed in this pull request?

Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in 
Python API

## How was this patch tested?

Added unit test

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: Ming Jiang 
Author: Ming Jiang 
Author: jmwdpk 

Closes #19185 from jmwdpk/SPARK-21854.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8d8641f1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8d8641f1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8d8641f1

Branch: refs/heads/master
Commit: 8d8641f12250b0a9d370ff9354407c27af7cfcf4
Parents: dcbb229
Author: Ming Jiang 
Authored: Thu Sep 14 13:53:28 2017 +0800
Committer: Yanbo Liang 
Committed: Thu Sep 14 13:53:28 2017 +0800

--
 .../LogisticRegressionSuite.scala   |  12 ++
 python/pyspark/ml/classification.py | 120 ++-
 python/pyspark/ml/tests.py  |  55 -
 3 files changed, 183 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8d8641f1/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
index d43c7cd..14f5508 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
@@ -2416,6 +2416,18 @@ class LogisticRegressionSuite
   blorSummary.recallByThreshold.collect() === 
sameBlorSummary.recallByThreshold.collect())
 assert(
   blorSummary.precisionByThreshold.collect() === 
sameBlorSummary.precisionByThreshold.collect())
+assert(blorSummary.labels === sameBlorSummary.labels)
+assert(blorSummary.truePositiveRateByLabel === 
sameBlorSummary.truePositiveRateByLabel)
+assert(blorSummary.falsePositiveRateByLabel === 
sameBlorSummary.falsePositiveRateByLabel)
+assert(blorSummary.precisionByLabel === sameBlorSummary.precisionByLabel)
+assert(blorSummary.recallByLabel === sameBlorSummary.recallByLabel)
+assert(blorSummary.fMeasureByLabel === sameBlorSummary.fMeasureByLabel)
+assert(blorSummary.accuracy === sameBlorSummary.accuracy)
+assert(blorSummary.weightedTruePositiveRate === 
sameBlorSummary.weightedTruePositiveRate)
+assert(blorSummary.weightedFalsePositiveRate === 
sameBlorSummary.weightedFalsePositiveRate)
+assert(blorSummary.weightedRecall === sameBlorSummary.weightedRecall)
+assert(blorSummary.weightedPrecision === sameBlorSummary.weightedPrecision)
+assert(blorSummary.weightedFMeasure === sameBlorSummary.weightedFMeasure)
 
 lr.setFamily("multinomial")
 val mlorModel = lr.fit(smallMultinomialDataset)

http://git-wip-us.apache.org/repos/asf/spark/blob/8d8641f1/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index fbb9e7f..0caafa6 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -529,9 +529,11 @@ class LogisticRegressionModel(JavaModel, 
JavaClassificationModel, JavaMLWritable
 trained on the training set. An exception is thrown if 
`trainingSummary is None`.
 """
 if self.hasSummary:
-java_blrt_summary = self._call_java("summary")
-# Note: Once multiclass is added, update this to return correct 
summary
-return BinaryLogisticRegressionTrainingSummary(java_blrt_summary)
+java_lrt_summary = self._call_java("summary")
+if self.numClasses <= 2:
+return 
BinaryLogisticRegressionTrainingSummary(java_lrt_summary)
+else:
+return LogisticRegressionTrainingSummary(java_lrt_summary)
 else:
 raise RuntimeError("No training summary available for this %s" %
self.__class__.__name__)
@@ -587,6 +589,14 @@ class LogisticRegressionSummary(JavaWrapper):
 return self._call_java("probabilityCol")
 
 @property
+@since("2.3.0")
+def predictionCol(self):
+"""
+Field in "predictions" which gives the prediction of

spark git commit: [SPARK-21690][ML] one-pass imputer

2017-09-13 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master ca00cc70d -> 0fa5b7cac


[SPARK-21690][ML] one-pass imputer

## What changes were proposed in this pull request?
parallelize the computation of all columns

performance tests:

|numColums| Mean(Old) | Median(Old) | Mean(RDD) | Median(RDD) | Mean(DF) | 
Median(DF) |
|--|--||--||--||
|1|0.0771394713|0.0658712813|0.080779802|0.04816598149996|0.1052550987001|0.0499620203|
|10|0.723434063099|0.5954440414|0.0867935197|0.1326342865998|0.0925572488999|0.1573943635|
|100|7.3756451568|6.2196631259|0.1911931552|0.862537681701|0.5557462431|1.721683798202|

## How was this patch tested?
existing tests

Author: Zheng RuiFeng 

Closes #18902 from zhengruifeng/parallelize_imputer.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0fa5b7ca
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0fa5b7ca
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0fa5b7ca

Branch: refs/heads/master
Commit: 0fa5b7cacca4e867dd9f787cc2801616967932a4
Parents: ca00cc7
Author: Zheng RuiFeng 
Authored: Wed Sep 13 20:12:21 2017 +0800
Committer: Yanbo Liang 
Committed: Wed Sep 13 20:12:21 2017 +0800

--
 .../org/apache/spark/ml/feature/Imputer.scala   | 56 ++--
 1 file changed, 41 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0fa5b7ca/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala
index 9e023b9..1f36ece 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala
@@ -133,23 +133,49 @@ class Imputer @Since("2.2.0") (@Since("2.2.0") override 
val uid: String)
   override def fit(dataset: Dataset[_]): ImputerModel = {
 transformSchema(dataset.schema, logging = true)
 val spark = dataset.sparkSession
-import spark.implicits._
-val surrogates = $(inputCols).map { inputCol =>
-  val ic = col(inputCol)
-  val filtered = dataset.select(ic.cast(DoubleType))
-.filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
-  if(filtered.take(1).length == 0) {
-throw new SparkException(s"surrogate cannot be computed. " +
-  s"All the values in $inputCol are Null, Nan or 
missingValue(${$(missingValue)})")
-  }
-  val surrogate = $(strategy) match {
-case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first()
-case Imputer.median => filtered.stat.approxQuantile(inputCol, 
Array(0.5), 0.001).head
-  }
-  surrogate
+
+val cols = $(inputCols).map { inputCol =>
+  when(col(inputCol).equalTo($(missingValue)), null)
+.when(col(inputCol).isNaN, null)
+.otherwise(col(inputCol))
+.cast("double")
+.as(inputCol)
+}
+
+val results = $(strategy) match {
+  case Imputer.mean =>
+// Function avg will ignore null automatically.
+// For a column only containing null, avg will return null.
+val row = dataset.select(cols.map(avg): _*).head()
+Array.range(0, $(inputCols).length).map { i =>
+  if (row.isNullAt(i)) {
+Double.NaN
+  } else {
+row.getDouble(i)
+  }
+}
+
+  case Imputer.median =>
+// Function approxQuantile will ignore null automatically.
+// For a column only containing null, approxQuantile will return an 
empty array.
+dataset.select(cols: _*).stat.approxQuantile($(inputCols), Array(0.5), 
0.001)
+  .map { array =>
+if (array.isEmpty) {
+  Double.NaN
+} else {
+  array.head
+}
+  }
+}
+
+val emptyCols = $(inputCols).zip(results).filter(_._2.isNaN).map(_._1)
+if (emptyCols.nonEmpty) {
+  throw new SparkException(s"surrogate cannot be computed. " +
+s"All the values in ${emptyCols.mkString(",")} are Null, Nan or " +
+s"missingValue(${$(missingValue)})")
 }
 
-val rows = spark.sparkContext.parallelize(Seq(Row.fromSeq(surrogates)))
+val rows = spark.sparkContext.parallelize(Seq(Row.fromSeq(results)))
 val schema = StructType($(inputCols).map(col => StructField(col, 
DoubleType, nullable = false)))
 val surrogateDF = spark.createDataFrame(rows, schema)
 copyValues(new ImputerModel(uid, surrogateDF).setParent(this))


-
To unsubscribe,

spark git commit: [SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine silhouette and squared Euclidean silhouette.

2017-09-12 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master e2ac2f1c7 -> dd7816758


[SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine 
silhouette and squared Euclidean silhouette.

## What changes were proposed in this pull request?

This PR adds the ClusteringEvaluator Evaluator which contains two metrics:
 - **cosineSilhouette**: the Silhouette measure using the cosine distance;
 - **squaredSilhouette**: the Silhouette measure using the squared Euclidean 
distance.

The implementation of the two metrics refers to the algorithm proposed and 
explained 
[here](https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view).
 These algorithms have been thought for a distributed and parallel environment, 
thus they have reasonable performance, unlike a naive Silhouette implementation 
following its definition.

## How was this patch tested?

The patch has been tested with the additional unit tests added (comparing the 
results with the ones provided by [Python sklearn 
library](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)).

Author: Marco Gaido 

Closes #18538 from mgaido91/SPARK-14516.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dd781675
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dd781675
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dd781675

Branch: refs/heads/master
Commit: dd7816758516b303d79adaac856670c3ccda11ce
Parents: e2ac2f1
Author: Marco Gaido 
Authored: Tue Sep 12 17:59:53 2017 +0800
Committer: Yanbo Liang 
Committed: Tue Sep 12 17:59:53 2017 +0800

--
 .../ml/evaluation/ClusteringEvaluator.scala | 436 +++
 mllib/src/test/resources/test-data/iris.libsvm  | 150 +++
 .../evaluation/ClusteringEvaluatorSuite.scala   |  89 
 3 files changed, 675 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dd781675/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
new file mode 100644
index 000..d6ec522
--- /dev/null
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
@@ -0,0 +1,436 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.evaluation
+
+import org.apache.spark.SparkContext
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, 
VectorUDT}
+import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, 
Identifiable, SchemaUtils}
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.functions.{avg, col, udf}
+import org.apache.spark.sql.types.DoubleType
+
+/**
+ * :: Experimental ::
+ *
+ * Evaluator for clustering results.
+ * The metric computes the Silhouette measure
+ * using the squared Euclidean distance.
+ *
+ * The Silhouette is a measure for the validation
+ * of the consistency within clusters. It ranges
+ * between 1 and -1, where a value close to 1
+ * means that the points in a cluster are close
+ * to the other points in the same cluster and
+ * far from the points of the other clusters.
+ */
+@Experimental
+@Since("2.3.0")
+class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: 
String)
+  extends Evaluator with HasPredictionCol with HasFeaturesCol with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("cluEval"))
+
+  @Since("2.3.0")
+  override def copy(pMap: ParamMap): ClusteringEvaluator = 
this.defaultCopy(pMap)
+

spark git commit: [SPARK-21856] Add probability and rawPrediction to MLPC for Python

2017-09-11 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 828fab035 -> 4bab8f599


[SPARK-21856] Add probability and rawPrediction to MLPC for Python

Probability and rawPrediction has been added to MultilayerPerceptronClassifier 
for Python

Add unit test.

Author: Chunsheng Ji 

Closes #19172 from chunshengji/SPARK-21856.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4bab8f59
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4bab8f59
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4bab8f59

Branch: refs/heads/master
Commit: 4bab8f5996d94a468a40fde2961ebebafc393508
Parents: 828fab0
Author: Chunsheng Ji 
Authored: Mon Sep 11 16:52:48 2017 +0800
Committer: Yanbo Liang 
Committed: Mon Sep 11 16:52:48 2017 +0800

--
 python/pyspark/ml/classification.py | 15 ++-
 python/pyspark/ml/tests.py  | 20 
 2 files changed, 30 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4bab8f59/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index f0f42a3..aa747f3 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -1356,7 +1356,8 @@ class NaiveBayesModel(JavaModel, JavaClassificationModel, 
JavaMLWritable, JavaML
 @inherit_doc
 class MultilayerPerceptronClassifier(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredictionCol,
  HasMaxIter, HasTol, HasSeed, HasStepSize, 
HasSolver,
- JavaMLWritable, JavaMLReadable):
+ JavaMLWritable, JavaMLReadable, 
HasProbabilityCol,
+ HasRawPredictionCol):
 """
 Classifier trainer based on the Multilayer Perceptron.
 Each layer has sigmoid activation function, output layer has softmax.
@@ -1425,11 +1426,13 @@ class MultilayerPerceptronClassifier(JavaEstimator, 
HasFeaturesCol, HasLabelCol,
 @keyword_only
 def __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
  maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, 
stepSize=0.03,
- solver="l-bfgs", initialWeights=None):
+ solver="l-bfgs", initialWeights=None, 
probabilityCol="probability",
+ rawPredicitionCol="rawPrediction"):
 """
 __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction", \
  maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, 
stepSize=0.03, \
- solver="l-bfgs", initialWeights=None)
+ solver="l-bfgs", initialWeights=None, 
probabilityCol="probability", \
+ rawPredicitionCol="rawPrediction")
 """
 super(MultilayerPerceptronClassifier, self).__init__()
 self._java_obj = self._new_java_obj(
@@ -1442,11 +1445,13 @@ class MultilayerPerceptronClassifier(JavaEstimator, 
HasFeaturesCol, HasLabelCol,
 @since("1.6.0")
 def setParams(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
   maxIter=100, tol=1e-6, seed=None, layers=None, 
blockSize=128, stepSize=0.03,
-  solver="l-bfgs", initialWeights=None):
+  solver="l-bfgs", initialWeights=None, 
probabilityCol="probability",
+  rawPredicitionCol="rawPrediction"):
 """
 setParams(self, featuresCol="features", labelCol="label", 
predictionCol="prediction", \
   maxIter=100, tol=1e-6, seed=None, layers=None, 
blockSize=128, stepSize=0.03, \
-  solver="l-bfgs", initialWeights=None)
+  solver="l-bfgs", initialWeights=None, 
probabilityCol="probability", \
+  rawPredicitionCol="rawPrediction"):
 Sets params for MultilayerPerceptronClassifier.
 """
 kwargs = self._input_kwargs

http://git-wip-us.apache.org/repos/asf/spark/blob/4bab8f59/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 509698f..15d6c76 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -1655,6 +1655,26 @@ class LogisticRegressionTest(SparkSessionTestCase):
 np.allclose(model.interceptVector.toArray(), [-0.9057, -1.1392, 
-0.0033], atol=1E-4))
 
 
+class MultilayerPerceptronClassifierTest(SparkSessionTestCase):
+
+def test_raw_and_probability_prediction(self):
+
+data_path =

spark git commit: [SPARK-21108][ML] convert LinearSVC to aggregator framework

2017-08-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 05af2de0f -> f3676d639


[SPARK-21108][ML] convert LinearSVC to aggregator framework

## What changes were proposed in this pull request?

convert LinearSVC to new aggregator framework

## How was this patch tested?

existing unit test.

Author: Yuhao Yang 

Closes #18315 from hhbyyh/svcAggregator.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f3676d63
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f3676d63
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f3676d63

Branch: refs/heads/master
Commit: f3676d63913e0706e071b71e1742b8d57b102fba
Parents: 05af2de
Author: Yuhao Yang 
Authored: Fri Aug 25 10:22:27 2017 +0800
Committer: Yanbo Liang 
Committed: Fri Aug 25 10:22:27 2017 +0800

--
 .../spark/ml/classification/LinearSVC.scala | 204 ++-
 .../ml/optim/aggregator/HingeAggregator.scala   | 105 ++
 .../ml/classification/LinearSVCSuite.scala  |   7 +-
 .../optim/aggregator/HingeAggregatorSuite.scala | 163 +++
 .../aggregator/LogisticAggregatorSuite.scala|   2 -
 5 files changed, 286 insertions(+), 195 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f3676d63/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
index 8d556de..3b0666c 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
@@ -25,11 +25,11 @@ import org.apache.hadoop.fs.Path
 
 import org.apache.spark.SparkException
 import org.apache.spark.annotation.{Experimental, Since}
-import org.apache.spark.broadcast.Broadcast
 import org.apache.spark.internal.Logging
 import org.apache.spark.ml.feature.Instance
 import org.apache.spark.ml.linalg._
-import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.optim.aggregator.HingeAggregator
+import org.apache.spark.ml.optim.loss.{L2Regularization, RDDLossFunction}
 import org.apache.spark.ml.param._
 import org.apache.spark.ml.param.shared._
 import org.apache.spark.ml.util._
@@ -214,10 +214,20 @@ class LinearSVC @Since("2.2.0") (
   }
 
   val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  val getFeaturesStd = (j: Int) => featuresStd(j)
   val regParamL2 = $(regParam)
   val bcFeaturesStd = instances.context.broadcast(featuresStd)
-  val costFun = new LinearSVCCostFun(instances, $(fitIntercept),
-$(standardization), bcFeaturesStd, regParamL2, $(aggregationDepth))
+  val regularization = if (regParamL2 != 0.0) {
+val shouldApply = (idx: Int) => idx >= 0 && idx < numFeatures
+Some(new L2Regularization(regParamL2, shouldApply,
+  if ($(standardization)) None else Some(getFeaturesStd)))
+  } else {
+None
+  }
+
+  val getAggregatorFunc = new HingeAggregator(bcFeaturesStd, 
$(fitIntercept))(_)
+  val costFun = new RDDLossFunction(instances, getAggregatorFunc, 
regularization,
+$(aggregationDepth))
 
   def regParamL1Fun = (index: Int) => 0D
   val optimizer = new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, 
regParamL1Fun, $(tol))
@@ -372,189 +382,3 @@ object LinearSVCModel extends MLReadable[LinearSVCModel] {
 }
   }
 }
-
-/**
- * LinearSVCCostFun implements Breeze's DiffFunction[T] for hinge loss function
- */
-private class LinearSVCCostFun(
-instances: RDD[Instance],
-fitIntercept: Boolean,
-standardization: Boolean,
-bcFeaturesStd: Broadcast[Array[Double]],
-regParamL2: Double,
-aggregationDepth: Int) extends DiffFunction[BDV[Double]] {
-
-  override def calculate(coefficients: BDV[Double]): (Double, BDV[Double]) = {
-val coeffs = Vectors.fromBreeze(coefficients)
-val bcCoeffs = instances.context.broadcast(coeffs)
-val featuresStd = bcFeaturesStd.value
-val numFeatures = featuresStd.length
-
-val svmAggregator = {
-  val seqOp = (c: LinearSVCAggregator, instance: Instance) => 
c.add(instance)
-  val combOp = (c1: LinearSVCAggregator, c2: LinearSVCAggregator) => 
c1.merge(c2)
-
-  instances.treeAggregate(
-new LinearSVCAggregator(bcCoeffs, bcFeaturesStd, fitIntercept)
-  )(seqOp, combOp, aggregationDepth)
-}
-
-val totalGradientArray = svmAggregator.gradient.toArray
-// regVal is the sum of coefficients squares excluding intercept for L2 
regularization.
-val regVal = if (regParamL2 == 0.0) {
-  0.0
-} else {
-  var sum = 0.0
-

spark git commit: [ML][MINOR] Make sharedParams update.

2017-08-22 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 3c0c2d09c -> 342961905


[ML][MINOR] Make sharedParams update.

## What changes were proposed in this pull request?
```sharedParams.scala``` was generated by ```SharedParamsCodeGen```, but it's 
not updated in master. Maybe someone manual update ```sharedParams.scala```, 
this PR fix this issue.

## How was this patch tested?
Offline check.

Author: Yanbo Liang 

Closes #19011 from yanboliang/sharedParams.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/34296190
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/34296190
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/34296190

Branch: refs/heads/master
Commit: 34296190558435fce73184fb7fb1e3d2ced7c3f6
Parents: 3c0c2d0
Author: Yanbo Liang 
Authored: Wed Aug 23 11:06:53 2017 +0800
Committer: Yanbo Liang 
Committed: Wed Aug 23 11:06:53 2017 +0800

--
 .../main/scala/org/apache/spark/ml/param/shared/sharedParams.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/34296190/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala 
b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
index 545e45e..6061d9c 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
@@ -154,7 +154,7 @@ private[ml] trait HasVarianceCol extends Params {
 }
 
 /**
- * Trait for shared param threshold (default: 0.5).
+ * Trait for shared param threshold.
  */
 private[ml] trait HasThreshold extends Params {
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19762][ML][FOLLOWUP] Add necessary comments to L2Regularization.

2017-08-21 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 84b5b16ea -> c108a5d30


[SPARK-19762][ML][FOLLOWUP] Add necessary comments to L2Regularization.

## What changes were proposed in this pull request?
MLlib ```LinearRegression/LogisticRegression/LinearSVC``` always standardize 
the data during training to improve the rate of convergence regardless of 
_standardization_ is true or false. If _standardization_ is false, we perform 
reverse standardization by penalizing each component differently to get 
effectively the same objective function when the training dataset is not 
standardized. We should keep these comments in the code to let developers 
understand how we handle it correctly.

## How was this patch tested?
Existing tests, only adding some comments in code.

Author: Yanbo Liang 

Closes #18992 from yanboliang/SPARK-19762.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c108a5d3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c108a5d3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c108a5d3

Branch: refs/heads/master
Commit: c108a5d30e821fef23709681fca7da22bc507129
Parents: 84b5b16
Author: Yanbo Liang 
Authored: Tue Aug 22 08:43:18 2017 +0800
Committer: Yanbo Liang 
Committed: Tue Aug 22 08:43:18 2017 +0800

--
 .../ml/optim/loss/DifferentiableRegularization.scala  | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c108a5d3/mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala
index 7ac7c22..929374e 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala
@@ -39,9 +39,13 @@ private[ml] trait DifferentiableRegularization[T] extends 
DiffFunction[T] {
  *
  * @param regParam The magnitude of the regularization.
  * @param shouldApply A function (Int => Boolean) indicating whether a given 
index should have
- *regularization applied to it.
+ *regularization applied to it. Usually we don't apply 
regularization to
+ *the intercept.
  * @param applyFeaturesStd Option for a function which maps coefficient index 
(column major) to the
- * feature standard deviation. If `None`, no 
standardization is applied.
+ * feature standard deviation. Since we always 
standardize the data during
+ * training, if `standardization` is false, we have to 
reverse
+ * standardization by penalizing each component 
differently by this param.
+ * If `standardization` is true, this should be `None`.
  */
 private[ml] class L2Regularization(
 override val regParam: Double,
@@ -57,6 +61,11 @@ private[ml] class L2Regularization(
   val coef = coefficients(j)
   applyFeaturesStd match {
 case Some(getStd) =>
+  // If `standardization` is false, we still standardize the data
+  // to improve the rate of convergence; as a result, we have to
+  // perform this reverse standardization by penalizing each 
component
+  // differently to get effectively the same objective function 
when
+  // the training dataset is not standardized.
   val std = getStd(j)
   if (std != 0.0) {
 val temp = coef / (std * std)
@@ -66,6 +75,7 @@ private[ml] class L2Regularization(
 0.0
   }
 case None =>
+  // If `standardization` is true, compute L2 regularization 
normally.
   sum += coef * coef
   gradient(j) = coef * regParam
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19634][ML] Multivariate summarizer - dataframes API

2017-08-15 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 966083105 -> 07549b20a


[SPARK-19634][ML] Multivariate summarizer - dataframes API

## What changes were proposed in this pull request?

This patch adds the DataFrames API to the multivariate summarizer (mean, 
variance, etc.). In addition to all the features of 
MultivariateOnlineSummarizer, it also allows the user to select a subset of the 
metrics.

## How was this patch tested?

Testcases added.

## Performance
Resolve several performance issues in #17419, further optimization pending on 
SQL team's work. One of the SQL layer performance issue related to these 
feature has been resolved in #18712, thanks liancheng and cloud-fan

### Performance data

(test on my laptop, use 2 partitions. tries out = 20, warm up = 10)

The unit of test results is records/milliseconds (higher is better)

Vector size/records number | 1/1000 | 10/100 | 100/100 | 
1000/10 | 1/1
|--||---||
Dataframe | 15149  | 7441 | 2118 | 224 | 21
RDD from Dataframe | 4992  | 4440 | 2328 | 320 | 33
raw RDD | 53931  | 20683 | 3966 | 528 | 53

Author: WeichenXu 

Closes #18798 from WeichenXu123/SPARK-19634-dataframe-summarizer.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07549b20
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07549b20
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07549b20

Branch: refs/heads/master
Commit: 07549b20a3fc2a282e080f76a2be075e4dd5ebc7
Parents: 9660831
Author: WeichenXu 
Authored: Wed Aug 16 10:41:05 2017 +0800
Committer: Yanbo Liang 
Committed: Wed Aug 16 10:41:05 2017 +0800

--
 .../org/apache/spark/ml/linalg/VectorUDT.scala  |  24 +-
 .../org/apache/spark/ml/stat/Summarizer.scala   | 596 +++
 .../apache/spark/ml/stat/SummarizerSuite.scala  | 582 ++
 .../sql/catalyst/expressions/Projection.scala   |   6 +
 .../expressions/aggregate/interfaces.scala  |   6 +
 5 files changed, 1203 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/07549b20/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala 
b/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala
index 9178613..37f173b 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala
@@ -27,17 +27,7 @@ import org.apache.spark.sql.types._
  */
 private[spark] class VectorUDT extends UserDefinedType[Vector] {
 
-  override def sqlType: StructType = {
-// type: 0 = sparse, 1 = dense
-// We only use "values" for dense vectors, and "size", "indices", and 
"values" for sparse
-// vectors. The "values" field is nullable because we might want to add 
binary vectors later,
-// which uses "size" and "indices", but not "values".
-StructType(Seq(
-  StructField("type", ByteType, nullable = false),
-  StructField("size", IntegerType, nullable = true),
-  StructField("indices", ArrayType(IntegerType, containsNull = false), 
nullable = true),
-  StructField("values", ArrayType(DoubleType, containsNull = false), 
nullable = true)))
-  }
+  override final def sqlType: StructType = _sqlType
 
   override def serialize(obj: Vector): InternalRow = {
 obj match {
@@ -94,4 +84,16 @@ private[spark] class VectorUDT extends 
UserDefinedType[Vector] {
   override def typeName: String = "vector"
 
   private[spark] override def asNullable: VectorUDT = this
+
+  private[this] val _sqlType = {
+// type: 0 = sparse, 1 = dense
+// We only use "values" for dense vectors, and "size", "indices", and 
"values" for sparse
+// vectors. The "values" field is nullable because we might want to add 
binary vectors later,
+// which uses "size" and "indices", but not "values".
+StructType(Seq(
+  StructField("type", ByteType, nullable = false),
+  StructField("size", IntegerType, nullable = true),
+  StructField("indices", ArrayType(IntegerType, containsNull = false), 
nullable = true),
+  StructField("values", ArrayType(DoubleType, containsNull = false), 
nullable = true)))
+  }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/07549b20/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala
new file mode 100644
index 000..7e408b9
--- /dev/null
+++

spark git commit: [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search

2017-08-09 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 d02331452 -> 7446be332


[SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong 
wolfe line search

## What changes were proposed in this pull request?

Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search
https://github.com/scalanlp/breeze/pull/651

## How was this patch tested?

N/A

Author: WeichenXu 

Closes #18797 from WeichenXu123/update-breeze.

(cherry picked from commit b35660dd0e930f4b484a079d9e2516b0a7dacf1d)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7446be33
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7446be33
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7446be33

Branch: refs/heads/branch-2.2
Commit: 7446be3328ea75a5197b2587e3a8e2ca7977726b
Parents: d023314
Author: WeichenXu 
Authored: Wed Aug 9 14:44:10 2017 +0800
Committer: Yanbo Liang 
Committed: Wed Aug 9 14:44:39 2017 +0800

--
 dev/deps/spark-deps-hadoop-2.6|  4 ++--
 dev/deps/spark-deps-hadoop-2.7|  4 ++--
 .../spark/ml/regression/AFTSurvivalRegression.scala   |  2 ++
 .../ml/regression/AFTSurvivalRegressionSuite.scala|  1 -
 .../org/apache/spark/ml/util/MLTestingUtils.scala |  1 -
 .../apache/spark/mllib/optimization/LBFGSSuite.scala  |  4 ++--
 pom.xml   |  2 +-
 python/pyspark/ml/regression.py   | 14 +++---
 8 files changed, 16 insertions(+), 16 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7446be33/dev/deps/spark-deps-hadoop-2.6
--
diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6
index 9287bd4..02c0b21 100644
--- a/dev/deps/spark-deps-hadoop-2.6
+++ b/dev/deps/spark-deps-hadoop-2.6
@@ -19,8 +19,8 @@ avro-mapred-1.7.7-hadoop2.jar
 base64-2.3.8.jar
 bcprov-jdk15on-1.51.jar
 bonecp-0.8.0.RELEASE.jar
-breeze-macros_2.11-0.13.1.jar
-breeze_2.11-0.13.1.jar
+breeze-macros_2.11-0.13.2.jar
+breeze_2.11-0.13.2.jar
 calcite-avatica-1.2.0-incubating.jar
 calcite-core-1.2.0-incubating.jar
 calcite-linq4j-1.2.0-incubating.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/7446be33/dev/deps/spark-deps-hadoop-2.7
--
diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7
index ab1de3d..47e28de 100644
--- a/dev/deps/spark-deps-hadoop-2.7
+++ b/dev/deps/spark-deps-hadoop-2.7
@@ -19,8 +19,8 @@ avro-mapred-1.7.7-hadoop2.jar
 base64-2.3.8.jar
 bcprov-jdk15on-1.51.jar
 bonecp-0.8.0.RELEASE.jar
-breeze-macros_2.11-0.13.1.jar
-breeze_2.11-0.13.1.jar
+breeze-macros_2.11-0.13.2.jar
+breeze_2.11-0.13.2.jar
 calcite-avatica-1.2.0-incubating.jar
 calcite-core-1.2.0-incubating.jar
 calcite-linq4j-1.2.0-incubating.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/7446be33/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
index 094853b..0891994 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
@@ -553,6 +553,8 @@ private class AFTAggregator(
 val ti = data.label
 val delta = data.censor
 
+require(ti > 0.0, "The lifetime or label should be  greater than 0.")
+
 val localFeaturesStd = bcFeaturesStd.value
 
 val margin = {

http://git-wip-us.apache.org/repos/asf/spark/blob/7446be33/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala
index fb39e50..02e5c6d 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala
@@ -364,7 +364,6 @@ class AFTSurvivalRegressionSuite
 
   test("should support all NumericType censors, and not support other types") {
 val df = spark.createDataFrame(Seq(
-  (0, Vectors.dense(0)),
   (1, Vectors.dense(1)),
   (2, Vectors.dense(2)),
   (3, Vectors.dense(3)),

spark git commit: [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search

2017-08-09 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master ae8a2b149 -> b35660dd0


[SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong 
wolfe line search

## What changes were proposed in this pull request?

Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search
https://github.com/scalanlp/breeze/pull/651

## How was this patch tested?

N/A

Author: WeichenXu 

Closes #18797 from WeichenXu123/update-breeze.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b35660dd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b35660dd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b35660dd

Branch: refs/heads/master
Commit: b35660dd0e930f4b484a079d9e2516b0a7dacf1d
Parents: ae8a2b1
Author: WeichenXu 
Authored: Wed Aug 9 14:44:10 2017 +0800
Committer: Yanbo Liang 
Committed: Wed Aug 9 14:44:10 2017 +0800

--
 dev/deps/spark-deps-hadoop-2.6|  4 ++--
 dev/deps/spark-deps-hadoop-2.7|  4 ++--
 .../spark/ml/regression/AFTSurvivalRegression.scala   |  2 ++
 .../ml/regression/AFTSurvivalRegressionSuite.scala|  1 -
 .../org/apache/spark/ml/util/MLTestingUtils.scala |  1 -
 .../apache/spark/mllib/optimization/LBFGSSuite.scala  |  4 ++--
 pom.xml   |  2 +-
 python/pyspark/ml/regression.py   | 14 +++---
 8 files changed, 16 insertions(+), 16 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b35660dd/dev/deps/spark-deps-hadoop-2.6
--
diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6
index a41183a..d7587fb 100644
--- a/dev/deps/spark-deps-hadoop-2.6
+++ b/dev/deps/spark-deps-hadoop-2.6
@@ -22,8 +22,8 @@ avro-mapred-1.7.7-hadoop2.jar
 base64-2.3.8.jar
 bcprov-jdk15on-1.51.jar
 bonecp-0.8.0.RELEASE.jar
-breeze-macros_2.11-0.13.1.jar
-breeze_2.11-0.13.1.jar
+breeze-macros_2.11-0.13.2.jar
+breeze_2.11-0.13.2.jar
 calcite-avatica-1.2.0-incubating.jar
 calcite-core-1.2.0-incubating.jar
 calcite-linq4j-1.2.0-incubating.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/b35660dd/dev/deps/spark-deps-hadoop-2.7
--
diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7
index 5e1321b..887eeca 100644
--- a/dev/deps/spark-deps-hadoop-2.7
+++ b/dev/deps/spark-deps-hadoop-2.7
@@ -22,8 +22,8 @@ avro-mapred-1.7.7-hadoop2.jar
 base64-2.3.8.jar
 bcprov-jdk15on-1.51.jar
 bonecp-0.8.0.RELEASE.jar
-breeze-macros_2.11-0.13.1.jar
-breeze_2.11-0.13.1.jar
+breeze-macros_2.11-0.13.2.jar
+breeze_2.11-0.13.2.jar
 calcite-avatica-1.2.0-incubating.jar
 calcite-core-1.2.0-incubating.jar
 calcite-linq4j-1.2.0-incubating.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/b35660dd/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
index 094853b..0891994 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
@@ -553,6 +553,8 @@ private class AFTAggregator(
 val ti = data.label
 val delta = data.censor
 
+require(ti > 0.0, "The lifetime or label should be  greater than 0.")
+
 val localFeaturesStd = bcFeaturesStd.value
 
 val margin = {

http://git-wip-us.apache.org/repos/asf/spark/blob/b35660dd/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala
index fb39e50..02e5c6d 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala
@@ -364,7 +364,6 @@ class AFTSurvivalRegressionSuite
 
   test("should support all NumericType censors, and not support other types") {
 val df = spark.createDataFrame(Seq(
-  (0, Vectors.dense(0)),
   (1, Vectors.dense(1)),
   (2, Vectors.dense(2)),
   (3, Vectors.dense(3)),

http://git-wip-us.apache.org/repos/asf/spark/blob/b35660dd/mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

spark git commit: [SPARK-21306][ML] For branch 2.0, OneVsRest should support setWeightCol

2017-08-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 c27a01aec -> 9f670ce5d


[SPARK-21306][ML] For branch 2.0, OneVsRest should support setWeightCol

The PR is related to #18554, and is modified for branch 2.0.

## What changes were proposed in this pull request?

add `setWeightCol` method for OneVsRest.

`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.

Author: Yan Facai (é¢åæ) 

Closes #18764 from facaiy/BUG/branch-2.0_OneVsRest_support_setWeightCol.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9f670ce5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9f670ce5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9f670ce5

Branch: refs/heads/branch-2.0
Commit: 9f670ce5d1aeef737226185d78f07147f0cc2693
Parents: c27a01a
Author: Yan Facai (é¢åæ) 
Authored: Tue Aug 8 11:18:15 2017 +0800
Committer: Yanbo Liang 
Committed: Tue Aug 8 11:18:15 2017 +0800

--
 .../spark/ml/classification/OneVsRest.scala | 39 ++--
 .../ml/classification/OneVsRestSuite.scala  | 11 ++
 python/pyspark/ml/classification.py | 27 +++---
 python/pyspark/ml/tests.py  | 14 +++
 4 files changed, 82 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9f670ce5/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
index f4ab0a0..770d5db 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
@@ -34,6 +34,7 @@ import org.apache.spark.ml._
 import org.apache.spark.ml.attribute._
 import org.apache.spark.ml.linalg.Vector
 import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params}
+import org.apache.spark.ml.param.shared.HasWeightCol
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset, Row}
 import org.apache.spark.sql.functions._
@@ -53,7 +54,8 @@ private[ml] trait ClassifierTypeTrait {
 /**
  * Params for [[OneVsRest]].
  */
-private[ml] trait OneVsRestParams extends PredictorParams with 
ClassifierTypeTrait {
+private[ml] trait OneVsRestParams extends PredictorParams
+  with ClassifierTypeTrait with HasWeightCol {
 
   /**
* param for the base binary classifier that we reduce multiclass 
classification into.
@@ -290,6 +292,18 @@ final class OneVsRest @Since("1.4.0") (
   @Since("1.5.0")
   def setPredictionCol(value: String): this.type = set(predictionCol, value)
 
+  /**
+   * Sets the value of param [[weightCol]].
+   *
+   * This is ignored if weight is not supported by [[classifier]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.3.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
   @Since("1.4.0")
   override def transformSchema(schema: StructType): StructType = {
 validateAndTransformSchema(schema, fitting = true, 
getClassifier.featuresDataType)
@@ -308,7 +322,20 @@ final class OneVsRest @Since("1.4.0") (
 }
 val numClasses = 
MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity)
 
-val multiclassLabeled = dataset.select($(labelCol), $(featuresCol))
+val weightColIsUsed = isDefined(weightCol) && $(weightCol).nonEmpty && {
+  getClassifier match {
+case _: HasWeightCol => true
+case c =>
+  logWarning(s"weightCol is ignored, as it is not supported by $c 
now.")
+  false
+  }
+}
+
+val multiclassLabeled = if (weightColIsUsed) {
+  dataset.select($(labelCol), $(featuresCol), $(weightCol))
+} else {
+  dataset.select($(labelCol), $(featuresCol))
+}
 
 // persist if underlying dataset is not persistent.
 val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
@@ -328,7 +355,13 @@ final class OneVsRest @Since("1.4.0") (
   paramMap.put(classifier.labelCol -> labelColName)
   paramMap.put(classifier.featuresCol -> getFeaturesCol)
   paramMap.put(classifier.predictionCol -> getPredictionCol)
-  classifier.fit(trainingDataset, paramMap)
+  if (weightColIsUsed) {
+val classifier_ = classifier.asInstanceOf[ClassifierType with 
HasWeightCol]
+paramMap.put(classifier_.weightCol -> getWeightCol)
+classifier_.fit(trainingDataset, paramMap)
+  }

spark git commit: [SPARK-21306][ML] For branch 2.1, OneVsRest should support setWeightCol

2017-08-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 444cca14d -> 9b749b6ce


[SPARK-21306][ML] For branch 2.1, OneVsRest should support setWeightCol

The PR is related to #18554, and is modified for branch 2.1.

## What changes were proposed in this pull request?

add `setWeightCol` method for OneVsRest.

`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.

Author: Yan Facai (é¢åæ) 

Closes #18763 from facaiy/BUG/branch-2.1_OneVsRest_support_setWeightCol.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9b749b6c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9b749b6c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9b749b6c

Branch: refs/heads/branch-2.1
Commit: 9b749b6ce6b86caf8a73d6993490fc140b9ad282
Parents: 444cca1
Author: Yan Facai (é¢åæ) 
Authored: Tue Aug 8 11:05:36 2017 +0800
Committer: Yanbo Liang 
Committed: Tue Aug 8 11:05:36 2017 +0800

--
 .../spark/ml/classification/OneVsRest.scala | 39 ++--
 .../ml/classification/OneVsRestSuite.scala  | 10 +
 python/pyspark/ml/classification.py | 27 +++---
 python/pyspark/ml/tests.py  | 14 +++
 4 files changed, 81 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9b749b6c/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
index e58b30d..c4a8f1f 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
@@ -34,6 +34,7 @@ import org.apache.spark.ml._
 import org.apache.spark.ml.attribute._
 import org.apache.spark.ml.linalg.Vector
 import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params}
+import org.apache.spark.ml.param.shared.HasWeightCol
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset, Row}
 import org.apache.spark.sql.functions._
@@ -53,7 +54,8 @@ private[ml] trait ClassifierTypeTrait {
 /**
  * Params for [[OneVsRest]].
  */
-private[ml] trait OneVsRestParams extends PredictorParams with 
ClassifierTypeTrait {
+private[ml] trait OneVsRestParams extends PredictorParams
+  with ClassifierTypeTrait with HasWeightCol {
 
   /**
* param for the base binary classifier that we reduce multiclass 
classification into.
@@ -299,6 +301,18 @@ final class OneVsRest @Since("1.4.0") (
   @Since("1.5.0")
   def setPredictionCol(value: String): this.type = set(predictionCol, value)
 
+  /**
+   * Sets the value of param [[weightCol]].
+   *
+   * This is ignored if weight is not supported by [[classifier]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.3.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
   @Since("1.4.0")
   override def transformSchema(schema: StructType): StructType = {
 validateAndTransformSchema(schema, fitting = true, 
getClassifier.featuresDataType)
@@ -317,7 +331,20 @@ final class OneVsRest @Since("1.4.0") (
 }
 val numClasses = 
MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity)
 
-val multiclassLabeled = dataset.select($(labelCol), $(featuresCol))
+val weightColIsUsed = isDefined(weightCol) && $(weightCol).nonEmpty && {
+  getClassifier match {
+case _: HasWeightCol => true
+case c =>
+  logWarning(s"weightCol is ignored, as it is not supported by $c 
now.")
+  false
+  }
+}
+
+val multiclassLabeled = if (weightColIsUsed) {
+  dataset.select($(labelCol), $(featuresCol), $(weightCol))
+} else {
+  dataset.select($(labelCol), $(featuresCol))
+}
 
 // persist if underlying dataset is not persistent.
 val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
@@ -337,7 +364,13 @@ final class OneVsRest @Since("1.4.0") (
   paramMap.put(classifier.labelCol -> labelColName)
   paramMap.put(classifier.featuresCol -> getFeaturesCol)
   paramMap.put(classifier.predictionCol -> getPredictionCol)
-  classifier.fit(trainingDataset, paramMap)
+  if (weightColIsUsed) {
+val classifier_ = classifier.asInstanceOf[ClassifierType with 
HasWeightCol]
+paramMap.put(classifier_.weightCol -> getWeightCol)
+classifier_.fit(trainingDataset, paramMap)
+  }

spark git commit: [SPARK-19270][FOLLOW-UP][ML] PySpark GLR model.summary should return a printable representation.

2017-08-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master fdcee028a -> f763d8464


[SPARK-19270][FOLLOW-UP][ML] PySpark GLR model.summary should return a 
printable representation.

## What changes were proposed in this pull request?
PySpark GLR ```model.summary``` should return a printable representation by 
calling Scala ```toString```.

## How was this patch tested?
```
from pyspark.ml.regression import GeneralizedLinearRegression
dataset = 
spark.read.format("libsvm").load("data/mllib/sample_linear_regression_data.txt")
glr = GeneralizedLinearRegression(family="gaussian", link="identity", 
maxIter=10, regParam=0.3)
model = glr.fit(dataset)
model.summary
```
Before this PR:
![image](https://user-images.githubusercontent.com/1962026/29021059-e221633e-7b96-11e7-8d77-5d53f89c81a9.png)
After this PR:
![image](https://user-images.githubusercontent.com/1962026/29021097-fce80fa6-7b96-11e7-8ab4-7e113d447d5d.png)

Author: Yanbo Liang 

Closes #18870 from yanboliang/spark-19270.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f763d846
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f763d846
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f763d846

Branch: refs/heads/master
Commit: f763d8464b32852d7fd33e962e5476a7f03bc6c6
Parents: fdcee02
Author: Yanbo Liang 
Authored: Tue Aug 8 08:43:58 2017 +0800
Committer: Yanbo Liang 
Committed: Tue Aug 8 08:43:58 2017 +0800

--
 python/pyspark/ml/regression.py | 3 +++
 1 file changed, 3 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f763d846/python/pyspark/ml/regression.py
--
diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py
index 2cc6234..72374ac 100644
--- a/python/pyspark/ml/regression.py
+++ b/python/pyspark/ml/regression.py
@@ -1745,6 +1745,9 @@ class 
GeneralizedLinearRegressionTrainingSummary(GeneralizedLinearRegressionSumm
 """
 return self._call_java("pValues")
 
+def __repr__(self):
+return self._call_java("toString")
+
 
 if __name__ == "__main__":
 import doctest


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20601][ML] Python API for Constrained Logistic Regression

2017-08-02 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 14e75758a -> 845c039ce


[SPARK-20601][ML] Python API for Constrained Logistic Regression

## What changes were proposed in this pull request?
Python API for Constrained Logistic Regression based on #17922 , thanks for the 
original contribution from zero323 .

## How was this patch tested?
Unit tests.

Author: zero323 
Author: Yanbo Liang 

Closes #18759 from yanboliang/SPARK-20601.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/845c039c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/845c039c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/845c039c

Branch: refs/heads/master
Commit: 845c039ceb1662632a97631b110e875e934894ad
Parents: 14e7575
Author: zero323 
Authored: Wed Aug 2 18:10:26 2017 +0800
Committer: Yanbo Liang 
Committed: Wed Aug 2 18:10:26 2017 +0800

--
 python/pyspark/ml/classification.py | 105 +--
 python/pyspark/ml/param/__init__.py |  11 +++-
 python/pyspark/ml/tests.py  |  37 +++
 3 files changed, 148 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/845c039c/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index ab1617b..bccf8e7 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -252,18 +252,55 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredicti
"be used in the model. Supported options: auto, binomial, 
multinomial",
typeConverter=TypeConverters.toString)
 
+lowerBoundsOnCoefficients = Param(Params._dummy(), 
"lowerBoundsOnCoefficients",
+  "The lower bounds on coefficients if 
fitting under bound "
+  "constrained optimization. The bound 
matrix must be "
+  "compatible with the shape "
+  "(1, number of features) for binomial 
regression, or "
+  "(number of classes, number of features) 
"
+  "for multinomial regression.",
+  typeConverter=TypeConverters.toMatrix)
+
+upperBoundsOnCoefficients = Param(Params._dummy(), 
"upperBoundsOnCoefficients",
+  "The upper bounds on coefficients if 
fitting under bound "
+  "constrained optimization. The bound 
matrix must be "
+  "compatible with the shape "
+  "(1, number of features) for binomial 
regression, or "
+  "(number of classes, number of features) 
"
+  "for multinomial regression.",
+  typeConverter=TypeConverters.toMatrix)
+
+lowerBoundsOnIntercepts = Param(Params._dummy(), "lowerBoundsOnIntercepts",
+"The lower bounds on intercepts if fitting 
under bound "
+"constrained optimization. The bounds 
vector size must be"
+"equal with 1 for binomial regression, or 
the number of"
+"lasses for multinomial regression.",
+typeConverter=TypeConverters.toVector)
+
+upperBoundsOnIntercepts = Param(Params._dummy(), "upperBoundsOnIntercepts",
+"The upper bounds on intercepts if fitting 
under bound "
+"constrained optimization. The bound 
vector size must be "
+"equal with 1 for binomial regression, or 
the number of "
+"classes for multinomial regression.",
+typeConverter=TypeConverters.toVector)
+
 @keyword_only
 def __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
  maxIter=100, regParam=0.0, elasticNetParam=0.0, tol=1e-6, 
fitIntercept=True,
  threshold=0.5, thresholds=None, probabilityCol="probability",
  rawPredictionCol="rawPrediction", standardization=True, 
weightCol=None,
- aggregationDepth=2, family="auto"):
+ aggregationDepth=2, family="auto",
+ lowerBoundsOnCoefficients=None, 
upperBoundsOnCoefficients=None,
+

spark git commit: [SPARK-21388][ML][PYSPARK] GBTs inherit from HasStepSize & LInearSVC from HasThreshold

2017-08-01 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 5fd0294ff -> 253a07e43


[SPARK-21388][ML][PYSPARK] GBTs inherit from HasStepSize & LInearSVC from 
HasThreshold

## What changes were proposed in this pull request?
GBTs inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold

## How was this patch tested?
existing tests

Author: Zheng RuiFeng 
Author: Ruifeng Zheng 

Closes #18612 from zhengruifeng/override_HasXXX.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/253a07e4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/253a07e4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/253a07e4

Branch: refs/heads/master
Commit: 253a07e43a35f3494aa5e5ead9f4997c653325aa
Parents: 5fd0294
Author: Zheng RuiFeng 
Authored: Tue Aug 1 21:34:26 2017 +0800
Committer: Yanbo Liang 
Committed: Tue Aug 1 21:34:26 2017 +0800

--
 .../spark/ml/classification/LinearSVC.scala  |  7 ++-
 .../ml/classification/LogisticRegression.scala   |  1 +
 .../org/apache/spark/ml/feature/Word2Vec.scala   |  1 -
 .../ml/param/shared/SharedParamsCodeGen.scala|  6 +++---
 .../spark/ml/param/shared/sharedParams.scala |  6 ++
 .../org/apache/spark/ml/tree/treeParams.scala|  7 ++-
 python/pyspark/ml/classification.py  | 19 ++-
 python/pyspark/ml/regression.py  |  5 +
 8 files changed, 21 insertions(+), 31 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/253a07e4/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
index d6ed6a4..8d556de 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
@@ -42,7 +42,7 @@ import org.apache.spark.sql.functions.{col, lit}
 /** Params for linear SVM Classifier. */
 private[classification] trait LinearSVCParams extends ClassifierParams with 
HasRegParam
   with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
-  with HasAggregationDepth {
+  with HasAggregationDepth with HasThreshold {
 
   /**
* Param for threshold in binary classification prediction.
@@ -53,11 +53,8 @@ private[classification] trait LinearSVCParams extends 
ClassifierParams with HasR
*
* @group param
*/
-  final val threshold: DoubleParam = new DoubleParam(this, "threshold",
+  final override val threshold: DoubleParam = new DoubleParam(this, 
"threshold",
 "threshold in binary classification prediction applied to rawPrediction")
-
-  /** @group getParam */
-  def getThreshold: Double = $(threshold)
 }
 
 /**

http://git-wip-us.apache.org/repos/asf/spark/blob/253a07e4/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index 6bba7f9..21957d9 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -366,6 +366,7 @@ class LogisticRegression @Since("1.2.0") (
 
   @Since("1.5.0")
   override def setThreshold(value: Double): this.type = 
super.setThreshold(value)
+  setDefault(threshold -> 0.5)
 
   @Since("1.5.0")
   override def getThreshold: Double = super.getThreshold

http://git-wip-us.apache.org/repos/asf/spark/blob/253a07e4/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
index b6909b3..d4c8e4b 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
@@ -19,7 +19,6 @@ package org.apache.spark.ml.feature
 
 import org.apache.hadoop.fs.Path
 
-import org.apache.spark.SparkContext
 import org.apache.spark.annotation.Since
 import org.apache.spark.ml.{Estimator, Model}
 import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT}

http://git-wip-us.apache.org/repos/asf/spark/blob/253a07e4/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala

spark git commit: [SPARK-21575][SPARKR] Eliminate needless synchronization in java-R serialization

2017-07-30 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 44e501ace -> 106eaa9b9


[SPARK-21575][SPARKR] Eliminate needless synchronization in java-R serialization

## What changes were proposed in this pull request?
Remove surplus synchronized blocks.

## How was this patch tested?
Unit tests run OK.

Author: iurii.ant 

Closes #18775 from 
SereneAnt/eliminate_unnecessary_synchronization_in_java-R_serialization.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/106eaa9b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/106eaa9b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/106eaa9b

Branch: refs/heads/master
Commit: 106eaa9b95192f0cdbb382c11efdcb85032e679b
Parents: 44e501a
Author: iurii.ant 
Authored: Mon Jul 31 10:42:09 2017 +0800
Committer: Yanbo Liang 
Committed: Mon Jul 31 10:42:09 2017 +0800

--
 .../org/apache/spark/api/r/JVMObjectTracker.scala   | 16 ++--
 1 file changed, 2 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/106eaa9b/core/src/main/scala/org/apache/spark/api/r/JVMObjectTracker.scala
--
diff --git a/core/src/main/scala/org/apache/spark/api/r/JVMObjectTracker.scala 
b/core/src/main/scala/org/apache/spark/api/r/JVMObjectTracker.scala
index 3432700..fe7438a 100644
--- a/core/src/main/scala/org/apache/spark/api/r/JVMObjectTracker.scala
+++ b/core/src/main/scala/org/apache/spark/api/r/JVMObjectTracker.scala
@@ -37,13 +37,7 @@ private[r] class JVMObjectTracker {
   /**
* Returns the JVM object associated with the input key or None if not found.
*/
-  final def get(id: JVMObjectId): Option[Object] = this.synchronized {
-if (objMap.containsKey(id)) {
-  Some(objMap.get(id))
-} else {
-  None
-}
-  }
+  final def get(id: JVMObjectId): Option[Object] = Option(objMap.get(id))
 
   /**
* Returns the JVM object associated with the input key or throws an 
exception if not found.
@@ -67,13 +61,7 @@ private[r] class JVMObjectTracker {
   /**
* Removes and returns a JVM object with the specific ID from the tracker, 
or None if not found.
*/
-  final def remove(id: JVMObjectId): Option[Object] = this.synchronized {
-if (objMap.containsKey(id)) {
-  Some(objMap.remove(id))
-} else {
-  None
-}
-  }
+  final def remove(id: JVMObjectId): Option[Object] = Option(objMap.remove(id))
 
   /**
* Number of JVM objects being tracked.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: Revert "[SPARK-21306][ML] OneVsRest should support setWeightCol"

2017-07-28 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 8520d7c6d -> 258ca40cf


Revert "[SPARK-21306][ML] OneVsRest should support setWeightCol"

This reverts commit 8520d7c6d5e880dea3c1a8a874148c07222b4b4b.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/258ca40c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/258ca40c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/258ca40c

Branch: refs/heads/branch-2.1
Commit: 258ca40cf43eedae59b014a41fc6197df9bde299
Parents: 8520d7c
Author: Yanbo Liang 
Authored: Fri Jul 28 20:24:54 2017 +0800
Committer: Yanbo Liang 
Committed: Fri Jul 28 20:24:54 2017 +0800

--
 .../spark/ml/classification/OneVsRest.scala | 39 ++--
 .../ml/classification/OneVsRestSuite.scala  | 10 -
 python/pyspark/ml/classification.py | 27 +++---
 python/pyspark/ml/tests.py  | 14 ---
 4 files changed, 9 insertions(+), 81 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/258ca40c/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
index c4a8f1f..e58b30d 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
@@ -34,7 +34,6 @@ import org.apache.spark.ml._
 import org.apache.spark.ml.attribute._
 import org.apache.spark.ml.linalg.Vector
 import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params}
-import org.apache.spark.ml.param.shared.HasWeightCol
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset, Row}
 import org.apache.spark.sql.functions._
@@ -54,8 +53,7 @@ private[ml] trait ClassifierTypeTrait {
 /**
  * Params for [[OneVsRest]].
  */
-private[ml] trait OneVsRestParams extends PredictorParams
-  with ClassifierTypeTrait with HasWeightCol {
+private[ml] trait OneVsRestParams extends PredictorParams with 
ClassifierTypeTrait {
 
   /**
* param for the base binary classifier that we reduce multiclass 
classification into.
@@ -301,18 +299,6 @@ final class OneVsRest @Since("1.4.0") (
   @Since("1.5.0")
   def setPredictionCol(value: String): this.type = set(predictionCol, value)
 
-  /**
-   * Sets the value of param [[weightCol]].
-   *
-   * This is ignored if weight is not supported by [[classifier]].
-   * If this is not set or empty, we treat all instance weights as 1.0.
-   * Default is not set, so all instances have weight one.
-   *
-   * @group setParam
-   */
-  @Since("2.3.0")
-  def setWeightCol(value: String): this.type = set(weightCol, value)
-
   @Since("1.4.0")
   override def transformSchema(schema: StructType): StructType = {
 validateAndTransformSchema(schema, fitting = true, 
getClassifier.featuresDataType)
@@ -331,20 +317,7 @@ final class OneVsRest @Since("1.4.0") (
 }
 val numClasses = 
MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity)
 
-val weightColIsUsed = isDefined(weightCol) && $(weightCol).nonEmpty && {
-  getClassifier match {
-case _: HasWeightCol => true
-case c =>
-  logWarning(s"weightCol is ignored, as it is not supported by $c 
now.")
-  false
-  }
-}
-
-val multiclassLabeled = if (weightColIsUsed) {
-  dataset.select($(labelCol), $(featuresCol), $(weightCol))
-} else {
-  dataset.select($(labelCol), $(featuresCol))
-}
+val multiclassLabeled = dataset.select($(labelCol), $(featuresCol))
 
 // persist if underlying dataset is not persistent.
 val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
@@ -364,13 +337,7 @@ final class OneVsRest @Since("1.4.0") (
   paramMap.put(classifier.labelCol -> labelColName)
   paramMap.put(classifier.featuresCol -> getFeaturesCol)
   paramMap.put(classifier.predictionCol -> getPredictionCol)
-  if (weightColIsUsed) {
-val classifier_ = classifier.asInstanceOf[ClassifierType with 
HasWeightCol]
-paramMap.put(classifier_.weightCol -> getWeightCol)
-classifier_.fit(trainingDataset, paramMap)
-  } else {
-classifier.fit(trainingDataset, paramMap)
-  }
+  classifier.fit(trainingDataset, paramMap)
 }.toArray[ClassificationModel[_, _]]
 
 if (handlePersistence) {

http://git-wip-us.apache.org/repos/asf/spark/blob/258ca40c/mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala
--
diff --git

spark git commit: Revert "[SPARK-21306][ML] OneVsRest should support setWeightCol"

2017-07-28 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 ccb827224 -> f8ae2bdd2


Revert "[SPARK-21306][ML] OneVsRest should support setWeightCol"

This reverts commit ccb82722450c20c9cdea2b2c68783943213a5aa1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f8ae2bdd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f8ae2bdd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f8ae2bdd

Branch: refs/heads/branch-2.0
Commit: f8ae2bdd2112780ec2b1104119bac2b718a55413
Parents: ccb8272
Author: Yanbo Liang 
Authored: Fri Jul 28 19:45:14 2017 +0800
Committer: Yanbo Liang 
Committed: Fri Jul 28 19:45:14 2017 +0800

--
 .../spark/ml/classification/OneVsRest.scala | 39 ++--
 .../ml/classification/OneVsRestSuite.scala  | 10 -
 python/pyspark/ml/classification.py | 27 +++---
 python/pyspark/ml/tests.py  | 14 ---
 4 files changed, 9 insertions(+), 81 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f8ae2bdd/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
index 770d5db..f4ab0a0 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
@@ -34,7 +34,6 @@ import org.apache.spark.ml._
 import org.apache.spark.ml.attribute._
 import org.apache.spark.ml.linalg.Vector
 import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params}
-import org.apache.spark.ml.param.shared.HasWeightCol
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset, Row}
 import org.apache.spark.sql.functions._
@@ -54,8 +53,7 @@ private[ml] trait ClassifierTypeTrait {
 /**
  * Params for [[OneVsRest]].
  */
-private[ml] trait OneVsRestParams extends PredictorParams
-  with ClassifierTypeTrait with HasWeightCol {
+private[ml] trait OneVsRestParams extends PredictorParams with 
ClassifierTypeTrait {
 
   /**
* param for the base binary classifier that we reduce multiclass 
classification into.
@@ -292,18 +290,6 @@ final class OneVsRest @Since("1.4.0") (
   @Since("1.5.0")
   def setPredictionCol(value: String): this.type = set(predictionCol, value)
 
-  /**
-   * Sets the value of param [[weightCol]].
-   *
-   * This is ignored if weight is not supported by [[classifier]].
-   * If this is not set or empty, we treat all instance weights as 1.0.
-   * Default is not set, so all instances have weight one.
-   *
-   * @group setParam
-   */
-  @Since("2.3.0")
-  def setWeightCol(value: String): this.type = set(weightCol, value)
-
   @Since("1.4.0")
   override def transformSchema(schema: StructType): StructType = {
 validateAndTransformSchema(schema, fitting = true, 
getClassifier.featuresDataType)
@@ -322,20 +308,7 @@ final class OneVsRest @Since("1.4.0") (
 }
 val numClasses = 
MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity)
 
-val weightColIsUsed = isDefined(weightCol) && $(weightCol).nonEmpty && {
-  getClassifier match {
-case _: HasWeightCol => true
-case c =>
-  logWarning(s"weightCol is ignored, as it is not supported by $c 
now.")
-  false
-  }
-}
-
-val multiclassLabeled = if (weightColIsUsed) {
-  dataset.select($(labelCol), $(featuresCol), $(weightCol))
-} else {
-  dataset.select($(labelCol), $(featuresCol))
-}
+val multiclassLabeled = dataset.select($(labelCol), $(featuresCol))
 
 // persist if underlying dataset is not persistent.
 val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
@@ -355,13 +328,7 @@ final class OneVsRest @Since("1.4.0") (
   paramMap.put(classifier.labelCol -> labelColName)
   paramMap.put(classifier.featuresCol -> getFeaturesCol)
   paramMap.put(classifier.predictionCol -> getPredictionCol)
-  if (weightColIsUsed) {
-val classifier_ = classifier.asInstanceOf[ClassifierType with 
HasWeightCol]
-paramMap.put(classifier_.weightCol -> getWeightCol)
-classifier_.fit(trainingDataset, paramMap)
-  } else {
-classifier.fit(trainingDataset, paramMap)
-  }
+  classifier.fit(trainingDataset, paramMap)
 }.toArray[ClassificationModel[_, _]]
 
 if (handlePersistence) {

http://git-wip-us.apache.org/repos/asf/spark/blob/f8ae2bdd/mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala
--
diff --git

spark git commit: [SPARK-21306][ML] OneVsRest should support setWeightCol

2017-07-27 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 d7b9d6235 -> ccb827224


[SPARK-21306][ML] OneVsRest should support setWeightCol

## What changes were proposed in this pull request?

add `setWeightCol` method for OneVsRest.

`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.

Author: Yan Facai (é¢åæ) 

Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.

(cherry picked from commit a5a3189974ea4628e9489eb50099a5432174e80c)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ccb82722
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ccb82722
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ccb82722

Branch: refs/heads/branch-2.0
Commit: ccb82722450c20c9cdea2b2c68783943213a5aa1
Parents: d7b9d62
Author: Yan Facai (é¢åæ) 
Authored: Fri Jul 28 10:10:35 2017 +0800
Committer: Yanbo Liang 
Committed: Fri Jul 28 10:20:27 2017 +0800

--
 .../spark/ml/classification/OneVsRest.scala | 39 ++--
 .../ml/classification/OneVsRestSuite.scala  | 10 +
 python/pyspark/ml/classification.py | 27 +++---
 python/pyspark/ml/tests.py  | 14 +++
 4 files changed, 81 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ccb82722/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
index f4ab0a0..770d5db 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
@@ -34,6 +34,7 @@ import org.apache.spark.ml._
 import org.apache.spark.ml.attribute._
 import org.apache.spark.ml.linalg.Vector
 import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params}
+import org.apache.spark.ml.param.shared.HasWeightCol
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset, Row}
 import org.apache.spark.sql.functions._
@@ -53,7 +54,8 @@ private[ml] trait ClassifierTypeTrait {
 /**
  * Params for [[OneVsRest]].
  */
-private[ml] trait OneVsRestParams extends PredictorParams with 
ClassifierTypeTrait {
+private[ml] trait OneVsRestParams extends PredictorParams
+  with ClassifierTypeTrait with HasWeightCol {
 
   /**
* param for the base binary classifier that we reduce multiclass 
classification into.
@@ -290,6 +292,18 @@ final class OneVsRest @Since("1.4.0") (
   @Since("1.5.0")
   def setPredictionCol(value: String): this.type = set(predictionCol, value)
 
+  /**
+   * Sets the value of param [[weightCol]].
+   *
+   * This is ignored if weight is not supported by [[classifier]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.3.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
   @Since("1.4.0")
   override def transformSchema(schema: StructType): StructType = {
 validateAndTransformSchema(schema, fitting = true, 
getClassifier.featuresDataType)
@@ -308,7 +322,20 @@ final class OneVsRest @Since("1.4.0") (
 }
 val numClasses = 
MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity)
 
-val multiclassLabeled = dataset.select($(labelCol), $(featuresCol))
+val weightColIsUsed = isDefined(weightCol) && $(weightCol).nonEmpty && {
+  getClassifier match {
+case _: HasWeightCol => true
+case c =>
+  logWarning(s"weightCol is ignored, as it is not supported by $c 
now.")
+  false
+  }
+}
+
+val multiclassLabeled = if (weightColIsUsed) {
+  dataset.select($(labelCol), $(featuresCol), $(weightCol))
+} else {
+  dataset.select($(labelCol), $(featuresCol))
+}
 
 // persist if underlying dataset is not persistent.
 val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
@@ -328,7 +355,13 @@ final class OneVsRest @Since("1.4.0") (
   paramMap.put(classifier.labelCol -> labelColName)
   paramMap.put(classifier.featuresCol -> getFeaturesCol)
   paramMap.put(classifier.predictionCol -> getPredictionCol)
-  classifier.fit(trainingDataset, paramMap)
+  if (weightColIsUsed) {
+val classifier_ = classifier.asInstanceOf[ClassifierType with 
HasWeightCol]
+paramMap.put(classifier_.weightCol -> getWeightCol)
+

spark git commit: [SPARK-21306][ML] OneVsRest should support setWeightCol

2017-07-27 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 94987987a -> 8520d7c6d


[SPARK-21306][ML] OneVsRest should support setWeightCol

## What changes were proposed in this pull request?

add `setWeightCol` method for OneVsRest.

`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.

Author: Yan Facai (é¢åæ) 

Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.

(cherry picked from commit a5a3189974ea4628e9489eb50099a5432174e80c)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8520d7c6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8520d7c6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8520d7c6

Branch: refs/heads/branch-2.1
Commit: 8520d7c6d5e880dea3c1a8a874148c07222b4b4b
Parents: 9498798
Author: Yan Facai (é¢åæ) 
Authored: Fri Jul 28 10:10:35 2017 +0800
Committer: Yanbo Liang 
Committed: Fri Jul 28 10:15:59 2017 +0800

--
 .../spark/ml/classification/OneVsRest.scala | 39 ++--
 .../ml/classification/OneVsRestSuite.scala  | 10 +
 python/pyspark/ml/classification.py | 27 +++---
 python/pyspark/ml/tests.py  | 14 +++
 4 files changed, 81 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8520d7c6/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
index e58b30d..c4a8f1f 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
@@ -34,6 +34,7 @@ import org.apache.spark.ml._
 import org.apache.spark.ml.attribute._
 import org.apache.spark.ml.linalg.Vector
 import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params}
+import org.apache.spark.ml.param.shared.HasWeightCol
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset, Row}
 import org.apache.spark.sql.functions._
@@ -53,7 +54,8 @@ private[ml] trait ClassifierTypeTrait {
 /**
  * Params for [[OneVsRest]].
  */
-private[ml] trait OneVsRestParams extends PredictorParams with 
ClassifierTypeTrait {
+private[ml] trait OneVsRestParams extends PredictorParams
+  with ClassifierTypeTrait with HasWeightCol {
 
   /**
* param for the base binary classifier that we reduce multiclass 
classification into.
@@ -299,6 +301,18 @@ final class OneVsRest @Since("1.4.0") (
   @Since("1.5.0")
   def setPredictionCol(value: String): this.type = set(predictionCol, value)
 
+  /**
+   * Sets the value of param [[weightCol]].
+   *
+   * This is ignored if weight is not supported by [[classifier]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.3.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
   @Since("1.4.0")
   override def transformSchema(schema: StructType): StructType = {
 validateAndTransformSchema(schema, fitting = true, 
getClassifier.featuresDataType)
@@ -317,7 +331,20 @@ final class OneVsRest @Since("1.4.0") (
 }
 val numClasses = 
MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity)
 
-val multiclassLabeled = dataset.select($(labelCol), $(featuresCol))
+val weightColIsUsed = isDefined(weightCol) && $(weightCol).nonEmpty && {
+  getClassifier match {
+case _: HasWeightCol => true
+case c =>
+  logWarning(s"weightCol is ignored, as it is not supported by $c 
now.")
+  false
+  }
+}
+
+val multiclassLabeled = if (weightColIsUsed) {
+  dataset.select($(labelCol), $(featuresCol), $(weightCol))
+} else {
+  dataset.select($(labelCol), $(featuresCol))
+}
 
 // persist if underlying dataset is not persistent.
 val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
@@ -337,7 +364,13 @@ final class OneVsRest @Since("1.4.0") (
   paramMap.put(classifier.labelCol -> labelColName)
   paramMap.put(classifier.featuresCol -> getFeaturesCol)
   paramMap.put(classifier.predictionCol -> getPredictionCol)
-  classifier.fit(trainingDataset, paramMap)
+  if (weightColIsUsed) {
+val classifier_ = classifier.asInstanceOf[ClassifierType with 
HasWeightCol]
+paramMap.put(classifier_.weightCol -> getWeightCol)
+

spark git commit: [SPARK-21306][ML] OneVsRest should support setWeightCol

2017-07-27 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master f44ead89f -> a5a318997


[SPARK-21306][ML] OneVsRest should support setWeightCol

## What changes were proposed in this pull request?

add `setWeightCol` method for OneVsRest.

`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.

Author: Yan Facai (é¢åæ) 

Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a5a31899
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a5a31899
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a5a31899

Branch: refs/heads/master
Commit: a5a3189974ea4628e9489eb50099a5432174e80c
Parents: f44ead8
Author: Yan Facai (é¢åæ) 
Authored: Fri Jul 28 10:10:35 2017 +0800
Committer: Yanbo Liang 
Committed: Fri Jul 28 10:10:35 2017 +0800

--
 .../spark/ml/classification/OneVsRest.scala | 39 ++--
 .../ml/classification/OneVsRestSuite.scala  | 10 +
 python/pyspark/ml/classification.py | 27 +++---
 python/pyspark/ml/tests.py  | 14 +++
 4 files changed, 81 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a5a31899/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
index 7cbcccf..05b8c3a 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
@@ -34,6 +34,7 @@ import org.apache.spark.ml._
 import org.apache.spark.ml.attribute._
 import org.apache.spark.ml.linalg.Vector
 import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params}
+import org.apache.spark.ml.param.shared.HasWeightCol
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset, Row}
 import org.apache.spark.sql.functions._
@@ -53,7 +54,8 @@ private[ml] trait ClassifierTypeTrait {
 /**
  * Params for [[OneVsRest]].
  */
-private[ml] trait OneVsRestParams extends PredictorParams with 
ClassifierTypeTrait {
+private[ml] trait OneVsRestParams extends PredictorParams
+  with ClassifierTypeTrait with HasWeightCol {
 
   /**
* param for the base binary classifier that we reduce multiclass 
classification into.
@@ -294,6 +296,18 @@ final class OneVsRest @Since("1.4.0") (
   @Since("1.5.0")
   def setPredictionCol(value: String): this.type = set(predictionCol, value)
 
+  /**
+   * Sets the value of param [[weightCol]].
+   *
+   * This is ignored if weight is not supported by [[classifier]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.3.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
   @Since("1.4.0")
   override def transformSchema(schema: StructType): StructType = {
 validateAndTransformSchema(schema, fitting = true, 
getClassifier.featuresDataType)
@@ -317,7 +331,20 @@ final class OneVsRest @Since("1.4.0") (
 val numClasses = 
MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity)
 instr.logNumClasses(numClasses)
 
-val multiclassLabeled = dataset.select($(labelCol), $(featuresCol))
+val weightColIsUsed = isDefined(weightCol) && $(weightCol).nonEmpty && {
+  getClassifier match {
+case _: HasWeightCol => true
+case c =>
+  logWarning(s"weightCol is ignored, as it is not supported by $c 
now.")
+  false
+  }
+}
+
+val multiclassLabeled = if (weightColIsUsed) {
+  dataset.select($(labelCol), $(featuresCol), $(weightCol))
+} else {
+  dataset.select($(labelCol), $(featuresCol))
+}
 
 // persist if underlying dataset is not persistent.
 val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
@@ -337,7 +364,13 @@ final class OneVsRest @Since("1.4.0") (
   paramMap.put(classifier.labelCol -> labelColName)
   paramMap.put(classifier.featuresCol -> getFeaturesCol)
   paramMap.put(classifier.predictionCol -> getPredictionCol)
-  classifier.fit(trainingDataset, paramMap)
+  if (weightColIsUsed) {
+val classifier_ = classifier.asInstanceOf[ClassifierType with 
HasWeightCol]
+paramMap.put(classifier_.weightCol -> getWeightCol)
+classifier_.fit(trainingDataset, paramMap)
+  } else {
+classifier.fit(trainingDataset, paramMap)
+  }

spark git commit: [SPARK-19270][ML] Add summary table to GLM summary

2017-07-27 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 2ff35a057 -> ddcd2e826


[SPARK-19270][ML] Add summary table to GLM summary

## What changes were proposed in this pull request?

Add R-like summary table to GLM summary, which includes feature name (if 
exist), parameter estimate, standard error, t-stat and p-value. This allows 
scala users to easily gather these commonly used inference results.

srowen yanboliang  felixcheung

## How was this patch tested?
New tests. One for testing feature Name, and one for testing the summary Table.

Author: actuaryzhang 
Author: Wayne Zhang 
Author: Yanbo Liang 

Closes #16630 from actuaryzhang/glmTable.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ddcd2e82
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ddcd2e82
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ddcd2e82

Branch: refs/heads/master
Commit: ddcd2e8269db36e4b43edd5cee921d4b12def203
Parents: 2ff35a0
Author: actuaryzhang 
Authored: Thu Jul 27 22:00:59 2017 +0800
Committer: Yanbo Liang 
Committed: Thu Jul 27 22:00:59 2017 +0800

--
 .../r/GeneralizedLinearRegressionWrapper.scala  |  39 ++-
 .../GeneralizedLinearRegression.scala   | 111 ++-
 .../GeneralizedLinearRegressionSuite.scala  |  83 +-
 3 files changed, 199 insertions(+), 34 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ddcd2e82/mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala
index ee1fc9b..176a6cf 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala
@@ -83,11 +83,7 @@ private[r] object GeneralizedLinearRegressionWrapper
   .setStringIndexerOrderType(stringIndexerOrderType)
 checkDataColumns(rFormula, data)
 val rFormulaModel = rFormula.fit(data)
-// get labels and feature names from output schema
-val schema = rFormulaModel.transform(data).schema
-val featureAttrs = 
AttributeGroup.fromStructField(schema(rFormula.getFeaturesCol))
-  .attributes.get
-val features = featureAttrs.map(_.name.get)
+
 // assemble and fit the pipeline
 val glr = new GeneralizedLinearRegression()
   .setFamily(family)
@@ -113,37 +109,16 @@ private[r] object GeneralizedLinearRegressionWrapper
 val summary = glm.summary
 
 val rFeatures: Array[String] = if (glm.getFitIntercept) {
-  Array("(Intercept)") ++ features
+  Array("(Intercept)") ++ summary.featureNames
 } else {
-  features
+  summary.featureNames
 }
 
 val rCoefficients: Array[Double] = if (summary.isNormalSolver) {
-  val rCoefficientStandardErrors = if (glm.getFitIntercept) {
-Array(summary.coefficientStandardErrors.last) ++
-  summary.coefficientStandardErrors.dropRight(1)
-  } else {
-summary.coefficientStandardErrors
-  }
-
-  val rTValues = if (glm.getFitIntercept) {
-Array(summary.tValues.last) ++ summary.tValues.dropRight(1)
-  } else {
-summary.tValues
-  }
-
-  val rPValues = if (glm.getFitIntercept) {
-Array(summary.pValues.last) ++ summary.pValues.dropRight(1)
-  } else {
-summary.pValues
-  }
-
-  if (glm.getFitIntercept) {
-Array(glm.intercept) ++ glm.coefficients.toArray ++
-  rCoefficientStandardErrors ++ rTValues ++ rPValues
-  } else {
-glm.coefficients.toArray ++ rCoefficientStandardErrors ++ rTValues ++ 
rPValues
-  }
+  summary.coefficientsWithStatistics.map(_._2) ++
+summary.coefficientsWithStatistics.map(_._3) ++
+summary.coefficientsWithStatistics.map(_._4) ++
+summary.coefficientsWithStatistics.map(_._5)
 } else {
   if (glm.getFitIntercept) {
 Array(glm.intercept) ++ glm.coefficients.toArray

http://git-wip-us.apache.org/repos/asf/spark/blob/ddcd2e82/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
index 815607f..917a4d2 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

spark git commit: [MINOR][ML] Reorg RFormula params.

2017-07-20 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 256358f66 -> 5d1850d4b


[MINOR][ML] Reorg RFormula params.

## What changes were proposed in this pull request?
There are mainly two reasons for this reorg:
* Some params are placed in ```RFormulaBase```, while others are placed in 
```RFormula```, this is disordered.
* ```RFormulaModel``` should have params ```handleInvalid```, ```formula``` and 
```forceIndexLabel```, that users can get invalid values handling policy, 
formula or whether to force index label if they only have a 
```RFormulaModel```. So we need move these params to ```RFormulaBase``` which 
is also inherited by ```RFormulaModel```.
* ```RFormulaModel``` should support set different ```handleInvalid``` when 
cross validation.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #18681 from yanboliang/rformula-reorg.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5d1850d4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5d1850d4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5d1850d4

Branch: refs/heads/master
Commit: 5d1850d4b541a8108c934a174097f3c7e10b5315
Parents: 256358f
Author: Yanbo Liang 
Authored: Thu Jul 20 20:07:16 2017 +0800
Committer: Yanbo Liang 
Committed: Thu Jul 20 20:07:16 2017 +0800

--
 .../org/apache/spark/ml/feature/RFormula.scala  | 95 ++--
 1 file changed, 47 insertions(+), 48 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5d1850d4/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
index c224454..7da3339 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
@@ -35,7 +35,51 @@ import org.apache.spark.sql.types._
 /**
  * Base trait for [[RFormula]] and [[RFormulaModel]].
  */
-private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
+private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol 
with HasHandleInvalid {
+
+  /**
+   * R formula parameter. The formula is provided in string form.
+   * @group param
+   */
+  @Since("1.5.0")
+  val formula: Param[String] = new Param(this, "formula", "R model formula")
+
+  /** @group getParam */
+  @Since("1.5.0")
+  def getFormula: String = $(formula)
+
+  /**
+   * Force to index label whether it is numeric or string type.
+   * Usually we index label only when it is string type.
+   * If the formula was used by classification algorithms,
+   * we can force to index label even it is numeric type by setting this param 
with true.
+   * Default: false.
+   * @group param
+   */
+  @Since("2.1.0")
+  val forceIndexLabel: BooleanParam = new BooleanParam(this, "forceIndexLabel",
+"Force to index label whether it is numeric or string")
+  setDefault(forceIndexLabel -> false)
+
+  /** @group getParam */
+  @Since("2.1.0")
+  def getForceIndexLabel: Boolean = $(forceIndexLabel)
+
+  /**
+   * Param for how to handle invalid data (unseen or NULL values) in features 
and label column
+   * of string type. Options are 'skip' (filter out rows with invalid data),
+   * 'error' (throw an error), or 'keep' (put invalid data in a special 
additional
+   * bucket, at index numLabels).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  final override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data (unseen or NULL values) in features and label 
column of string " +
+"type. Options are 'skip' (filter out rows with invalid data), error 
(throw an error), " +
+"or 'keep' (put invalid data in a special additional bucket, at index 
numLabels).",
+ParamValidators.inArray(StringIndexer.supportedHandleInvalids))
+  setDefault(handleInvalid, StringIndexer.ERROR_INVALID)
 
   /**
* Param for how to order categories of a string FEATURE column used by 
`StringIndexer`.
@@ -68,6 +112,7 @@ private[feature] trait RFormulaBase extends HasFeaturesCol 
with HasLabelCol {
 "The default value is 'frequencyDesc'. When the ordering is set to 
'alphabetDesc', " +
 "RFormula drops the same category as R when encoding strings.",
 ParamValidators.inArray(StringIndexer.supportedStringOrderType))
+  setDefault(stringIndexerOrderType, StringIndexer.frequencyDesc)
 
   /** @group getParam */
   @Since("2.3.0")
@@ -108,20 +153,12 @@ private[feature] trait RFormulaBase extends 
HasFeaturesCol with HasLabelCol {
 @Experimental
 @Since("1.5.0")
 class RFormula

spark git commit: [SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column.

2017-07-15 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 74ac1fb08 -> 69e5282d3


[SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both 
features and label column.

## What changes were proposed in this pull request?
```RFormula``` should handle invalid for both features and label column.
#18496 only handle invalid values in features column. This PR add handling 
invalid values for label column and test cases.

## How was this patch tested?
Add test cases.

Author: Yanbo Liang 

Closes #18613 from yanboliang/spark-20307.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/69e5282d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/69e5282d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/69e5282d

Branch: refs/heads/master
Commit: 69e5282d3c2998611680d3e10f2830d4e9c5f750
Parents: 74ac1fb
Author: Yanbo Liang 
Authored: Sat Jul 15 20:56:38 2017 +0800
Committer: Yanbo Liang 
Committed: Sat Jul 15 20:56:38 2017 +0800

--
 R/pkg/tests/fulltests/test_mllib_tree.R |  2 +-
 .../org/apache/spark/ml/feature/RFormula.scala  |  9 ++--
 .../apache/spark/ml/feature/RFormulaSuite.scala | 49 +++-
 python/pyspark/ml/feature.py|  5 +-
 4 files changed, 57 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/69e5282d/R/pkg/tests/fulltests/test_mllib_tree.R
--
diff --git a/R/pkg/tests/fulltests/test_mllib_tree.R 
b/R/pkg/tests/fulltests/test_mllib_tree.R
index 66a0693..e31a65f 100644
--- a/R/pkg/tests/fulltests/test_mllib_tree.R
+++ b/R/pkg/tests/fulltests/test_mllib_tree.R
@@ -225,7 +225,7 @@ test_that("spark.randomForest", {
   expect_error(collect(predictions))
   model <- spark.randomForest(traindf, clicked ~ ., type = "classification",
  maxDepth = 10, maxBins = 10, numTrees = 10,
- handleInvalid = "skip")
+ handleInvalid = "keep")
   predictions <- predict(model, testdf)
   expect_equal(class(collect(predictions)$clicked[1]), "character")
 

http://git-wip-us.apache.org/repos/asf/spark/blob/69e5282d/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
index bb7acaf..c224454 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
@@ -134,16 +134,16 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override 
val uid: String)
   def getFormula: String = $(formula)
 
   /**
-   * Param for how to handle invalid data (unseen labels or NULL values).
-   * Options are 'skip' (filter out rows with invalid data),
+   * Param for how to handle invalid data (unseen or NULL values) in features 
and label column
+   * of string type. Options are 'skip' (filter out rows with invalid data),
* 'error' (throw an error), or 'keep' (put invalid data in a special 
additional
* bucket, at index numLabels).
* Default: "error"
* @group param
*/
   @Since("2.3.0")
-  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
-"How to handle invalid data (unseen labels or NULL values). " +
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid", "How to " +
+"handle invalid data (unseen or NULL values) in features and label column 
of string type. " +
 "Options are 'skip' (filter out rows with invalid data), error (throw an 
error), " +
 "or 'keep' (put invalid data in a special additional bucket, at index 
numLabels).",
 ParamValidators.inArray(StringIndexer.supportedHandleInvalids))
@@ -265,6 +265,7 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override 
val uid: String)
   encoderStages += new StringIndexer()
 .setInputCol(resolvedFormula.label)
 .setOutputCol($(labelCol))
+.setHandleInvalid($(handleInvalid))
 }
 
 val pipelineModel = new 
Pipeline(uid).setStages(encoderStages.toArray).fit(dataset)

http://git-wip-us.apache.org/repos/asf/spark/blob/69e5282d/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala
index 23570d6..5d09c90 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala
+++

spark git commit: [SPARK-18619][ML] Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from HasHandleInvalid

2017-07-12 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master aaad34dc2 -> d2d2a5de1


[SPARK-18619][ML] Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula 
inherit from HasHandleInvalid

## What changes were proposed in this pull request?
1, HasHandleInvaild support override
2, Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from 
HasHandleInvalid

## How was this patch tested?
existing tests

[JIRA](https://issues.apache.org/jira/browse/SPARK-18619)

Author: Zheng RuiFeng 

Closes #18582 from zhengruifeng/heritate_HasHandleInvalid.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d2d2a5de
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d2d2a5de
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d2d2a5de

Branch: refs/heads/master
Commit: d2d2a5de186ddf381d0bdb353b23d64ff0224e7f
Parents: aaad34d
Author: Zheng RuiFeng 
Authored: Wed Jul 12 22:09:03 2017 +0800
Committer: Yanbo Liang 
Committed: Wed Jul 12 22:09:03 2017 +0800

--
 .../apache/spark/ml/feature/Bucketizer.scala| 14 ++---
 .../spark/ml/feature/QuantileDiscretizer.scala  | 13 ++---
 .../org/apache/spark/ml/feature/RFormula.scala  | 13 ++---
 .../apache/spark/ml/feature/StringIndexer.scala | 13 ++---
 .../ml/param/shared/SharedParamsCodeGen.scala   |  2 +-
 .../spark/ml/param/shared/sharedParams.scala|  2 +-
 .../GeneralizedLinearRegression.scala   |  2 +-
 .../spark/ml/regression/LinearRegression.scala  | 14 ++---
 python/pyspark/ml/feature.py| 60 
 9 files changed, 53 insertions(+), 80 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d2d2a5de/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
index 46b512f..6a11a75 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
@@ -24,7 +24,7 @@ import org.apache.spark.annotation.Since
 import org.apache.spark.ml.Model
 import org.apache.spark.ml.attribute.NominalAttribute
 import org.apache.spark.ml.param._
-import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, 
HasOutputCol}
 import org.apache.spark.ml.util._
 import org.apache.spark.sql._
 import org.apache.spark.sql.expressions.UserDefinedFunction
@@ -36,7 +36,8 @@ import org.apache.spark.sql.types.{DoubleType, StructField, 
StructType}
  */
 @Since("1.4.0")
 final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: 
String)
-  extends Model[Bucketizer] with HasInputCol with HasOutputCol with 
DefaultParamsWritable {
+  extends Model[Bucketizer] with HasHandleInvalid with HasInputCol with 
HasOutputCol
+with DefaultParamsWritable {
 
   @Since("1.4.0")
   def this() = this(Identifiable.randomUID("bucketizer"))
@@ -84,17 +85,12 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
* Default: "error"
* @group param
*/
-  // TODO: SPARK-18619 Make Bucketizer inherit from HasHandleInvalid.
   @Since("2.1.0")
-  val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", 
"how to handle " +
-"invalid entries. Options are skip (filter out rows with invalid values), 
" +
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"how to handle invalid entries. Options are skip (filter out rows with 
invalid values), " +
 "error (throw an error), or keep (keep invalid values in a special 
additional bucket).",
 ParamValidators.inArray(Bucketizer.supportedHandleInvalids))
 
-  /** @group getParam */
-  @Since("2.1.0")
-  def getHandleInvalid: String = $(handleInvalid)
-
   /** @group setParam */
   @Since("2.1.0")
   def setHandleInvalid(value: String): this.type = set(handleInvalid, value)

http://git-wip-us.apache.org/repos/asf/spark/blob/d2d2a5de/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
index feceeba..95e8830 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
@@ -22,7 +22,7 @@ import org.apache.spark.internal.Logging
 import org.apache.spark.ml._
 import

spark git commit: [SPARK-21285][ML] VectorAssembler reports the column name of unsupported data type

2017-07-07 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 7fcbb9b57 -> 56536e999


[SPARK-21285][ML] VectorAssembler reports the column name of unsupported data 
type

## What changes were proposed in this pull request?
add the column name in the exception which is raised by unsupported data type.

## How was this patch tested?
+ [x] pass all tests.

Author: Yan Facai (é¢åæ) 

Closes #18523 from facaiy/ENH/vectorassembler_add_col.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/56536e99
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/56536e99
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/56536e99

Branch: refs/heads/master
Commit: 56536e9992ac4ea771758463962e49bba410e896
Parents: 7fcbb9b
Author: Yan Facai (é¢åæ) 
Authored: Fri Jul 7 18:32:01 2017 +0800
Committer: Yanbo Liang 
Committed: Fri Jul 7 18:32:01 2017 +0800

--
 .../apache/spark/ml/feature/VectorAssembler.scala| 15 +--
 .../spark/ml/feature/VectorAssemblerSuite.scala  |  5 -
 2 files changed, 13 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/56536e99/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
index ca90053..73f27d1 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
@@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   override def transformSchema(schema: StructType): StructType = {
 val inputColNames = $(inputCols)
 val outputColName = $(outputCol)
-val inputDataTypes = inputColNames.map(name => schema(name).dataType)
-inputDataTypes.foreach {
-  case _: NumericType | BooleanType =>
-  case t if t.isInstanceOf[VectorUDT] =>
-  case other =>
-throw new IllegalArgumentException(s"Data type $other is not 
supported.")
+val incorrectColumns = inputColNames.flatMap { name =>
+  schema(name).dataType match {
+case _: NumericType | BooleanType => None
+case t if t.isInstanceOf[VectorUDT] => None
+case other => Some(s"Data type $other of column $name is not 
supported.")
+  }
+}
+if (incorrectColumns.nonEmpty) {
+  throw new IllegalArgumentException(incorrectColumns.mkString("\n"))
 }
 if (schema.fieldNames.contains(outputColName)) {
   throw new IllegalArgumentException(s"Output column $outputColName 
already exists.")

http://git-wip-us.apache.org/repos/asf/spark/blob/56536e99/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
index 46cced3..6aef1c6 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
@@ -79,7 +79,10 @@ class VectorAssemblerSuite
 val thrown = intercept[IllegalArgumentException] {
   assembler.transform(df)
 }
-assert(thrown.getMessage contains "Data type StringType is not supported")
+assert(thrown.getMessage contains
+  "Data type StringType of column a is not supported.\n" +
+  "Data type StringType of column b is not supported.\n" +
+  "Data type StringType of column c is not supported.")
   }
 
   test("ML attributes") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-21310][ML][PYSPARK] Expose offset in PySpark

2017-07-05 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master a38643256 -> 4852b7d44


[SPARK-21310][ML][PYSPARK] Expose offset in PySpark

## What changes were proposed in this pull request?
Add offset to PySpark in GLM as in #16699.

## How was this patch tested?
Python test

Author: actuaryzhang 

Closes #18534 from actuaryzhang/pythonOffset.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4852b7d4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4852b7d4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4852b7d4

Branch: refs/heads/master
Commit: 4852b7d447e872079c2c81428354adc825a87b27
Parents: a386432
Author: actuaryzhang 
Authored: Wed Jul 5 18:41:00 2017 +0800
Committer: Yanbo Liang 
Committed: Wed Jul 5 18:41:00 2017 +0800

--
 python/pyspark/ml/regression.py | 25 +
 python/pyspark/ml/tests.py  | 14 ++
 2 files changed, 35 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4852b7d4/python/pyspark/ml/regression.py
--
diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py
index 84d8433..f0ff7a5 100644
--- a/python/pyspark/ml/regression.py
+++ b/python/pyspark/ml/regression.py
@@ -1376,17 +1376,20 @@ class GeneralizedLinearRegression(JavaEstimator, 
HasLabelCol, HasFeaturesCol, Ha
   typeConverter=TypeConverters.toFloat)
 solver = Param(Params._dummy(), "solver", "The solver algorithm for 
optimization. Supported " +
"options: irls.", typeConverter=TypeConverters.toString)
+offsetCol = Param(Params._dummy(), "offsetCol", "The offset column name. 
If this is not set " +
+  "or empty, we treat all instance offsets as 0.0",
+  typeConverter=TypeConverters.toString)
 
 @keyword_only
 def __init__(self, labelCol="label", featuresCol="features", 
predictionCol="prediction",
  family="gaussian", link=None, fitIntercept=True, maxIter=25, 
tol=1e-6,
  regParam=0.0, weightCol=None, solver="irls", 
linkPredictionCol=None,
- variancePower=0.0, linkPower=None):
+ variancePower=0.0, linkPower=None, offsetCol=None):
 """
 __init__(self, labelCol="label", featuresCol="features", 
predictionCol="prediction", \
  family="gaussian", link=None, fitIntercept=True, maxIter=25, 
tol=1e-6, \
  regParam=0.0, weightCol=None, solver="irls", 
linkPredictionCol=None, \
- variancePower=0.0, linkPower=None)
+ variancePower=0.0, linkPower=None, offsetCol=None)
 """
 super(GeneralizedLinearRegression, self).__init__()
 self._java_obj = self._new_java_obj(
@@ -1402,12 +1405,12 @@ class GeneralizedLinearRegression(JavaEstimator, 
HasLabelCol, HasFeaturesCol, Ha
 def setParams(self, labelCol="label", featuresCol="features", 
predictionCol="prediction",
   family="gaussian", link=None, fitIntercept=True, maxIter=25, 
tol=1e-6,
   regParam=0.0, weightCol=None, solver="irls", 
linkPredictionCol=None,
-  variancePower=0.0, linkPower=None):
+  variancePower=0.0, linkPower=None, offsetCol=None):
 """
 setParams(self, labelCol="label", featuresCol="features", 
predictionCol="prediction", \
   family="gaussian", link=None, fitIntercept=True, maxIter=25, 
tol=1e-6, \
   regParam=0.0, weightCol=None, solver="irls", 
linkPredictionCol=None, \
-  variancePower=0.0, linkPower=None)
+  variancePower=0.0, linkPower=None, offsetCol=None)
 Sets params for generalized linear regression.
 """
 kwargs = self._input_kwargs
@@ -1486,6 +1489,20 @@ class GeneralizedLinearRegression(JavaEstimator, 
HasLabelCol, HasFeaturesCol, Ha
 """
 return self.getOrDefault(self.linkPower)
 
+@since("2.3.0")
+def setOffsetCol(self, value):
+"""
+Sets the value of :py:attr:`offsetCol`.
+"""
+return self._set(offsetCol=value)
+
+@since("2.3.0")
+def getOffsetCol(self):
+"""
+Gets the value of offsetCol or its default value.
+"""
+return self.getOrDefault(self.offsetCol)
+
 
 class GeneralizedLinearRegressionModel(JavaModel, JavaPredictionModel, 
JavaMLWritable,
JavaMLReadable):

http://git-wip-us.apache.org/repos/asf/spark/blob/4852b7d4/python/pyspark/ml/tests.py
--
diff --git

spark git commit: [SPARK-19852][PYSPARK][ML] Python StringIndexer supports 'keep' to handle invalid data

2017-07-02 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master c605fee01 -> c19680be1


[SPARK-19852][PYSPARK][ML] Python StringIndexer supports 'keep' to handle 
invalid data

## What changes were proposed in this pull request?
This PR is to maintain API parity with changes made in SPARK-17498 to support a 
new option
'keep' in StringIndexer to handle unseen labels or NULL values with PySpark.

Note: This is updated version of #17237 , the primary author of this PR is 
VinceShieh .
## How was this patch tested?
Unit tests.

Author: VinceShieh 
Author: Yanbo Liang 

Closes #18453 from yanboliang/spark-19852.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c19680be
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c19680be
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c19680be

Branch: refs/heads/master
Commit: c19680be1c532dded1e70edce7a981ba28af09ad
Parents: c605fee
Author: Yanbo Liang 
Authored: Sun Jul 2 16:17:03 2017 +0800
Committer: Yanbo Liang 
Committed: Sun Jul 2 16:17:03 2017 +0800

--
 python/pyspark/ml/feature.py |  6 ++
 python/pyspark/ml/tests.py   | 21 +
 2 files changed, 27 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c19680be/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 77de1cc..25ad06f 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -2132,6 +2132,12 @@ class StringIndexer(JavaEstimator, HasInputCol, 
HasOutputCol, HasHandleInvalid,
 "frequencyDesc, frequencyAsc, alphabetDesc, 
alphabetAsc.",
 typeConverter=TypeConverters.toString)
 
+handleInvalid = Param(Params._dummy(), "handleInvalid", "how to handle 
invalid data (unseen " +
+  "labels or NULL values). Options are 'skip' (filter 
out rows with " +
+  "invalid data), error (throw an error), or 'keep' 
(put invalid data " +
+  "in a special additional bucket, at index 
numLabels).",
+  typeConverter=TypeConverters.toString)
+
 @keyword_only
 def __init__(self, inputCol=None, outputCol=None, handleInvalid="error",
  stringOrderType="frequencyDesc"):

http://git-wip-us.apache.org/repos/asf/spark/blob/c19680be/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 17a3947..ffb8b0a 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -551,6 +551,27 @@ class FeatureTests(SparkSessionTestCase):
 for i in range(0, len(expected)):
 self.assertTrue(all(observed[i]["features"].toArray() == 
expected[i]))
 
+def test_string_indexer_handle_invalid(self):
+df = self.spark.createDataFrame([
+(0, "a"),
+(1, "d"),
+(2, None)], ["id", "label"])
+
+si1 = StringIndexer(inputCol="label", outputCol="indexed", 
handleInvalid="keep",
+stringOrderType="alphabetAsc")
+model1 = si1.fit(df)
+td1 = model1.transform(df)
+actual1 = td1.select("id", "indexed").collect()
+expected1 = [Row(id=0, indexed=0.0), Row(id=1, indexed=1.0), Row(id=2, 
indexed=2.0)]
+self.assertEqual(actual1, expected1)
+
+si2 = si1.setHandleInvalid("skip")
+model2 = si2.fit(df)
+td2 = model2.transform(df)
+actual2 = td2.select("id", "indexed").collect()
+expected2 = [Row(id=0, indexed=0.0), Row(id=1, indexed=1.0)]
+self.assertEqual(actual2, expected2)
+
 
 class HasInducedError(Params):
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18518][ML] HasSolver supports override

2017-07-01 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 37ef32e51 -> e0b047eaf


[SPARK-18518][ML] HasSolver supports override

## What changes were proposed in this pull request?
1, make param support non-final with `finalFields` option
2, generate `HasSolver` with `finalFields = false`
3, override `solver` in LiR, GLR, and make MLPC inherit `HasSolver`

## How was this patch tested?
existing tests

Author: Ruifeng Zheng 
Author: Zheng RuiFeng 

Closes #16028 from zhengruifeng/param_non_final.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e0b047ea
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e0b047ea
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e0b047ea

Branch: refs/heads/master
Commit: e0b047eafed92eadf6842a9df964438095e12d41
Parents: 37ef32e
Author: Ruifeng Zheng 
Authored: Sat Jul 1 15:37:41 2017 +0800
Committer: Yanbo Liang 
Committed: Sat Jul 1 15:37:41 2017 +0800

--
 .../MultilayerPerceptronClassifier.scala| 19 
 .../ml/param/shared/SharedParamsCodeGen.scala   | 11 +++--
 .../spark/ml/param/shared/sharedParams.scala|  8 ++--
 .../GeneralizedLinearRegression.scala   | 21 -
 .../spark/ml/regression/LinearRegression.scala  | 46 +++-
 python/pyspark/ml/classification.py | 18 +---
 python/pyspark/ml/regression.py |  5 +++
 7 files changed, 82 insertions(+), 46 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e0b047ea/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
index ec39f96..ceba11e 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
@@ -27,13 +27,16 @@ import org.apache.spark.ml.ann.{FeedForwardTopology, 
FeedForwardTrainer}
 import org.apache.spark.ml.feature.LabeledPoint
 import org.apache.spark.ml.linalg.{Vector, Vectors}
 import org.apache.spark.ml.param._
-import org.apache.spark.ml.param.shared.{HasMaxIter, HasSeed, HasStepSize, 
HasTol}
+import org.apache.spark.ml.param.shared._
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.Dataset
 
 /** Params for Multilayer Perceptron. */
 private[classification] trait MultilayerPerceptronParams extends 
PredictorParams
-  with HasSeed with HasMaxIter with HasTol with HasStepSize {
+  with HasSeed with HasMaxIter with HasTol with HasStepSize with HasSolver {
+
+  import MultilayerPerceptronClassifier._
+
   /**
* Layer sizes including input size and output size.
*
@@ -78,14 +81,10 @@ private[classification] trait MultilayerPerceptronParams 
extends PredictorParams
* @group expertParam
*/
   @Since("2.0.0")
-  final val solver: Param[String] = new Param[String](this, "solver",
+  final override val solver: Param[String] = new Param[String](this, "solver",
 "The solver algorithm for optimization. Supported options: " +
-  s"${MultilayerPerceptronClassifier.supportedSolvers.mkString(", ")}. 
(Default l-bfgs)",
-
ParamValidators.inArray[String](MultilayerPerceptronClassifier.supportedSolvers))
-
-  /** @group expertGetParam */
-  @Since("2.0.0")
-  final def getSolver: String = $(solver)
+  s"${supportedSolvers.mkString(", ")}. (Default l-bfgs)",
+ParamValidators.inArray[String](supportedSolvers))
 
   /**
* The initial weights of the model.
@@ -101,7 +100,7 @@ private[classification] trait MultilayerPerceptronParams 
extends PredictorParams
   final def getInitialWeights: Vector = $(initialWeights)
 
   setDefault(maxIter -> 100, tol -> 1e-6, blockSize -> 128,
-solver -> MultilayerPerceptronClassifier.LBFGS, stepSize -> 0.03)
+solver -> LBFGS, stepSize -> 0.03)
 }
 
 /** Label to vector converter. */

http://git-wip-us.apache.org/repos/asf/spark/blob/e0b047ea/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
index 013817a..23e0d45 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
@@ -80,8 +80,7 @@

spark git commit: [SPARK-21275][ML] Update GLM test to use supportedFamilyNames

2017-07-01 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master b1d719e7c -> 37ef32e51


[SPARK-21275][ML] Update GLM test to use supportedFamilyNames

## What changes were proposed in this pull request?
Update GLM test to use supportedFamilyNames as suggested here:
https://github.com/apache/spark/pull/16699#discussion-diff-100574976R855

Author: actuaryzhang 

Closes #18495 from actuaryzhang/mlGlmTest2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/37ef32e5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/37ef32e5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/37ef32e5

Branch: refs/heads/master
Commit: 37ef32e515ea071afe63b56ba0d4299bb76e8a75
Parents: b1d719e
Author: actuaryzhang 
Authored: Sat Jul 1 14:57:57 2017 +0800
Committer: Yanbo Liang 
Committed: Sat Jul 1 14:57:57 2017 +0800

--
 .../GeneralizedLinearRegressionSuite.scala  | 33 ++--
 1 file changed, 16 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/37ef32e5/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
index 83f1344..a47bd17 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
@@ -749,15 +749,15 @@ class GeneralizedLinearRegressionSuite
   library(statmod)
   y <- c(1.0, 0.5, 0.7, 0.3)
   w <- c(1, 2, 3, 4)
-  for (fam in list(gaussian(), poisson(), binomial(), Gamma(), 
tweedie(1.6))) {
+  for (fam in list(binomial(), Gamma(), gaussian(), poisson(), 
tweedie(1.6))) {
 model1 <- glm(y ~ 1, family = fam)
 model2 <- glm(y ~ 1, family = fam, weights = w)
 print(as.vector(c(coef(model1), coef(model2
   }
-  [1] 0.625 0.530
-  [1] -0.4700036 -0.6348783
   [1] 0.5108256 0.1201443
   [1] 1.60 1.886792
+  [1] 0.625 0.530
+  [1] -0.4700036 -0.6348783
   [1] 1.325782 1.463641
  */
 
@@ -768,13 +768,13 @@ class GeneralizedLinearRegressionSuite
   Instance(0.3, 4.0, Vectors.zeros(0))
 ).toDF()
 
-val expected = Seq(0.625, 0.530, -0.4700036, -0.6348783, 0.5108256, 
0.1201443,
-  1.60, 1.886792, 1.325782, 1.463641)
+val expected = Seq(0.5108256, 0.1201443, 1.60, 1.886792, 0.625, 0.530,
+  -0.4700036, -0.6348783, 1.325782, 1.463641)
 
 import GeneralizedLinearRegression._
 
 var idx = 0
-for (family <- Seq("gaussian", "poisson", "binomial", "gamma", "tweedie")) 
{
+for (family <- GeneralizedLinearRegression.supportedFamilyNames.sortWith(_ 
< _)) {
   for (useWeight <- Seq(false, true)) {
 val trainer = new GeneralizedLinearRegression().setFamily(family)
 if (useWeight) trainer.setWeightCol("weight")
@@ -807,7 +807,7 @@ class GeneralizedLinearRegressionSuite
 0.5, 2.1, 0.5, 1.0, 2.0,
 0.9, 0.4, 1.0, 2.0, 1.0,
 0.7, 0.7, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
-  families <- list(gaussian, binomial, poisson, Gamma, tweedie(1.5))
+  families <- list(binomial, Gamma, gaussian, poisson, tweedie(1.5))
   f1 <- V1 ~ -1 + V4 + V5
   f2 <- V1 ~ V4 + V5
   for (f in c(f1, f2)) {
@@ -816,15 +816,15 @@ class GeneralizedLinearRegressionSuite
   print(as.vector(coef(model)))
 }
   }
-  [1]  0.5169222 -0.334
   [1]  0.9419107 -0.6864404
-  [1]  0.1812436 -0.6568422
   [1] -0.2869094  0.7857710
+  [1]  0.5169222 -0.334
+  [1]  0.1812436 -0.6568422
   [1] 0.1055254 0.2979113
-  [1] -0.05990345  0.53188982 -0.32118415
   [1] -0.2147117  0.9911750 -0.6356096
-  [1] -1.5616130  0.6646470 -0.3192581
   [1]  0.3390397 -0.3406099  0.6870259
+  [1] -0.05990345  0.53188982 -0.32118415
+  [1] -1.5616130  0.6646470 -0.3192581
   [1] 0.3665034 0.1039416 0.1484616
 */
 val dataset = Seq(
@@ -835,23 +835,22 @@ class GeneralizedLinearRegressionSuite
 ).toDF()
 
 val expected = Seq(
-  Vectors.dense(0, 0.5169222, -0.334),
   Vectors.dense(0, 0.9419107, -0.6864404),
-  Vectors.dense(0, 0.1812436, -0.6568422),
   Vectors.dense(0, -0.2869094, 0.785771),
+  Vectors.dense(0, 0.5169222, -0.334),
+  Vectors.dense(0, 0.1812436, -0.6568422),
   Vectors.dense(0, 0.1055254, 0.2979113),
-  Vectors.dense(-0.05990345, 0.53188982, -0.32118415),

spark git commit: [ML] Fix scala-2.10 build failure of GeneralizedLinearRegressionSuite.

2017-06-30 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 3c2fc19d4 -> 528c9281a


[ML] Fix scala-2.10 build failure of GeneralizedLinearRegressionSuite.

## What changes were proposed in this pull request?
Fix scala-2.10 build failure of ```GeneralizedLinearRegressionSuite```.

## How was this patch tested?
Build with scala-2.10.

Author: Yanbo Liang 

Closes #18489 from yanboliang/glr.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/528c9281
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/528c9281
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/528c9281

Branch: refs/heads/master
Commit: 528c9281aecc49e9bff204dd303962c705c6f237
Parents: 3c2fc19
Author: Yanbo Liang 
Authored: Fri Jun 30 23:25:14 2017 +0800
Committer: Yanbo Liang 
Committed: Fri Jun 30 23:25:14 2017 +0800

--
 .../ml/regression/GeneralizedLinearRegressionSuite.scala | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/528c9281/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
index cfaa573..83f1344 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
@@ -1075,7 +1075,7 @@ class GeneralizedLinearRegressionSuite
 val seCoefR = Array(1.23439, 0.9669, 3.56866)
 val tValsR = Array(0.80297, -0.65737, -0.06017)
 val pValsR = Array(0.42199, 0.51094, 0.95202)
-val dispersionR = 1
+val dispersionR = 1.0
 val nullDevianceR = 2.17561
 val residualDevianceR = 0.00018
 val residualDegreeOfFreedomNullR = 3
@@ -1114,7 +1114,7 @@ class GeneralizedLinearRegressionSuite
   assert(x._1 ~== x._2 absTol 1E-3) }
 summary.tValues.zip(tValsR).foreach{ x => assert(x._1 ~== x._2 absTol 
1E-3) }
 summary.pValues.zip(pValsR).foreach{ x => assert(x._1 ~== x._2 absTol 
1E-3) }
-assert(summary.dispersion ~== dispersionR absTol 1E-3)
+assert(summary.dispersion === dispersionR)
 assert(summary.nullDeviance ~== nullDevianceR absTol 1E-3)
 assert(summary.deviance ~== residualDevianceR absTol 1E-3)
 assert(summary.residualDegreeOfFreedom === residualDegreeOfFreedomR)
@@ -1190,7 +1190,7 @@ class GeneralizedLinearRegressionSuite
 val seCoefR = Array(1.16826, 0.41703, 1.96249)
 val tValsR = Array(-2.46387, 2.12428, -2.32757)
 val pValsR = Array(0.01374, 0.03365, 0.01993)
-val dispersionR = 1
+val dispersionR = 1.0
 val nullDevianceR = 22.55853
 val residualDevianceR = 9.5622
 val residualDegreeOfFreedomNullR = 3
@@ -1229,7 +1229,7 @@ class GeneralizedLinearRegressionSuite
   assert(x._1 ~== x._2 absTol 1E-3) }
 summary.tValues.zip(tValsR).foreach{ x => assert(x._1 ~== x._2 absTol 
1E-3) }
 summary.pValues.zip(pValsR).foreach{ x => assert(x._1 ~== x._2 absTol 
1E-3) }
-assert(summary.dispersion ~== dispersionR absTol 1E-3)
+assert(summary.dispersion === dispersionR)
 assert(summary.nullDeviance ~== nullDevianceR absTol 1E-3)
 assert(summary.deviance ~== residualDevianceR absTol 1E-3)
 assert(summary.residualDegreeOfFreedom === residualDegreeOfFreedomR)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18710][ML] Add offset in GLM

2017-06-30 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 52981715b -> 49d767d83


[SPARK-18710][ML] Add offset in GLM

## What changes were proposed in this pull request?
Add support for offset in GLM. This is useful for at least two reasons:

1. Account for exposure: e.g., when modeling the number of accidents, we may 
need to use miles driven as an offset to access factors on frequency.
2. Test incremental effects of new variables: we can use predictions from the 
existing model as offset and run a much smaller model on only new variables. 
This avoids re-estimating the large model with all variables (old + new) and 
can be very important for efficient large-scaled analysis.

## How was this patch tested?
New test.

yanboliang srowen felixcheung sethah

Author: actuaryzhang 

Closes #16699 from actuaryzhang/offset.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/49d767d8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/49d767d8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/49d767d8

Branch: refs/heads/master
Commit: 49d767d838691fc7d964be2c4349662f5500ff2b
Parents: 5298171
Author: actuaryzhang 
Authored: Fri Jun 30 20:02:15 2017 +0800
Committer: Yanbo Liang 
Committed: Fri Jun 30 20:02:15 2017 +0800

--
 .../org/apache/spark/ml/feature/Instance.scala  |  21 +
 .../IterativelyReweightedLeastSquares.scala |  14 +-
 .../spark/ml/optim/WeightedLeastSquares.scala   |   2 +-
 .../GeneralizedLinearRegression.scala   | 184 --
 ...IterativelyReweightedLeastSquaresSuite.scala |  40 +-
 .../GeneralizedLinearRegressionSuite.scala  | 634 +++
 6 files changed, 534 insertions(+), 361 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/49d767d8/mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
index cce3ca4..dd56fbb 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
@@ -27,3 +27,24 @@ import org.apache.spark.ml.linalg.Vector
  * @param features The vector of features for this data point.
  */
 private[ml] case class Instance(label: Double, weight: Double, features: 
Vector)
+
+/**
+ * Case class that represents an instance of data point with
+ * label, weight, offset and features.
+ * This is mainly used in GeneralizedLinearRegression currently.
+ *
+ * @param label Label for this data point.
+ * @param weight The weight of this instance.
+ * @param offset The offset used for this data point.
+ * @param features The vector of features for this data point.
+ */
+private[ml] case class OffsetInstance(
+label: Double,
+weight: Double,
+offset: Double,
+features: Vector) {
+
+  /** Converts to an [[Instance]] object by leaving out the offset. */
+  def toInstance: Instance = Instance(label, weight, features)
+
+}

http://git-wip-us.apache.org/repos/asf/spark/blob/49d767d8/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
index 9c49551..6961b45 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
@@ -18,7 +18,7 @@
 package org.apache.spark.ml.optim
 
 import org.apache.spark.internal.Logging
-import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.feature.{Instance, OffsetInstance}
 import org.apache.spark.ml.linalg._
 import org.apache.spark.rdd.RDD
 
@@ -43,7 +43,7 @@ private[ml] class IterativelyReweightedLeastSquaresModel(
  * find M-estimator in robust regression and other optimization problems.
  *
  * @param initialModel the initial guess model.
- * @param reweightFunc the reweight function which is used to update offsets 
and weights
+ * @param reweightFunc the reweight function which is used to update working 
labels and weights
  * at each iteration.
  * @param fitIntercept whether to fit intercept.
  * @param regParam L2 regularization parameter used by WLS.
@@ -57,13 +57,13 @@ private[ml] class IterativelyReweightedLeastSquaresModel(
  */
 private[ml] class IterativelyReweightedLeastSquares(
 val initialModel: WeightedLeastSquaresModel,
-val reweightFunc:

spark git commit: [SPARK-14657][SPARKR][ML] RFormula w/o intercept should output reference category when encoding string terms

2017-06-28 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 376d90d55 -> 0c8444cf6


[SPARK-14657][SPARKR][ML] RFormula w/o intercept should output reference 
category when encoding string terms

## What changes were proposed in this pull request?

Please see [SPARK-14657](https://issues.apache.org/jira/browse/SPARK-14657) for 
detail of this bug.
I searched online and test some other cases, found when we fit R glm model(or 
other models powered by R formula) w/o intercept on a dataset including 
string/category features, one of the categories in the first category feature 
is being used as reference category, we will not drop any category for that 
feature.
I think we should keep consistent semantics between Spark RFormula and R 
formula.
## How was this patch tested?

Add standard unit tests.

cc mengxr

Author: Yanbo Liang 

Closes #12414 from yanboliang/spark-14657.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0c8444cf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0c8444cf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0c8444cf

Branch: refs/heads/master
Commit: 0c8444cf6d0620cd219ddcf5f50b12ff648639e9
Parents: 376d90d
Author: Yanbo Liang 
Authored: Thu Jun 29 10:32:32 2017 +0800
Committer: Yanbo Liang 
Committed: Thu Jun 29 10:32:32 2017 +0800

--
 .../org/apache/spark/ml/feature/RFormula.scala  | 10 ++-
 .../apache/spark/ml/feature/RFormulaSuite.scala | 83 
 2 files changed, 92 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0c8444cf/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
index 1fad0a6..4b44878 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
@@ -205,12 +205,20 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override 
val uid: String)
 }.toMap
 
 // Then we handle one-hot encoding and interactions between terms.
+var keepReferenceCategory = false
 val encodedTerms = resolvedFormula.terms.map {
   case Seq(term) if dataset.schema(term).dataType == StringType =>
 val encodedCol = tmpColumn("onehot")
-encoderStages += new OneHotEncoder()
+var encoder = new OneHotEncoder()
   .setInputCol(indexed(term))
   .setOutputCol(encodedCol)
+// Formula w/o intercept, one of the categories in the first category 
feature is
+// being used as reference category, we will not drop any category for 
that feature.
+if (!hasIntercept && !keepReferenceCategory) {
+  encoder = encoder.setDropLast(false)
+  keepReferenceCategory = true
+}
+encoderStages += encoder
 prefixesToRewrite(encodedCol + "_") = term + "_"
 encodedCol
   case Seq(term) =>

http://git-wip-us.apache.org/repos/asf/spark/blob/0c8444cf/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala
index 41d0062..23570d6 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala
@@ -213,6 +213,89 @@ class RFormulaSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
 assert(result.collect() === expected.collect())
   }
 
+  test("formula w/o intercept, we should output reference category when 
encoding string terms") {
+/*
+ R code:
+
+ df <- data.frame(id = c(1, 2, 3, 4),
+  a = c("foo", "bar", "bar", "baz"),
+  b = c("zq", "zz", "zz", "zz"),
+  c = c(4, 4, 5, 5))
+ model.matrix(id ~ a + b + c - 1, df)
+
+   abar abaz afoo bzz c
+ 1001   0 4
+ 2100   1 4
+ 3100   1 5
+ 4010   1 5
+
+ model.matrix(id ~ a:b + c - 1, df)
+
+   c abar:bzq abaz:bzq afoo:bzq abar:bzz abaz:bzz afoo:bzz
+ 1 4001000
+ 2 4000100
+ 3 5000100
+ 4 5000010
+*/
+val original = Seq((1, "foo", "zq", 4), (2, "bar", "zz", 4), (3, "bar", 
"zz", 5),
+  (4, "baz", "zz", 5)).toDF("id", "a", "b",

spark git commit: [SPARK-20899][PYSPARK] PySpark supports stringIndexerOrderType in RFormula

2017-05-30 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 35b644bd0 -> ff5676b01


[SPARK-20899][PYSPARK] PySpark supports stringIndexerOrderType in RFormula

## What changes were proposed in this pull request?

PySpark supports stringIndexerOrderType in RFormula as in #17967.

## How was this patch tested?
docstring test

Author: actuaryzhang 

Closes #18122 from actuaryzhang/PythonRFormula.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ff5676b0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ff5676b0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ff5676b0

Branch: refs/heads/master
Commit: ff5676b01ffd8adfe753cb749582579cbd496e7f
Parents: 35b644b
Author: actuaryzhang 
Authored: Wed May 31 01:02:19 2017 +0800
Committer: Yanbo Liang 
Committed: Wed May 31 01:02:19 2017 +0800

--
 python/pyspark/ml/feature.py | 33 -
 python/pyspark/ml/tests.py   | 13 +
 2 files changed, 41 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ff5676b0/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 955bc97..77de1cc 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -3043,26 +3043,35 @@ class RFormula(JavaEstimator, HasFeaturesCol, 
HasLabelCol, JavaMLReadable, JavaM
 "Force to index label whether it is numeric or 
string",
 typeConverter=TypeConverters.toBoolean)
 
+stringIndexerOrderType = Param(Params._dummy(), "stringIndexerOrderType",
+   "How to order categories of a string 
feature column used by " +
+   "StringIndexer. The last category after 
ordering is dropped " +
+   "when encoding strings. Supported options: 
frequencyDesc, " +
+   "frequencyAsc, alphabetDesc, alphabetAsc. 
The default value " +
+   "is frequencyDesc. When the ordering is set 
to alphabetDesc, " +
+   "RFormula drops the same category as R when 
encoding strings.",
+   typeConverter=TypeConverters.toString)
+
 @keyword_only
 def __init__(self, formula=None, featuresCol="features", labelCol="label",
- forceIndexLabel=False):
+ forceIndexLabel=False, 
stringIndexerOrderType="frequencyDesc"):
 """
 __init__(self, formula=None, featuresCol="features", labelCol="label", 
\
- forceIndexLabel=False)
+ forceIndexLabel=False, stringIndexerOrderType="frequencyDesc")
 """
 super(RFormula, self).__init__()
 self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.RFormula", self.uid)
-self._setDefault(forceIndexLabel=False)
+self._setDefault(forceIndexLabel=False, 
stringIndexerOrderType="frequencyDesc")
 kwargs = self._input_kwargs
 self.setParams(**kwargs)
 
 @keyword_only
 @since("1.5.0")
 def setParams(self, formula=None, featuresCol="features", labelCol="label",
-  forceIndexLabel=False):
+  forceIndexLabel=False, 
stringIndexerOrderType="frequencyDesc"):
 """
 setParams(self, formula=None, featuresCol="features", 
labelCol="label", \
-  forceIndexLabel=False)
+  forceIndexLabel=False, 
stringIndexerOrderType="frequencyDesc")
 Sets params for RFormula.
 """
 kwargs = self._input_kwargs
@@ -3096,6 +3105,20 @@ class RFormula(JavaEstimator, HasFeaturesCol, 
HasLabelCol, JavaMLReadable, JavaM
 """
 return self.getOrDefault(self.forceIndexLabel)
 
+@since("2.3.0")
+def setStringIndexerOrderType(self, value):
+"""
+Sets the value of :py:attr:`stringIndexerOrderType`.
+"""
+return self._set(stringIndexerOrderType=value)
+
+@since("2.3.0")
+def getStringIndexerOrderType(self):
+"""
+Gets the value of :py:attr:`stringIndexerOrderType` or its default 
value 'frequencyDesc'.
+"""
+return self.getOrDefault(self.stringIndexerOrderType)
+
 def _create_model(self, java_model):
 return RFormulaModel(java_model)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/ff5676b0/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 0daf29d..17a3947 100755
---

spark git commit: [SPARK-14659][ML] RFormula consistent with R when handling strings

2017-05-25 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 2dbe0c528 -> f47700c9c


[SPARK-14659][ML] RFormula consistent with R when handling strings

## What changes were proposed in this pull request?
When handling strings, the category dropped by RFormula and R are different:
- RFormula drops the least frequent level
- R drops the first level after ascending alphabetical ordering

This PR supports different string ordering types in StringIndexer #17879 so 
that RFormula can drop the same level as R when handling strings 
using`stringOrderType = "alphabetDesc"`.

## How was this patch tested?
new tests

Author: Wayne Zhang 

Closes #17967 from actuaryzhang/RFormula.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f47700c9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f47700c9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f47700c9

Branch: refs/heads/master
Commit: f47700c9cadd72a2495f97f250790449705f631f
Parents: 2dbe0c5
Author: Wayne Zhang 
Authored: Fri May 26 10:44:40 2017 +0800
Committer: Yanbo Liang 
Committed: Fri May 26 10:44:40 2017 +0800

--
 .../org/apache/spark/ml/feature/RFormula.scala  | 44 +-
 .../apache/spark/ml/feature/StringIndexer.scala |  4 +-
 .../apache/spark/ml/feature/RFormulaSuite.scala | 84 
 3 files changed, 129 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f47700c9/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
index 5a3e292..1fad0a6 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
@@ -26,7 +26,7 @@ import org.apache.spark.annotation.{Experimental, Since}
 import org.apache.spark.ml.{Estimator, Model, Pipeline, PipelineModel, 
PipelineStage, Transformer}
 import org.apache.spark.ml.attribute.AttributeGroup
 import org.apache.spark.ml.linalg.VectorUDT
-import org.apache.spark.ml.param.{BooleanParam, Param, ParamMap}
+import org.apache.spark.ml.param.{BooleanParam, Param, ParamMap, 
ParamValidators}
 import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset}
@@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
  */
 private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
 
+  /**
+   * Param for how to order categories of a string FEATURE column used by 
`StringIndexer`.
+   * The last category after ordering is dropped when encoding strings.
+   * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 
'alphabetAsc'.
+   * The default value is 'frequencyDesc'. When the ordering is set to 
'alphabetDesc', `RFormula`
+   * drops the same category as R when encoding strings.
+   *
+   * The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
+   * {{{
+   * 
+-+---+--+
+   * |  Option | Category mapped to 0 by StringIndexer |  Category 
dropped by RFormula|
+   * 
+-+---+--+
+   * | 'frequencyDesc' | most frequent category ('b')  | least 
frequent category ('c')|
+   * | 'frequencyAsc'  | least frequent category ('c') | most frequent 
category ('b') |
+   * | 'alphabetDesc'  | last alphabetical category ('c')  | first 
alphabetical category ('a')|
+   * | 'alphabetAsc'   | first alphabetical category ('a') | last 
alphabetical category ('c') |
+   * 
+-+---+--+
+   * }}}
+   * Note that this ordering option is NOT used for the label column. When the 
label column is
+   * indexed, it uses the default descending frequency ordering in 
`StringIndexer`.
+   *
+   * @group param
+   */
+  @Since("2.3.0")
+  final val stringIndexerOrderType: Param[String] = new Param(this, 
"stringIndexerOrderType",
+"How to order categories of a string FEATURE column used by StringIndexer. 
" +
+"The last category after ordering is dropped when encoding strings. " +
+s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", 
")}. " +
+"The default value is 'frequencyDesc'. When the ordering is set to 
'alphabetDesc', " +
+"RFormula drops the same category as R when encoding strings.",
+

spark git commit: [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark FPGrowth.

2017-05-25 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 9cbf39f1c -> e01f1f222


[SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark 
FPGrowth.

## What changes were proposed in this pull request?

Expose numPartitions (expert) param of PySpark FPGrowth.

## How was this patch tested?

+ [x] Pass all unit tests.

Author: Yan Facai (é¢åæ) 

Closes #18058 from facaiy/ENH/pyspark_fpg_add_num_partition.

(cherry picked from commit 139da116f130ed21481d3e9bdee5df4b8d7760ac)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e01f1f22
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e01f1f22
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e01f1f22

Branch: refs/heads/branch-2.2
Commit: e01f1f222bcb7c469b1e1595e9338ed478d99894
Parents: 9cbf39f
Author: Yan Facai (é¢åæ) 
Authored: Thu May 25 21:40:39 2017 +0800
Committer: Yanbo Liang 
Committed: Thu May 25 21:40:52 2017 +0800

--
 python/pyspark/ml/fpm.py | 30 +-
 1 file changed, 29 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e01f1f22/python/pyspark/ml/fpm.py
--
diff --git a/python/pyspark/ml/fpm.py b/python/pyspark/ml/fpm.py
index 6ff7d2c..dd7dda5 100644
--- a/python/pyspark/ml/fpm.py
+++ b/python/pyspark/ml/fpm.py
@@ -49,6 +49,32 @@ class HasMinSupport(Params):
 return self.getOrDefault(self.minSupport)
 
 
+class HasNumPartitions(Params):
+"""
+Mixin for param numPartitions: Number of partitions (at least 1) used by 
parallel FP-growth.
+"""
+
+numPartitions = Param(
+Params._dummy(),
+"numPartitions",
+"Number of partitions (at least 1) used by parallel FP-growth. " +
+"By default the param is not set, " +
+"and partition number of the input dataset is used.",
+typeConverter=TypeConverters.toInt)
+
+def setNumPartitions(self, value):
+"""
+Sets the value of :py:attr:`numPartitions`.
+"""
+return self._set(numPartitions=value)
+
+def getNumPartitions(self):
+"""
+Gets the value of :py:attr:`numPartitions` or its default value.
+"""
+return self.getOrDefault(self.numPartitions)
+
+
 class HasMinConfidence(Params):
 """
 Mixin for param minConfidence.
@@ -127,7 +153,9 @@ class FPGrowthModel(JavaModel, JavaMLWritable, 
JavaMLReadable):
 
 
 class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol,
-   HasMinSupport, HasMinConfidence, JavaMLWritable, 
JavaMLReadable):
+   HasMinSupport, HasNumPartitions, HasMinConfidence,
+   JavaMLWritable, JavaMLReadable):
+
 """
 .. note:: Experimental
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark FPGrowth.

2017-05-25 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 913a6bfe4 -> 139da116f


[SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark 
FPGrowth.

## What changes were proposed in this pull request?

Expose numPartitions (expert) param of PySpark FPGrowth.

## How was this patch tested?

+ [x] Pass all unit tests.

Author: Yan Facai (é¢åæ) 

Closes #18058 from facaiy/ENH/pyspark_fpg_add_num_partition.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/139da116
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/139da116
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/139da116

Branch: refs/heads/master
Commit: 139da116f130ed21481d3e9bdee5df4b8d7760ac
Parents: 913a6bf
Author: Yan Facai (é¢åæ) 
Authored: Thu May 25 21:40:39 2017 +0800
Committer: Yanbo Liang 
Committed: Thu May 25 21:40:39 2017 +0800

--
 python/pyspark/ml/fpm.py | 30 +-
 1 file changed, 29 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/139da116/python/pyspark/ml/fpm.py
--
diff --git a/python/pyspark/ml/fpm.py b/python/pyspark/ml/fpm.py
index 6ff7d2c..dd7dda5 100644
--- a/python/pyspark/ml/fpm.py
+++ b/python/pyspark/ml/fpm.py
@@ -49,6 +49,32 @@ class HasMinSupport(Params):
 return self.getOrDefault(self.minSupport)
 
 
+class HasNumPartitions(Params):
+"""
+Mixin for param numPartitions: Number of partitions (at least 1) used by 
parallel FP-growth.
+"""
+
+numPartitions = Param(
+Params._dummy(),
+"numPartitions",
+"Number of partitions (at least 1) used by parallel FP-growth. " +
+"By default the param is not set, " +
+"and partition number of the input dataset is used.",
+typeConverter=TypeConverters.toInt)
+
+def setNumPartitions(self, value):
+"""
+Sets the value of :py:attr:`numPartitions`.
+"""
+return self._set(numPartitions=value)
+
+def getNumPartitions(self):
+"""
+Gets the value of :py:attr:`numPartitions` or its default value.
+"""
+return self.getOrDefault(self.numPartitions)
+
+
 class HasMinConfidence(Params):
 """
 Mixin for param minConfidence.
@@ -127,7 +153,9 @@ class FPGrowthModel(JavaModel, JavaMLWritable, 
JavaMLReadable):
 
 
 class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol,
-   HasMinSupport, HasMinConfidence, JavaMLWritable, 
JavaMLReadable):
+   HasMinSupport, HasNumPartitions, HasMinConfidence,
+   JavaMLWritable, JavaMLReadable):
+
 """
 .. note:: Experimental
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth.

2017-05-25 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 8896c4ee9 -> 9cbf39f1c


[SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth.

## What changes were proposed in this pull request?
Follow-up for #17218, some minor fix for PySpark ```FPGrowth```.

## How was this patch tested?
Existing UT.

Author: Yanbo Liang 

Closes #18089 from yanboliang/spark-19281.

(cherry picked from commit 913a6bfe4b0eb6b80a03b858ab4b2767194103de)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9cbf39f1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9cbf39f1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9cbf39f1

Branch: refs/heads/branch-2.2
Commit: 9cbf39f1c74f16483865cd93d6ffc3c521e878a7
Parents: 8896c4e
Author: Yanbo Liang 
Authored: Thu May 25 20:15:15 2017 +0800
Committer: Yanbo Liang 
Committed: Thu May 25 20:15:38 2017 +0800

--
 python/pyspark/ml/fpm.py | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9cbf39f1/python/pyspark/ml/fpm.py
--
diff --git a/python/pyspark/ml/fpm.py b/python/pyspark/ml/fpm.py
index b30d4ed..6ff7d2c 100644
--- a/python/pyspark/ml/fpm.py
+++ b/python/pyspark/ml/fpm.py
@@ -23,17 +23,17 @@ from pyspark.ml.param.shared import *
 __all__ = ["FPGrowth", "FPGrowthModel"]
 
 
-class HasSupport(Params):
+class HasMinSupport(Params):
 """
-Mixin for param support.
+Mixin for param minSupport.
 """
 
 minSupport = Param(
 Params._dummy(),
 "minSupport",
-"""Minimal support level of the frequent pattern. [0.0, 1.0].
-Any pattern that appears more than (minSupport * size-of-the-dataset)
-times will be output""",
+"Minimal support level of the frequent pattern. [0.0, 1.0]. " +
+"Any pattern that appears more than (minSupport * size-of-the-dataset) 
" +
+"times will be output in the frequent itemsets.",
 typeConverter=TypeConverters.toFloat)
 
 def setMinSupport(self, value):
@@ -49,16 +49,17 @@ class HasSupport(Params):
 return self.getOrDefault(self.minSupport)
 
 
-class HasConfidence(Params):
+class HasMinConfidence(Params):
 """
-Mixin for param confidence.
+Mixin for param minConfidence.
 """
 
 minConfidence = Param(
 Params._dummy(),
 "minConfidence",
-"""Minimal confidence for generating Association Rule. [0.0, 1.0]
-Note that minConfidence has no effect during fitting.""",
+"Minimal confidence for generating Association Rule. [0.0, 1.0]. " +
+"minConfidence will not affect the mining for frequent itemsets, " +
+"but will affect the association rules generation.",
 typeConverter=TypeConverters.toFloat)
 
 def setMinConfidence(self, value):
@@ -126,7 +127,7 @@ class FPGrowthModel(JavaModel, JavaMLWritable, 
JavaMLReadable):
 
 
 class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol,
-   HasSupport, HasConfidence, JavaMLWritable, JavaMLReadable):
+   HasMinSupport, HasMinConfidence, JavaMLWritable, 
JavaMLReadable):
 """
 .. note:: Experimental
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth.

2017-05-25 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 3f94e64aa -> 913a6bfe4


[SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth.

## What changes were proposed in this pull request?
Follow-up for #17218, some minor fix for PySpark ```FPGrowth```.

## How was this patch tested?
Existing UT.

Author: Yanbo Liang 

Closes #18089 from yanboliang/spark-19281.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/913a6bfe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/913a6bfe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/913a6bfe

Branch: refs/heads/master
Commit: 913a6bfe4b0eb6b80a03b858ab4b2767194103de
Parents: 3f94e64
Author: Yanbo Liang 
Authored: Thu May 25 20:15:15 2017 +0800
Committer: Yanbo Liang 
Committed: Thu May 25 20:15:15 2017 +0800

--
 python/pyspark/ml/fpm.py | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/913a6bfe/python/pyspark/ml/fpm.py
--
diff --git a/python/pyspark/ml/fpm.py b/python/pyspark/ml/fpm.py
index b30d4ed..6ff7d2c 100644
--- a/python/pyspark/ml/fpm.py
+++ b/python/pyspark/ml/fpm.py
@@ -23,17 +23,17 @@ from pyspark.ml.param.shared import *
 __all__ = ["FPGrowth", "FPGrowthModel"]
 
 
-class HasSupport(Params):
+class HasMinSupport(Params):
 """
-Mixin for param support.
+Mixin for param minSupport.
 """
 
 minSupport = Param(
 Params._dummy(),
 "minSupport",
-"""Minimal support level of the frequent pattern. [0.0, 1.0].
-Any pattern that appears more than (minSupport * size-of-the-dataset)
-times will be output""",
+"Minimal support level of the frequent pattern. [0.0, 1.0]. " +
+"Any pattern that appears more than (minSupport * size-of-the-dataset) 
" +
+"times will be output in the frequent itemsets.",
 typeConverter=TypeConverters.toFloat)
 
 def setMinSupport(self, value):
@@ -49,16 +49,17 @@ class HasSupport(Params):
 return self.getOrDefault(self.minSupport)
 
 
-class HasConfidence(Params):
+class HasMinConfidence(Params):
 """
-Mixin for param confidence.
+Mixin for param minConfidence.
 """
 
 minConfidence = Param(
 Params._dummy(),
 "minConfidence",
-"""Minimal confidence for generating Association Rule. [0.0, 1.0]
-Note that minConfidence has no effect during fitting.""",
+"Minimal confidence for generating Association Rule. [0.0, 1.0]. " +
+"minConfidence will not affect the mining for frequent itemsets, " +
+"but will affect the association rules generation.",
 typeConverter=TypeConverters.toFloat)
 
 def setMinConfidence(self, value):
@@ -126,7 +127,7 @@ class FPGrowthModel(JavaModel, JavaMLWritable, 
JavaMLReadable):
 
 
 class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol,
-   HasSupport, HasConfidence, JavaMLWritable, JavaMLReadable):
+   HasMinSupport, HasMinConfidence, JavaMLWritable, 
JavaMLReadable):
 """
 .. note:: Experimental
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel

2017-05-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 4dd34d004 -> 72e1f83d7


[SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in 
LogisticRegressionModel

## What changes were proposed in this pull request?

Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer 
takes floats as arguments as of 1.12. Also, python3 uses float division for 
`/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set 
to a float.

## How was this patch tested?

Existing tests run using python3 and numpy 1.12.

Author: Bago Amirbekian 

Closes #18081 from MrBago/BF-py3floatbug.

(cherry picked from commit bc66a77bbe2120cc21bd8da25194efca4cde13c3)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/72e1f83d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/72e1f83d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/72e1f83d

Branch: refs/heads/branch-2.0
Commit: 72e1f83d78e51b53c104d1cd101c10bbe557c047
Parents: 4dd34d0
Author: Bago Amirbekian 
Authored: Wed May 24 22:55:38 2017 +0800
Committer: Yanbo Liang 
Committed: Wed May 24 23:00:01 2017 +0800

--
 python/pyspark/mllib/classification.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/72e1f83d/python/pyspark/mllib/classification.py
--
diff --git a/python/pyspark/mllib/classification.py 
b/python/pyspark/mllib/classification.py
index 9f53ed0..e04eeb2 100644
--- a/python/pyspark/mllib/classification.py
+++ b/python/pyspark/mllib/classification.py
@@ -171,7 +171,7 @@ class LogisticRegressionModel(LinearClassificationModel):
 self._dataWithBiasSize = None
 self._weightsMatrix = None
 else:
-self._dataWithBiasSize = self._coeff.size / (self._numClasses - 1)
+self._dataWithBiasSize = self._coeff.size // (self._numClasses - 1)
 self._weightsMatrix = 
self._coeff.toArray().reshape(self._numClasses - 1,
 
self._dataWithBiasSize)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel

2017-05-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 f4538c95f -> 13adc0fc0


[SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in 
LogisticRegressionModel

## What changes were proposed in this pull request?

Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer 
takes floats as arguments as of 1.12. Also, python3 uses float division for 
`/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set 
to a float.

## How was this patch tested?

Existing tests run using python3 and numpy 1.12.

Author: Bago Amirbekian 

Closes #18081 from MrBago/BF-py3floatbug.

(cherry picked from commit bc66a77bbe2120cc21bd8da25194efca4cde13c3)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/13adc0fc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/13adc0fc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/13adc0fc

Branch: refs/heads/branch-2.1
Commit: 13adc0fc0e940a4ea8b703241666440357a597e3
Parents: f4538c9
Author: Bago Amirbekian 
Authored: Wed May 24 22:55:38 2017 +0800
Committer: Yanbo Liang 
Committed: Wed May 24 22:58:16 2017 +0800

--
 python/pyspark/mllib/classification.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/13adc0fc/python/pyspark/mllib/classification.py
--
diff --git a/python/pyspark/mllib/classification.py 
b/python/pyspark/mllib/classification.py
index 9f53ed0..e04eeb2 100644
--- a/python/pyspark/mllib/classification.py
+++ b/python/pyspark/mllib/classification.py
@@ -171,7 +171,7 @@ class LogisticRegressionModel(LinearClassificationModel):
 self._dataWithBiasSize = None
 self._weightsMatrix = None
 else:
-self._dataWithBiasSize = self._coeff.size / (self._numClasses - 1)
+self._dataWithBiasSize = self._coeff.size // (self._numClasses - 1)
 self._weightsMatrix = 
self._coeff.toArray().reshape(self._numClasses - 1,
 
self._dataWithBiasSize)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel

2017-05-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 1d107242f -> 83aeac9e0


[SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in 
LogisticRegressionModel

## What changes were proposed in this pull request?

Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer 
takes floats as arguments as of 1.12. Also, python3 uses float division for 
`/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set 
to a float.

## How was this patch tested?

Existing tests run using python3 and numpy 1.12.

Author: Bago Amirbekian 

Closes #18081 from MrBago/BF-py3floatbug.

(cherry picked from commit bc66a77bbe2120cc21bd8da25194efca4cde13c3)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/83aeac9e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/83aeac9e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/83aeac9e

Branch: refs/heads/branch-2.2
Commit: 83aeac9e0590e99010d0af8e067822d0ed0971fe
Parents: 1d10724
Author: Bago Amirbekian 
Authored: Wed May 24 22:55:38 2017 +0800
Committer: Yanbo Liang 
Committed: Wed May 24 22:56:28 2017 +0800

--
 python/pyspark/mllib/classification.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/83aeac9e/python/pyspark/mllib/classification.py
--
diff --git a/python/pyspark/mllib/classification.py 
b/python/pyspark/mllib/classification.py
index 9f53ed0..e04eeb2 100644
--- a/python/pyspark/mllib/classification.py
+++ b/python/pyspark/mllib/classification.py
@@ -171,7 +171,7 @@ class LogisticRegressionModel(LinearClassificationModel):
 self._dataWithBiasSize = None
 self._weightsMatrix = None
 else:
-self._dataWithBiasSize = self._coeff.size / (self._numClasses - 1)
+self._dataWithBiasSize = self._coeff.size // (self._numClasses - 1)
 self._weightsMatrix = 
self._coeff.toArray().reshape(self._numClasses - 1,
 
self._dataWithBiasSize)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel

2017-05-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 1816eb3be -> bc66a77bb


[SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in 
LogisticRegressionModel

## What changes were proposed in this pull request?

Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer 
takes floats as arguments as of 1.12. Also, python3 uses float division for 
`/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set 
to a float.

## How was this patch tested?

Existing tests run using python3 and numpy 1.12.

Author: Bago Amirbekian 

Closes #18081 from MrBago/BF-py3floatbug.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bc66a77b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bc66a77b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bc66a77b

Branch: refs/heads/master
Commit: bc66a77bbe2120cc21bd8da25194efca4cde13c3
Parents: 1816eb3
Author: Bago Amirbekian 
Authored: Wed May 24 22:55:38 2017 +0800
Committer: Yanbo Liang 
Committed: Wed May 24 22:55:38 2017 +0800

--
 python/pyspark/mllib/classification.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bc66a77b/python/pyspark/mllib/classification.py
--
diff --git a/python/pyspark/mllib/classification.py 
b/python/pyspark/mllib/classification.py
index 9f53ed0..e04eeb2 100644
--- a/python/pyspark/mllib/classification.py
+++ b/python/pyspark/mllib/classification.py
@@ -171,7 +171,7 @@ class LogisticRegressionModel(LinearClassificationModel):
 self._dataWithBiasSize = None
 self._weightsMatrix = None
 else:
-self._dataWithBiasSize = self._coeff.size / (self._numClasses - 1)
+self._dataWithBiasSize = self._coeff.size // (self._numClasses - 1)
 self._weightsMatrix = 
self._coeff.toArray().reshape(self._numClasses - 1,
 
self._dataWithBiasSize)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20631][FOLLOW-UP] Fix incorrect tests.

2017-05-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 e936a96ba -> 1d107242f


[SPARK-20631][FOLLOW-UP] Fix incorrect tests.

## What changes were proposed in this pull request?

- Fix incorrect tests for `_check_thresholds`.
- Move test to `ParamTests`.

## How was this patch tested?

Unit tests.

Author: zero323 

Closes #18085 from zero323/SPARK-20631-FOLLOW-UP.

(cherry picked from commit 1816eb3bef930407dc9e083de08f5105725c55d1)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1d107242
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1d107242
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1d107242

Branch: refs/heads/branch-2.2
Commit: 1d107242f8ec842c009e0b427f6e4a8313d99aa2
Parents: e936a96
Author: zero323 
Authored: Wed May 24 19:57:44 2017 +0800
Committer: Yanbo Liang 
Committed: Wed May 24 19:58:40 2017 +0800

--
 python/pyspark/ml/tests.py | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1d107242/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index a3393c6..0daf29d 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -404,6 +404,18 @@ class ParamTests(PySparkTestCase):
 self.assertEqual(tp._paramMap, copied_no_extra)
 self.assertEqual(tp._defaultParamMap, tp_copy._defaultParamMap)
 
+def test_logistic_regression_check_thresholds(self):
+self.assertIsInstance(
+LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]),
+LogisticRegression
+)
+
+self.assertRaisesRegexp(
+ValueError,
+"Logistic Regression getThreshold found inconsistent.*$",
+LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
+)
+
 
 class EvaluatorTests(SparkSessionTestCase):
 
@@ -807,18 +819,6 @@ class PersistenceTest(SparkSessionTestCase):
 except OSError:
 pass
 
-def logistic_regression_check_thresholds(self):
-self.assertIsInstance(
-LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]),
-LogisticRegressionModel
-)
-
-self.assertRaisesRegexp(
-ValueError,
-"Logistic Regression getThreshold found inconsistent.*$",
-LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
-)
-
 def _compare_params(self, m1, m2, param):
 """
 Compare 2 ML Params instances for the given param, and assert both 
have the same param value


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20631][FOLLOW-UP] Fix incorrect tests.

2017-05-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 9afcf127d -> 1816eb3be


[SPARK-20631][FOLLOW-UP] Fix incorrect tests.

## What changes were proposed in this pull request?

- Fix incorrect tests for `_check_thresholds`.
- Move test to `ParamTests`.

## How was this patch tested?

Unit tests.

Author: zero323 

Closes #18085 from zero323/SPARK-20631-FOLLOW-UP.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1816eb3b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1816eb3b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1816eb3b

Branch: refs/heads/master
Commit: 1816eb3bef930407dc9e083de08f5105725c55d1
Parents: 9afcf12
Author: zero323 
Authored: Wed May 24 19:57:44 2017 +0800
Committer: Yanbo Liang 
Committed: Wed May 24 19:57:44 2017 +0800

--
 python/pyspark/ml/tests.py | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1816eb3b/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index a3393c6..0daf29d 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -404,6 +404,18 @@ class ParamTests(PySparkTestCase):
 self.assertEqual(tp._paramMap, copied_no_extra)
 self.assertEqual(tp._defaultParamMap, tp_copy._defaultParamMap)
 
+def test_logistic_regression_check_thresholds(self):
+self.assertIsInstance(
+LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]),
+LogisticRegression
+)
+
+self.assertRaisesRegexp(
+ValueError,
+"Logistic Regression getThreshold found inconsistent.*$",
+LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
+)
+
 
 class EvaluatorTests(SparkSessionTestCase):
 
@@ -807,18 +819,6 @@ class PersistenceTest(SparkSessionTestCase):
 except OSError:
 pass
 
-def logistic_regression_check_thresholds(self):
-self.assertIsInstance(
-LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]),
-LogisticRegressionModel
-)
-
-self.assertRaisesRegexp(
-ValueError,
-"Logistic Regression getThreshold found inconsistent.*$",
-LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
-)
-
 def _compare_params(self, m1, m2, param):
 """
 Compare 2 ML Params instances for the given param, and assert both 
have the same param value


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version

2017-05-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 ee9d5975e -> e936a96ba


[SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with 
numInstances and degreesOfFreedom in LR and GLR - Python version

## What changes were proposed in this pull request?
Add test cases for PR-18062

## How was this patch tested?
The existing UT

Author: Peng 

Closes #18068 from mpjlu/moreTest.

(cherry picked from commit 9afcf127d31b5477a539dde6e5f01861532a1c4c)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e936a96b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e936a96b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e936a96b

Branch: refs/heads/branch-2.2
Commit: e936a96badfeeb2051ee35dc4b0fbecefa9bf4cb
Parents: ee9d597
Author: Peng 
Authored: Wed May 24 19:54:17 2017 +0800
Committer: Yanbo Liang 
Committed: Wed May 24 19:54:58 2017 +0800

--
 python/pyspark/ml/tests.py | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e936a96b/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 51a3e8e..a3393c6 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -1066,6 +1066,7 @@ class TrainingSummaryTest(SparkSessionTestCase):
 self.assertAlmostEqual(s.r2, 1.0, 2)
 self.assertTrue(isinstance(s.residuals, DataFrame))
 self.assertEqual(s.numInstances, 2)
+self.assertEqual(s.degreesOfFreedom, 1)
 devResiduals = s.devianceResiduals
 self.assertTrue(isinstance(devResiduals, list) and 
isinstance(devResiduals[0], float))
 coefStdErr = s.coefficientStandardErrors
@@ -1075,7 +1076,8 @@ class TrainingSummaryTest(SparkSessionTestCase):
 pValues = s.pValues
 self.assertTrue(isinstance(pValues, list) and isinstance(pValues[0], 
float))
 # test evaluation (with training dataset) produces a summary with same 
values
-# one check is enough to verify a summary is returned, Scala version 
runs full test
+# one check is enough to verify a summary is returned
+# The child class LinearRegressionTrainingSummary runs full test
 sameSummary = model.evaluate(df)
 self.assertAlmostEqual(sameSummary.explainedVariance, 
s.explainedVariance)
 
@@ -1093,6 +1095,7 @@ class TrainingSummaryTest(SparkSessionTestCase):
 self.assertEqual(s.numIterations, 1)  # this should default to a 
single iteration of WLS
 self.assertTrue(isinstance(s.predictions, DataFrame))
 self.assertEqual(s.predictionCol, "prediction")
+self.assertEqual(s.numInstances, 2)
 self.assertTrue(isinstance(s.residuals(), DataFrame))
 self.assertTrue(isinstance(s.residuals("pearson"), DataFrame))
 coefStdErr = s.coefficientStandardErrors
@@ -,7 +1114,8 @@ class TrainingSummaryTest(SparkSessionTestCase):
 self.assertTrue(isinstance(s.nullDeviance, float))
 self.assertTrue(isinstance(s.dispersion, float))
 # test evaluation (with training dataset) produces a summary with same 
values
-# one check is enough to verify a summary is returned, Scala version 
runs full test
+# one check is enough to verify a summary is returned
+# The child class GeneralizedLinearRegressionTrainingSummary runs full 
test
 sameSummary = model.evaluate(df)
 self.assertAlmostEqual(sameSummary.deviance, s.deviance)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version

2017-05-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master d76633e3c -> 9afcf127d


[SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with 
numInstances and degreesOfFreedom in LR and GLR - Python version

## What changes were proposed in this pull request?
Add test cases for PR-18062

## How was this patch tested?
The existing UT

Author: Peng 

Closes #18068 from mpjlu/moreTest.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9afcf127
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9afcf127
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9afcf127

Branch: refs/heads/master
Commit: 9afcf127d31b5477a539dde6e5f01861532a1c4c
Parents: d76633e
Author: Peng 
Authored: Wed May 24 19:54:17 2017 +0800
Committer: Yanbo Liang 
Committed: Wed May 24 19:54:17 2017 +0800

--
 python/pyspark/ml/tests.py | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9afcf127/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 51a3e8e..a3393c6 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -1066,6 +1066,7 @@ class TrainingSummaryTest(SparkSessionTestCase):
 self.assertAlmostEqual(s.r2, 1.0, 2)
 self.assertTrue(isinstance(s.residuals, DataFrame))
 self.assertEqual(s.numInstances, 2)
+self.assertEqual(s.degreesOfFreedom, 1)
 devResiduals = s.devianceResiduals
 self.assertTrue(isinstance(devResiduals, list) and 
isinstance(devResiduals[0], float))
 coefStdErr = s.coefficientStandardErrors
@@ -1075,7 +1076,8 @@ class TrainingSummaryTest(SparkSessionTestCase):
 pValues = s.pValues
 self.assertTrue(isinstance(pValues, list) and isinstance(pValues[0], 
float))
 # test evaluation (with training dataset) produces a summary with same 
values
-# one check is enough to verify a summary is returned, Scala version 
runs full test
+# one check is enough to verify a summary is returned
+# The child class LinearRegressionTrainingSummary runs full test
 sameSummary = model.evaluate(df)
 self.assertAlmostEqual(sameSummary.explainedVariance, 
s.explainedVariance)
 
@@ -1093,6 +1095,7 @@ class TrainingSummaryTest(SparkSessionTestCase):
 self.assertEqual(s.numIterations, 1)  # this should default to a 
single iteration of WLS
 self.assertTrue(isinstance(s.predictions, DataFrame))
 self.assertEqual(s.predictionCol, "prediction")
+self.assertEqual(s.numInstances, 2)
 self.assertTrue(isinstance(s.residuals(), DataFrame))
 self.assertTrue(isinstance(s.residuals("pearson"), DataFrame))
 coefStdErr = s.coefficientStandardErrors
@@ -,7 +1114,8 @@ class TrainingSummaryTest(SparkSessionTestCase):
 self.assertTrue(isinstance(s.nullDeviance, float))
 self.assertTrue(isinstance(s.dispersion, float))
 # test evaluation (with training dataset) produces a summary with same 
values
-# one check is enough to verify a summary is returned, Scala version 
runs full test
+# one check is enough to verify a summary is returned
+# The child class GeneralizedLinearRegressionTrainingSummary runs full 
test
 sameSummary = model.evaluate(df)
 self.assertAlmostEqual(sameSummary.deviance, s.deviance)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary.

2017-05-23 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 442287ae2 -> ad09e4ca0


[MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM 
summary.

## What changes were proposed in this pull request?
Joint coefficients with intercept for SparkR linear SVM summary.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #18035 from yanboliang/svm-r.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ad09e4ca
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ad09e4ca
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ad09e4ca

Branch: refs/heads/master
Commit: ad09e4ca045715d053a672c2ba23f598f06085d8
Parents: 442287a
Author: Yanbo Liang 
Authored: Tue May 23 16:16:14 2017 +0800
Committer: Yanbo Liang 
Committed: Tue May 23 16:16:14 2017 +0800

--
 R/pkg/R/mllib_classification.R  | 38 
 .../tests/testthat/test_mllib_classification.R  |  3 +-
 .../apache/spark/ml/r/LinearSVCWrapper.scala| 12 +--
 3 files changed, 26 insertions(+), 27 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ad09e4ca/R/pkg/R/mllib_classification.R
--
diff --git a/R/pkg/R/mllib_classification.R b/R/pkg/R/mllib_classification.R
index 4db9cc3..306a9b8 100644
--- a/R/pkg/R/mllib_classification.R
+++ b/R/pkg/R/mllib_classification.R
@@ -46,15 +46,16 @@ setClass("MultilayerPerceptronClassificationModel", 
representation(jobj = "jobj"
 #' @note NaiveBayesModel since 2.0.0
 setClass("NaiveBayesModel", representation(jobj = "jobj"))
 
-#' linear SVM Model
+#' Linear SVM Model
 #'
-#' Fits an linear SVM model against a SparkDataFrame. It is a binary 
classifier, similar to svm in glmnet package
+#' Fits a linear SVM model against a SparkDataFrame, similar to svm in e1071 
package.
+#' Currently only supports binary classification model with linear kernel.
 #' Users can print, make predictions on the produced model and save the model 
to the input path.
 #'
 #' @param data SparkDataFrame for training.
 #' @param formula A symbolic description of the model to be fitted. Currently 
only a few formula
 #'operators are supported, including '~', '.', ':', '+', and 
'-'.
-#' @param regParam The regularization parameter.
+#' @param regParam The regularization parameter. Only supports L2 
regularization currently.
 #' @param maxIter Maximum iteration number.
 #' @param tol Convergence tolerance of iterations.
 #' @param standardization Whether to standardize the training features before 
fitting the model. The coefficients
@@ -111,10 +112,10 @@ setMethod("spark.svmLinear", signature(data = 
"SparkDataFrame", formula = "formu
 new("LinearSVCModel", jobj = jobj)
   })
 
-#  Predicted values based on an LinearSVCModel model
+#  Predicted values based on a LinearSVCModel model
 
 #' @param newData a SparkDataFrame for testing.
-#' @return \code{predict} returns the predicted values based on an 
LinearSVCModel.
+#' @return \code{predict} returns the predicted values based on a 
LinearSVCModel.
 #' @rdname spark.svmLinear
 #' @aliases predict,LinearSVCModel,SparkDataFrame-method
 #' @export
@@ -124,13 +125,12 @@ setMethod("predict", signature(object = "LinearSVCModel"),
 predict_internal(object, newData)
   })
 
-#  Get the summary of an LinearSVCModel
+#  Get the summary of a LinearSVCModel
 
-#' @param object an LinearSVCModel fitted by \code{spark.svmLinear}.
+#' @param object a LinearSVCModel fitted by \code{spark.svmLinear}.
 #' @return \code{summary} returns summary information of the fitted model, 
which is a list.
 #' The list includes \code{coefficients} (coefficients of the fitted 
model),
-#' \code{intercept} (intercept of the fitted model), \code{numClasses} 
(number of classes),
-#' \code{numFeatures} (number of features).
+#' \code{numClasses} (number of classes), \code{numFeatures} (number 
of features).
 #' @rdname spark.svmLinear
 #' @aliases summary,LinearSVCModel-method
 #' @export
@@ -138,22 +138,14 @@ setMethod("predict", signature(object = "LinearSVCModel"),
 setMethod("summary", signature(object = "LinearSVCModel"),
   function(object) {
 jobj <- object@jobj
-features <- callJMethod(jobj, "features")
-labels <- callJMethod(jobj, "labels")
-coefficients <- callJMethod(jobj, "coefficients")
-nCol <- length(coefficients) / length(features)
-coefficients <- matrix(unlist(coefficients), ncol = nCol)
-intercept <- callJMethod(jobj, "intercept")
+features <- callJMethod(jobj, "rFeatures")
+coefficients

spark git commit: [MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary.

2017-05-23 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 06c985c1b -> dbb068f4f


[MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM 
summary.

## What changes were proposed in this pull request?
Joint coefficients with intercept for SparkR linear SVM summary.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #18035 from yanboliang/svm-r.

(cherry picked from commit ad09e4ca045715d053a672c2ba23f598f06085d8)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dbb068f4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dbb068f4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dbb068f4

Branch: refs/heads/branch-2.2
Commit: dbb068f4f280fd48c991302f9e9728378926b1a2
Parents: 06c985c
Author: Yanbo Liang 
Authored: Tue May 23 16:16:14 2017 +0800
Committer: Yanbo Liang 
Committed: Tue May 23 16:16:29 2017 +0800

--
 R/pkg/R/mllib_classification.R  | 38 
 .../tests/testthat/test_mllib_classification.R  |  3 +-
 .../apache/spark/ml/r/LinearSVCWrapper.scala| 12 +--
 3 files changed, 26 insertions(+), 27 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dbb068f4/R/pkg/R/mllib_classification.R
--
diff --git a/R/pkg/R/mllib_classification.R b/R/pkg/R/mllib_classification.R
index 4db9cc3..306a9b8 100644
--- a/R/pkg/R/mllib_classification.R
+++ b/R/pkg/R/mllib_classification.R
@@ -46,15 +46,16 @@ setClass("MultilayerPerceptronClassificationModel", 
representation(jobj = "jobj"
 #' @note NaiveBayesModel since 2.0.0
 setClass("NaiveBayesModel", representation(jobj = "jobj"))
 
-#' linear SVM Model
+#' Linear SVM Model
 #'
-#' Fits an linear SVM model against a SparkDataFrame. It is a binary 
classifier, similar to svm in glmnet package
+#' Fits a linear SVM model against a SparkDataFrame, similar to svm in e1071 
package.
+#' Currently only supports binary classification model with linear kernel.
 #' Users can print, make predictions on the produced model and save the model 
to the input path.
 #'
 #' @param data SparkDataFrame for training.
 #' @param formula A symbolic description of the model to be fitted. Currently 
only a few formula
 #'operators are supported, including '~', '.', ':', '+', and 
'-'.
-#' @param regParam The regularization parameter.
+#' @param regParam The regularization parameter. Only supports L2 
regularization currently.
 #' @param maxIter Maximum iteration number.
 #' @param tol Convergence tolerance of iterations.
 #' @param standardization Whether to standardize the training features before 
fitting the model. The coefficients
@@ -111,10 +112,10 @@ setMethod("spark.svmLinear", signature(data = 
"SparkDataFrame", formula = "formu
 new("LinearSVCModel", jobj = jobj)
   })
 
-#  Predicted values based on an LinearSVCModel model
+#  Predicted values based on a LinearSVCModel model
 
 #' @param newData a SparkDataFrame for testing.
-#' @return \code{predict} returns the predicted values based on an 
LinearSVCModel.
+#' @return \code{predict} returns the predicted values based on a 
LinearSVCModel.
 #' @rdname spark.svmLinear
 #' @aliases predict,LinearSVCModel,SparkDataFrame-method
 #' @export
@@ -124,13 +125,12 @@ setMethod("predict", signature(object = "LinearSVCModel"),
 predict_internal(object, newData)
   })
 
-#  Get the summary of an LinearSVCModel
+#  Get the summary of a LinearSVCModel
 
-#' @param object an LinearSVCModel fitted by \code{spark.svmLinear}.
+#' @param object a LinearSVCModel fitted by \code{spark.svmLinear}.
 #' @return \code{summary} returns summary information of the fitted model, 
which is a list.
 #' The list includes \code{coefficients} (coefficients of the fitted 
model),
-#' \code{intercept} (intercept of the fitted model), \code{numClasses} 
(number of classes),
-#' \code{numFeatures} (number of features).
+#' \code{numClasses} (number of classes), \code{numFeatures} (number 
of features).
 #' @rdname spark.svmLinear
 #' @aliases summary,LinearSVCModel-method
 #' @export
@@ -138,22 +138,14 @@ setMethod("predict", signature(object = "LinearSVCModel"),
 setMethod("summary", signature(object = "LinearSVCModel"),
   function(object) {
 jobj <- object@jobj
-features <- callJMethod(jobj, "features")
-labels <- callJMethod(jobj, "labels")
-coefficients <- callJMethod(jobj, "coefficients")
-nCol <- length(coefficients) / length(features)
-coefficients <- matrix(unlist(coefficients), ncol = nCol)
-

spark git commit: [SPARK-20764][ML][PYSPARK] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version

2017-05-22 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 a57553279 -> a0bf5c47c


[SPARK-20764][ML][PYSPARK] Fix visibility discrepancy with numInstances and 
degreesOfFreedom in LR and GLR - Python version

## What changes were proposed in this pull request?

SPARK-20097 exposed degreesOfFreedom in LinearRegressionSummary and 
numInstances in GeneralizedLinearRegressionSummary. Python API should be 
updated to reflect these changes.

## How was this patch tested?
The existing UT

Author: Peng 

Closes #18062 from mpjlu/spark-20764.

(cherry picked from commit cfca01136bd7443c1d9daf8e8e256635eec20ddc)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a0bf5c47
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a0bf5c47
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a0bf5c47

Branch: refs/heads/branch-2.2
Commit: a0bf5c47cb9c72d73616f876a4521ae80e2e4ecb
Parents: a575532
Author: Peng 
Authored: Mon May 22 22:42:37 2017 +0800
Committer: Yanbo Liang 
Committed: Mon May 22 22:42:56 2017 +0800

--
 python/pyspark/ml/regression.py | 16 
 1 file changed, 16 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a0bf5c47/python/pyspark/ml/regression.py
--
diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py
index 3c3fcc8..2d17f95 100644
--- a/python/pyspark/ml/regression.py
+++ b/python/pyspark/ml/regression.py
@@ -324,6 +324,14 @@ class LinearRegressionSummary(JavaWrapper):
 return self._call_java("numInstances")
 
 @property
+@since("2.2.0")
+def degreesOfFreedom(self):
+"""
+Degrees of freedom.
+"""
+return self._call_java("degreesOfFreedom")
+
+@property
 @since("2.0.0")
 def devianceResiduals(self):
 """
@@ -1566,6 +1574,14 @@ class GeneralizedLinearRegressionSummary(JavaWrapper):
 return self._call_java("predictionCol")
 
 @property
+@since("2.2.0")
+def numInstances(self):
+"""
+Number of instances in DataFrame predictions.
+"""
+return self._call_java("numInstances")
+
+@property
 @since("2.0.0")
 def rank(self):
 """


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20764][ML][PYSPARK] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version

2017-05-22 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master f3ed62a38 -> cfca01136


[SPARK-20764][ML][PYSPARK] Fix visibility discrepancy with numInstances and 
degreesOfFreedom in LR and GLR - Python version

## What changes were proposed in this pull request?

SPARK-20097 exposed degreesOfFreedom in LinearRegressionSummary and 
numInstances in GeneralizedLinearRegressionSummary. Python API should be 
updated to reflect these changes.

## How was this patch tested?
The existing UT

Author: Peng 

Closes #18062 from mpjlu/spark-20764.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cfca0113
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cfca0113
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cfca0113

Branch: refs/heads/master
Commit: cfca01136bd7443c1d9daf8e8e256635eec20ddc
Parents: f3ed62a
Author: Peng 
Authored: Mon May 22 22:42:37 2017 +0800
Committer: Yanbo Liang 
Committed: Mon May 22 22:42:37 2017 +0800

--
 python/pyspark/ml/regression.py | 16 
 1 file changed, 16 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cfca0113/python/pyspark/ml/regression.py
--
diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py
index 3c3fcc8..2d17f95 100644
--- a/python/pyspark/ml/regression.py
+++ b/python/pyspark/ml/regression.py
@@ -324,6 +324,14 @@ class LinearRegressionSummary(JavaWrapper):
 return self._call_java("numInstances")
 
 @property
+@since("2.2.0")
+def degreesOfFreedom(self):
+"""
+Degrees of freedom.
+"""
+return self._call_java("degreesOfFreedom")
+
+@property
 @since("2.0.0")
 def devianceResiduals(self):
 """
@@ -1566,6 +1574,14 @@ class GeneralizedLinearRegressionSummary(JavaWrapper):
 return self._call_java("predictionCol")
 
 @property
+@since("2.2.0")
+def numInstances(self):
+"""
+Number of instances in DataFrame predictions.
+"""
+return self._call_java("numInstances")
+
+@property
 @since("2.0.0")
 def rank(self):
 """


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20505][ML] Add docs and examples for ml.stat.Correlation and ml.stat.ChiSquareTest.

2017-05-17 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 b8fa79cec -> ba0117c27


[SPARK-20505][ML] Add docs and examples for ml.stat.Correlation and 
ml.stat.ChiSquareTest.

## What changes were proposed in this pull request?
Add docs and examples for ```ml.stat.Correlation``` and 
```ml.stat.ChiSquareTest```.

## How was this patch tested?
Generate docs and run examples manually, successfully.

Author: Yanbo Liang 

Closes #17994 from yanboliang/spark-20505.

(cherry picked from commit 697a5e5517e32c5ef44c273e3b26662d0eb70f24)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ba0117c2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ba0117c2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ba0117c2

Branch: refs/heads/branch-2.2
Commit: ba0117c2716a6a3b9810bc17b67f9f502c49fa9b
Parents: b8fa79c
Author: Yanbo Liang 
Authored: Thu May 18 11:54:09 2017 +0800
Committer: Yanbo Liang 
Committed: Thu May 18 11:54:21 2017 +0800

--
 docs/_data/menu-ml.yaml |  2 +
 docs/ml-statistics.md   | 92 
 .../examples/ml/JavaChiSquareTestExample.java   | 75 
 .../examples/ml/JavaCorrelationExample.java | 72 +++
 .../main/python/ml/chi_square_test_example.py   | 52 +++
 .../src/main/python/ml/correlation_example.py   | 51 +++
 .../examples/ml/ChiSquareTestExample.scala  | 63 ++
 .../spark/examples/ml/CorrelationExample.scala  | 63 ++
 8 files changed, 470 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ba0117c2/docs/_data/menu-ml.yaml
--
diff --git a/docs/_data/menu-ml.yaml b/docs/_data/menu-ml.yaml
index 047423f..b5a6641 100644
--- a/docs/_data/menu-ml.yaml
+++ b/docs/_data/menu-ml.yaml
@@ -1,3 +1,5 @@
+- text: Basic statistics
+  url: ml-statistics.html
 - text: Pipelines
   url: ml-pipeline.html
 - text: Extracting, transforming and selecting features

http://git-wip-us.apache.org/repos/asf/spark/blob/ba0117c2/docs/ml-statistics.md
--
diff --git a/docs/ml-statistics.md b/docs/ml-statistics.md
new file mode 100644
index 000..abfb3ca
--- /dev/null
+++ b/docs/ml-statistics.md
@@ -0,0 +1,92 @@
+---
+layout: global
+title: Basic Statistics
+displayTitle: Basic Statistics
+---
+
+
+`\[
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}}
+\newcommand{\x}{\mathbf{x}}
+\newcommand{\y}{\mathbf{y}}
+\newcommand{\wv}{\mathbf{w}}
+\newcommand{\av}{\mathbf{\alpha}}
+\newcommand{\bv}{\mathbf{b}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\id}{\mathbf{I}}
+\newcommand{\ind}{\mathbf{1}}
+\newcommand{\0}{\mathbf{0}}
+\newcommand{\unit}{\mathbf{e}}
+\newcommand{\one}{\mathbf{1}}
+\newcommand{\zero}{\mathbf{0}}
+\]`
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## Correlation
+
+Calculating the correlation between two series of data is a common operation 
in Statistics. In `spark.ml`
+we provide the flexibility to calculate pairwise correlations among many 
series. The supported
+correlation methods are currently Pearson's and Spearman's correlation.
+
+
+
+[`Correlation`](api/scala/index.html#org.apache.spark.ml.stat.Correlation$)
+computes the correlation matrix for the input Dataset of Vectors using the 
specified method.
+The output will be a DataFrame that contains the correlation matrix of the 
column of vectors.
+
+{% include_example scala/org/apache/spark/examples/ml/CorrelationExample.scala 
%}
+
+
+
+[`Correlation`](api/java/org/apache/spark/ml/stat/Correlation.html)
+computes the correlation matrix for the input Dataset of Vectors using the 
specified method.
+The output will be a DataFrame that contains the correlation matrix of the 
column of vectors.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaCorrelationExample.java %}
+
+
+
+[`Correlation`](api/python/pyspark.ml.html#pyspark.ml.stat.Correlation$)
+computes the correlation matrix for the input Dataset of Vectors using the 
specified method.
+The output will be a DataFrame that contains the correlation matrix of the 
column of vectors.
+
+{% include_example python/ml/correlation_example.py %}
+
+
+
+
+## Hypothesis testing
+
+Hypothesis testing is a powerful tool in statistics to determine whether a 
result is statistically
+significant, whether this result occurred by chance or not. `spark.ml` 
currently supports Pearson's
+Chi-squared ( $\chi^2$) tests for independence.
+
+`ChiSquareTest` conducts Pearson's independence test for every feature against 
the label.
+For

spark git commit: [SPARK-20505][ML] Add docs and examples for ml.stat.Correlation and ml.stat.ChiSquareTest.

2017-05-17 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 324a904d8 -> 697a5e551


[SPARK-20505][ML] Add docs and examples for ml.stat.Correlation and 
ml.stat.ChiSquareTest.

## What changes were proposed in this pull request?
Add docs and examples for ```ml.stat.Correlation``` and 
```ml.stat.ChiSquareTest```.

## How was this patch tested?
Generate docs and run examples manually, successfully.

Author: Yanbo Liang 

Closes #17994 from yanboliang/spark-20505.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/697a5e55
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/697a5e55
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/697a5e55

Branch: refs/heads/master
Commit: 697a5e5517e32c5ef44c273e3b26662d0eb70f24
Parents: 324a904
Author: Yanbo Liang 
Authored: Thu May 18 11:54:09 2017 +0800
Committer: Yanbo Liang 
Committed: Thu May 18 11:54:09 2017 +0800

--
 docs/_data/menu-ml.yaml |  2 +
 docs/ml-statistics.md   | 92 
 .../examples/ml/JavaChiSquareTestExample.java   | 75 
 .../examples/ml/JavaCorrelationExample.java | 72 +++
 .../main/python/ml/chi_square_test_example.py   | 52 +++
 .../src/main/python/ml/correlation_example.py   | 51 +++
 .../examples/ml/ChiSquareTestExample.scala  | 63 ++
 .../spark/examples/ml/CorrelationExample.scala  | 63 ++
 8 files changed, 470 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/697a5e55/docs/_data/menu-ml.yaml
--
diff --git a/docs/_data/menu-ml.yaml b/docs/_data/menu-ml.yaml
index 047423f..b5a6641 100644
--- a/docs/_data/menu-ml.yaml
+++ b/docs/_data/menu-ml.yaml
@@ -1,3 +1,5 @@
+- text: Basic statistics
+  url: ml-statistics.html
 - text: Pipelines
   url: ml-pipeline.html
 - text: Extracting, transforming and selecting features

http://git-wip-us.apache.org/repos/asf/spark/blob/697a5e55/docs/ml-statistics.md
--
diff --git a/docs/ml-statistics.md b/docs/ml-statistics.md
new file mode 100644
index 000..abfb3ca
--- /dev/null
+++ b/docs/ml-statistics.md
@@ -0,0 +1,92 @@
+---
+layout: global
+title: Basic Statistics
+displayTitle: Basic Statistics
+---
+
+
+`\[
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}}
+\newcommand{\x}{\mathbf{x}}
+\newcommand{\y}{\mathbf{y}}
+\newcommand{\wv}{\mathbf{w}}
+\newcommand{\av}{\mathbf{\alpha}}
+\newcommand{\bv}{\mathbf{b}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\id}{\mathbf{I}}
+\newcommand{\ind}{\mathbf{1}}
+\newcommand{\0}{\mathbf{0}}
+\newcommand{\unit}{\mathbf{e}}
+\newcommand{\one}{\mathbf{1}}
+\newcommand{\zero}{\mathbf{0}}
+\]`
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## Correlation
+
+Calculating the correlation between two series of data is a common operation 
in Statistics. In `spark.ml`
+we provide the flexibility to calculate pairwise correlations among many 
series. The supported
+correlation methods are currently Pearson's and Spearman's correlation.
+
+
+
+[`Correlation`](api/scala/index.html#org.apache.spark.ml.stat.Correlation$)
+computes the correlation matrix for the input Dataset of Vectors using the 
specified method.
+The output will be a DataFrame that contains the correlation matrix of the 
column of vectors.
+
+{% include_example scala/org/apache/spark/examples/ml/CorrelationExample.scala 
%}
+
+
+
+[`Correlation`](api/java/org/apache/spark/ml/stat/Correlation.html)
+computes the correlation matrix for the input Dataset of Vectors using the 
specified method.
+The output will be a DataFrame that contains the correlation matrix of the 
column of vectors.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaCorrelationExample.java %}
+
+
+
+[`Correlation`](api/python/pyspark.ml.html#pyspark.ml.stat.Correlation$)
+computes the correlation matrix for the input Dataset of Vectors using the 
specified method.
+The output will be a DataFrame that contains the correlation matrix of the 
column of vectors.
+
+{% include_example python/ml/correlation_example.py %}
+
+
+
+
+## Hypothesis testing
+
+Hypothesis testing is a powerful tool in statistics to determine whether a 
result is statistically
+significant, whether this result occurred by chance or not. `spark.ml` 
currently supports Pearson's
+Chi-squared ( $\chi^2$) tests for independence.
+
+`ChiSquareTest` conducts Pearson's independence test for every feature against 
the label.
+For each feature, the (feature, label) pairs are converted into a contingency 
matrix for which
+the Chi-squared statistic is

spark git commit: [SPARK-20707][ML] ML deprecated APIs should be removed in major release.

2017-05-15 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 10e599f69 -> a869e8bfd


[SPARK-20707][ML] ML deprecated APIs should be removed in major release.

## What changes were proposed in this pull request?
Before 2.2, MLlib keep to remove APIs deprecated in last feature/minor release. 
But from Spark 2.2, we decide to remove deprecated APIs in a major release, so 
we need to change corresponding annotations to tell users those will be removed 
in 3.0.
Meanwhile, this fixed bugs in ML documents. The original ML docs can't show 
deprecated annotations in ```MLWriter``` and ```MLReader``` related class, we 
correct it in this PR.

Before:
![image](https://cloud.githubusercontent.com/assets/1962026/25939889/f8c55f20-3666-11e7-9fa2-0605bfb3ed06.png)

After:
![image](https://cloud.githubusercontent.com/assets/1962026/25939870/e9b0d5be-3666-11e7-9765-5e04885e4b32.png)

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #17946 from yanboliang/spark-20707.

(cherry picked from commit d4022d49514cc1f8ffc5bfe243186ec3748df475)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a869e8bf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a869e8bf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a869e8bf

Branch: refs/heads/branch-2.2
Commit: a869e8bfdc23b9e3796a7c4d51f91902b5a067d2
Parents: 10e599f
Author: Yanbo Liang 
Authored: Tue May 16 10:08:23 2017 +0800
Committer: Yanbo Liang 
Committed: Tue May 16 10:08:35 2017 +0800

--
 .../org/apache/spark/ml/tree/treeParams.scala   | 60 ++--
 .../org/apache/spark/ml/util/ReadWrite.scala|  4 +-
 python/docs/pyspark.ml.rst  |  8 +++
 python/pyspark/ml/util.py   | 16 --
 4 files changed, 51 insertions(+), 37 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a869e8bf/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala 
b/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala
index cd1950b..3fc3ac5 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala
@@ -110,77 +110,77 @@ private[ml] trait DecisionTreeParams extends 
PredictorParams
 maxMemoryInMB -> 256, cacheNodeIds -> false, checkpointInterval -> 10)
 
   /**
-   * @deprecated This method is deprecated and will be removed in 2.2.0.
+   * @deprecated This method is deprecated and will be removed in 3.0.0.
* @group setParam
*/
-  @deprecated("This method is deprecated and will be removed in 2.2.0.", 
"2.1.0")
+  @deprecated("This method is deprecated and will be removed in 3.0.0.", 
"2.1.0")
   def setMaxDepth(value: Int): this.type = set(maxDepth, value)
 
   /** @group getParam */
   final def getMaxDepth: Int = $(maxDepth)
 
   /**
-   * @deprecated This method is deprecated and will be removed in 2.2.0.
+   * @deprecated This method is deprecated and will be removed in 3.0.0.
* @group setParam
*/
-  @deprecated("This method is deprecated and will be removed in 2.2.0.", 
"2.1.0")
+  @deprecated("This method is deprecated and will be removed in 3.0.0.", 
"2.1.0")
   def setMaxBins(value: Int): this.type = set(maxBins, value)
 
   /** @group getParam */
   final def getMaxBins: Int = $(maxBins)
 
   /**
-   * @deprecated This method is deprecated and will be removed in 2.2.0.
+   * @deprecated This method is deprecated and will be removed in 3.0.0.
* @group setParam
*/
-  @deprecated("This method is deprecated and will be removed in 2.2.0.", 
"2.1.0")
+  @deprecated("This method is deprecated and will be removed in 3.0.0.", 
"2.1.0")
   def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, 
value)
 
   /** @group getParam */
   final def getMinInstancesPerNode: Int = $(minInstancesPerNode)
 
   /**
-   * @deprecated This method is deprecated and will be removed in 2.2.0.
+   * @deprecated This method is deprecated and will be removed in 3.0.0.
* @group setParam
*/
-  @deprecated("This method is deprecated and will be removed in 2.2.0.", 
"2.1.0")
+  @deprecated("This method is deprecated and will be removed in 3.0.0.", 
"2.1.0")
   def setMinInfoGain(value: Double): this.type = set(minInfoGain, value)
 
   /** @group getParam */
   final def getMinInfoGain: Double = $(minInfoGain)
 
   /**
-   * @deprecated This method is deprecated and will be removed in 2.2.0.
+   * @deprecated This method is deprecated and will be removed in 3.0.0.
* @group setParam
*/
-  @deprecated("This

spark git commit: [SPARK-20669][ML] LoR.family and LDA.optimizer should be case insensitive

2017-05-15 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master b0888d1ac -> 9970aa096


[SPARK-20669][ML] LoR.family and LDA.optimizer should be case insensitive

## What changes were proposed in this pull request?
make param `family` in LoR and `optimizer` in LDA case insensitive

## How was this patch tested?
updated tests

yanboliang

Author: Zheng RuiFeng 

Closes #17910 from zhengruifeng/lr_family_lowercase.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9970aa09
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9970aa09
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9970aa09

Branch: refs/heads/master
Commit: 9970aa0962ec253a6e838aea26a627689dc5b011
Parents: b0888d1
Author: Zheng RuiFeng 
Authored: Mon May 15 23:21:44 2017 +0800
Committer: Yanbo Liang 
Committed: Mon May 15 23:21:44 2017 +0800

--
 .../ml/classification/LogisticRegression.scala  |  4 +--
 .../org/apache/spark/ml/clustering/LDA.scala| 30 ++--
 .../LogisticRegressionSuite.scala   | 11 +++
 .../apache/spark/ml/clustering/LDASuite.scala   | 10 +++
 4 files changed, 38 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9970aa09/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index 42dc7fb..0534872 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -94,7 +94,7 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
   final val family: Param[String] = new Param(this, "family",
 "The name of family which is a description of the label distribution to be 
used in the " +
   s"model. Supported options: ${supportedFamilyNames.mkString(", ")}.",
-ParamValidators.inArray[String](supportedFamilyNames))
+(value: String) => 
supportedFamilyNames.contains(value.toLowerCase(Locale.ROOT)))
 
   /** @group getParam */
   @Since("2.1.0")
@@ -526,7 +526,7 @@ class LogisticRegression @Since("1.2.0") (
   case None => histogram.length
 }
 
-val isMultinomial = $(family) match {
+val isMultinomial = getFamily.toLowerCase(Locale.ROOT) match {
   case "binomial" =>
 require(numClasses == 1 || numClasses == 2, s"Binomial family only 
supports 1 or 2 " +
 s"outcome classes but found $numClasses.")

http://git-wip-us.apache.org/repos/asf/spark/blob/9970aa09/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala
index e3026c8..3da29b1 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala
@@ -174,8 +174,7 @@ private[clustering] trait LDAParams extends Params with 
HasFeaturesCol with HasM
   @Since("1.6.0")
   final val optimizer = new Param[String](this, "optimizer", "Optimizer or 
inference" +
 " algorithm used to estimate the LDA model. Supported: " + 
supportedOptimizers.mkString(", "),
-(o: String) =>
-  
ParamValidators.inArray(supportedOptimizers).apply(o.toLowerCase(Locale.ROOT)))
+(value: String) => 
supportedOptimizers.contains(value.toLowerCase(Locale.ROOT)))
 
   /** @group getParam */
   @Since("1.6.0")
@@ -325,7 +324,7 @@ private[clustering] trait LDAParams extends Params with 
HasFeaturesCol with HasM
   s" ${getDocConcentration.length}, but k = $getK.  docConcentration 
must be an array of" +
   s" length either 1 (scalar) or k (num topics).")
   }
-  getOptimizer match {
+  getOptimizer.toLowerCase(Locale.ROOT) match {
 case "online" =>
   require(getDocConcentration.forall(_ >= 0),
 "For Online LDA optimizer, docConcentration values must be >= 0.  
Found values: " +
@@ -337,7 +336,7 @@ private[clustering] trait LDAParams extends Params with 
HasFeaturesCol with HasM
   }
 }
 if (isSet(topicConcentration)) {
-  getOptimizer match {
+  getOptimizer.toLowerCase(Locale.ROOT) match {
 case "online" =>
   require(getTopicConcentration >= 0, s"For Online LDA optimizer, 
topicConcentration" +
 s" must be >= 0.  Found value: $getTopicConcentration")
@@ -350,17 +349,18 @@ private[clustering] trait

spark git commit: [SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML"

2017-05-11 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 3eb0ee06a -> 80a57fa90


[SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML"

This reverts commit b8733e0ad9f5a700f385e210450fd2c10137293e.

Author: Yanbo Liang 

Closes #17944 from yanboliang/spark-20606-revert.

(cherry picked from commit 0698e6c88ca11fdfd6e5498cab784cf6dbcdfacb)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/80a57fa9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/80a57fa9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/80a57fa9

Branch: refs/heads/branch-2.2
Commit: 80a57fa90be8dca4340345c09b4ea28fbf11a516
Parents: 3eb0ee0
Author: Yanbo Liang 
Authored: Thu May 11 14:48:13 2017 +0800
Committer: Yanbo Liang 
Committed: Thu May 11 14:48:26 2017 +0800

--
 .../classification/DecisionTreeClassifier.scala |  18 ++--
 .../spark/ml/classification/GBTClassifier.scala |  24 ++---
 .../classification/RandomForestClassifier.scala |  24 ++---
 .../ml/regression/DecisionTreeRegressor.scala   |  18 ++--
 .../spark/ml/regression/GBTRegressor.scala  |  24 ++---
 .../ml/regression/RandomForestRegressor.scala   |  24 ++---
 .../org/apache/spark/ml/tree/treeParams.scala   | 105 +++
 .../org/apache/spark/ml/util/ReadWrite.scala|  16 +++
 project/MimaExcludes.scala  |  68 
 python/pyspark/ml/util.py   |  32 ++
 10 files changed, 219 insertions(+), 134 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/80a57fa9/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
index 5fb105c..9f60f08 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
@@ -54,27 +54,27 @@ class DecisionTreeClassifier @Since("1.4.0") (
 
   /** @group setParam */
   @Since("1.4.0")
-  def setMaxDepth(value: Int): this.type = set(maxDepth, value)
+  override def setMaxDepth(value: Int): this.type = set(maxDepth, value)
 
   /** @group setParam */
   @Since("1.4.0")
-  def setMaxBins(value: Int): this.type = set(maxBins, value)
+  override def setMaxBins(value: Int): this.type = set(maxBins, value)
 
   /** @group setParam */
   @Since("1.4.0")
-  def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, 
value)
+  override def setMinInstancesPerNode(value: Int): this.type = 
set(minInstancesPerNode, value)
 
   /** @group setParam */
   @Since("1.4.0")
-  def setMinInfoGain(value: Double): this.type = set(minInfoGain, value)
+  override def setMinInfoGain(value: Double): this.type = set(minInfoGain, 
value)
 
   /** @group expertSetParam */
   @Since("1.4.0")
-  def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, value)
+  override def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, 
value)
 
   /** @group expertSetParam */
   @Since("1.4.0")
-  def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, value)
+  override def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, 
value)
 
   /**
* Specifies how often to checkpoint the cached node IDs.
@@ -86,15 +86,15 @@ class DecisionTreeClassifier @Since("1.4.0") (
* @group setParam
*/
   @Since("1.4.0")
-  def setCheckpointInterval(value: Int): this.type = set(checkpointInterval, 
value)
+  override def setCheckpointInterval(value: Int): this.type = 
set(checkpointInterval, value)
 
   /** @group setParam */
   @Since("1.4.0")
-  def setImpurity(value: String): this.type = set(impurity, value)
+  override def setImpurity(value: String): this.type = set(impurity, value)
 
   /** @group setParam */
   @Since("1.6.0")
-  def setSeed(value: Long): this.type = set(seed, value)
+  override def setSeed(value: Long): this.type = set(seed, value)
 
   override protected def train(dataset: Dataset[_]): 
DecisionTreeClassificationModel = {
 val categoricalFeatures: Map[Int, Int] =

http://git-wip-us.apache.org/repos/asf/spark/blob/80a57fa9/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
index 263ed10..ade0960 100644
---

spark git commit: [SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML"

2017-05-11 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 8ddbc431d -> 0698e6c88


[SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML"

This reverts commit b8733e0ad9f5a700f385e210450fd2c10137293e.

Author: Yanbo Liang 

Closes #17944 from yanboliang/spark-20606-revert.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0698e6c8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0698e6c8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0698e6c8

Branch: refs/heads/master
Commit: 0698e6c88ca11fdfd6e5498cab784cf6dbcdfacb
Parents: 8ddbc43
Author: Yanbo Liang 
Authored: Thu May 11 14:48:13 2017 +0800
Committer: Yanbo Liang 
Committed: Thu May 11 14:48:13 2017 +0800

--
 .../classification/DecisionTreeClassifier.scala |  18 ++--
 .../spark/ml/classification/GBTClassifier.scala |  24 ++---
 .../classification/RandomForestClassifier.scala |  24 ++---
 .../ml/regression/DecisionTreeRegressor.scala   |  18 ++--
 .../spark/ml/regression/GBTRegressor.scala  |  24 ++---
 .../ml/regression/RandomForestRegressor.scala   |  24 ++---
 .../org/apache/spark/ml/tree/treeParams.scala   | 105 +++
 .../org/apache/spark/ml/util/ReadWrite.scala|  16 +++
 project/MimaExcludes.scala  |  68 
 python/pyspark/ml/util.py   |  32 ++
 10 files changed, 219 insertions(+), 134 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0698e6c8/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
index 5fb105c..9f60f08 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
@@ -54,27 +54,27 @@ class DecisionTreeClassifier @Since("1.4.0") (
 
   /** @group setParam */
   @Since("1.4.0")
-  def setMaxDepth(value: Int): this.type = set(maxDepth, value)
+  override def setMaxDepth(value: Int): this.type = set(maxDepth, value)
 
   /** @group setParam */
   @Since("1.4.0")
-  def setMaxBins(value: Int): this.type = set(maxBins, value)
+  override def setMaxBins(value: Int): this.type = set(maxBins, value)
 
   /** @group setParam */
   @Since("1.4.0")
-  def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, 
value)
+  override def setMinInstancesPerNode(value: Int): this.type = 
set(minInstancesPerNode, value)
 
   /** @group setParam */
   @Since("1.4.0")
-  def setMinInfoGain(value: Double): this.type = set(minInfoGain, value)
+  override def setMinInfoGain(value: Double): this.type = set(minInfoGain, 
value)
 
   /** @group expertSetParam */
   @Since("1.4.0")
-  def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, value)
+  override def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, 
value)
 
   /** @group expertSetParam */
   @Since("1.4.0")
-  def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, value)
+  override def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, 
value)
 
   /**
* Specifies how often to checkpoint the cached node IDs.
@@ -86,15 +86,15 @@ class DecisionTreeClassifier @Since("1.4.0") (
* @group setParam
*/
   @Since("1.4.0")
-  def setCheckpointInterval(value: Int): this.type = set(checkpointInterval, 
value)
+  override def setCheckpointInterval(value: Int): this.type = 
set(checkpointInterval, value)
 
   /** @group setParam */
   @Since("1.4.0")
-  def setImpurity(value: String): this.type = set(impurity, value)
+  override def setImpurity(value: String): this.type = set(impurity, value)
 
   /** @group setParam */
   @Since("1.6.0")
-  def setSeed(value: Long): this.type = set(seed, value)
+  override def setSeed(value: Long): this.type = set(seed, value)
 
   override protected def train(dataset: Dataset[_]): 
DecisionTreeClassificationModel = {
 val categoricalFeatures: Map[Int, Int] =

http://git-wip-us.apache.org/repos/asf/spark/blob/0698e6c8/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
index 263ed10..ade0960 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
+++

spark git commit: [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params

2017-05-10 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 46659974e -> d86dae8fe


[SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should 
use values not Params

## What changes were proposed in this pull request?

- Replace `getParam` calls with `getOrDefault` calls.
- Fix exception message to avoid unintended `TypeError`.
- Add unit tests

## How was this patch tested?

New unit tests.

Author: zero323 

Closes #17891 from zero323/SPARK-20631.

(cherry picked from commit 804949c6bf00b8e26c39d48bbcc4d0470ee84e47)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d86dae8f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d86dae8f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d86dae8f

Branch: refs/heads/branch-2.0
Commit: d86dae8feec5e9bf77dd5ba0cf9caa1b955de020
Parents: 4665997
Author: zero323 
Authored: Wed May 10 16:57:52 2017 +0800
Committer: Yanbo Liang 
Committed: Wed May 10 17:00:22 2017 +0800

--
 python/pyspark/ml/classification.py |  6 +++---
 python/pyspark/ml/tests.py  | 12 
 2 files changed, 15 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d86dae8f/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index bfeda7c..0a30321 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -200,13 +200,13 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredicti
 
 def _checkThresholdConsistency(self):
 if self.isSet(self.threshold) and self.isSet(self.thresholds):
-ts = self.getParam(self.thresholds)
+ts = self.getOrDefault(self.thresholds)
 if len(ts) != 2:
 raise ValueError("Logistic Regression getThreshold only 
applies to" +
  " binary classification, but thresholds has 
length != 2." +
- " thresholds: " + ",".join(ts))
+ " thresholds: {0}".format(str(ts)))
 t = 1.0/(1.0 + ts[0]/ts[1])
-t2 = self.getParam(self.threshold)
+t2 = self.getOrDefault(self.threshold)
 if abs(t2 - t) >= 1E-5:
 raise ValueError("Logistic Regression getThreshold found 
inconsistent values for" +
  " threshold (%g) and thresholds (equivalent 
to %g)" % (t2, t))

http://git-wip-us.apache.org/repos/asf/spark/blob/d86dae8f/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 3c346b9..87f0aff 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -765,6 +765,18 @@ class PersistenceTest(SparkSessionTestCase):
 except OSError:
 pass
 
+def logistic_regression_check_thresholds(self):
+self.assertIsInstance(
+LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]),
+LogisticRegressionModel
+)
+
+self.assertRaisesRegexp(
+ValueError,
+"Logistic Regression getThreshold found inconsistent.*$",
+LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
+)
+
 def _compare_params(self, m1, m2, param):
 """
 Compare 2 ML Params instances for the given param, and assert both 
have the same param value


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params

2017-05-10 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 0ef16bd4b -> 804949c6b


[SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should 
use values not Params

## What changes were proposed in this pull request?

- Replace `getParam` calls with `getOrDefault` calls.
- Fix exception message to avoid unintended `TypeError`.
- Add unit tests

## How was this patch tested?

New unit tests.

Author: zero323 

Closes #17891 from zero323/SPARK-20631.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/804949c6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/804949c6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/804949c6

Branch: refs/heads/master
Commit: 804949c6bf00b8e26c39d48bbcc4d0470ee84e47
Parents: 0ef16bd
Author: zero323 
Authored: Wed May 10 16:57:52 2017 +0800
Committer: Yanbo Liang 
Committed: Wed May 10 16:57:52 2017 +0800

--
 python/pyspark/ml/classification.py |  6 +++---
 python/pyspark/ml/tests.py  | 12 
 2 files changed, 15 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/804949c6/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index a9756ea..dcc12d9 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -349,13 +349,13 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredicti
 
 def _checkThresholdConsistency(self):
 if self.isSet(self.threshold) and self.isSet(self.thresholds):
-ts = self.getParam(self.thresholds)
+ts = self.getOrDefault(self.thresholds)
 if len(ts) != 2:
 raise ValueError("Logistic Regression getThreshold only 
applies to" +
  " binary classification, but thresholds has 
length != 2." +
- " thresholds: " + ",".join(ts))
+ " thresholds: {0}".format(str(ts)))
 t = 1.0/(1.0 + ts[0]/ts[1])
-t2 = self.getParam(self.threshold)
+t2 = self.getOrDefault(self.threshold)
 if abs(t2 - t) >= 1E-5:
 raise ValueError("Logistic Regression getThreshold found 
inconsistent values for" +
  " threshold (%g) and thresholds (equivalent 
to %g)" % (t2, t))

http://git-wip-us.apache.org/repos/asf/spark/blob/804949c6/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 571ac4b..51a3e8e 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -807,6 +807,18 @@ class PersistenceTest(SparkSessionTestCase):
 except OSError:
 pass
 
+def logistic_regression_check_thresholds(self):
+self.assertIsInstance(
+LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]),
+LogisticRegressionModel
+)
+
+self.assertRaisesRegexp(
+ValueError,
+"Logistic Regression getThreshold found inconsistent.*$",
+LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
+)
+
 def _compare_params(self, m1, m2, param):
 """
 Compare 2 ML Params instances for the given param, and assert both 
have the same param value


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params

2017-05-10 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 8e097890a -> 69786ea3a


[SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should 
use values not Params

## What changes were proposed in this pull request?

- Replace `getParam` calls with `getOrDefault` calls.
- Fix exception message to avoid unintended `TypeError`.
- Add unit tests

## How was this patch tested?

New unit tests.

Author: zero323 

Closes #17891 from zero323/SPARK-20631.

(cherry picked from commit 804949c6bf00b8e26c39d48bbcc4d0470ee84e47)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/69786ea3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/69786ea3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/69786ea3

Branch: refs/heads/branch-2.1
Commit: 69786ea3a972af1b29a332dc11ac507ed4368cc6
Parents: 8e09789
Author: zero323 
Authored: Wed May 10 16:57:52 2017 +0800
Committer: Yanbo Liang 
Committed: Wed May 10 16:58:34 2017 +0800

--
 python/pyspark/ml/classification.py |  6 +++---
 python/pyspark/ml/tests.py  | 12 
 2 files changed, 15 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/69786ea3/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index 570a414..2b47c40 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -238,13 +238,13 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredicti
 
 def _checkThresholdConsistency(self):
 if self.isSet(self.threshold) and self.isSet(self.thresholds):
-ts = self.getParam(self.thresholds)
+ts = self.getOrDefault(self.thresholds)
 if len(ts) != 2:
 raise ValueError("Logistic Regression getThreshold only 
applies to" +
  " binary classification, but thresholds has 
length != 2." +
- " thresholds: " + ",".join(ts))
+ " thresholds: {0}".format(str(ts)))
 t = 1.0/(1.0 + ts[0]/ts[1])
-t2 = self.getParam(self.threshold)
+t2 = self.getOrDefault(self.threshold)
 if abs(t2 - t) >= 1E-5:
 raise ValueError("Logistic Regression getThreshold found 
inconsistent values for" +
  " threshold (%g) and thresholds (equivalent 
to %g)" % (t2, t))

http://git-wip-us.apache.org/repos/asf/spark/blob/69786ea3/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 70e0c6d..7152036 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -808,6 +808,18 @@ class PersistenceTest(SparkSessionTestCase):
 except OSError:
 pass
 
+def logistic_regression_check_thresholds(self):
+self.assertIsInstance(
+LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]),
+LogisticRegressionModel
+)
+
+self.assertRaisesRegexp(
+ValueError,
+"Logistic Regression getThreshold found inconsistent.*$",
+LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
+)
+
 def _compare_params(self, m1, m2, param):
 """
 Compare 2 ML Params instances for the given param, and assert both 
have the same param value


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params

2017-05-10 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 ef50a9548 -> 3ed2f4d51


[SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should 
use values not Params

## What changes were proposed in this pull request?

- Replace `getParam` calls with `getOrDefault` calls.
- Fix exception message to avoid unintended `TypeError`.
- Add unit tests

## How was this patch tested?

New unit tests.

Author: zero323 

Closes #17891 from zero323/SPARK-20631.

(cherry picked from commit 804949c6bf00b8e26c39d48bbcc4d0470ee84e47)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3ed2f4d5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3ed2f4d5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3ed2f4d5

Branch: refs/heads/branch-2.2
Commit: 3ed2f4d516ce02dfef929195778f8214703913d8
Parents: ef50a95
Author: zero323 
Authored: Wed May 10 16:57:52 2017 +0800
Committer: Yanbo Liang 
Committed: Wed May 10 16:58:08 2017 +0800

--
 python/pyspark/ml/classification.py |  6 +++---
 python/pyspark/ml/tests.py  | 12 
 2 files changed, 15 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3ed2f4d5/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index a9756ea..dcc12d9 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -349,13 +349,13 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredicti
 
 def _checkThresholdConsistency(self):
 if self.isSet(self.threshold) and self.isSet(self.thresholds):
-ts = self.getParam(self.thresholds)
+ts = self.getOrDefault(self.thresholds)
 if len(ts) != 2:
 raise ValueError("Logistic Regression getThreshold only 
applies to" +
  " binary classification, but thresholds has 
length != 2." +
- " thresholds: " + ",".join(ts))
+ " thresholds: {0}".format(str(ts)))
 t = 1.0/(1.0 + ts[0]/ts[1])
-t2 = self.getParam(self.threshold)
+t2 = self.getOrDefault(self.threshold)
 if abs(t2 - t) >= 1E-5:
 raise ValueError("Logistic Regression getThreshold found 
inconsistent values for" +
  " threshold (%g) and thresholds (equivalent 
to %g)" % (t2, t))

http://git-wip-us.apache.org/repos/asf/spark/blob/3ed2f4d5/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 571ac4b..51a3e8e 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -807,6 +807,18 @@ class PersistenceTest(SparkSessionTestCase):
 except OSError:
 pass
 
+def logistic_regression_check_thresholds(self):
+self.assertIsInstance(
+LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]),
+LogisticRegressionModel
+)
+
+self.assertRaisesRegexp(
+ValueError,
+"Logistic Regression getThreshold found inconsistent.*$",
+LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
+)
+
 def _compare_params(self, m1, m2, param):
 """
 Compare 2 ML Params instances for the given param, and assert both 
have the same param value


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML

2017-05-09 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 4bbfad44e -> 4b7aa0b1d


[SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML

## What changes were proposed in this pull request?
Remove ML methods we deprecated in 2.1.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #17867 from yanboliang/spark-20606.

(cherry picked from commit b8733e0ad9f5a700f385e210450fd2c10137293e)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4b7aa0b1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4b7aa0b1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4b7aa0b1

Branch: refs/heads/branch-2.2
Commit: 4b7aa0b1dbd85e2238acba45e8f94c097358fb72
Parents: 4bbfad4
Author: Yanbo Liang 
Authored: Tue May 9 17:30:37 2017 +0800
Committer: Yanbo Liang 
Committed: Tue May 9 17:30:50 2017 +0800

--
 .../classification/DecisionTreeClassifier.scala |  18 ++--
 .../spark/ml/classification/GBTClassifier.scala |  24 ++---
 .../classification/RandomForestClassifier.scala |  24 ++---
 .../ml/regression/DecisionTreeRegressor.scala   |  18 ++--
 .../spark/ml/regression/GBTRegressor.scala  |  24 ++---
 .../ml/regression/RandomForestRegressor.scala   |  24 ++---
 .../org/apache/spark/ml/tree/treeParams.scala   | 105 ---
 .../org/apache/spark/ml/util/ReadWrite.scala|  16 ---
 project/MimaExcludes.scala  |  68 
 python/pyspark/ml/util.py   |  32 --
 10 files changed, 134 insertions(+), 219 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4b7aa0b1/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
index 9f60f08..5fb105c 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
@@ -54,27 +54,27 @@ class DecisionTreeClassifier @Since("1.4.0") (
 
   /** @group setParam */
   @Since("1.4.0")
-  override def setMaxDepth(value: Int): this.type = set(maxDepth, value)
+  def setMaxDepth(value: Int): this.type = set(maxDepth, value)
 
   /** @group setParam */
   @Since("1.4.0")
-  override def setMaxBins(value: Int): this.type = set(maxBins, value)
+  def setMaxBins(value: Int): this.type = set(maxBins, value)
 
   /** @group setParam */
   @Since("1.4.0")
-  override def setMinInstancesPerNode(value: Int): this.type = 
set(minInstancesPerNode, value)
+  def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, 
value)
 
   /** @group setParam */
   @Since("1.4.0")
-  override def setMinInfoGain(value: Double): this.type = set(minInfoGain, 
value)
+  def setMinInfoGain(value: Double): this.type = set(minInfoGain, value)
 
   /** @group expertSetParam */
   @Since("1.4.0")
-  override def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, 
value)
+  def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, value)
 
   /** @group expertSetParam */
   @Since("1.4.0")
-  override def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, 
value)
+  def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, value)
 
   /**
* Specifies how often to checkpoint the cached node IDs.
@@ -86,15 +86,15 @@ class DecisionTreeClassifier @Since("1.4.0") (
* @group setParam
*/
   @Since("1.4.0")
-  override def setCheckpointInterval(value: Int): this.type = 
set(checkpointInterval, value)
+  def setCheckpointInterval(value: Int): this.type = set(checkpointInterval, 
value)
 
   /** @group setParam */
   @Since("1.4.0")
-  override def setImpurity(value: String): this.type = set(impurity, value)
+  def setImpurity(value: String): this.type = set(impurity, value)
 
   /** @group setParam */
   @Since("1.6.0")
-  override def setSeed(value: Long): this.type = set(seed, value)
+  def setSeed(value: Long): this.type = set(seed, value)
 
   override protected def train(dataset: Dataset[_]): 
DecisionTreeClassificationModel = {
 val categoricalFeatures: Map[Int, Int] =

http://git-wip-us.apache.org/repos/asf/spark/blob/4b7aa0b1/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

spark git commit: [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML

2017-05-09 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master be53a7835 -> b8733e0ad


[SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML

## What changes were proposed in this pull request?
Remove ML methods we deprecated in 2.1.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #17867 from yanboliang/spark-20606.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b8733e0a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b8733e0a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b8733e0a

Branch: refs/heads/master
Commit: b8733e0ad9f5a700f385e210450fd2c10137293e
Parents: be53a78
Author: Yanbo Liang 
Authored: Tue May 9 17:30:37 2017 +0800
Committer: Yanbo Liang 
Committed: Tue May 9 17:30:37 2017 +0800

--
 .../classification/DecisionTreeClassifier.scala |  18 ++--
 .../spark/ml/classification/GBTClassifier.scala |  24 ++---
 .../classification/RandomForestClassifier.scala |  24 ++---
 .../ml/regression/DecisionTreeRegressor.scala   |  18 ++--
 .../spark/ml/regression/GBTRegressor.scala  |  24 ++---
 .../ml/regression/RandomForestRegressor.scala   |  24 ++---
 .../org/apache/spark/ml/tree/treeParams.scala   | 105 ---
 .../org/apache/spark/ml/util/ReadWrite.scala|  16 ---
 project/MimaExcludes.scala  |  68 
 python/pyspark/ml/util.py   |  32 --
 10 files changed, 134 insertions(+), 219 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b8733e0a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
index 9f60f08..5fb105c 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
@@ -54,27 +54,27 @@ class DecisionTreeClassifier @Since("1.4.0") (
 
   /** @group setParam */
   @Since("1.4.0")
-  override def setMaxDepth(value: Int): this.type = set(maxDepth, value)
+  def setMaxDepth(value: Int): this.type = set(maxDepth, value)
 
   /** @group setParam */
   @Since("1.4.0")
-  override def setMaxBins(value: Int): this.type = set(maxBins, value)
+  def setMaxBins(value: Int): this.type = set(maxBins, value)
 
   /** @group setParam */
   @Since("1.4.0")
-  override def setMinInstancesPerNode(value: Int): this.type = 
set(minInstancesPerNode, value)
+  def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, 
value)
 
   /** @group setParam */
   @Since("1.4.0")
-  override def setMinInfoGain(value: Double): this.type = set(minInfoGain, 
value)
+  def setMinInfoGain(value: Double): this.type = set(minInfoGain, value)
 
   /** @group expertSetParam */
   @Since("1.4.0")
-  override def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, 
value)
+  def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, value)
 
   /** @group expertSetParam */
   @Since("1.4.0")
-  override def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, 
value)
+  def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, value)
 
   /**
* Specifies how often to checkpoint the cached node IDs.
@@ -86,15 +86,15 @@ class DecisionTreeClassifier @Since("1.4.0") (
* @group setParam
*/
   @Since("1.4.0")
-  override def setCheckpointInterval(value: Int): this.type = 
set(checkpointInterval, value)
+  def setCheckpointInterval(value: Int): this.type = set(checkpointInterval, 
value)
 
   /** @group setParam */
   @Since("1.4.0")
-  override def setImpurity(value: String): this.type = set(impurity, value)
+  def setImpurity(value: String): this.type = set(impurity, value)
 
   /** @group setParam */
   @Since("1.6.0")
-  override def setSeed(value: Long): this.type = set(seed, value)
+  def setSeed(value: Long): this.type = set(seed, value)
 
   override protected def train(dataset: Dataset[_]): 
DecisionTreeClassificationModel = {
 val categoricalFeatures: Map[Int, Int] =

http://git-wip-us.apache.org/repos/asf/spark/blob/b8733e0a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
index ade0960..263ed10 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
+++

spark git commit: [SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column

2017-05-04 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master bfc8c79c8 -> 0d16faab9


[SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column

## What changes were proposed in this pull request?
Bucketizer currently requires input column to be Double, but the logic should 
work on any numeric data types. Many practical problems have integer/float data 
types, and it could get very tedious to manually cast them into Double before 
calling bucketizer. This PR extends bucketizer to handle all numeric types.

## How was this patch tested?
New test.

Author: Wayne Zhang 

Closes #17840 from actuaryzhang/bucketizer.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0d16faab
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0d16faab
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0d16faab

Branch: refs/heads/master
Commit: 0d16faab90e4cd1f73c5b749dbda7bc2a400b26f
Parents: bfc8c79
Author: Wayne Zhang 
Authored: Fri May 5 10:23:58 2017 +0800
Committer: Yanbo Liang 
Committed: Fri May 5 10:23:58 2017 +0800

--
 .../apache/spark/ml/feature/Bucketizer.scala|  4 ++--
 .../spark/ml/feature/BucketizerSuite.scala  | 25 
 2 files changed, 27 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0d16faab/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
index d1f3b2a..bb8f2a3 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
@@ -116,7 +116,7 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
   Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid)
 }
 
-val newCol = bucketizer(filteredDataset($(inputCol)))
+val newCol = bucketizer(filteredDataset($(inputCol)).cast(DoubleType))
 val newField = prepOutputField(filteredDataset.schema)
 filteredDataset.withColumn($(outputCol), newCol, newField.metadata)
   }
@@ -130,7 +130,7 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
 
   @Since("1.4.0")
   override def transformSchema(schema: StructType): StructType = {
-SchemaUtils.checkColumnType(schema, $(inputCol), DoubleType)
+SchemaUtils.checkNumericType(schema, $(inputCol))
 SchemaUtils.appendColumn(schema, prepOutputField(schema))
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/0d16faab/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala
index aac2913..420fb17 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala
@@ -26,6 +26,8 @@ import org.apache.spark.ml.util.{DefaultReadWriteTest, 
MLTestingUtils}
 import org.apache.spark.ml.util.TestingUtils._
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
 
 class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
 
@@ -162,6 +164,29 @@ class BucketizerSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
   .setSplits(Array(0.1, 0.8, 0.9))
 testDefaultReadWrite(t)
   }
+
+  test("Bucket numeric features") {
+val splits = Array(-3.0, 0.0, 3.0)
+val data = Array(-2.0, -1.0, 0.0, 1.0, 2.0)
+val expectedBuckets = Array(0.0, 0.0, 1.0, 1.0, 1.0)
+val dataFrame: DataFrame = data.zip(expectedBuckets).toSeq.toDF("feature", 
"expected")
+
+val bucketizer: Bucketizer = new Bucketizer()
+  .setInputCol("feature")
+  .setOutputCol("result")
+  .setSplits(splits)
+
+val types = Seq(ShortType, IntegerType, LongType, FloatType, DoubleType,
+  ByteType, DecimalType(10, 0))
+for (mType <- types) {
+  val df = dataFrame.withColumn("feature", col("feature").cast(mType))
+  bucketizer.transform(df).select("result", "expected").collect().foreach {
+case Row(x: Double, y: Double) =>
+  assert(x === y, "The result is not correct after bucketing in type " 
+
+mType.toString + ". " + s"Expected $y but found $x.")
+  }
+}
+  }
 }
 
 private object BucketizerSuite extends

spark git commit: [SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column

2017-05-04 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 425ed26d2 -> c8756288d


[SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column

## What changes were proposed in this pull request?
Bucketizer currently requires input column to be Double, but the logic should 
work on any numeric data types. Many practical problems have integer/float data 
types, and it could get very tedious to manually cast them into Double before 
calling bucketizer. This PR extends bucketizer to handle all numeric types.

## How was this patch tested?
New test.

Author: Wayne Zhang 

Closes #17840 from actuaryzhang/bucketizer.

(cherry picked from commit 0d16faab90e4cd1f73c5b749dbda7bc2a400b26f)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c8756288
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c8756288
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c8756288

Branch: refs/heads/branch-2.2
Commit: c8756288de12cfd9528d8d3ff73ff600909d657a
Parents: 425ed26
Author: Wayne Zhang 
Authored: Fri May 5 10:23:58 2017 +0800
Committer: Yanbo Liang 
Committed: Fri May 5 10:24:12 2017 +0800

--
 .../apache/spark/ml/feature/Bucketizer.scala|  4 ++--
 .../spark/ml/feature/BucketizerSuite.scala  | 25 
 2 files changed, 27 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c8756288/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
index d1f3b2a..bb8f2a3 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
@@ -116,7 +116,7 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
   Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid)
 }
 
-val newCol = bucketizer(filteredDataset($(inputCol)))
+val newCol = bucketizer(filteredDataset($(inputCol)).cast(DoubleType))
 val newField = prepOutputField(filteredDataset.schema)
 filteredDataset.withColumn($(outputCol), newCol, newField.metadata)
   }
@@ -130,7 +130,7 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
 
   @Since("1.4.0")
   override def transformSchema(schema: StructType): StructType = {
-SchemaUtils.checkColumnType(schema, $(inputCol), DoubleType)
+SchemaUtils.checkNumericType(schema, $(inputCol))
 SchemaUtils.appendColumn(schema, prepOutputField(schema))
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/c8756288/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala
index aac2913..420fb17 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala
@@ -26,6 +26,8 @@ import org.apache.spark.ml.util.{DefaultReadWriteTest, 
MLTestingUtils}
 import org.apache.spark.ml.util.TestingUtils._
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
 
 class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
 
@@ -162,6 +164,29 @@ class BucketizerSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
   .setSplits(Array(0.1, 0.8, 0.9))
 testDefaultReadWrite(t)
   }
+
+  test("Bucket numeric features") {
+val splits = Array(-3.0, 0.0, 3.0)
+val data = Array(-2.0, -1.0, 0.0, 1.0, 2.0)
+val expectedBuckets = Array(0.0, 0.0, 1.0, 1.0, 1.0)
+val dataFrame: DataFrame = data.zip(expectedBuckets).toSeq.toDF("feature", 
"expected")
+
+val bucketizer: Bucketizer = new Bucketizer()
+  .setInputCol("feature")
+  .setOutputCol("result")
+  .setSplits(splits)
+
+val types = Seq(ShortType, IntegerType, LongType, FloatType, DoubleType,
+  ByteType, DecimalType(10, 0))
+for (mType <- types) {
+  val df = dataFrame.withColumn("feature", col("feature").cast(mType))
+  bucketizer.transform(df).select("result", "expected").collect().foreach {
+case Row(x: Double, y: Double) =>
+  assert(x === y, "The result is not correct after bucketing in type " 
+
+

spark git commit: [SPARK-20047][FOLLOWUP][ML] Constrained Logistic Regression follow up

2017-05-04 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 b6727795f -> 425ed26d2


[SPARK-20047][FOLLOWUP][ML] Constrained Logistic Regression follow up

## What changes were proposed in this pull request?
Address some minor comments for #17715:
* Put bound-constrained optimization params under expertParams.
* Update some docs.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #17829 from yanboliang/spark-20047-followup.

(cherry picked from commit c5dceb8c65545169bc96628140b5acdaa85dd226)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/425ed26d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/425ed26d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/425ed26d

Branch: refs/heads/branch-2.2
Commit: 425ed26d2a0f6d3308bdb4fcbf0cedc6ef12612e
Parents: b672779
Author: Yanbo Liang 
Authored: Thu May 4 17:56:43 2017 +0800
Committer: Yanbo Liang 
Committed: Thu May 4 17:57:08 2017 +0800

--
 .../ml/classification/LogisticRegression.scala  | 54 +---
 1 file changed, 35 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/425ed26d/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index d7dde32..42dc7fb 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -183,14 +183,15 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
* The bound matrix must be compatible with the shape (1, number of 
features) for binomial
* regression, or (number of classes, number of features) for multinomial 
regression.
* Otherwise, it throws exception.
+   * Default is none.
*
-   * @group param
+   * @group expertParam
*/
   @Since("2.2.0")
   val lowerBoundsOnCoefficients: Param[Matrix] = new Param(this, 
"lowerBoundsOnCoefficients",
 "The lower bounds on coefficients if fitting under bound constrained 
optimization.")
 
-  /** @group getParam */
+  /** @group expertGetParam */
   @Since("2.2.0")
   def getLowerBoundsOnCoefficients: Matrix = $(lowerBoundsOnCoefficients)
 
@@ -199,14 +200,15 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
* The bound matrix must be compatible with the shape (1, number of 
features) for binomial
* regression, or (number of classes, number of features) for multinomial 
regression.
* Otherwise, it throws exception.
+   * Default is none.
*
-   * @group param
+   * @group expertParam
*/
   @Since("2.2.0")
   val upperBoundsOnCoefficients: Param[Matrix] = new Param(this, 
"upperBoundsOnCoefficients",
 "The upper bounds on coefficients if fitting under bound constrained 
optimization.")
 
-  /** @group getParam */
+  /** @group expertGetParam */
   @Since("2.2.0")
   def getUpperBoundsOnCoefficients: Matrix = $(upperBoundsOnCoefficients)
 
@@ -214,14 +216,15 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
* The lower bounds on intercepts if fitting under bound constrained 
optimization.
* The bounds vector size must be equal with 1 for binomial regression, or 
the number
* of classes for multinomial regression. Otherwise, it throws exception.
+   * Default is none.
*
-   * @group param
+   * @group expertParam
*/
   @Since("2.2.0")
   val lowerBoundsOnIntercepts: Param[Vector] = new Param(this, 
"lowerBoundsOnIntercepts",
 "The lower bounds on intercepts if fitting under bound constrained 
optimization.")
 
-  /** @group getParam */
+  /** @group expertGetParam */
   @Since("2.2.0")
   def getLowerBoundsOnIntercepts: Vector = $(lowerBoundsOnIntercepts)
 
@@ -229,14 +232,15 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
* The upper bounds on intercepts if fitting under bound constrained 
optimization.
* The bound vector size must be equal with 1 for binomial regression, or 
the number
* of classes for multinomial regression. Otherwise, it throws exception.
+   * Default is none.
*
-   * @group param
+   * @group expertParam
*/
   @Since("2.2.0")
   val upperBoundsOnIntercepts: Param[Vector] = new Param(this, 
"upperBoundsOnIntercepts",
 "The upper bounds on intercepts if fitting under bound constrained 
optimization.")
 
-  /** @group getParam */
+  /** @group

spark git commit: [SPARK-20047][FOLLOWUP][ML] Constrained Logistic Regression follow up

2017-05-04 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 57b64703e -> c5dceb8c6


[SPARK-20047][FOLLOWUP][ML] Constrained Logistic Regression follow up

## What changes were proposed in this pull request?
Address some minor comments for #17715:
* Put bound-constrained optimization params under expertParams.
* Update some docs.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #17829 from yanboliang/spark-20047-followup.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c5dceb8c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c5dceb8c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c5dceb8c

Branch: refs/heads/master
Commit: c5dceb8c65545169bc96628140b5acdaa85dd226
Parents: 57b6470
Author: Yanbo Liang 
Authored: Thu May 4 17:56:43 2017 +0800
Committer: Yanbo Liang 
Committed: Thu May 4 17:56:43 2017 +0800

--
 .../ml/classification/LogisticRegression.scala  | 54 +---
 1 file changed, 35 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c5dceb8c/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index d7dde32..42dc7fb 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -183,14 +183,15 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
* The bound matrix must be compatible with the shape (1, number of 
features) for binomial
* regression, or (number of classes, number of features) for multinomial 
regression.
* Otherwise, it throws exception.
+   * Default is none.
*
-   * @group param
+   * @group expertParam
*/
   @Since("2.2.0")
   val lowerBoundsOnCoefficients: Param[Matrix] = new Param(this, 
"lowerBoundsOnCoefficients",
 "The lower bounds on coefficients if fitting under bound constrained 
optimization.")
 
-  /** @group getParam */
+  /** @group expertGetParam */
   @Since("2.2.0")
   def getLowerBoundsOnCoefficients: Matrix = $(lowerBoundsOnCoefficients)
 
@@ -199,14 +200,15 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
* The bound matrix must be compatible with the shape (1, number of 
features) for binomial
* regression, or (number of classes, number of features) for multinomial 
regression.
* Otherwise, it throws exception.
+   * Default is none.
*
-   * @group param
+   * @group expertParam
*/
   @Since("2.2.0")
   val upperBoundsOnCoefficients: Param[Matrix] = new Param(this, 
"upperBoundsOnCoefficients",
 "The upper bounds on coefficients if fitting under bound constrained 
optimization.")
 
-  /** @group getParam */
+  /** @group expertGetParam */
   @Since("2.2.0")
   def getUpperBoundsOnCoefficients: Matrix = $(upperBoundsOnCoefficients)
 
@@ -214,14 +216,15 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
* The lower bounds on intercepts if fitting under bound constrained 
optimization.
* The bounds vector size must be equal with 1 for binomial regression, or 
the number
* of classes for multinomial regression. Otherwise, it throws exception.
+   * Default is none.
*
-   * @group param
+   * @group expertParam
*/
   @Since("2.2.0")
   val lowerBoundsOnIntercepts: Param[Vector] = new Param(this, 
"lowerBoundsOnIntercepts",
 "The lower bounds on intercepts if fitting under bound constrained 
optimization.")
 
-  /** @group getParam */
+  /** @group expertGetParam */
   @Since("2.2.0")
   def getLowerBoundsOnIntercepts: Vector = $(lowerBoundsOnIntercepts)
 
@@ -229,14 +232,15 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
* The upper bounds on intercepts if fitting under bound constrained 
optimization.
* The bound vector size must be equal with 1 for binomial regression, or 
the number
* of classes for multinomial regression. Otherwise, it throws exception.
+   * Default is none.
*
-   * @group param
+   * @group expertParam
*/
   @Since("2.2.0")
   val upperBoundsOnIntercepts: Param[Vector] = new Param(this, 
"upperBoundsOnIntercepts",
 "The upper bounds on intercepts if fitting under bound constrained 
optimization.")
 
-  /** @group getParam */
+  /** @group expertGetParam */
   @Since("2.2.0")
   def getUpperBoundsOnIntercepts: Vector = $(upperBoundsOnIntercepts)
 
@@ -256,7 +260,7 @@

spark git commit: [MINOR][ML] Fix some PySpark & SparkR flaky tests

2017-04-26 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 612952251 -> 34dec68d7


[MINOR][ML] Fix some PySpark & SparkR flaky tests

## What changes were proposed in this pull request?
Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which 
means they are not converged. I donât think checking intermediate result 
during iteration make sense, and these intermediate result may vulnerable and 
not stable, so we should switch to check the converged result. We hit this 
issue at #17746 when we upgrade breeze to 0.13.1.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #17757 from yanboliang/flaky-test.

(cherry picked from commit dbb06c689c157502cb081421baecce411832aad8)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/34dec68d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/34dec68d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/34dec68d

Branch: refs/heads/branch-2.2
Commit: 34dec68d7eb647d997fdb27fe65d579c74b39e58
Parents: 6129522
Author: Yanbo Liang 
Authored: Wed Apr 26 21:34:18 2017 +0800
Committer: Yanbo Liang 
Committed: Wed Apr 26 21:34:35 2017 +0800

--
 .../tests/testthat/test_mllib_classification.R  | 17 +
 python/pyspark/ml/classification.py | 71 ++--
 2 files changed, 38 insertions(+), 50 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/34dec68d/R/pkg/inst/tests/testthat/test_mllib_classification.R
--
diff --git a/R/pkg/inst/tests/testthat/test_mllib_classification.R 
b/R/pkg/inst/tests/testthat/test_mllib_classification.R
index af7cbdc..cbc7087 100644
--- a/R/pkg/inst/tests/testthat/test_mllib_classification.R
+++ b/R/pkg/inst/tests/testthat/test_mllib_classification.R
@@ -284,22 +284,11 @@ test_that("spark.mlp", {
c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "2.0", "2.0", 
"1.0", "0.0"))
 
   # test initialWeights
-  model <- spark.mlp(df, label ~ features, layers = c(4, 3), maxIter = 2, 
initialWeights =
+  model <- spark.mlp(df, label ~ features, layers = c(4, 3), initialWeights =
 c(0, 0, 0, 0, 0, 5, 5, 5, 5, 5, 9, 9, 9, 9, 9))
   mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction"))
   expect_equal(head(mlpPredictions$prediction, 10),
-   c("1.0", "1.0", "2.0", "1.0", "2.0", "1.0", "2.0", "2.0", 
"1.0", "0.0"))
-
-  model <- spark.mlp(df, label ~ features, layers = c(4, 3), maxIter = 2, 
initialWeights =
-c(0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 9.0, 9.0, 9.0, 9.0, 
9.0))
-  mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction"))
-  expect_equal(head(mlpPredictions$prediction, 10),
-   c("1.0", "1.0", "2.0", "1.0", "2.0", "1.0", "2.0", "2.0", 
"1.0", "0.0"))
-
-  model <- spark.mlp(df, label ~ features, layers = c(4, 3), maxIter = 2)
-  mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction"))
-  expect_equal(head(mlpPredictions$prediction, 10),
-   c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "0.0", "0.0", 
"1.0", "0.0"))
+   c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "2.0", "2.0", 
"1.0", "0.0"))
 
   # Test formula works well
   df <- suppressWarnings(createDataFrame(iris))
@@ -310,8 +299,6 @@ test_that("spark.mlp", {
   expect_equal(summary$numOfOutputs, 3)
   expect_equal(summary$layers, c(4, 3))
   expect_equal(length(summary$weights), 15)
-  expect_equal(head(summary$weights, 5), list(-0.5793153, -4.652961, 6.216155, 
-6.649478,
-   -10.51147), tolerance = 1e-3)
 })
 
 test_that("spark.naiveBayes", {

http://git-wip-us.apache.org/repos/asf/spark/blob/34dec68d/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index 8649683..a9756ea 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -185,34 +185,33 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredicti
 >>> from pyspark.sql import Row
 >>> from pyspark.ml.linalg import Vectors
 >>> bdf = sc.parallelize([
-... Row(label=1.0, weight=2.0, features=Vectors.dense(1.0)),
-... Row(label=0.0, weight=2.0, features=Vectors.sparse(1, [], 
[]))]).toDF()
->>> blor = LogisticRegression(maxIter=5, regParam=0.01, weightCol="weight")
+... Row(label=1.0, weight=1.0, features=Vectors.dense(0.0, 5.0)),
+... Row(label=0.0, weight=2.0, features=Vectors.dense(1.0, 2.0)),
+... Row(label=1.0, weight=3.0, features=Vectors.dense(2.0, 1.0)),
+

spark git commit: [MINOR][ML] Fix some PySpark & SparkR flaky tests

2017-04-26 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 7fecf5130 -> dbb06c689


[MINOR][ML] Fix some PySpark & SparkR flaky tests

## What changes were proposed in this pull request?
Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which 
means they are not converged. I donât think checking intermediate result 
during iteration make sense, and these intermediate result may vulnerable and 
not stable, so we should switch to check the converged result. We hit this 
issue at #17746 when we upgrade breeze to 0.13.1.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang 

Closes #17757 from yanboliang/flaky-test.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dbb06c68
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dbb06c68
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dbb06c68

Branch: refs/heads/master
Commit: dbb06c689c157502cb081421baecce411832aad8
Parents: 7fecf51
Author: Yanbo Liang 
Authored: Wed Apr 26 21:34:18 2017 +0800
Committer: Yanbo Liang 
Committed: Wed Apr 26 21:34:18 2017 +0800

--
 .../tests/testthat/test_mllib_classification.R  | 17 +
 python/pyspark/ml/classification.py | 71 ++--
 2 files changed, 38 insertions(+), 50 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dbb06c68/R/pkg/inst/tests/testthat/test_mllib_classification.R
--
diff --git a/R/pkg/inst/tests/testthat/test_mllib_classification.R 
b/R/pkg/inst/tests/testthat/test_mllib_classification.R
index af7cbdc..cbc7087 100644
--- a/R/pkg/inst/tests/testthat/test_mllib_classification.R
+++ b/R/pkg/inst/tests/testthat/test_mllib_classification.R
@@ -284,22 +284,11 @@ test_that("spark.mlp", {
c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "2.0", "2.0", 
"1.0", "0.0"))
 
   # test initialWeights
-  model <- spark.mlp(df, label ~ features, layers = c(4, 3), maxIter = 2, 
initialWeights =
+  model <- spark.mlp(df, label ~ features, layers = c(4, 3), initialWeights =
 c(0, 0, 0, 0, 0, 5, 5, 5, 5, 5, 9, 9, 9, 9, 9))
   mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction"))
   expect_equal(head(mlpPredictions$prediction, 10),
-   c("1.0", "1.0", "2.0", "1.0", "2.0", "1.0", "2.0", "2.0", 
"1.0", "0.0"))
-
-  model <- spark.mlp(df, label ~ features, layers = c(4, 3), maxIter = 2, 
initialWeights =
-c(0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 9.0, 9.0, 9.0, 9.0, 
9.0))
-  mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction"))
-  expect_equal(head(mlpPredictions$prediction, 10),
-   c("1.0", "1.0", "2.0", "1.0", "2.0", "1.0", "2.0", "2.0", 
"1.0", "0.0"))
-
-  model <- spark.mlp(df, label ~ features, layers = c(4, 3), maxIter = 2)
-  mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction"))
-  expect_equal(head(mlpPredictions$prediction, 10),
-   c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "0.0", "0.0", 
"1.0", "0.0"))
+   c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "2.0", "2.0", 
"1.0", "0.0"))
 
   # Test formula works well
   df <- suppressWarnings(createDataFrame(iris))
@@ -310,8 +299,6 @@ test_that("spark.mlp", {
   expect_equal(summary$numOfOutputs, 3)
   expect_equal(summary$layers, c(4, 3))
   expect_equal(length(summary$weights), 15)
-  expect_equal(head(summary$weights, 5), list(-0.5793153, -4.652961, 6.216155, 
-6.649478,
-   -10.51147), tolerance = 1e-3)
 })
 
 test_that("spark.naiveBayes", {

http://git-wip-us.apache.org/repos/asf/spark/blob/dbb06c68/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index 8649683..a9756ea 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -185,34 +185,33 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredicti
 >>> from pyspark.sql import Row
 >>> from pyspark.ml.linalg import Vectors
 >>> bdf = sc.parallelize([
-... Row(label=1.0, weight=2.0, features=Vectors.dense(1.0)),
-... Row(label=0.0, weight=2.0, features=Vectors.sparse(1, [], 
[]))]).toDF()
->>> blor = LogisticRegression(maxIter=5, regParam=0.01, weightCol="weight")
+... Row(label=1.0, weight=1.0, features=Vectors.dense(0.0, 5.0)),
+... Row(label=0.0, weight=2.0, features=Vectors.dense(1.0, 2.0)),
+... Row(label=1.0, weight=3.0, features=Vectors.dense(2.0, 1.0)),
+... Row(label=0.0, weight=4.0, features=Vectors.dense(3.0, 
3.0))]).toDF()
+>>> blor =

spark git commit: [SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant

2017-04-25 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 b62ebd91b -> e2591c6d7


[SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant

## What changes were proposed in this pull request?

This is a follow-up PR of #17478.

## How was this patch tested?

Existing tests

Author: wangmiao1981 

Closes #17754 from wangmiao1981/followup.

(cherry picked from commit 387565cf14b490810f9479ff3adbf776e2edecdc)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e2591c6d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e2591c6d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e2591c6d

Branch: refs/heads/branch-2.2
Commit: e2591c6d74081e9edad2e8982c0125a4f1d21437
Parents: b62ebd9
Author: wangmiao1981 
Authored: Tue Apr 25 16:30:36 2017 +0800
Committer: Yanbo Liang 
Committed: Tue Apr 25 16:30:53 2017 +0800

--
 .../scala/org/apache/spark/ml/classification/LinearSVC.scala| 5 ++---
 .../scala/org/apache/spark/ml/regression/LinearRegression.scala | 5 -
 2 files changed, 2 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e2591c6d/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
index f76b14e..7507c75 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
@@ -458,9 +458,7 @@ private class LinearSVCAggregator(
*/
   def add(instance: Instance): this.type = {
 instance match { case Instance(label, weight, features) =>
-  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
-  require(numFeatures == features.size, s"Dimensions mismatch when adding 
new instance." +
-s" Expecting $numFeatures but got ${features.size}.")
+
   if (weight == 0.0) return this
   val localFeaturesStd = bcFeaturesStd.value
   val localCoefficients = coefficientsArray
@@ -512,6 +510,7 @@ private class LinearSVCAggregator(
* @return This LinearSVCAggregator object.
*/
   def merge(other: LinearSVCAggregator): this.type = {
+
 if (other.weightSum != 0.0) {
   weightSum += other.weightSum
   lossSum += other.lossSum

http://git-wip-us.apache.org/repos/asf/spark/blob/e2591c6d/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala 
b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
index f7e3c8f..eaad549 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
@@ -971,9 +971,6 @@ private class LeastSquaresAggregator(
*/
   def add(instance: Instance): this.type = {
 instance match { case Instance(label, weight, features) =>
-  require(dim == features.size, s"Dimensions mismatch when adding new 
sample." +
-s" Expecting $dim but got ${features.size}.")
-  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
 
   if (weight == 0.0) return this
 
@@ -1005,8 +1002,6 @@ private class LeastSquaresAggregator(
* @return This LeastSquaresAggregator object.
*/
   def merge(other: LeastSquaresAggregator): this.type = {
-require(dim == other.dim, s"Dimensions mismatch when merging with another 
" +
-  s"LeastSquaresAggregator. Expecting $dim but got ${other.dim}.")
 
 if (other.weightSum != 0) {
   totalCnt += other.totalCnt


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant

2017-04-25 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 0bc7a9021 -> 387565cf1


[SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant

## What changes were proposed in this pull request?

This is a follow-up PR of #17478.

## How was this patch tested?

Existing tests

Author: wangmiao1981 

Closes #17754 from wangmiao1981/followup.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/387565cf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/387565cf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/387565cf

Branch: refs/heads/master
Commit: 387565cf14b490810f9479ff3adbf776e2edecdc
Parents: 0bc7a90
Author: wangmiao1981 
Authored: Tue Apr 25 16:30:36 2017 +0800
Committer: Yanbo Liang 
Committed: Tue Apr 25 16:30:36 2017 +0800

--
 .../scala/org/apache/spark/ml/classification/LinearSVC.scala| 5 ++---
 .../scala/org/apache/spark/ml/regression/LinearRegression.scala | 5 -
 2 files changed, 2 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/387565cf/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
index f76b14e..7507c75 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
@@ -458,9 +458,7 @@ private class LinearSVCAggregator(
*/
   def add(instance: Instance): this.type = {
 instance match { case Instance(label, weight, features) =>
-  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
-  require(numFeatures == features.size, s"Dimensions mismatch when adding 
new instance." +
-s" Expecting $numFeatures but got ${features.size}.")
+
   if (weight == 0.0) return this
   val localFeaturesStd = bcFeaturesStd.value
   val localCoefficients = coefficientsArray
@@ -512,6 +510,7 @@ private class LinearSVCAggregator(
* @return This LinearSVCAggregator object.
*/
   def merge(other: LinearSVCAggregator): this.type = {
+
 if (other.weightSum != 0.0) {
   weightSum += other.weightSum
   lossSum += other.lossSum

http://git-wip-us.apache.org/repos/asf/spark/blob/387565cf/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala 
b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
index f7e3c8f..eaad549 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
@@ -971,9 +971,6 @@ private class LeastSquaresAggregator(
*/
   def add(instance: Instance): this.type = {
 instance match { case Instance(label, weight, features) =>
-  require(dim == features.size, s"Dimensions mismatch when adding new 
sample." +
-s" Expecting $dim but got ${features.size}.")
-  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
 
   if (weight == 0.0) return this
 
@@ -1005,8 +1002,6 @@ private class LeastSquaresAggregator(
* @return This LeastSquaresAggregator object.
*/
   def merge(other: LeastSquaresAggregator): this.type = {
-require(dim == other.dim, s"Dimensions mismatch when merging with another 
" +
-  s"LeastSquaresAggregator. Expecting $dim but got ${other.dim}.")
 
 if (other.weightSum != 0) {
   totalCnt += other.totalCnt


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18901][ML] Require in LR LogisticAggregator is redundant

2017-04-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 2bef01f64 -> cf16c3250


[SPARK-18901][ML] Require in LR LogisticAggregator is redundant

## What changes were proposed in this pull request?

In MultivariateOnlineSummarizer,

`add` and `merge` have check for weights and feature sizes. The checks in LR 
are redundant, which are removed from this PR.

## How was this patch tested?

Existing tests.

Author: wm...@hotmail.com 

Closes #17478 from wangmiao1981/logit.

(cherry picked from commit 90264aced7cfdf265636517b91e5d1324fe60112)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cf16c325
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cf16c325
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cf16c325

Branch: refs/heads/branch-2.2
Commit: cf16c3250e946c4f89edc999d8764e8fa3dfb056
Parents: 2bef01f
Author: wm...@hotmail.com 
Authored: Mon Apr 24 23:43:06 2017 +0800
Committer: Yanbo Liang 
Committed: Mon Apr 24 23:43:23 2017 +0800

--
 .../org/apache/spark/ml/classification/LogisticRegression.scala | 5 -
 1 file changed, 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cf16c325/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index bc81546..44b3478 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -1571,9 +1571,6 @@ private class LogisticAggregator(
*/
   def add(instance: Instance): this.type = {
 instance match { case Instance(label, weight, features) =>
-  require(numFeatures == features.size, s"Dimensions mismatch when adding 
new instance." +
-s" Expecting $numFeatures but got ${features.size}.")
-  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
 
   if (weight == 0.0) return this
 
@@ -1596,8 +1593,6 @@ private class LogisticAggregator(
* @return This LogisticAggregator object.
*/
   def merge(other: LogisticAggregator): this.type = {
-require(numFeatures == other.numFeatures, s"Dimensions mismatch when 
merging with another " +
-  s"LogisticAggregator. Expecting $numFeatures but got 
${other.numFeatures}.")
 
 if (other.weightSum != 0.0) {
   weightSum += other.weightSum


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18901][ML] Require in LR LogisticAggregator is redundant

2017-04-24 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 776a2c0e9 -> 90264aced


[SPARK-18901][ML] Require in LR LogisticAggregator is redundant

## What changes were proposed in this pull request?

In MultivariateOnlineSummarizer,

`add` and `merge` have check for weights and feature sizes. The checks in LR 
are redundant, which are removed from this PR.

## How was this patch tested?

Existing tests.

Author: wm...@hotmail.com 

Closes #17478 from wangmiao1981/logit.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/90264ace
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/90264ace
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/90264ace

Branch: refs/heads/master
Commit: 90264aced7cfdf265636517b91e5d1324fe60112
Parents: 776a2c0
Author: wm...@hotmail.com 
Authored: Mon Apr 24 23:43:06 2017 +0800
Committer: Yanbo Liang 
Committed: Mon Apr 24 23:43:06 2017 +0800

--
 .../org/apache/spark/ml/classification/LogisticRegression.scala | 5 -
 1 file changed, 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/90264ace/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index bc81546..44b3478 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -1571,9 +1571,6 @@ private class LogisticAggregator(
*/
   def add(instance: Instance): this.type = {
 instance match { case Instance(label, weight, features) =>
-  require(numFeatures == features.size, s"Dimensions mismatch when adding 
new instance." +
-s" Expecting $numFeatures but got ${features.size}.")
-  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
 
   if (weight == 0.0) return this
 
@@ -1596,8 +1593,6 @@ private class LogisticAggregator(
* @return This LogisticAggregator object.
*/
   def merge(other: LogisticAggregator): this.type = {
-require(numFeatures == other.numFeatures, s"Dimensions mismatch when 
merging with another " +
-  s"LogisticAggregator. Expecting $numFeatures but got 
${other.numFeatures}.")
 
 if (other.weightSum != 0.0) {
   weightSum += other.weightSum


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][SPARKR] Move 'Data type mapping between R and Spark' to right place in SparkR doc.

2017-03-27 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 3fada2f50 -> 1d00761b9


[MINOR][SPARKR] Move 'Data type mapping between R and Spark' to right place in 
SparkR doc.

Section ```Data type mapping between R and Spark``` was put in the wrong place 
in SparkR doc currently, we should move it to a separate section.

## What changes were proposed in this pull request?
Before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/24340911/bc01a532-126a-11e7-9a08-0d60d13a547c.png)

After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/24340938/d9d32a9a-126a-11e7-8891-d2f5b46e0c71.png)

Author: Yanbo Liang 

Closes #17440 from yanboliang/sparkr-doc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1d00761b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1d00761b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1d00761b

Branch: refs/heads/master
Commit: 1d00761b9176a1f42976057ca78638c5b0763abc
Parents: 3fada2f
Author: Yanbo Liang 
Authored: Mon Mar 27 17:37:24 2017 -0700
Committer: Yanbo Liang 
Committed: Mon Mar 27 17:37:24 2017 -0700

--
 docs/sparkr.md | 138 ++--
 1 file changed, 69 insertions(+), 69 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1d00761b/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index d7ffd9b..a1a35a7 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -394,75 +394,6 @@ head(result[order(result$max_eruption, decreasing = TRUE), 
])
 {% endhighlight %}
 
 
- Data type mapping between R and Spark
-
-RSpark
-
-  byte
-  byte
-
-
-  integer
-  integer
-
-
-  float
-  float
-
-
-  double
-  double
-
-
-  numeric
-  double
-
-
-  character
-  string
-
-
-  string
-  string
-
-
-  binary
-  binary
-
-
-  raw
-  binary
-
-
-  logical
-  boolean
-
-
-  https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html;>POSIXct
-  timestamp
-
-
-  https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html;>POSIXlt
-  timestamp
-
-
-  https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html;>Date
-  date
-
-
-  array
-  array
-
-
-  list
-  array
-
-
-  env
-  map
-
-
-
  Run local R functions distributed using `spark.lapply`
 
 # spark.lapply
@@ -557,6 +488,75 @@ SparkR supports a subset of the available R formula 
operators for model fitting,
 The following example shows how to save/load a MLlib model by SparkR.
 {% include_example read_write r/ml/ml.R %}
 
+# Data type mapping between R and Spark
+
+RSpark
+
+  byte
+  byte
+
+
+  integer
+  integer
+
+
+  float
+  float
+
+
+  double
+  double
+
+
+  numeric
+  double
+
+
+  character
+  string
+
+
+  string
+  string
+
+
+  binary
+  binary
+
+
+  raw
+  binary
+
+
+  logical
+  boolean
+
+
+  https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html;>POSIXct
+  timestamp
+
+
+  https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html;>POSIXlt
+  timestamp
+
+
+  https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html;>Date
+  date
+
+
+  array
+  array
+
+
+  list
+  array
+
+
+  env
+  map
+
+
+
 # R Function Name Conflicts
 
 When loading and attaching a new package in R, it is possible to have a name 
[conflict](https://stat.ethz.ch/R-manual/R-devel/library/base/html/library.html),
 where a


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19925][SPARKR] Fix SparkR spark.getSparkFiles fails when it was called on executors.

2017-03-21 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 c4d2b8338 -> 277ed375b


[SPARK-19925][SPARKR] Fix SparkR spark.getSparkFiles fails when it was called 
on executors.

## What changes were proposed in this pull request?
SparkR ```spark.getSparkFiles``` fails when it was called on executors, see 
details at [SPARK-19925](https://issues.apache.org/jira/browse/SPARK-19925).

## How was this patch tested?
Add unit tests, and verify this fix at standalone and yarn cluster.

Author: Yanbo Liang 

Closes #17274 from yanboliang/spark-19925.

(cherry picked from commit 478fbc866fbfdb4439788583281863ecea14e8af)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/277ed375
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/277ed375
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/277ed375

Branch: refs/heads/branch-2.1
Commit: 277ed375b0af3e8fe2a8b9dee62997dcf16d5872
Parents: c4d2b83
Author: Yanbo Liang 
Authored: Tue Mar 21 21:50:54 2017 -0700
Committer: Yanbo Liang 
Committed: Tue Mar 21 22:12:55 2017 -0700

--
 R/pkg/R/context.R   | 16 ++--
 R/pkg/inst/tests/testthat/test_context.R|  7 +++
 .../main/scala/org/apache/spark/api/r/RRunner.scala |  2 ++
 3 files changed, 23 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/277ed375/R/pkg/R/context.R
--
diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R
index 1a0dd65..634bdcb 100644
--- a/R/pkg/R/context.R
+++ b/R/pkg/R/context.R
@@ -330,7 +330,13 @@ spark.addFile <- function(path, recursive = FALSE) {
 #'}
 #' @note spark.getSparkFilesRootDirectory since 2.1.0
 spark.getSparkFilesRootDirectory <- function() {
-  callJStatic("org.apache.spark.SparkFiles", "getRootDirectory")
+  if (Sys.getenv("SPARKR_IS_RUNNING_ON_WORKER") == "") {
+# Running on driver.
+callJStatic("org.apache.spark.SparkFiles", "getRootDirectory")
+  } else {
+# Running on worker.
+Sys.getenv("SPARKR_SPARKFILES_ROOT_DIR")
+  }
 }
 
 #' Get the absolute path of a file added through spark.addFile.
@@ -345,7 +351,13 @@ spark.getSparkFilesRootDirectory <- function() {
 #'}
 #' @note spark.getSparkFiles since 2.1.0
 spark.getSparkFiles <- function(fileName) {
-  callJStatic("org.apache.spark.SparkFiles", "get", as.character(fileName))
+  if (Sys.getenv("SPARKR_IS_RUNNING_ON_WORKER") == "") {
+# Running on driver.
+callJStatic("org.apache.spark.SparkFiles", "get", as.character(fileName))
+  } else {
+# Running on worker.
+file.path(spark.getSparkFilesRootDirectory(), as.character(fileName))
+  }
 }
 
 #' Run a function over a list of elements, distributing the computations with 
Spark

http://git-wip-us.apache.org/repos/asf/spark/blob/277ed375/R/pkg/inst/tests/testthat/test_context.R
--
diff --git a/R/pkg/inst/tests/testthat/test_context.R 
b/R/pkg/inst/tests/testthat/test_context.R
index caca069..c847113 100644
--- a/R/pkg/inst/tests/testthat/test_context.R
+++ b/R/pkg/inst/tests/testthat/test_context.R
@@ -177,6 +177,13 @@ test_that("add and get file to be downloaded with Spark 
job on every node", {
   spark.addFile(path)
   download_path <- spark.getSparkFiles(filename)
   expect_equal(readLines(download_path), words)
+
+  # Test spark.getSparkFiles works well on executors.
+  seq <- seq(from = 1, to = 10, length.out = 5)
+  f <- function(seq) { spark.getSparkFiles(filename) }
+  results <- spark.lapply(seq, f)
+  for (i in 1:5) { expect_equal(basename(results[[i]]), filename) }
+
   unlink(path)
 
   # Test add directory recursively.

http://git-wip-us.apache.org/repos/asf/spark/blob/277ed375/core/src/main/scala/org/apache/spark/api/r/RRunner.scala
--
diff --git a/core/src/main/scala/org/apache/spark/api/r/RRunner.scala 
b/core/src/main/scala/org/apache/spark/api/r/RRunner.scala
index 29e21b3..8811839 100644
--- a/core/src/main/scala/org/apache/spark/api/r/RRunner.scala
+++ b/core/src/main/scala/org/apache/spark/api/r/RRunner.scala
@@ -347,6 +347,8 @@ private[r] object RRunner {
 pb.environment().put("SPARKR_RLIBDIR", rLibDir.mkString(","))
 pb.environment().put("SPARKR_WORKER_PORT", port.toString)
 pb.environment().put("SPARKR_BACKEND_CONNECTION_TIMEOUT", 
rConnectionTimeout.toString)
+pb.environment().put("SPARKR_SPARKFILES_ROOT_DIR", 
SparkFiles.getRootDirectory())
+pb.environment().put("SPARKR_IS_RUNNING_ON_WORKER", "TRUE")
 pb.redirectErrorStream(true)  // redirect stderr into stdout
 val proc = pb.start()
 val

spark git commit: [SPARK-19925][SPARKR] Fix SparkR spark.getSparkFiles fails when it was called on executors.

2017-03-21 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master c1e87e384 -> 478fbc866


[SPARK-19925][SPARKR] Fix SparkR spark.getSparkFiles fails when it was called 
on executors.

## What changes were proposed in this pull request?
SparkR ```spark.getSparkFiles``` fails when it was called on executors, see 
details at [SPARK-19925](https://issues.apache.org/jira/browse/SPARK-19925).

## How was this patch tested?
Add unit tests, and verify this fix at standalone and yarn cluster.

Author: Yanbo Liang 

Closes #17274 from yanboliang/spark-19925.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/478fbc86
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/478fbc86
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/478fbc86

Branch: refs/heads/master
Commit: 478fbc866fbfdb4439788583281863ecea14e8af
Parents: c1e87e3
Author: Yanbo Liang 
Authored: Tue Mar 21 21:50:54 2017 -0700
Committer: Yanbo Liang 
Committed: Tue Mar 21 21:50:54 2017 -0700

--
 R/pkg/R/context.R   | 16 ++--
 R/pkg/inst/tests/testthat/test_context.R|  7 +++
 .../main/scala/org/apache/spark/api/r/RRunner.scala |  2 ++
 3 files changed, 23 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/478fbc86/R/pkg/R/context.R
--
diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R
index 1ca573e..50856e3 100644
--- a/R/pkg/R/context.R
+++ b/R/pkg/R/context.R
@@ -330,7 +330,13 @@ spark.addFile <- function(path, recursive = FALSE) {
 #'}
 #' @note spark.getSparkFilesRootDirectory since 2.1.0
 spark.getSparkFilesRootDirectory <- function() {
-  callJStatic("org.apache.spark.SparkFiles", "getRootDirectory")
+  if (Sys.getenv("SPARKR_IS_RUNNING_ON_WORKER") == "") {
+# Running on driver.
+callJStatic("org.apache.spark.SparkFiles", "getRootDirectory")
+  } else {
+# Running on worker.
+Sys.getenv("SPARKR_SPARKFILES_ROOT_DIR")
+  }
 }
 
 #' Get the absolute path of a file added through spark.addFile.
@@ -345,7 +351,13 @@ spark.getSparkFilesRootDirectory <- function() {
 #'}
 #' @note spark.getSparkFiles since 2.1.0
 spark.getSparkFiles <- function(fileName) {
-  callJStatic("org.apache.spark.SparkFiles", "get", as.character(fileName))
+  if (Sys.getenv("SPARKR_IS_RUNNING_ON_WORKER") == "") {
+# Running on driver.
+callJStatic("org.apache.spark.SparkFiles", "get", as.character(fileName))
+  } else {
+# Running on worker.
+file.path(spark.getSparkFilesRootDirectory(), as.character(fileName))
+  }
 }
 
 #' Run a function over a list of elements, distributing the computations with 
Spark

http://git-wip-us.apache.org/repos/asf/spark/blob/478fbc86/R/pkg/inst/tests/testthat/test_context.R
--
diff --git a/R/pkg/inst/tests/testthat/test_context.R 
b/R/pkg/inst/tests/testthat/test_context.R
index caca069..c847113 100644
--- a/R/pkg/inst/tests/testthat/test_context.R
+++ b/R/pkg/inst/tests/testthat/test_context.R
@@ -177,6 +177,13 @@ test_that("add and get file to be downloaded with Spark 
job on every node", {
   spark.addFile(path)
   download_path <- spark.getSparkFiles(filename)
   expect_equal(readLines(download_path), words)
+
+  # Test spark.getSparkFiles works well on executors.
+  seq <- seq(from = 1, to = 10, length.out = 5)
+  f <- function(seq) { spark.getSparkFiles(filename) }
+  results <- spark.lapply(seq, f)
+  for (i in 1:5) { expect_equal(basename(results[[i]]), filename) }
+
   unlink(path)
 
   # Test add directory recursively.

http://git-wip-us.apache.org/repos/asf/spark/blob/478fbc86/core/src/main/scala/org/apache/spark/api/r/RRunner.scala
--
diff --git a/core/src/main/scala/org/apache/spark/api/r/RRunner.scala 
b/core/src/main/scala/org/apache/spark/api/r/RRunner.scala
index 29e21b3..8811839 100644
--- a/core/src/main/scala/org/apache/spark/api/r/RRunner.scala
+++ b/core/src/main/scala/org/apache/spark/api/r/RRunner.scala
@@ -347,6 +347,8 @@ private[r] object RRunner {
 pb.environment().put("SPARKR_RLIBDIR", rLibDir.mkString(","))
 pb.environment().put("SPARKR_WORKER_PORT", port.toString)
 pb.environment().put("SPARKR_BACKEND_CONNECTION_TIMEOUT", 
rConnectionTimeout.toString)
+pb.environment().put("SPARKR_SPARKFILES_ROOT_DIR", 
SparkFiles.getRootDirectory())
+pb.environment().put("SPARKR_IS_RUNNING_ON_WORKER", "TRUE")
 pb.redirectErrorStream(true)  // redirect stderr into stdout
 val proc = pb.start()
 val errThread = startStdoutThread(proc)


-
To unsubscribe,

spark git commit: [SPARK-19806][ML][PYSPARK] PySpark GeneralizedLinearRegression supports tweedie distribution.

2017-03-08 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 1fa58868b -> 81303f7ca


[SPARK-19806][ML][PYSPARK] PySpark GeneralizedLinearRegression supports tweedie 
distribution.

## What changes were proposed in this pull request?
PySpark ```GeneralizedLinearRegression``` supports tweedie distribution.

## How was this patch tested?
Add unit tests.

Author: Yanbo Liang 

Closes #17146 from yanboliang/spark-19806.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/81303f7c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/81303f7c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/81303f7c

Branch: refs/heads/master
Commit: 81303f7ca7808d51229411dce8feeed8c23dbe15
Parents: 1fa5886
Author: Yanbo Liang 
Authored: Wed Mar 8 02:09:36 2017 -0800
Committer: Yanbo Liang 
Committed: Wed Mar 8 02:09:36 2017 -0800

--
 .../GeneralizedLinearRegression.scala   |  8 +--
 python/pyspark/ml/regression.py | 61 +---
 python/pyspark/ml/tests.py  | 20 +++
 3 files changed, 77 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/81303f7c/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
index 110764d..3be8b53 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
@@ -66,7 +66,7 @@ private[regression] trait GeneralizedLinearRegressionBase 
extends PredictorParam
   /**
* Param for the power in the variance function of the Tweedie distribution 
which provides
* the relationship between the variance and mean of the distribution.
-   * Only applicable for the Tweedie family.
+   * Only applicable to the Tweedie family.
* (see https://en.wikipedia.org/wiki/Tweedie_distribution;>
* Tweedie Distribution (Wikipedia))
* Supported values: 0 and [1, Inf).
@@ -79,7 +79,7 @@ private[regression] trait GeneralizedLinearRegressionBase 
extends PredictorParam
   final val variancePower: DoubleParam = new DoubleParam(this, "variancePower",
 "The power in the variance function of the Tweedie distribution which 
characterizes " +
 "the relationship between the variance and mean of the distribution. " +
-"Only applicable for the Tweedie family. Supported values: 0 and [1, 
Inf).",
+"Only applicable to the Tweedie family. Supported values: 0 and [1, Inf).",
 (x: Double) => x >= 1.0 || x == 0.0)
 
   /** @group getParam */
@@ -106,7 +106,7 @@ private[regression] trait GeneralizedLinearRegressionBase 
extends PredictorParam
   def getLink: String = $(link)
 
   /**
-   * Param for the index in the power link function. Only applicable for the 
Tweedie family.
+   * Param for the index in the power link function. Only applicable to the 
Tweedie family.
* Note that link power 0, 1, -1 or 0.5 corresponds to the Log, Identity, 
Inverse or Sqrt
* link, respectively.
* When not set, this value defaults to 1 - [[variancePower]], which matches 
the R "statmod"
@@ -116,7 +116,7 @@ private[regression] trait GeneralizedLinearRegressionBase 
extends PredictorParam
*/
   @Since("2.2.0")
   final val linkPower: DoubleParam = new DoubleParam(this, "linkPower",
-"The index in the power link function. Only applicable for the Tweedie 
family.")
+"The index in the power link function. Only applicable to the Tweedie 
family.")
 
   /** @group getParam */
   @Since("2.2.0")

http://git-wip-us.apache.org/repos/asf/spark/blob/81303f7c/python/pyspark/ml/regression.py
--
diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py
index b199bf2..3c3fcc8 100644
--- a/python/pyspark/ml/regression.py
+++ b/python/pyspark/ml/regression.py
@@ -1294,8 +1294,8 @@ class GeneralizedLinearRegression(JavaEstimator, 
HasLabelCol, HasFeaturesCol, Ha
 
 Fit a Generalized Linear Model specified by giving a symbolic description 
of the linear
 predictor (link function) and a description of the error distribution 
(family). It supports
-"gaussian", "binomial", "poisson" and "gamma" as family. Valid link 
functions for each family
-is listed below. The first link function of each family is the default one.
+"gaussian", "binomial", "poisson", "gamma" and "tweedie" as family. Valid 
link functions for
+each family is listed below.

spark git commit: [SPARK-19745][ML] SVCAggregator captures coefficients in its closure

2017-03-02 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 8417a7ae6 -> 93ae176e8


[SPARK-19745][ML] SVCAggregator captures coefficients in its closure

## What changes were proposed in this pull request?

JIRA: [SPARK-19745](https://issues.apache.org/jira/browse/SPARK-19745)

Reorganize SVCAggregator to avoid serializing coefficients. This patch also 
makes the gradient array a `lazy val` which will avoid materializing a large 
array on the driver before shipping the class to the executors. This 
improvement stems from https://github.com/apache/spark/pull/16037. Actually, 
probably all ML aggregators can benefit from this.

We can either: a.) separate the gradient improvement into another patch b.) 
keep what's here _plus_ add the lazy evaluation to all other aggregators in 
this patch or c.) keep it as is.

## How was this patch tested?

This is an interesting question! I don't know of a reasonable way to test this 
right now. Ideally, we could perform an optimization and look at the shuffle 
write data for each task, and we could compare the size to what it we know it 
should be: `numCoefficients * 8 bytes`. Not sure if there is a good way to do 
that right now? We could discuss this here or in another JIRA, but I suspect it 
would be a significant undertaking.

Author: sethah 

Closes #17076 from sethah/svc_agg.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/93ae176e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/93ae176e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/93ae176e

Branch: refs/heads/master
Commit: 93ae176e8943d6b346c80deea778bffd188366a1
Parents: 8417a7a
Author: sethah 
Authored: Thu Mar 2 19:38:25 2017 -0800
Committer: Yanbo Liang 
Committed: Thu Mar 2 19:38:25 2017 -0800

--
 .../spark/ml/classification/LinearSVC.scala | 29 
 .../ml/classification/LogisticRegression.scala  |  2 +-
 .../spark/ml/clustering/GaussianMixture.scala   |  6 ++--
 .../ml/regression/AFTSurvivalRegression.scala   |  2 +-
 .../spark/ml/regression/LinearRegression.scala  |  2 +-
 .../ml/classification/LinearSVCSuite.scala  | 17 +++-
 6 files changed, 34 insertions(+), 24 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/93ae176e/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
index bf6e76d..f76b14e 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
@@ -440,19 +440,14 @@ private class LinearSVCAggregator(
 
   private val numFeatures: Int = bcFeaturesStd.value.length
   private val numFeaturesPlusIntercept: Int = if (fitIntercept) numFeatures + 
1 else numFeatures
-  private val coefficients: Vector = bcCoefficients.value
   private var weightSum: Double = 0.0
   private var lossSum: Double = 0.0
-  require(numFeaturesPlusIntercept == coefficients.size, s"Dimension mismatch. 
Coefficients " +
-s"length ${coefficients.size}, FeaturesStd length ${numFeatures}, 
fitIntercept: $fitIntercept")
-
-  private val coefficientsArray = coefficients match {
-case dv: DenseVector => dv.values
-case _ =>
-  throw new IllegalArgumentException(
-s"coefficients only supports dense vector but got type 
${coefficients.getClass}.")
+  @transient private lazy val coefficientsArray = bcCoefficients.value match {
+case DenseVector(values) => values
+case _ => throw new IllegalArgumentException(s"coefficients only supports 
dense vector" +
+  s" but got type ${bcCoefficients.value.getClass}.")
   }
-  private val gradientSumArray = 
Array.fill[Double](coefficientsArray.length)(0)
+  private lazy val gradientSumArray = new 
Array[Double](numFeaturesPlusIntercept)
 
   /**
* Add a new training instance to this LinearSVCAggregator, and update the 
loss and gradient
@@ -463,6 +458,9 @@ private class LinearSVCAggregator(
*/
   def add(instance: Instance): this.type = {
 instance match { case Instance(label, weight, features) =>
+  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
+  require(numFeatures == features.size, s"Dimensions mismatch when adding 
new instance." +
+s" Expecting $numFeatures but got ${features.size}.")
   if (weight == 0.0) return this
   val localFeaturesStd = bcFeaturesStd.value
   val localCoefficients = coefficientsArray
@@ -530,18 +528,15 @@ private class LinearSVCAggregator(
 this
   }
 
-  def loss:

spark git commit: [SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dropLast

2017-03-01 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 3bd8ddf7c -> d2a879762


[SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dropLast

## What changes were proposed in this pull request?
Updates the doc string to match up with the code
i.e. say dropLast instead of includeFirst

## How was this patch tested?
Not much, since it's a doc-like change. Will run unit tests via Jenkins job.

Author: Mark Grover 

Closes #17127 from markgrover/spark_19734.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d2a87976
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d2a87976
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d2a87976

Branch: refs/heads/master
Commit: d2a879762a2b4f3c4d703cc183275af12b3c7de1
Parents: 3bd8ddf
Author: Mark Grover 
Authored: Wed Mar 1 22:57:34 2017 -0800
Committer: Yanbo Liang 
Committed: Wed Mar 1 22:57:34 2017 -0800

--
 python/pyspark/ml/feature.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d2a87976/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 67c12d8..83cf763 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -1363,7 +1363,7 @@ class OneHotEncoder(JavaTransformer, HasInputCol, 
HasOutputCol, JavaMLReadable,
 @keyword_only
 def __init__(self, dropLast=True, inputCol=None, outputCol=None):
 """
-__init__(self, includeFirst=True, inputCol=None, outputCol=None)
+__init__(self, dropLast=True, inputCol=None, outputCol=None)
 """
 super(OneHotEncoder, self).__init__()
 self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.OneHotEncoder", self.uid)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][ML] Fix comments in LSH Examples and Python API

2017-03-01 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master de2b53df4 -> 3bd8ddf7c


[MINOR][ML] Fix comments in LSH Examples and Python API

## What changes were proposed in this pull request?
Remove `org.apache.spark.examples.` in
Add slash in one of the python doc.

## How was this patch tested?
Run examples using the commands in the comments.

Author: Yun Ni 

Closes #17104 from Yunni/yunn_minor.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3bd8ddf7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3bd8ddf7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3bd8ddf7

Branch: refs/heads/master
Commit: 3bd8ddf7c34be35e5adeb802d6e63120f9f11713
Parents: de2b53d
Author: Yun Ni 
Authored: Wed Mar 1 22:55:13 2017 -0800
Committer: Yanbo Liang 
Committed: Wed Mar 1 22:55:13 2017 -0800

--
 .../spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java  | 2 +-
 .../java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java   | 2 +-
 .../spark/examples/ml/BucketedRandomProjectionLSHExample.scala | 2 +-
 .../scala/org/apache/spark/examples/ml/MinHashLSHExample.scala | 2 +-
 python/pyspark/ml/feature.py   | 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3bd8ddf7/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java
 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java
index 4594e34..ff917b7 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java
@@ -42,7 +42,7 @@ import static org.apache.spark.sql.functions.col;
 /**
  * An example demonstrating BucketedRandomProjectionLSH.
  * Run with:
- *   bin/run-example 
org.apache.spark.examples.ml.JavaBucketedRandomProjectionLSHExample
+ *   bin/run-example ml.JavaBucketedRandomProjectionLSHExample
  */
 public class JavaBucketedRandomProjectionLSHExample {
   public static void main(String[] args) {

http://git-wip-us.apache.org/repos/asf/spark/blob/3bd8ddf7/examples/src/main/java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java
 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java
index 0aace46..e164598 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java
@@ -42,7 +42,7 @@ import static org.apache.spark.sql.functions.col;
 /**
  * An example demonstrating MinHashLSH.
  * Run with:
- *   bin/run-example org.apache.spark.examples.ml.JavaMinHashLSHExample
+ *   bin/run-example ml.JavaMinHashLSHExample
  */
 public class JavaMinHashLSHExample {
   public static void main(String[] args) {

http://git-wip-us.apache.org/repos/asf/spark/blob/3bd8ddf7/examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala
--
diff --git 
a/examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala
 
b/examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala
index 654535c..16da4fa 100644
--- 
a/examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala
+++ 
b/examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala
@@ -28,7 +28,7 @@ import org.apache.spark.sql.SparkSession
 /**
  * An example demonstrating BucketedRandomProjectionLSH.
  * Run with:
- *   bin/run-example 
org.apache.spark.examples.ml.BucketedRandomProjectionLSHExample
+ *   bin/run-example ml.BucketedRandomProjectionLSHExample
  */
 object BucketedRandomProjectionLSHExample {
   def main(args: Array[String]): Unit = {

http://git-wip-us.apache.org/repos/asf/spark/blob/3bd8ddf7/examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala
--
diff --git 
a/examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala 
b/examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala
index 6c1e222..b94ab9b 100644
---

spark git commit: [MINOR][ML][DOC] Document default value for GeneralizedLinearRegression.linkPower

2017-02-25 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 410392ed7 -> 6ab60542e


[MINOR][ML][DOC] Document default value for 
GeneralizedLinearRegression.linkPower

Add Scaladoc for GeneralizedLinearRegression.linkPower default value

Follow-up to https://github.com/apache/spark/pull/16344

Author: Joseph K. Bradley 

Closes #17069 from jkbradley/tweedie-comment.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6ab60542
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6ab60542
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6ab60542

Branch: refs/heads/master
Commit: 6ab60542e8e803b1d91371a92f4aaef6a64106f6
Parents: 410392e
Author: Joseph K. Bradley 
Authored: Sat Feb 25 22:24:08 2017 -0800
Committer: Yanbo Liang 
Committed: Sat Feb 25 22:24:08 2017 -0800

--
 .../apache/spark/ml/regression/GeneralizedLinearRegression.scala   | 2 ++
 1 file changed, 2 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6ab60542/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
index fdeadaf..110764d 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
@@ -109,6 +109,8 @@ private[regression] trait GeneralizedLinearRegressionBase 
extends PredictorParam
* Param for the index in the power link function. Only applicable for the 
Tweedie family.
* Note that link power 0, 1, -1 or 0.5 corresponds to the Log, Identity, 
Inverse or Sqrt
* link, respectively.
+   * When not set, this value defaults to 1 - [[variancePower]], which matches 
the R "statmod"
+   * package.
*
* @group param
*/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18285][SPARKR] SparkR approxQuantile supports input multiple columns

2017-02-17 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 1a3f5f8c5 -> b40659838


[SPARK-18285][SPARKR] SparkR approxQuantile supports input multiple columns

## What changes were proposed in this pull request?
SparkR ```approxQuantile``` supports input multiple columns.

## How was this patch tested?
Unit test.

Author: Yanbo Liang 

Closes #16951 from yanboliang/spark-19619.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b4065983
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b4065983
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b4065983

Branch: refs/heads/master
Commit: b40659838245ecaefb4e83d2ec6155f3f23a6675
Parents: 1a3f5f8
Author: Yanbo Liang 
Authored: Fri Feb 17 11:58:39 2017 -0800
Committer: Yanbo Liang 
Committed: Fri Feb 17 11:58:39 2017 -0800

--
 R/pkg/R/generics.R|  2 +-
 R/pkg/R/stats.R   | 25 +
 R/pkg/inst/tests/testthat/test_sparkSQL.R | 18 +-
 3 files changed, 31 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b4065983/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 68864e6..11940d3 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -66,7 +66,7 @@ setGeneric("freqItems", function(x, cols, support = 0.01) { 
standardGeneric("fre
 # @rdname approxQuantile
 # @export
 setGeneric("approxQuantile",
-   function(x, col, probabilities, relativeError) {
+   function(x, cols, probabilities, relativeError) {
  standardGeneric("approxQuantile")
})
 

http://git-wip-us.apache.org/repos/asf/spark/blob/b4065983/R/pkg/R/stats.R
--
diff --git a/R/pkg/R/stats.R b/R/pkg/R/stats.R
index dcd7198..8d1d165 100644
--- a/R/pkg/R/stats.R
+++ b/R/pkg/R/stats.R
@@ -138,9 +138,9 @@ setMethod("freqItems", signature(x = "SparkDataFrame", cols 
= "character"),
 collect(dataFrame(sct))
   })
 
-#' Calculates the approximate quantiles of a numerical column of a 
SparkDataFrame
+#' Calculates the approximate quantiles of numerical columns of a 
SparkDataFrame
 #'
-#' Calculates the approximate quantiles of a numerical column of a 
SparkDataFrame.
+#' Calculates the approximate quantiles of numerical columns of a 
SparkDataFrame.
 #' The result of this algorithm has the following deterministic bound:
 #' If the SparkDataFrame has N elements and if we request the quantile at 
probability p up to
 #' error err, then the algorithm will return a sample x from the 
SparkDataFrame so that the
@@ -149,15 +149,19 @@ setMethod("freqItems", signature(x = "SparkDataFrame", 
cols = "character"),
 #' This method implements a variation of the Greenwald-Khanna algorithm (with 
some speed
 #' optimizations). The algorithm was first present in 
[[http://dx.doi.org/10.1145/375663.375670
 #' Space-efficient Online Computation of Quantile Summaries]] by Greenwald and 
Khanna.
+#' Note that rows containing any NA values will be removed before calculation.
 #'
 #' @param x A SparkDataFrame.
-#' @param col The name of the numerical column.
+#' @param cols A single column name, or a list of names for multiple columns.
 #' @param probabilities A list of quantile probabilities. Each number must 
belong to [0, 1].
 #'  For example 0 is the minimum, 0.5 is the median, 1 is 
the maximum.
 #' @param relativeError The relative target precision to achieve (>= 0). If 
set to zero,
 #'  the exact quantiles are computed, which could be very 
expensive.
 #'  Note that values greater than 1 are accepted but give 
the same result as 1.
-#' @return The approximate quantiles at the given probabilities.
+#' @return The approximate quantiles at the given probabilities. If the input 
is a single column name,
+#' the output is a list of approximate quantiles in that column; If 
the input is
+#' multiple column names, the output should be a list, and each 
element in it is a list of
+#' numeric values which represents the approximate quantiles in 
corresponding column.
 #'
 #' @rdname approxQuantile
 #' @name approxQuantile
@@ -171,12 +175,17 @@ setMethod("freqItems", signature(x = "SparkDataFrame", 
cols = "character"),
 #' }
 #' @note approxQuantile since 2.0.0
 setMethod("approxQuantile",
-  signature(x = "SparkDataFrame", col = "character",
+  signature(x = "SparkDataFrame", cols = "character",
 probabilities = "numeric", relativeError = "numeric"),
-  function(x, col, probabilities,

spark git commit: [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing

2017-02-15 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 21b4ba2d6 -> 08c1972a0


[SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing

## What changes were proposed in this pull request?
This pull request includes python API and examples for LSH. The API changes was 
based on yanboliang 's PR #15768 and resolved conflicts and API changes on the 
Scala API. The examples are consistent with Scala examples of MinHashLSH and 
BucketedRandomProjectionLSH.

## How was this patch tested?
API and examples are tested using spark-submit:
`bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py`
`bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py`

User guide changes are generated and manually inspected:
`SKIP_API=1 jekyll build`

Author: Yun Ni 
Author: Yanbo Liang 
Author: Yunni 

Closes #16715 from Yunni/spark-18080.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/08c1972a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/08c1972a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/08c1972a

Branch: refs/heads/master
Commit: 08c1972a0661d42f300520cc6e5fb31023de093b
Parents: 21b4ba2
Author: Yun Ni 
Authored: Wed Feb 15 16:26:05 2017 -0800
Committer: Yanbo Liang 
Committed: Wed Feb 15 16:26:05 2017 -0800

--
 docs/ml-features.md |  17 ++
 .../JavaBucketedRandomProjectionLSHExample.java |  38 ++-
 .../examples/ml/JavaMinHashLSHExample.java  |  57 +++-
 .../bucketed_random_projection_lsh_example.py   |  81 ++
 .../src/main/python/ml/min_hash_lsh_example.py  |  81 ++
 .../ml/BucketedRandomProjectionLSHExample.scala |  39 ++-
 .../spark/examples/ml/MinHashLSHExample.scala   |  43 ++-
 .../scala/org/apache/spark/ml/feature/LSH.scala |   7 +-
 python/pyspark/ml/feature.py| 291 +++
 9 files changed, 601 insertions(+), 53 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/08c1972a/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 13d97a2..57605ba 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1558,6 +1558,15 @@ for more details on the API.
 
 {% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
 
+
+
+
+Refer to the [BucketedRandomProjectionLSH Python 
docs](api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH)
+for more details on the API.
+
+{% include_example python/ml/bucketed_random_projection_lsh_example.py %}
+
+
 
 
 ### MinHash for Jaccard Distance
@@ -1590,4 +1599,12 @@ for more details on the API.
 
 {% include_example 
java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}
 
+
+
+
+Refer to the [MinHashLSH Python 
docs](api/python/pyspark.ml.html#pyspark.ml.feature.MinHashLSH)
+for more details on the API.
+
+{% include_example python/ml/min_hash_lsh_example.py %}
+
 

http://git-wip-us.apache.org/repos/asf/spark/blob/08c1972a/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java
 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java
index ca3ee5a..4594e34 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java
@@ -35,8 +35,15 @@ import org.apache.spark.sql.types.DataTypes;
 import org.apache.spark.sql.types.Metadata;
 import org.apache.spark.sql.types.StructField;
 import org.apache.spark.sql.types.StructType;
+
+import static org.apache.spark.sql.functions.col;
 // $example off$
 
+/**
+ * An example demonstrating BucketedRandomProjectionLSH.
+ * Run with:
+ *   bin/run-example 
org.apache.spark.examples.ml.JavaBucketedRandomProjectionLSHExample
+ */
 public class JavaBucketedRandomProjectionLSHExample {
   public static void main(String[] args) {
 SparkSession spark = SparkSession
@@ -61,7 +68,7 @@ public class JavaBucketedRandomProjectionLSHExample {
 
 StructType schema = new StructType(new StructField[]{
   new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
-  new StructField("keys", new VectorUDT(), false, Metadata.empty())
+  new StructField("features", new VectorUDT(), false, Metadata.empty())
 });
 Dataset dfA = spark.createDataFrame(dataA, schema);
 Dataset dfB =

spark git commit: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-26 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 90817a6cd -> 4172ff80d


[SPARK-18929][ML] Add Tweedie distribution in GLM

## What changes were proposed in this pull request?
I propose to add the full Tweedie family into the GeneralizedLinearRegression 
model. The Tweedie family is characterized by a power variance function. 
Currently supported distributions such as Gaussian, Poisson and Gamma families 
are a special case of the Tweedie 
https://en.wikipedia.org/wiki/Tweedie_distribution.

yanboliang srowen sethah

Author: actuaryzhang 
Author: Wayne Zhang 

Closes #16344 from actuaryzhang/tweedie.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4172ff80
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4172ff80
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4172ff80

Branch: refs/heads/master
Commit: 4172ff80dd9ca9cde4f310953bfc386cbfc62ba4
Parents: 90817a6
Author: actuaryzhang 
Authored: Thu Jan 26 23:01:13 2017 -0800
Committer: Yanbo Liang 
Committed: Thu Jan 26 23:01:13 2017 -0800

--
 .../GeneralizedLinearRegression.scala   | 359 +++
 .../GeneralizedLinearRegressionSuite.scala  | 291 ++-
 2 files changed, 567 insertions(+), 83 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4172ff80/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
index 3ffed39..c4f41d0 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
@@ -48,7 +48,7 @@ private[regression] trait GeneralizedLinearRegressionBase 
extends PredictorParam
   /**
* Param for the name of family which is a description of the error 
distribution
* to be used in the model.
-   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Supported options: "gaussian", "binomial", "poisson", "gamma" and 
"tweedie".
* Default is "gaussian".
*
* @group param
@@ -64,9 +64,34 @@ private[regression] trait GeneralizedLinearRegressionBase 
extends PredictorParam
   def getFamily: String = $(family)
 
   /**
+   * Param for the power in the variance function of the Tweedie distribution 
which provides
+   * the relationship between the variance and mean of the distribution.
+   * Only applicable for the Tweedie family.
+   * (see https://en.wikipedia.org/wiki/Tweedie_distribution;>
+   * Tweedie Distribution (Wikipedia))
+   * Supported values: 0 and [1, Inf).
+   * Note that variance power 0, 1, or 2 corresponds to the Gaussian, Poisson 
or Gamma
+   * family, respectively.
+   *
+   * @group param
+   */
+  @Since("2.2.0")
+  final val variancePower: DoubleParam = new DoubleParam(this, "variancePower",
+"The power in the variance function of the Tweedie distribution which 
characterizes " +
+"the relationship between the variance and mean of the distribution. " +
+"Only applicable for the Tweedie family. Supported values: 0 and [1, 
Inf).",
+(x: Double) => x >= 1.0 || x == 0.0)
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getVariancePower: Double = $(variancePower)
+
+  /**
* Param for the name of link function which provides the relationship
* between the linear predictor and the mean of the distribution function.
* Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * This is used only when family is not "tweedie". The link function for the 
"tweedie" family
+   * must be specified through [[linkPower]].
*
* @group param
*/
@@ -81,6 +106,21 @@ private[regression] trait GeneralizedLinearRegressionBase 
extends PredictorParam
   def getLink: String = $(link)
 
   /**
+   * Param for the index in the power link function. Only applicable for the 
Tweedie family.
+   * Note that link power 0, 1, -1 or 0.5 corresponds to the Log, Identity, 
Inverse or Sqrt
+   * link, respectively.
+   *
+   * @group param
+   */
+  @Since("2.2.0")
+  final val linkPower: DoubleParam = new DoubleParam(this, "linkPower",
+"The index in the power link function. Only applicable for the Tweedie 
family.")
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getLinkPower: Double = $(linkPower)
+
+  /**
* Param for link prediction (linear predictor) column name.
* Default is not set, which means we do not output link prediction.
*
@@

spark git commit: [SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features

2017-01-25 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 76db394f2 -> 0e821ec6f


[SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features

## What changes were proposed in this pull request?

The following test will fail on current master

scala
test("gmm fails on high dimensional data") {
val ctx = spark.sqlContext
import ctx.implicits._
val df = Seq(
  Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), 
Array(3.0, 8.0)),
  Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), 
Array(4.0, 9.0)))
  .map(Tuple1.apply).toDF("features")
val gm = new GaussianMixture()
intercept[IllegalArgumentException] {
  gm.fit(df)
}
  }


Instead, you'll get an `ArrayIndexOutOfBoundsException` or something similar 
for MLlib. That's because the covariance matrix allocates an array of 
`numFeatures * numFeatures`, and in this case we get integer overflow. While 
there is currently a warning that the algorithm does not perform well for high 
number of features, we should perform an appropriate check to communicate this 
limitation to users.

This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)` 
check to ML and MLlib algorithms. For the feature limitation, we can limit it 
such that we do not get numerical overflow to something like 
`math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic 
error. However in, for example WLS, we need to collect an array on the order of 
`numFeatures * numFeatures` to the driver and we therefore limit to 4096 
features. We may want to keep that convention here for consistency.

## How was this patch tested?
Unit tests in ML and MLlib.

Author: sethah 

Closes #16661 from sethah/gmm_high_dim.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0e821ec6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0e821ec6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0e821ec6

Branch: refs/heads/master
Commit: 0e821ec6fa98f4b0aa6e2eb6fecd18cc1ee6f3f2
Parents: 76db394
Author: sethah 
Authored: Wed Jan 25 07:12:25 2017 -0800
Committer: Yanbo Liang 
Committed: Wed Jan 25 07:12:25 2017 -0800

--
 .../apache/spark/ml/clustering/GaussianMixture.scala | 14 +++---
 .../spark/mllib/clustering/GaussianMixture.scala | 15 ---
 .../spark/ml/clustering/GaussianMixtureSuite.scala   | 14 ++
 .../mllib/clustering/GaussianMixtureSuite.scala  | 14 ++
 4 files changed, 51 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0e821ec6/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
index db5fff5..ea2dc6c 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
@@ -278,7 +278,9 @@ object GaussianMixtureModel extends 
MLReadable[GaussianMixtureModel] {
  * While this process is generally guaranteed to converge, it is not guaranteed
  * to find a global optimum.
  *
- * @note For high-dimensional data (with many features), this algorithm may 
perform poorly.
+ * @note This algorithm is limited in its number of features since it requires 
storing a covariance
+ * matrix which has size quadratic in the number of features. Even when the 
number of features does
+ * not exceed this limit, this algorithm may perform poorly on 
high-dimensional data.
  * This is due to high-dimensional data (a) making it difficult to cluster at 
all (based
  * on statistical/theoretical arguments) and (b) numerical issues with 
Gaussian distributions.
  */
@@ -344,6 +346,9 @@ class GaussianMixture @Since("2.0.0") (
 
 // Extract the number of features.
 val numFeatures = instances.first().size
+require(numFeatures < GaussianMixture.MAX_NUM_FEATURES, s"GaussianMixture 
cannot handle more " +
+  s"than ${GaussianMixture.MAX_NUM_FEATURES} features because the size of 
the covariance" +
+  s" matrix is quadratic in the number of features.")
 
 val instr = Instrumentation.create(this, instances)
 instr.logParams(featuresCol, predictionCol, probabilityCol, k, maxIter, 
seed, tol)
@@ -391,8 +396,8 @@ class GaussianMixture @Since("2.0.0") (
 val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean, 
cov, weight) =>
   GaussianMixture.updateWeightsAndGaussians(mean, cov, weight, 
sumWeights)
 }.collect().unzip
-

spark git commit: [SPARK-19155][ML] Make family case insensitive in GLM

2017-01-23 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 8daf10e3f -> 1e07a7192


[SPARK-19155][ML] Make family case insensitive in GLM

## What changes were proposed in this pull request?
This is a supplement to PR #16516 which did not make the value from `getFamily` 
case insensitive. Current tests of poisson/binomial glm with weight fail when 
specifying 'Poisson' or 'Binomial', because the calculation of `dispersion` and 
`pValue` checks the value of family retrieved from `getFamily`
```
model.getFamily == Binomial.name || model.getFamily == Poisson.name
```

## How was this patch tested?
Update existing tests for 'Poisson' and 'Binomial'.

yanboliang felixcheung imatiach-msft

Author: actuaryzhang 

Closes #16675 from actuaryzhang/family.

(cherry picked from commit f067acefabebf04939d03a639a2aaa654e1bc8f9)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e07a719
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e07a719
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e07a719

Branch: refs/heads/branch-2.1
Commit: 1e07a71924ef1420c96a3a0a8cb5be2f3a830037
Parents: 8daf10e
Author: actuaryzhang 
Authored: Mon Jan 23 00:53:44 2017 -0800
Committer: Yanbo Liang 
Committed: Mon Jan 23 00:54:08 2017 -0800

--
 .../spark/ml/regression/GeneralizedLinearRegression.scala  | 6 --
 .../spark/ml/regression/GeneralizedLinearRegressionSuite.scala | 4 ++--
 2 files changed, 6 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1e07a719/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
index 1e7ba91..676be61 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
@@ -1027,7 +1027,8 @@ class GeneralizedLinearRegressionSummary 
private[regression] (
*/
   @Since("2.0.0")
   lazy val dispersion: Double = if (
-model.getFamily == Binomial.name || model.getFamily == Poisson.name) {
+model.getFamily.toLowerCase == Binomial.name ||
+  model.getFamily.toLowerCase == Poisson.name) {
 1.0
   } else {
 val rss = pearsonResiduals.agg(sum(pow(col("pearsonResiduals"), 
2.0))).first().getDouble(0)
@@ -1130,7 +1131,8 @@ class GeneralizedLinearRegressionTrainingSummary 
private[regression] (
   @Since("2.0.0")
   lazy val pValues: Array[Double] = {
 if (isNormalSolver) {
-  if (model.getFamily == Binomial.name || model.getFamily == Poisson.name) 
{
+  if (model.getFamily.toLowerCase == Binomial.name ||
+model.getFamily.toLowerCase == Poisson.name) {
 tValues.map { x => 2.0 * (1.0 - dist.Gaussian(0.0, 
1.0).cdf(math.abs(x))) }
   } else {
 tValues.map { x =>

http://git-wip-us.apache.org/repos/asf/spark/blob/1e07a719/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
index 415d426..95b443d 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
@@ -757,7 +757,7 @@ class GeneralizedLinearRegressionSuite
0.5554219 -0.4034267  0.6567520 -0.2611382
  */
 val trainer = new GeneralizedLinearRegression()
-  .setFamily("binomial")
+  .setFamily("Binomial")
   .setWeightCol("weight")
   .setFitIntercept(false)
 
@@ -874,7 +874,7 @@ class GeneralizedLinearRegressionSuite
-0.4378554  0.2189277  0.1459518 -0.1094638
  */
 val trainer = new GeneralizedLinearRegression()
-  .setFamily("poisson")
+  .setFamily("Poisson")
   .setWeightCol("weight")
   .setFitIntercept(true)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

1 2 3 >

1 - 100 of 227 matches

Mail list logo