spark git commit: [SPARK-23291][SPARK-23291][R][FOLLOWUP] Update SparkR migration note for
Repository: spark Updated Branches: refs/heads/master 56a52e0a5 -> 1c9c5de95 [SPARK-23291][SPARK-23291][R][FOLLOWUP] Update SparkR migration note for ## What changes were proposed in this pull request? This PR fixes the migration note for SPARK-23291 since it's going to backport to 2.3.1. See the discussion in https://issues.apache.org/jira/browse/SPARK-23291 ## How was this patch tested? N/A Author: hyukjinkwonCloses #21249 from HyukjinKwon/SPARK-23291. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1c9c5de9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1c9c5de9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1c9c5de9 Branch: refs/heads/master Commit: 1c9c5de951ed86290bcd7d8edaab952b8cacd290 Parents: 56a52e0 Author: hyukjinkwon Authored: Mon May 7 14:52:14 2018 -0700 Committer: Yanbo Liang Committed: Mon May 7 14:52:14 2018 -0700 -- docs/sparkr.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1c9c5de9/docs/sparkr.md -- diff --git a/docs/sparkr.md b/docs/sparkr.md index 7fabab5..4faad2c 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -664,6 +664,6 @@ You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-ma - For `summary`, option for statistics to compute has been added. Its output is changed from that from `describe`. - A warning can be raised if versions of SparkR package and the Spark JVM do not match. -## Upgrading to Spark 2.4.0 +## Upgrading to SparkR 2.3.1 and above - - The `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been fixed so the `start` parameter of `substr` method is now 1-base, e.g., therefore to get the same result as `substr(df$a, 2, 5)`, it should be changed to `substr(df$a, 1, 4)`. + - In SparkR 2.3.0 and earlier, the `start` parameter of `substr` method was wrongly subtracted by one and considered as 0-based. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. In version 2.3.1 and later, it has been fixed so the `start` parameter of `substr` method is now 1-base. As an example, `substr(lit('abcdef'), 2, 4))` would result to `abc` in SparkR 2.3.0, and the result would be `bcd` in SparkR 2.3.1. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-23291][SQL][R][BRANCH-2.3] R's substr should not reduce starting position by 1 when calling Scala API
Repository: spark Updated Branches: refs/heads/branch-2.3 f87785a76 -> 3a22feab4 [SPARK-23291][SQL][R][BRANCH-2.3] R's substr should not reduce starting position by 1 when calling Scala API ## What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/commit/24b5c69ee3feded439e5bb6390e4b63f503eeafe and https://github.com/apache/spark/pull/21249 There's no conflict but I opened this just to run the test and for sure. See the discussion in https://issues.apache.org/jira/browse/SPARK-23291 ## How was this patch tested? Jenkins tests. Author: hyukjinkwonAuthor: Liang-Chi Hsieh Closes #21250 from HyukjinKwon/SPARK-23291-backport. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3a22feab Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3a22feab Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3a22feab Branch: refs/heads/branch-2.3 Commit: 3a22feab4dc9f0cffe3aaec692e27ab277666507 Parents: f87785a Author: hyukjinkwon Authored: Mon May 7 14:48:28 2018 -0700 Committer: Yanbo Liang Committed: Mon May 7 14:48:28 2018 -0700 -- R/pkg/R/column.R | 10 -- R/pkg/tests/fulltests/test_sparkSQL.R | 1 + docs/sparkr.md| 4 3 files changed, 13 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3a22feab/R/pkg/R/column.R -- diff --git a/R/pkg/R/column.R b/R/pkg/R/column.R index 3095adb..3d6d9f9 100644 --- a/R/pkg/R/column.R +++ b/R/pkg/R/column.R @@ -164,12 +164,18 @@ setMethod("alias", #' @aliases substr,Column-method #' #' @param x a Column. -#' @param start starting position. +#' @param start starting position. It should be 1-base. #' @param stop ending position. +#' @examples +#' \dontrun{ +#' df <- createDataFrame(list(list(a="abcdef"))) +#' collect(select(df, substr(df$a, 1, 4))) # the result is `abcd`. +#' collect(select(df, substr(df$a, 2, 4))) # the result is `bcd`. +#' } #' @note substr since 1.4.0 setMethod("substr", signature(x = "Column"), function(x, start, stop) { -jc <- callJMethod(x@jc, "substr", as.integer(start - 1), as.integer(stop - start + 1)) +jc <- callJMethod(x@jc, "substr", as.integer(start), as.integer(stop - start + 1)) column(jc) }) http://git-wip-us.apache.org/repos/asf/spark/blob/3a22feab/R/pkg/tests/fulltests/test_sparkSQL.R -- diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R b/R/pkg/tests/fulltests/test_sparkSQL.R index 5197838..bed26ec 100644 --- a/R/pkg/tests/fulltests/test_sparkSQL.R +++ b/R/pkg/tests/fulltests/test_sparkSQL.R @@ -1649,6 +1649,7 @@ test_that("string operators", { expect_false(first(select(df, startsWith(df$name, "m")))[[1]]) expect_true(first(select(df, endsWith(df$name, "el")))[[1]]) expect_equal(first(select(df, substr(df$name, 1, 2)))[[1]], "Mi") + expect_equal(first(select(df, substr(df$name, 4, 6)))[[1]], "hae") if (as.numeric(R.version$major) >= 3 && as.numeric(R.version$minor) >= 3) { expect_true(startsWith("Hello World", "Hello")) expect_false(endsWith("Hello World", "a")) http://git-wip-us.apache.org/repos/asf/spark/blob/3a22feab/docs/sparkr.md -- diff --git a/docs/sparkr.md b/docs/sparkr.md index 6685b58..73f9424 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -663,3 +663,7 @@ You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-ma - The `stringsAsFactors` parameter was previously ignored with `collect`, for example, in `collect(createDataFrame(iris), stringsAsFactors = TRUE))`. It has been corrected. - For `summary`, option for statistics to compute has been added. Its output is changed from that from `describe`. - A warning can be raised if versions of SparkR package and the Spark JVM do not match. + +## Upgrading to SparkR 2.3.1 and above + + - In SparkR 2.3.0 and earlier, the `start` parameter of `substr` method was wrongly subtracted by one and considered as 0-based. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. In version 2.3.1 and later, it has been fixed so the `start` parameter of `substr` method is now 1-base. As an example, `substr(lit('abcdef'), 2, 4))` would result to `abc` in SparkR 2.3.0, and the result would be `bcd` in SparkR 2.3.1. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands,
spark git commit: [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of dataframe vectorized summarizer
Repository: spark Updated Branches: refs/heads/master 9c289a5cb -> d3ae3e1e8 [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of dataframe vectorized summarizer ## What changes were proposed in this pull request? Make several improvements in dataframe vectorized summarizer. 1. Make the summarizer return `Vector` type for all metrics (except "count"). It will return "WrappedArray" type before which won't be very convenient. 2. Make `MetricsAggregate` inherit `ImplicitCastInputTypes` trait. So it can check and implicitly cast input values. 3. Add "weight" parameter for all single metric method. 4. Update doc and improve the example code in doc. 5. Simplified test cases. ## How was this patch tested? Test added and simplified. Author: WeichenXuCloses #19156 from WeichenXu123/improve_vec_summarizer. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d3ae3e1e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d3ae3e1e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d3ae3e1e Branch: refs/heads/master Commit: d3ae3e1e894f88a8500752d9633fe9ad00da5f20 Parents: 9c289a5 Author: WeichenXu Authored: Wed Dec 20 19:53:35 2017 -0800 Committer: Yanbo Liang Committed: Wed Dec 20 19:53:35 2017 -0800 -- .../org/apache/spark/ml/stat/Summarizer.scala | 128 --- .../spark/ml/stat/JavaSummarizerSuite.java | 64 .../apache/spark/ml/stat/SummarizerSuite.scala | 362 ++- 3 files changed, 341 insertions(+), 213 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d3ae3e1e/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala b/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala index cae41ed..9bed74a 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala @@ -24,7 +24,7 @@ import org.apache.spark.internal.Logging import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} import org.apache.spark.sql.Column import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.{Expression, ImplicitCastInputTypes, UnsafeArrayData} import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} import org.apache.spark.sql.functions.lit import org.apache.spark.sql.types._ @@ -41,7 +41,7 @@ sealed abstract class SummaryBuilder { /** * Returns an aggregate object that contains the summary of the column with the requested metrics. * @param featuresCol a column that contains features Vector object. - * @param weightCol a column that contains weight value. + * @param weightCol a column that contains weight value. Default weight is 1.0. * @return an aggregate column that contains the statistics. The exact content of this * structure is determined during the creation of the builder. */ @@ -50,6 +50,7 @@ sealed abstract class SummaryBuilder { @Since("2.3.0") def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) + } /** @@ -60,15 +61,18 @@ sealed abstract class SummaryBuilder { * This class lets users pick the statistics they would like to extract for a given column. Here is * an example in Scala: * {{{ - * val dataframe = ... // Some dataframe containing a feature column - * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) - * val Row(Row(min_, max_)) = allStats.first() + * import org.apache.spark.ml.linalg._ + * import org.apache.spark.sql.Row + * val dataframe = ... // Some dataframe containing a feature column and a weight column + * val multiStatsDF = dataframe.select( + * Summarizer.metrics("min", "max", "count").summary($"features", $"weight") + * val Row(Row(minVec, maxVec, count)) = multiStatsDF.first() * }}} * * If one wants to get a single metric, shortcuts are also available: * {{{ * val meanDF = dataframe.select(Summarizer.mean($"features")) - * val Row(mean_) = meanDF.first() + * val Row(meanVec) = meanDF.first() * }}} * * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD @@ -94,8 +98,7 @@ object Summarizer extends Logging { * - min: the minimum for each coefficient. * - normL2: the Euclidian norm for each coefficient. * - normL1: the L1 norm of each coefficient (sum of the absolute values). - * @param firstMetric the metric being
spark git commit: [SPARK-22810][ML][PYSPARK] Expose Python API for LinearRegression with huber loss.
Repository: spark Updated Branches: refs/heads/master 0114c89d0 -> fb0562f34 [SPARK-22810][ML][PYSPARK] Expose Python API for LinearRegression with huber loss. ## What changes were proposed in this pull request? Expose Python API for _LinearRegression_ with _huber_ loss. ## How was this patch tested? Unit test. Author: Yanbo LiangCloses #19994 from yanboliang/spark-22810. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fb0562f3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fb0562f3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fb0562f3 Branch: refs/heads/master Commit: fb0562f34605cd27fd39d09e6664a46e55eac327 Parents: 0114c89 Author: Yanbo Liang Authored: Wed Dec 20 17:51:42 2017 -0800 Committer: Yanbo Liang Committed: Wed Dec 20 17:51:42 2017 -0800 -- .../pyspark/ml/param/_shared_params_code_gen.py | 3 +- python/pyspark/ml/param/shared.py | 23 +++ python/pyspark/ml/regression.py | 64 +++- python/pyspark/ml/tests.py | 21 +++ 4 files changed, 96 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fb0562f3/python/pyspark/ml/param/_shared_params_code_gen.py -- diff --git a/python/pyspark/ml/param/_shared_params_code_gen.py b/python/pyspark/ml/param/_shared_params_code_gen.py index 130d1a0..d55d209 100644 --- a/python/pyspark/ml/param/_shared_params_code_gen.py +++ b/python/pyspark/ml/param/_shared_params_code_gen.py @@ -154,7 +154,8 @@ if __name__ == "__main__": ("aggregationDepth", "suggested depth for treeAggregate (>= 2).", "2", "TypeConverters.toInt"), ("parallelism", "the number of threads to use when running parallel algorithms (>= 1).", - "1", "TypeConverters.toInt")] + "1", "TypeConverters.toInt"), +("loss", "the loss function to be optimized.", None, "TypeConverters.toString")] code = [] for name, doc, defaultValueStr, typeConverter in shared: http://git-wip-us.apache.org/repos/asf/spark/blob/fb0562f3/python/pyspark/ml/param/shared.py -- diff --git a/python/pyspark/ml/param/shared.py b/python/pyspark/ml/param/shared.py index 4041d9c..e5c5ddf 100644 --- a/python/pyspark/ml/param/shared.py +++ b/python/pyspark/ml/param/shared.py @@ -632,6 +632,29 @@ class HasParallelism(Params): return self.getOrDefault(self.parallelism) +class HasLoss(Params): +""" +Mixin for param loss: the loss function to be optimized. +""" + +loss = Param(Params._dummy(), "loss", "the loss function to be optimized.", typeConverter=TypeConverters.toString) + +def __init__(self): +super(HasLoss, self).__init__() + +def setLoss(self, value): +""" +Sets the value of :py:attr:`loss`. +""" +return self._set(loss=value) + +def getLoss(self): +""" +Gets the value of loss or its default value. +""" +return self.getOrDefault(self.loss) + + class DecisionTreeParams(Params): """ Mixin for Decision Tree parameters. http://git-wip-us.apache.org/repos/asf/spark/blob/fb0562f3/python/pyspark/ml/regression.py -- diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py index 9d5b768..f0812bd 100644 --- a/python/pyspark/ml/regression.py +++ b/python/pyspark/ml/regression.py @@ -39,23 +39,26 @@ __all__ = ['AFTSurvivalRegression', 'AFTSurvivalRegressionModel', @inherit_doc class LinearRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter, HasRegParam, HasTol, HasElasticNetParam, HasFitIntercept, - HasStandardization, HasSolver, HasWeightCol, HasAggregationDepth, + HasStandardization, HasSolver, HasWeightCol, HasAggregationDepth, HasLoss, JavaMLWritable, JavaMLReadable): """ Linear regression. -The learning objective is to minimize the squared error, with regularization. -The specific squared error loss function used is: L = 1/2n ||A coefficients - y||^2^ +The learning objective is to minimize the specified loss function, with regularization. +This supports two kinds of loss: -This supports multiple types of regularization: - - * none (a.k.a. ordinary least squares) +* squaredError (a.k.a squared loss) +* huber (a hybrid of squared error for relatively small errors and absolute error for \ +relatively large ones, and we estimate the scale parameter
spark git commit: [SPARK-3181][ML] Implement huber loss for LinearRegression.
Repository: spark Updated Branches: refs/heads/master 2a29a60da -> 1e44dd004 [SPARK-3181][ML] Implement huber loss for LinearRegression. ## What changes were proposed in this pull request? MLlib ```LinearRegression``` supports _huber_ loss addition to _leastSquares_ loss. The huber loss objective function is: ![image](https://user-images.githubusercontent.com/1962026/29554124-9544d198-8750-11e7-8afa-33579ec419d5.png) Refer Eq.(6) and Eq.(8) in [A robust hybrid of lasso and ridge regression](http://statweb.stanford.edu/~owen/reports/hhu.pdf). This objective is jointly convex as a function of (w, Ï) â R Ã (0,â), we can use L-BFGS-B to solve it. The current implementation is a straight forward porting for Python scikit-learn [```HuberRegressor```](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html). There are some differences: * We use mean loss (```lossSum/weightSum```), but sklearn uses total loss (```lossSum```). * We multiply the loss function and L2 regularization by 1/2. It does not affect the result if we multiply the whole formula by a factor, we just keep consistent with _leastSquares_ loss. So if fitting w/o regularization, MLlib and sklearn produce the same output. If fitting w/ regularization, MLlib should set ```regParam``` divide by the number of instances to match the output of sklearn. ## How was this patch tested? Unit tests. Author: Yanbo LiangCloses #19020 from yanboliang/spark-3181. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e44dd00 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e44dd00 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e44dd00 Branch: refs/heads/master Commit: 1e44dd004425040912f2cf16362d2c13f12e1689 Parents: 2a29a60 Author: Yanbo Liang Authored: Wed Dec 13 21:19:14 2017 -0800 Committer: Yanbo Liang Committed: Wed Dec 13 21:19:14 2017 -0800 -- .../ml/optim/aggregator/HuberAggregator.scala | 150 ++ .../ml/param/shared/SharedParamsCodeGen.scala | 3 +- .../spark/ml/param/shared/sharedParams.scala| 17 ++ .../spark/ml/regression/LinearRegression.scala | 299 +++ .../optim/aggregator/HuberAggregatorSuite.scala | 170 +++ .../ml/regression/LinearRegressionSuite.scala | 244 ++- project/MimaExcludes.scala | 5 + 7 files changed, 823 insertions(+), 65 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1e44dd00/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala b/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala new file mode 100644 index 000..13f64d2 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala @@ -0,0 +1,150 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.optim.aggregator + +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.linalg.Vector + +/** + * HuberAggregator computes the gradient and loss for a huber loss function, + * as used in robust regression for samples in sparse or dense vector in an online fashion. + * + * The huber loss function based on: + * http://statweb.stanford.edu/~owen/reports/hhu.pdf;>Art B. Owen (2006), + * A robust hybrid of lasso and ridge regression. + * + * Two HuberAggregator can be merged together to have a summary of loss and gradient of + * the corresponding joint dataset. + * + * The huber loss function is given by + * + * + * $$ + * \begin{align} + * \min_{w, \sigma}\frac{1}{2n}{\sum_{i=1}^n\left(\sigma + + * H_m\left(\frac{X_{i}w - y_{i}}{\sigma}\right)\sigma\right) + \frac{1}{2}\lambda {||w||_2}^2} + * \end{align} + * $$ + * + * + *
spark git commit: [SPARK-21087][ML][FOLLOWUP] Sync SharedParamsCodeGen and sharedParams.
Repository: spark Updated Branches: refs/heads/master 17cdabb88 -> b03af8b58 [SPARK-21087][ML][FOLLOWUP] Sync SharedParamsCodeGen and sharedParams. ## What changes were proposed in this pull request? #19208 modified ```sharedParams.scala```, but didn't generated by ```SharedParamsCodeGen.scala```. This involves mismatch between them. ## How was this patch tested? Existing test. Author: Yanbo LiangCloses #19958 from yanboliang/spark-21087. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b03af8b5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b03af8b5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b03af8b5 Branch: refs/heads/master Commit: b03af8b582b9b71b09eaf3a1c01d1b3ef5f072e8 Parents: 17cdabb Author: Yanbo Liang Authored: Tue Dec 12 17:37:01 2017 -0800 Committer: Yanbo Liang Committed: Tue Dec 12 17:37:01 2017 -0800 -- .../spark/ml/param/shared/SharedParamsCodeGen.scala | 8 .../org/apache/spark/ml/param/shared/sharedParams.scala | 10 ++ 2 files changed, 10 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b03af8b5/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala index c540629..a267bbc 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala @@ -84,10 +84,10 @@ private[shared] object SharedParamsCodeGen { ParamDesc[String]("solver", "the solver algorithm for optimization", finalFields = false), ParamDesc[Int]("aggregationDepth", "suggested depth for treeAggregate (>= 2)", Some("2"), isValid = "ParamValidators.gtEq(2)", isExpertParam = true), - ParamDesc[Boolean]("collectSubModels", "If set to false, then only the single best " + -"sub-model will be available after fitting. If set to true, then all sub-models will be " + -"available. Warning: For large models, collecting all sub-models can cause OOMs on the " + -"Spark driver.", + ParamDesc[Boolean]("collectSubModels", "whether to collect a list of sub-models trained " + +"during tuning. If set to false, then only the single best sub-model will be available " + +"after fitting. If set to true, then all sub-models will be available. Warning: For " + +"large models, collecting all sub-models can cause OOMs on the Spark driver", Some("false"), isExpertParam = true) ) http://git-wip-us.apache.org/repos/asf/spark/blob/b03af8b5/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala index 34aa38a..0004f08 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala @@ -470,15 +470,17 @@ trait HasAggregationDepth extends Params { } /** - * Trait for shared param collectSubModels (default: false). + * Trait for shared param collectSubModels (default: false). This trait may be changed or + * removed between minor versions. */ -private[ml] trait HasCollectSubModels extends Params { +@DeveloperApi +trait HasCollectSubModels extends Params { /** - * Param for whether to collect a list of sub-models trained during tuning. + * Param for whether to collect a list of sub-models trained during tuning. If set to false, then only the single best sub-model will be available after fitting. If set to true, then all sub-models will be available. Warning: For large models, collecting all sub-models can cause OOMs on the Spark driver. * @group expertParam */ - final val collectSubModels: BooleanParam = new BooleanParam(this, "collectSubModels", "whether to collect a list of sub-models trained during tuning") + final val collectSubModels: BooleanParam = new BooleanParam(this, "collectSubModels", "whether to collect a list of sub-models trained during tuning. If set to false, then only the single best sub-model will be available after fitting. If set to true, then all sub-models will be available. Warning: For large models, collecting all sub-models can cause OOMs on the Spark driver") setDefault(collectSubModels, false)
spark git commit: [SPARK-22289][ML] Add JSON support for Matrix parameters (LR with coefficients bound)
Repository: spark Updated Branches: refs/heads/branch-2.2 9e2d96d1d -> 00cdb38dc [SPARK-22289][ML] Add JSON support for Matrix parameters (LR with coefficients bound) ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-22289 add JSON encoding/decoding for Param[Matrix]. The issue was reported by Nic Eggert during saving LR model with LowerBoundsOnCoefficients. There're two ways to resolve this as I see: 1. Support save/load on LogisticRegressionParams, and also adjust the save/load in LogisticRegression and LogisticRegressionModel. 2. Directly support Matrix in Param.jsonEncode, similar to what we have done for Vector. After some discussion in jira, we prefer the fix to support Matrix as a valid Param type, for simplicity and convenience for other classes. Note that in the implementation, I added a "class" field in the JSON object to match different JSON converters when loading, which is for preciseness and future extension. ## How was this patch tested? new unit test to cover the LR case and JsonMatrixConverter Author: Yuhao YangCloses #19525 from hhbyyh/lrsave. (cherry picked from commit 10c27a6559803797e89c28ced11c1087127b82eb) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/00cdb38d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/00cdb38d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/00cdb38d Branch: refs/heads/branch-2.2 Commit: 00cdb38dcd0f617de7f0559214a8b1a35e9b179c Parents: 9e2d96d Author: Yuhao Yang Authored: Tue Dec 12 11:27:01 2017 -0800 Committer: Yanbo Liang Committed: Tue Dec 12 11:27:40 2017 -0800 -- .../org/apache/spark/ml/linalg/Matrices.scala | 7 ++ .../spark/ml/linalg/JsonMatrixConverter.scala | 79 .../org/apache/spark/ml/param/params.scala | 36 +++-- .../LogisticRegressionSuite.scala | 11 +++ .../ml/linalg/JsonMatrixConverterSuite.scala| 45 +++ 5 files changed, 170 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/00cdb38d/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala -- diff --git a/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala b/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala index 07f3bc2..ed3e493 100644 --- a/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala +++ b/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala @@ -476,6 +476,9 @@ class DenseMatrix @Since("2.0.0") ( @Since("2.0.0") object DenseMatrix { + private[ml] def unapply(dm: DenseMatrix): Option[(Int, Int, Array[Double], Boolean)] = +Some((dm.numRows, dm.numCols, dm.values, dm.isTransposed)) + /** * Generate a `DenseMatrix` consisting of zeros. * @param numRows number of rows of the matrix @@ -827,6 +830,10 @@ class SparseMatrix @Since("2.0.0") ( @Since("2.0.0") object SparseMatrix { + private[ml] def unapply( + sm: SparseMatrix): Option[(Int, Int, Array[Int], Array[Int], Array[Double], Boolean)] = +Some((sm.numRows, sm.numCols, sm.colPtrs, sm.rowIndices, sm.values, sm.isTransposed)) + /** * Generate a `SparseMatrix` from Coordinate List (COO) format. Input must be an array of * (i, j, value) tuples. Entries that have duplicate values of i and j are http://git-wip-us.apache.org/repos/asf/spark/blob/00cdb38d/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala b/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala new file mode 100644 index 000..0bee643 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific
spark git commit: [SPARK-22289][ML] Add JSON support for Matrix parameters (LR with coefficients bound)
Repository: spark Updated Branches: refs/heads/master e6dc5f280 -> 10c27a655 [SPARK-22289][ML] Add JSON support for Matrix parameters (LR with coefficients bound) ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-22289 add JSON encoding/decoding for Param[Matrix]. The issue was reported by Nic Eggert during saving LR model with LowerBoundsOnCoefficients. There're two ways to resolve this as I see: 1. Support save/load on LogisticRegressionParams, and also adjust the save/load in LogisticRegression and LogisticRegressionModel. 2. Directly support Matrix in Param.jsonEncode, similar to what we have done for Vector. After some discussion in jira, we prefer the fix to support Matrix as a valid Param type, for simplicity and convenience for other classes. Note that in the implementation, I added a "class" field in the JSON object to match different JSON converters when loading, which is for preciseness and future extension. ## How was this patch tested? new unit test to cover the LR case and JsonMatrixConverter Author: Yuhao YangCloses #19525 from hhbyyh/lrsave. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/10c27a65 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/10c27a65 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/10c27a65 Branch: refs/heads/master Commit: 10c27a6559803797e89c28ced11c1087127b82eb Parents: e6dc5f2 Author: Yuhao Yang Authored: Tue Dec 12 11:27:01 2017 -0800 Committer: Yanbo Liang Committed: Tue Dec 12 11:27:01 2017 -0800 -- .../org/apache/spark/ml/linalg/Matrices.scala | 7 ++ .../spark/ml/linalg/JsonMatrixConverter.scala | 79 .../org/apache/spark/ml/param/params.scala | 36 +++-- .../LogisticRegressionSuite.scala | 11 +++ .../ml/linalg/JsonMatrixConverterSuite.scala| 45 +++ 5 files changed, 170 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/10c27a65/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala -- diff --git a/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala b/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala index 66c5362..14428c6 100644 --- a/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala +++ b/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala @@ -476,6 +476,9 @@ class DenseMatrix @Since("2.0.0") ( @Since("2.0.0") object DenseMatrix { + private[ml] def unapply(dm: DenseMatrix): Option[(Int, Int, Array[Double], Boolean)] = +Some((dm.numRows, dm.numCols, dm.values, dm.isTransposed)) + /** * Generate a `DenseMatrix` consisting of zeros. * @param numRows number of rows of the matrix @@ -827,6 +830,10 @@ class SparseMatrix @Since("2.0.0") ( @Since("2.0.0") object SparseMatrix { + private[ml] def unapply( + sm: SparseMatrix): Option[(Int, Int, Array[Int], Array[Int], Array[Double], Boolean)] = +Some((sm.numRows, sm.numCols, sm.colPtrs, sm.rowIndices, sm.values, sm.isTransposed)) + /** * Generate a `SparseMatrix` from Coordinate List (COO) format. Input must be an array of * (i, j, value) tuples. Entries that have duplicate values of i and j are http://git-wip-us.apache.org/repos/asf/spark/blob/10c27a65/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala b/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala new file mode 100644 index 000..0bee643 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.linalg + +import
spark git commit: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSuite test data to data/mllib.
Repository: spark Updated Branches: refs/heads/master 7475a9655 -> 3da3d7635 [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSuite test data to data/mllib. ## What changes were proposed in this pull request? Move ```ClusteringEvaluatorSuite``` test data(iris) to data/mllib, to prevent from re-creating a new folder. ## How was this patch tested? Existing tests. Author: Yanbo LiangCloses #19648 from yanboliang/spark-14516. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3da3d763 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3da3d763 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3da3d763 Branch: refs/heads/master Commit: 3da3d76352cc471252a54088cc55208bb4ea5b3a Parents: 7475a96 Author: Yanbo Liang Authored: Tue Nov 7 20:07:30 2017 -0800 Committer: Yanbo Liang Committed: Tue Nov 7 20:07:30 2017 -0800 -- data/mllib/iris_libsvm.txt | 150 +++ mllib/src/test/resources/test-data/iris.libsvm | 150 --- .../evaluation/ClusteringEvaluatorSuite.scala | 30 ++-- 3 files changed, 161 insertions(+), 169 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3da3d763/data/mllib/iris_libsvm.txt -- diff --git a/data/mllib/iris_libsvm.txt b/data/mllib/iris_libsvm.txt new file mode 100644 index 000..db95901 --- /dev/null +++ b/data/mllib/iris_libsvm.txt @@ -0,0 +1,150 @@ +0.0 1:5.1 2:3.5 3:1.4 4:0.2 +0.0 1:4.9 2:3.0 3:1.4 4:0.2 +0.0 1:4.7 2:3.2 3:1.3 4:0.2 +0.0 1:4.6 2:3.1 3:1.5 4:0.2 +0.0 1:5.0 2:3.6 3:1.4 4:0.2 +0.0 1:5.4 2:3.9 3:1.7 4:0.4 +0.0 1:4.6 2:3.4 3:1.4 4:0.3 +0.0 1:5.0 2:3.4 3:1.5 4:0.2 +0.0 1:4.4 2:2.9 3:1.4 4:0.2 +0.0 1:4.9 2:3.1 3:1.5 4:0.1 +0.0 1:5.4 2:3.7 3:1.5 4:0.2 +0.0 1:4.8 2:3.4 3:1.6 4:0.2 +0.0 1:4.8 2:3.0 3:1.4 4:0.1 +0.0 1:4.3 2:3.0 3:1.1 4:0.1 +0.0 1:5.8 2:4.0 3:1.2 4:0.2 +0.0 1:5.7 2:4.4 3:1.5 4:0.4 +0.0 1:5.4 2:3.9 3:1.3 4:0.4 +0.0 1:5.1 2:3.5 3:1.4 4:0.3 +0.0 1:5.7 2:3.8 3:1.7 4:0.3 +0.0 1:5.1 2:3.8 3:1.5 4:0.3 +0.0 1:5.4 2:3.4 3:1.7 4:0.2 +0.0 1:5.1 2:3.7 3:1.5 4:0.4 +0.0 1:4.6 2:3.6 3:1.0 4:0.2 +0.0 1:5.1 2:3.3 3:1.7 4:0.5 +0.0 1:4.8 2:3.4 3:1.9 4:0.2 +0.0 1:5.0 2:3.0 3:1.6 4:0.2 +0.0 1:5.0 2:3.4 3:1.6 4:0.4 +0.0 1:5.2 2:3.5 3:1.5 4:0.2 +0.0 1:5.2 2:3.4 3:1.4 4:0.2 +0.0 1:4.7 2:3.2 3:1.6 4:0.2 +0.0 1:4.8 2:3.1 3:1.6 4:0.2 +0.0 1:5.4 2:3.4 3:1.5 4:0.4 +0.0 1:5.2 2:4.1 3:1.5 4:0.1 +0.0 1:5.5 2:4.2 3:1.4 4:0.2 +0.0 1:4.9 2:3.1 3:1.5 4:0.1 +0.0 1:5.0 2:3.2 3:1.2 4:0.2 +0.0 1:5.5 2:3.5 3:1.3 4:0.2 +0.0 1:4.9 2:3.1 3:1.5 4:0.1 +0.0 1:4.4 2:3.0 3:1.3 4:0.2 +0.0 1:5.1 2:3.4 3:1.5 4:0.2 +0.0 1:5.0 2:3.5 3:1.3 4:0.3 +0.0 1:4.5 2:2.3 3:1.3 4:0.3 +0.0 1:4.4 2:3.2 3:1.3 4:0.2 +0.0 1:5.0 2:3.5 3:1.6 4:0.6 +0.0 1:5.1 2:3.8 3:1.9 4:0.4 +0.0 1:4.8 2:3.0 3:1.4 4:0.3 +0.0 1:5.1 2:3.8 3:1.6 4:0.2 +0.0 1:4.6 2:3.2 3:1.4 4:0.2 +0.0 1:5.3 2:3.7 3:1.5 4:0.2 +0.0 1:5.0 2:3.3 3:1.4 4:0.2 +1.0 1:7.0 2:3.2 3:4.7 4:1.4 +1.0 1:6.4 2:3.2 3:4.5 4:1.5 +1.0 1:6.9 2:3.1 3:4.9 4:1.5 +1.0 1:5.5 2:2.3 3:4.0 4:1.3 +1.0 1:6.5 2:2.8 3:4.6 4:1.5 +1.0 1:5.7 2:2.8 3:4.5 4:1.3 +1.0 1:6.3 2:3.3 3:4.7 4:1.6 +1.0 1:4.9 2:2.4 3:3.3 4:1.0 +1.0 1:6.6 2:2.9 3:4.6 4:1.3 +1.0 1:5.2 2:2.7 3:3.9 4:1.4 +1.0 1:5.0 2:2.0 3:3.5 4:1.0 +1.0 1:5.9 2:3.0 3:4.2 4:1.5 +1.0 1:6.0 2:2.2 3:4.0 4:1.0 +1.0 1:6.1 2:2.9 3:4.7 4:1.4 +1.0 1:5.6 2:2.9 3:3.6 4:1.3 +1.0 1:6.7 2:3.1 3:4.4 4:1.4 +1.0 1:5.6 2:3.0 3:4.5 4:1.5 +1.0 1:5.8 2:2.7 3:4.1 4:1.0 +1.0 1:6.2 2:2.2 3:4.5 4:1.5 +1.0 1:5.6 2:2.5 3:3.9 4:1.1 +1.0 1:5.9 2:3.2 3:4.8 4:1.8 +1.0 1:6.1 2:2.8 3:4.0 4:1.3 +1.0 1:6.3 2:2.5 3:4.9 4:1.5 +1.0 1:6.1 2:2.8 3:4.7 4:1.2 +1.0 1:6.4 2:2.9 3:4.3 4:1.3 +1.0 1:6.6 2:3.0 3:4.4 4:1.4 +1.0 1:6.8 2:2.8 3:4.8 4:1.4 +1.0 1:6.7 2:3.0 3:5.0 4:1.7 +1.0 1:6.0 2:2.9 3:4.5 4:1.5 +1.0 1:5.7 2:2.6 3:3.5 4:1.0 +1.0 1:5.5 2:2.4 3:3.8 4:1.1 +1.0 1:5.5 2:2.4 3:3.7 4:1.0 +1.0 1:5.8 2:2.7 3:3.9 4:1.2 +1.0 1:6.0 2:2.7 3:5.1 4:1.6 +1.0 1:5.4 2:3.0 3:4.5 4:1.5 +1.0 1:6.0 2:3.4 3:4.5 4:1.6 +1.0 1:6.7 2:3.1 3:4.7 4:1.5 +1.0 1:6.3 2:2.3 3:4.4 4:1.3 +1.0 1:5.6 2:3.0 3:4.1 4:1.3 +1.0 1:5.5 2:2.5 3:4.0 4:1.3 +1.0 1:5.5 2:2.6 3:4.4 4:1.2 +1.0 1:6.1 2:3.0 3:4.6 4:1.4 +1.0 1:5.8 2:2.6 3:4.0 4:1.2 +1.0 1:5.0 2:2.3 3:3.3 4:1.0 +1.0 1:5.6 2:2.7 3:4.2 4:1.3 +1.0 1:5.7 2:3.0 3:4.2 4:1.2 +1.0 1:5.7 2:2.9 3:4.2 4:1.3 +1.0 1:6.2 2:2.9 3:4.3 4:1.3 +1.0 1:5.1 2:2.5 3:3.0 4:1.1 +1.0 1:5.7 2:2.8 3:4.1 4:1.3 +2.0 1:6.3 2:3.3 3:6.0 4:2.5 +2.0 1:5.8 2:2.7 3:5.1 4:1.9 +2.0 1:7.1 2:3.0 3:5.9 4:2.1 +2.0 1:6.3 2:2.9 3:5.6 4:1.8 +2.0 1:6.5 2:3.0 3:5.8 4:2.2 +2.0 1:7.6 2:3.0 3:6.6 4:2.1 +2.0 1:4.9 2:2.5 3:4.5 4:1.7 +2.0 1:7.3 2:2.9 3:6.3 4:1.8 +2.0 1:6.7 2:2.5 3:5.8 4:1.8 +2.0 1:7.2 2:3.6 3:6.1 4:2.5 +2.0 1:6.5 2:3.2 3:5.1 4:2.0 +2.0 1:6.4 2:2.7 3:5.3 4:1.9 +2.0 1:6.8 2:3.0 3:5.5
spark git commit: [SPARK-21981][PYTHON][ML] Added Python interface for ClusteringEvaluator
Repository: spark Updated Branches: refs/heads/master fedf6961b -> 5ac96854c [SPARK-21981][PYTHON][ML] Added Python interface for ClusteringEvaluator ## What changes were proposed in this pull request? Added Python interface for ClusteringEvaluator ## How was this patch tested? Manual test, eg. the example Python code in the comments. cc yanboliang Author: Marco GaidoAuthor: Marco Gaido Closes #19204 from mgaido91/SPARK-21981. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5ac96854 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5ac96854 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5ac96854 Branch: refs/heads/master Commit: 5ac96854cc6186fa2dad602d0906ff2705e3f610 Parents: fedf696 Author: Marco Gaido Authored: Fri Sep 22 13:12:33 2017 +0800 Committer: Yanbo Liang Committed: Fri Sep 22 13:12:33 2017 +0800 -- python/pyspark/ml/evaluation.py | 76 +++- 1 file changed, 74 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5ac96854/python/pyspark/ml/evaluation.py -- diff --git a/python/pyspark/ml/evaluation.py b/python/pyspark/ml/evaluation.py index 09cdf9b..aa8dbe7 100644 --- a/python/pyspark/ml/evaluation.py +++ b/python/pyspark/ml/evaluation.py @@ -20,12 +20,13 @@ from abc import abstractmethod, ABCMeta from pyspark import since, keyword_only from pyspark.ml.wrapper import JavaParams from pyspark.ml.param import Param, Params, TypeConverters -from pyspark.ml.param.shared import HasLabelCol, HasPredictionCol, HasRawPredictionCol +from pyspark.ml.param.shared import HasLabelCol, HasPredictionCol, HasRawPredictionCol, \ +HasFeaturesCol from pyspark.ml.common import inherit_doc from pyspark.ml.util import JavaMLReadable, JavaMLWritable __all__ = ['Evaluator', 'BinaryClassificationEvaluator', 'RegressionEvaluator', - 'MulticlassClassificationEvaluator'] + 'MulticlassClassificationEvaluator', 'ClusteringEvaluator'] @inherit_doc @@ -325,6 +326,77 @@ class MulticlassClassificationEvaluator(JavaEvaluator, HasLabelCol, HasPredictio kwargs = self._input_kwargs return self._set(**kwargs) + +@inherit_doc +class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, + JavaMLReadable, JavaMLWritable): +""" +.. note:: Experimental + +Evaluator for Clustering results, which expects two input +columns: prediction and features. + +>>> from pyspark.ml.linalg import Vectors +>>> featureAndPredictions = map(lambda x: (Vectors.dense(x[0]), x[1]), +... [([0.0, 0.5], 0.0), ([0.5, 0.0], 0.0), ([10.0, 11.0], 1.0), +... ([10.5, 11.5], 1.0), ([1.0, 1.0], 0.0), ([8.0, 6.0], 1.0)]) +>>> dataset = spark.createDataFrame(featureAndPredictions, ["features", "prediction"]) +... +>>> evaluator = ClusteringEvaluator(predictionCol="prediction") +>>> evaluator.evaluate(dataset) +0.9079... +>>> ce_path = temp_path + "/ce" +>>> evaluator.save(ce_path) +>>> evaluator2 = ClusteringEvaluator.load(ce_path) +>>> str(evaluator2.getPredictionCol()) +'prediction' + +.. versionadded:: 2.3.0 +""" +metricName = Param(Params._dummy(), "metricName", + "metric name in evaluation (silhouette)", + typeConverter=TypeConverters.toString) + +@keyword_only +def __init__(self, predictionCol="prediction", featuresCol="features", + metricName="silhouette"): +""" +__init__(self, predictionCol="prediction", featuresCol="features", \ + metricName="silhouette") +""" +super(ClusteringEvaluator, self).__init__() +self._java_obj = self._new_java_obj( +"org.apache.spark.ml.evaluation.ClusteringEvaluator", self.uid) +self._setDefault(metricName="silhouette") +kwargs = self._input_kwargs +self._set(**kwargs) + +@since("2.3.0") +def setMetricName(self, value): +""" +Sets the value of :py:attr:`metricName`. +""" +return self._set(metricName=value) + +@since("2.3.0") +def getMetricName(self): +""" +Gets the value of metricName or its default value. +""" +return self.getOrDefault(self.metricName) + +@keyword_only +@since("2.3.0") +def setParams(self, predictionCol="prediction", featuresCol="features", + metricName="silhouette"): +""" +setParams(self, predictionCol="prediction", featuresCol="features", \ + metricName="silhouette") +
spark git commit: [MINOR][ML] Remove unnecessary default value setting for evaluators.
Repository: spark Updated Branches: refs/heads/master 8319432af -> 2f962422a [MINOR][ML] Remove unnecessary default value setting for evaluators. ## What changes were proposed in this pull request? Remove unnecessary default value setting for all evaluators, as we have set them in corresponding _HasXXX_ base classes. ## How was this patch tested? Existing tests. Author: Yanbo LiangCloses #19262 from yanboliang/evaluation. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2f962422 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2f962422 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2f962422 Branch: refs/heads/master Commit: 2f962422a25020582c915e15819f91f43c0b9d68 Parents: 8319432 Author: Yanbo Liang Authored: Tue Sep 19 22:22:35 2017 +0800 Committer: Yanbo Liang Committed: Tue Sep 19 22:22:35 2017 +0800 -- python/pyspark/ml/evaluation.py | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2f962422/python/pyspark/ml/evaluation.py -- diff --git a/python/pyspark/ml/evaluation.py b/python/pyspark/ml/evaluation.py index 7cb8d62..09cdf9b 100644 --- a/python/pyspark/ml/evaluation.py +++ b/python/pyspark/ml/evaluation.py @@ -146,8 +146,7 @@ class BinaryClassificationEvaluator(JavaEvaluator, HasLabelCol, HasRawPrediction super(BinaryClassificationEvaluator, self).__init__() self._java_obj = self._new_java_obj( "org.apache.spark.ml.evaluation.BinaryClassificationEvaluator", self.uid) -self._setDefault(rawPredictionCol="rawPrediction", labelCol="label", - metricName="areaUnderROC") +self._setDefault(metricName="areaUnderROC") kwargs = self._input_kwargs self._set(**kwargs) @@ -224,8 +223,7 @@ class RegressionEvaluator(JavaEvaluator, HasLabelCol, HasPredictionCol, super(RegressionEvaluator, self).__init__() self._java_obj = self._new_java_obj( "org.apache.spark.ml.evaluation.RegressionEvaluator", self.uid) -self._setDefault(predictionCol="prediction", labelCol="label", - metricName="rmse") +self._setDefault(metricName="rmse") kwargs = self._input_kwargs self._set(**kwargs) @@ -297,8 +295,7 @@ class MulticlassClassificationEvaluator(JavaEvaluator, HasLabelCol, HasPredictio super(MulticlassClassificationEvaluator, self).__init__() self._java_obj = self._new_java_obj( "org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator", self.uid) -self._setDefault(predictionCol="prediction", labelCol="label", - metricName="f1") +self._setDefault(metricName="f1") kwargs = self._input_kwargs self._set(**kwargs) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest.
Repository: spark Updated Branches: refs/heads/branch-2.2 3a692e355 -> 51e5a821d [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest. ## What changes were proposed in this pull request? #19197 fixed double caching for MLlib algorithms, but missed PySpark ```OneVsRest```, this PR fixed it. ## How was this patch tested? Existing tests. Author: Yanbo LiangCloses #19220 from yanboliang/SPARK-18608. (cherry picked from commit c76153cc7dd25b8de5266fe119095066be7f78f5) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/51e5a821 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/51e5a821 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/51e5a821 Branch: refs/heads/branch-2.2 Commit: 51e5a821dcaa1d5f529afafc88cb8cfb4ad48e09 Parents: 3a692e3 Author: Yanbo Liang Authored: Thu Sep 14 14:09:44 2017 +0800 Committer: Yanbo Liang Committed: Thu Sep 14 14:10:10 2017 +0800 -- python/pyspark/ml/classification.py | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/51e5a821/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index 80bb054..ea6800a 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -1576,8 +1576,7 @@ class OneVsRest(Estimator, OneVsRestParams, MLReadable, MLWritable): multiclassLabeled = dataset.select(labelCol, featuresCol) # persist if underlying dataset is not persistent. -handlePersistence = \ -dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, False) +handlePersistence = dataset.storageLevel == StorageLevel(False, False, False, False) if handlePersistence: multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK) @@ -1690,8 +1689,7 @@ class OneVsRestModel(Model, OneVsRestParams, MLReadable, MLWritable): newDataset = dataset.withColumn(accColName, initUDF(dataset[origCols[0]])) # persist if underlying dataset is not persistent. -handlePersistence = \ -dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, False) +handlePersistence = dataset.storageLevel == StorageLevel(False, False, False, False) if handlePersistence: newDataset.persist(StorageLevel.MEMORY_AND_DISK) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest.
Repository: spark Updated Branches: refs/heads/master 66cb72d7b -> c76153cc7 [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest. ## What changes were proposed in this pull request? #19197 fixed double caching for MLlib algorithms, but missed PySpark ```OneVsRest```, this PR fixed it. ## How was this patch tested? Existing tests. Author: Yanbo LiangCloses #19220 from yanboliang/SPARK-18608. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c76153cc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c76153cc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c76153cc Branch: refs/heads/master Commit: c76153cc7dd25b8de5266fe119095066be7f78f5 Parents: 66cb72d Author: Yanbo Liang Authored: Thu Sep 14 14:09:44 2017 +0800 Committer: Yanbo Liang Committed: Thu Sep 14 14:09:44 2017 +0800 -- python/pyspark/ml/classification.py | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c76153cc/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index 0caafa6..27ad1e8 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -1773,8 +1773,7 @@ class OneVsRest(Estimator, OneVsRestParams, HasParallelism, JavaMLReadable, Java multiclassLabeled = dataset.select(labelCol, featuresCol) # persist if underlying dataset is not persistent. -handlePersistence = \ -dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, False) +handlePersistence = dataset.storageLevel == StorageLevel(False, False, False, False) if handlePersistence: multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK) @@ -1928,8 +1927,7 @@ class OneVsRestModel(Model, OneVsRestParams, JavaMLReadable, JavaMLWritable): newDataset = dataset.withColumn(accColName, initUDF(dataset[origCols[0]])) # persist if underlying dataset is not persistent. -handlePersistence = \ -dataset.rdd.getStorageLevel() == StorageLevel(False, False, False, False) +handlePersistence = dataset.storageLevel == StorageLevel(False, False, False, False) if handlePersistence: newDataset.persist(StorageLevel.MEMORY_AND_DISK) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][DOC] Add missing call of `update()` in examples of PeriodicGraphCheckpointer & PeriodicRDDCheckpointer
Repository: spark Updated Branches: refs/heads/master 8d8641f12 -> 66cb72d7b [MINOR][DOC] Add missing call of `update()` in examples of PeriodicGraphCheckpointer & PeriodicRDDCheckpointer ## What changes were proposed in this pull request? forgot to call `update()` with `graph1` & `rdd1` in examples for `PeriodicGraphCheckpointer` & `PeriodicRDDCheckpoin` ## How was this patch tested? existing tests Author: Zheng RuiFengCloses #19198 from zhengruifeng/fix_doc_checkpointer. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/66cb72d7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/66cb72d7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/66cb72d7 Branch: refs/heads/master Commit: 66cb72d7b9178774ba253e244bb2eddb1345b21f Parents: 8d8641f Author: Zheng RuiFeng Authored: Thu Sep 14 14:04:43 2017 +0800 Committer: Yanbo Liang Committed: Thu Sep 14 14:04:43 2017 +0800 -- .../scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala | 1 + .../org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala| 1 + 2 files changed, 2 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/66cb72d7/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala -- diff --git a/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala b/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala index ab72add..facbb83 100644 --- a/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala +++ b/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala @@ -50,6 +50,7 @@ import org.apache.spark.util.PeriodicCheckpointer * {{{ * val (rdd1, rdd2, rdd3, ...) = ... * val cp = new PeriodicRDDCheckpointer(2, sc) + * cp.update(rdd1) * rdd1.count(); * // persisted: rdd1 * cp.update(rdd2) http://git-wip-us.apache.org/repos/asf/spark/blob/66cb72d7/graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala -- diff --git a/graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala b/graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala index fda501a..539b66f 100644 --- a/graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala +++ b/graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala @@ -50,6 +50,7 @@ import org.apache.spark.util.PeriodicCheckpointer * {{{ * val (graph1, graph2, graph3, ...) = ... * val cp = new PeriodicGraphCheckpointer(2, sc) + * cp.updateGraph(graph1) * graph1.vertices.count(); graph1.edges.count() * // persisted: graph1 * cp.updateGraph(graph2) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-21854] Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API
Repository: spark Updated Branches: refs/heads/master dcbb22943 -> 8d8641f12 [SPARK-21854] Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API ## What changes were proposed in this pull request? Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API ## How was this patch tested? Added unit test Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Ming JiangAuthor: Ming Jiang Author: jmwdpk Closes #19185 from jmwdpk/SPARK-21854. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8d8641f1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8d8641f1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8d8641f1 Branch: refs/heads/master Commit: 8d8641f12250b0a9d370ff9354407c27af7cfcf4 Parents: dcbb229 Author: Ming Jiang Authored: Thu Sep 14 13:53:28 2017 +0800 Committer: Yanbo Liang Committed: Thu Sep 14 13:53:28 2017 +0800 -- .../LogisticRegressionSuite.scala | 12 ++ python/pyspark/ml/classification.py | 120 ++- python/pyspark/ml/tests.py | 55 - 3 files changed, 183 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8d8641f1/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala index d43c7cd..14f5508 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala @@ -2416,6 +2416,18 @@ class LogisticRegressionSuite blorSummary.recallByThreshold.collect() === sameBlorSummary.recallByThreshold.collect()) assert( blorSummary.precisionByThreshold.collect() === sameBlorSummary.precisionByThreshold.collect()) +assert(blorSummary.labels === sameBlorSummary.labels) +assert(blorSummary.truePositiveRateByLabel === sameBlorSummary.truePositiveRateByLabel) +assert(blorSummary.falsePositiveRateByLabel === sameBlorSummary.falsePositiveRateByLabel) +assert(blorSummary.precisionByLabel === sameBlorSummary.precisionByLabel) +assert(blorSummary.recallByLabel === sameBlorSummary.recallByLabel) +assert(blorSummary.fMeasureByLabel === sameBlorSummary.fMeasureByLabel) +assert(blorSummary.accuracy === sameBlorSummary.accuracy) +assert(blorSummary.weightedTruePositiveRate === sameBlorSummary.weightedTruePositiveRate) +assert(blorSummary.weightedFalsePositiveRate === sameBlorSummary.weightedFalsePositiveRate) +assert(blorSummary.weightedRecall === sameBlorSummary.weightedRecall) +assert(blorSummary.weightedPrecision === sameBlorSummary.weightedPrecision) +assert(blorSummary.weightedFMeasure === sameBlorSummary.weightedFMeasure) lr.setFamily("multinomial") val mlorModel = lr.fit(smallMultinomialDataset) http://git-wip-us.apache.org/repos/asf/spark/blob/8d8641f1/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index fbb9e7f..0caafa6 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -529,9 +529,11 @@ class LogisticRegressionModel(JavaModel, JavaClassificationModel, JavaMLWritable trained on the training set. An exception is thrown if `trainingSummary is None`. """ if self.hasSummary: -java_blrt_summary = self._call_java("summary") -# Note: Once multiclass is added, update this to return correct summary -return BinaryLogisticRegressionTrainingSummary(java_blrt_summary) +java_lrt_summary = self._call_java("summary") +if self.numClasses <= 2: +return BinaryLogisticRegressionTrainingSummary(java_lrt_summary) +else: +return LogisticRegressionTrainingSummary(java_lrt_summary) else: raise RuntimeError("No training summary available for this %s" % self.__class__.__name__) @@ -587,6 +589,14 @@ class LogisticRegressionSummary(JavaWrapper): return self._call_java("probabilityCol") @property +@since("2.3.0") +def predictionCol(self): +""" +Field in "predictions" which gives the prediction of
spark git commit: [SPARK-21690][ML] one-pass imputer
Repository: spark Updated Branches: refs/heads/master ca00cc70d -> 0fa5b7cac [SPARK-21690][ML] one-pass imputer ## What changes were proposed in this pull request? parallelize the computation of all columns performance tests: |numColums| Mean(Old) | Median(Old) | Mean(RDD) | Median(RDD) | Mean(DF) | Median(DF) | |--|--||--||--|| |1|0.0771394713|0.0658712813|0.080779802|0.04816598149996|0.1052550987001|0.0499620203| |10|0.723434063099|0.5954440414|0.0867935197|0.1326342865998|0.0925572488999|0.1573943635| |100|7.3756451568|6.2196631259|0.1911931552|0.862537681701|0.5557462431|1.721683798202| ## How was this patch tested? existing tests Author: Zheng RuiFengCloses #18902 from zhengruifeng/parallelize_imputer. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0fa5b7ca Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0fa5b7ca Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0fa5b7ca Branch: refs/heads/master Commit: 0fa5b7cacca4e867dd9f787cc2801616967932a4 Parents: ca00cc7 Author: Zheng RuiFeng Authored: Wed Sep 13 20:12:21 2017 +0800 Committer: Yanbo Liang Committed: Wed Sep 13 20:12:21 2017 +0800 -- .../org/apache/spark/ml/feature/Imputer.scala | 56 ++-- 1 file changed, 41 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0fa5b7ca/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala index 9e023b9..1f36ece 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala @@ -133,23 +133,49 @@ class Imputer @Since("2.2.0") (@Since("2.2.0") override val uid: String) override def fit(dataset: Dataset[_]): ImputerModel = { transformSchema(dataset.schema, logging = true) val spark = dataset.sparkSession -import spark.implicits._ -val surrogates = $(inputCols).map { inputCol => - val ic = col(inputCol) - val filtered = dataset.select(ic.cast(DoubleType)) -.filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN) - if(filtered.take(1).length == 0) { -throw new SparkException(s"surrogate cannot be computed. " + - s"All the values in $inputCol are Null, Nan or missingValue(${$(missingValue)})") - } - val surrogate = $(strategy) match { -case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first() -case Imputer.median => filtered.stat.approxQuantile(inputCol, Array(0.5), 0.001).head - } - surrogate + +val cols = $(inputCols).map { inputCol => + when(col(inputCol).equalTo($(missingValue)), null) +.when(col(inputCol).isNaN, null) +.otherwise(col(inputCol)) +.cast("double") +.as(inputCol) +} + +val results = $(strategy) match { + case Imputer.mean => +// Function avg will ignore null automatically. +// For a column only containing null, avg will return null. +val row = dataset.select(cols.map(avg): _*).head() +Array.range(0, $(inputCols).length).map { i => + if (row.isNullAt(i)) { +Double.NaN + } else { +row.getDouble(i) + } +} + + case Imputer.median => +// Function approxQuantile will ignore null automatically. +// For a column only containing null, approxQuantile will return an empty array. +dataset.select(cols: _*).stat.approxQuantile($(inputCols), Array(0.5), 0.001) + .map { array => +if (array.isEmpty) { + Double.NaN +} else { + array.head +} + } +} + +val emptyCols = $(inputCols).zip(results).filter(_._2.isNaN).map(_._1) +if (emptyCols.nonEmpty) { + throw new SparkException(s"surrogate cannot be computed. " + +s"All the values in ${emptyCols.mkString(",")} are Null, Nan or " + +s"missingValue(${$(missingValue)})") } -val rows = spark.sparkContext.parallelize(Seq(Row.fromSeq(surrogates))) +val rows = spark.sparkContext.parallelize(Seq(Row.fromSeq(results))) val schema = StructType($(inputCols).map(col => StructField(col, DoubleType, nullable = false))) val surrogateDF = spark.createDataFrame(rows, schema) copyValues(new ImputerModel(uid, surrogateDF).setParent(this)) - To unsubscribe,
spark git commit: [SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine silhouette and squared Euclidean silhouette.
Repository: spark Updated Branches: refs/heads/master e2ac2f1c7 -> dd7816758 [SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine silhouette and squared Euclidean silhouette. ## What changes were proposed in this pull request? This PR adds the ClusteringEvaluator Evaluator which contains two metrics: - **cosineSilhouette**: the Silhouette measure using the cosine distance; - **squaredSilhouette**: the Silhouette measure using the squared Euclidean distance. The implementation of the two metrics refers to the algorithm proposed and explained [here](https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view). These algorithms have been thought for a distributed and parallel environment, thus they have reasonable performance, unlike a naive Silhouette implementation following its definition. ## How was this patch tested? The patch has been tested with the additional unit tests added (comparing the results with the ones provided by [Python sklearn library](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)). Author: Marco GaidoCloses #18538 from mgaido91/SPARK-14516. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dd781675 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dd781675 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dd781675 Branch: refs/heads/master Commit: dd7816758516b303d79adaac856670c3ccda11ce Parents: e2ac2f1 Author: Marco Gaido Authored: Tue Sep 12 17:59:53 2017 +0800 Committer: Yanbo Liang Committed: Tue Sep 12 17:59:53 2017 +0800 -- .../ml/evaluation/ClusteringEvaluator.scala | 436 +++ mllib/src/test/resources/test-data/iris.libsvm | 150 +++ .../evaluation/ClusteringEvaluatorSuite.scala | 89 3 files changed, 675 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dd781675/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala new file mode 100644 index 000..d6ec522 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala @@ -0,0 +1,436 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.SparkContext +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils} +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.functions.{avg, col, udf} +import org.apache.spark.sql.types.DoubleType + +/** + * :: Experimental :: + * + * Evaluator for clustering results. + * The metric computes the Silhouette measure + * using the squared Euclidean distance. + * + * The Silhouette is a measure for the validation + * of the consistency within clusters. It ranges + * between 1 and -1, where a value close to 1 + * means that the points in a cluster are close + * to the other points in the same cluster and + * far from the points of the other clusters. + */ +@Experimental +@Since("2.3.0") +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String) + extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("cluEval")) + + @Since("2.3.0") + override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap) +
spark git commit: [SPARK-21856] Add probability and rawPrediction to MLPC for Python
Repository: spark Updated Branches: refs/heads/master 828fab035 -> 4bab8f599 [SPARK-21856] Add probability and rawPrediction to MLPC for Python Probability and rawPrediction has been added to MultilayerPerceptronClassifier for Python Add unit test. Author: Chunsheng JiCloses #19172 from chunshengji/SPARK-21856. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4bab8f59 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4bab8f59 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4bab8f59 Branch: refs/heads/master Commit: 4bab8f5996d94a468a40fde2961ebebafc393508 Parents: 828fab0 Author: Chunsheng Ji Authored: Mon Sep 11 16:52:48 2017 +0800 Committer: Yanbo Liang Committed: Mon Sep 11 16:52:48 2017 +0800 -- python/pyspark/ml/classification.py | 15 ++- python/pyspark/ml/tests.py | 20 2 files changed, 30 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4bab8f59/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index f0f42a3..aa747f3 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -1356,7 +1356,8 @@ class NaiveBayesModel(JavaModel, JavaClassificationModel, JavaMLWritable, JavaML @inherit_doc class MultilayerPerceptronClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter, HasTol, HasSeed, HasStepSize, HasSolver, - JavaMLWritable, JavaMLReadable): + JavaMLWritable, JavaMLReadable, HasProbabilityCol, + HasRawPredictionCol): """ Classifier trainer based on the Multilayer Perceptron. Each layer has sigmoid activation function, output layer has softmax. @@ -1425,11 +1426,13 @@ class MultilayerPerceptronClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, @keyword_only def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, - solver="l-bfgs", initialWeights=None): + solver="l-bfgs", initialWeights=None, probabilityCol="probability", + rawPredicitionCol="rawPrediction"): """ __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \ maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, \ - solver="l-bfgs", initialWeights=None) + solver="l-bfgs", initialWeights=None, probabilityCol="probability", \ + rawPredicitionCol="rawPrediction") """ super(MultilayerPerceptronClassifier, self).__init__() self._java_obj = self._new_java_obj( @@ -1442,11 +1445,13 @@ class MultilayerPerceptronClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, @since("1.6.0") def setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, - solver="l-bfgs", initialWeights=None): + solver="l-bfgs", initialWeights=None, probabilityCol="probability", + rawPredicitionCol="rawPrediction"): """ setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction", \ maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, \ - solver="l-bfgs", initialWeights=None) + solver="l-bfgs", initialWeights=None, probabilityCol="probability", \ + rawPredicitionCol="rawPrediction"): Sets params for MultilayerPerceptronClassifier. """ kwargs = self._input_kwargs http://git-wip-us.apache.org/repos/asf/spark/blob/4bab8f59/python/pyspark/ml/tests.py -- diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py index 509698f..15d6c76 100755 --- a/python/pyspark/ml/tests.py +++ b/python/pyspark/ml/tests.py @@ -1655,6 +1655,26 @@ class LogisticRegressionTest(SparkSessionTestCase): np.allclose(model.interceptVector.toArray(), [-0.9057, -1.1392, -0.0033], atol=1E-4)) +class MultilayerPerceptronClassifierTest(SparkSessionTestCase): + +def test_raw_and_probability_prediction(self): + +data_path =
spark git commit: [SPARK-21108][ML] convert LinearSVC to aggregator framework
Repository: spark Updated Branches: refs/heads/master 05af2de0f -> f3676d639 [SPARK-21108][ML] convert LinearSVC to aggregator framework ## What changes were proposed in this pull request? convert LinearSVC to new aggregator framework ## How was this patch tested? existing unit test. Author: Yuhao YangCloses #18315 from hhbyyh/svcAggregator. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f3676d63 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f3676d63 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f3676d63 Branch: refs/heads/master Commit: f3676d63913e0706e071b71e1742b8d57b102fba Parents: 05af2de Author: Yuhao Yang Authored: Fri Aug 25 10:22:27 2017 +0800 Committer: Yanbo Liang Committed: Fri Aug 25 10:22:27 2017 +0800 -- .../spark/ml/classification/LinearSVC.scala | 204 ++- .../ml/optim/aggregator/HingeAggregator.scala | 105 ++ .../ml/classification/LinearSVCSuite.scala | 7 +- .../optim/aggregator/HingeAggregatorSuite.scala | 163 +++ .../aggregator/LogisticAggregatorSuite.scala| 2 - 5 files changed, 286 insertions(+), 195 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f3676d63/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala index 8d556de..3b0666c 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala @@ -25,11 +25,11 @@ import org.apache.hadoop.fs.Path import org.apache.spark.SparkException import org.apache.spark.annotation.{Experimental, Since} -import org.apache.spark.broadcast.Broadcast import org.apache.spark.internal.Logging import org.apache.spark.ml.feature.Instance import org.apache.spark.ml.linalg._ -import org.apache.spark.ml.linalg.BLAS._ +import org.apache.spark.ml.optim.aggregator.HingeAggregator +import org.apache.spark.ml.optim.loss.{L2Regularization, RDDLossFunction} import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ import org.apache.spark.ml.util._ @@ -214,10 +214,20 @@ class LinearSVC @Since("2.2.0") ( } val featuresStd = summarizer.variance.toArray.map(math.sqrt) + val getFeaturesStd = (j: Int) => featuresStd(j) val regParamL2 = $(regParam) val bcFeaturesStd = instances.context.broadcast(featuresStd) - val costFun = new LinearSVCCostFun(instances, $(fitIntercept), -$(standardization), bcFeaturesStd, regParamL2, $(aggregationDepth)) + val regularization = if (regParamL2 != 0.0) { +val shouldApply = (idx: Int) => idx >= 0 && idx < numFeatures +Some(new L2Regularization(regParamL2, shouldApply, + if ($(standardization)) None else Some(getFeaturesStd))) + } else { +None + } + + val getAggregatorFunc = new HingeAggregator(bcFeaturesStd, $(fitIntercept))(_) + val costFun = new RDDLossFunction(instances, getAggregatorFunc, regularization, +$(aggregationDepth)) def regParamL1Fun = (index: Int) => 0D val optimizer = new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) @@ -372,189 +382,3 @@ object LinearSVCModel extends MLReadable[LinearSVCModel] { } } } - -/** - * LinearSVCCostFun implements Breeze's DiffFunction[T] for hinge loss function - */ -private class LinearSVCCostFun( -instances: RDD[Instance], -fitIntercept: Boolean, -standardization: Boolean, -bcFeaturesStd: Broadcast[Array[Double]], -regParamL2: Double, -aggregationDepth: Int) extends DiffFunction[BDV[Double]] { - - override def calculate(coefficients: BDV[Double]): (Double, BDV[Double]) = { -val coeffs = Vectors.fromBreeze(coefficients) -val bcCoeffs = instances.context.broadcast(coeffs) -val featuresStd = bcFeaturesStd.value -val numFeatures = featuresStd.length - -val svmAggregator = { - val seqOp = (c: LinearSVCAggregator, instance: Instance) => c.add(instance) - val combOp = (c1: LinearSVCAggregator, c2: LinearSVCAggregator) => c1.merge(c2) - - instances.treeAggregate( -new LinearSVCAggregator(bcCoeffs, bcFeaturesStd, fitIntercept) - )(seqOp, combOp, aggregationDepth) -} - -val totalGradientArray = svmAggregator.gradient.toArray -// regVal is the sum of coefficients squares excluding intercept for L2 regularization. -val regVal = if (regParamL2 == 0.0) { - 0.0 -} else { - var sum = 0.0 -
spark git commit: [ML][MINOR] Make sharedParams update.
Repository: spark Updated Branches: refs/heads/master 3c0c2d09c -> 342961905 [ML][MINOR] Make sharedParams update. ## What changes were proposed in this pull request? ```sharedParams.scala``` was generated by ```SharedParamsCodeGen```, but it's not updated in master. Maybe someone manual update ```sharedParams.scala```, this PR fix this issue. ## How was this patch tested? Offline check. Author: Yanbo LiangCloses #19011 from yanboliang/sharedParams. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/34296190 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/34296190 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/34296190 Branch: refs/heads/master Commit: 34296190558435fce73184fb7fb1e3d2ced7c3f6 Parents: 3c0c2d0 Author: Yanbo Liang Authored: Wed Aug 23 11:06:53 2017 +0800 Committer: Yanbo Liang Committed: Wed Aug 23 11:06:53 2017 +0800 -- .../main/scala/org/apache/spark/ml/param/shared/sharedParams.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/34296190/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala index 545e45e..6061d9c 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala @@ -154,7 +154,7 @@ private[ml] trait HasVarianceCol extends Params { } /** - * Trait for shared param threshold (default: 0.5). + * Trait for shared param threshold. */ private[ml] trait HasThreshold extends Params { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19762][ML][FOLLOWUP] Add necessary comments to L2Regularization.
Repository: spark Updated Branches: refs/heads/master 84b5b16ea -> c108a5d30 [SPARK-19762][ML][FOLLOWUP] Add necessary comments to L2Regularization. ## What changes were proposed in this pull request? MLlib ```LinearRegression/LogisticRegression/LinearSVC``` always standardize the data during training to improve the rate of convergence regardless of _standardization_ is true or false. If _standardization_ is false, we perform reverse standardization by penalizing each component differently to get effectively the same objective function when the training dataset is not standardized. We should keep these comments in the code to let developers understand how we handle it correctly. ## How was this patch tested? Existing tests, only adding some comments in code. Author: Yanbo LiangCloses #18992 from yanboliang/SPARK-19762. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c108a5d3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c108a5d3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c108a5d3 Branch: refs/heads/master Commit: c108a5d30e821fef23709681fca7da22bc507129 Parents: 84b5b16 Author: Yanbo Liang Authored: Tue Aug 22 08:43:18 2017 +0800 Committer: Yanbo Liang Committed: Tue Aug 22 08:43:18 2017 +0800 -- .../ml/optim/loss/DifferentiableRegularization.scala | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c108a5d3/mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala b/mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala index 7ac7c22..929374e 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala @@ -39,9 +39,13 @@ private[ml] trait DifferentiableRegularization[T] extends DiffFunction[T] { * * @param regParam The magnitude of the regularization. * @param shouldApply A function (Int => Boolean) indicating whether a given index should have - *regularization applied to it. + *regularization applied to it. Usually we don't apply regularization to + *the intercept. * @param applyFeaturesStd Option for a function which maps coefficient index (column major) to the - * feature standard deviation. If `None`, no standardization is applied. + * feature standard deviation. Since we always standardize the data during + * training, if `standardization` is false, we have to reverse + * standardization by penalizing each component differently by this param. + * If `standardization` is true, this should be `None`. */ private[ml] class L2Regularization( override val regParam: Double, @@ -57,6 +61,11 @@ private[ml] class L2Regularization( val coef = coefficients(j) applyFeaturesStd match { case Some(getStd) => + // If `standardization` is false, we still standardize the data + // to improve the rate of convergence; as a result, we have to + // perform this reverse standardization by penalizing each component + // differently to get effectively the same objective function when + // the training dataset is not standardized. val std = getStd(j) if (std != 0.0) { val temp = coef / (std * std) @@ -66,6 +75,7 @@ private[ml] class L2Regularization( 0.0 } case None => + // If `standardization` is true, compute L2 regularization normally. sum += coef * coef gradient(j) = coef * regParam } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19634][ML] Multivariate summarizer - dataframes API
Repository: spark Updated Branches: refs/heads/master 966083105 -> 07549b20a [SPARK-19634][ML] Multivariate summarizer - dataframes API ## What changes were proposed in this pull request? This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics. ## How was this patch tested? Testcases added. ## Performance Resolve several performance issues in #17419, further optimization pending on SQL team's work. One of the SQL layer performance issue related to these feature has been resolved in #18712, thanks liancheng and cloud-fan ### Performance data (test on my laptop, use 2 partitions. tries out = 20, warm up = 10) The unit of test results is records/milliseconds (higher is better) Vector size/records number | 1/1000 | 10/100 | 100/100 | 1000/10 | 1/1 |--||---|| Dataframe | 15149 | 7441 | 2118 | 224 | 21 RDD from Dataframe | 4992 | 4440 | 2328 | 320 | 33 raw RDD | 53931 | 20683 | 3966 | 528 | 53 Author: WeichenXuCloses #18798 from WeichenXu123/SPARK-19634-dataframe-summarizer. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07549b20 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07549b20 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07549b20 Branch: refs/heads/master Commit: 07549b20a3fc2a282e080f76a2be075e4dd5ebc7 Parents: 9660831 Author: WeichenXu Authored: Wed Aug 16 10:41:05 2017 +0800 Committer: Yanbo Liang Committed: Wed Aug 16 10:41:05 2017 +0800 -- .../org/apache/spark/ml/linalg/VectorUDT.scala | 24 +- .../org/apache/spark/ml/stat/Summarizer.scala | 596 +++ .../apache/spark/ml/stat/SummarizerSuite.scala | 582 ++ .../sql/catalyst/expressions/Projection.scala | 6 + .../expressions/aggregate/interfaces.scala | 6 + 5 files changed, 1203 insertions(+), 11 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/07549b20/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala b/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala index 9178613..37f173b 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala @@ -27,17 +27,7 @@ import org.apache.spark.sql.types._ */ private[spark] class VectorUDT extends UserDefinedType[Vector] { - override def sqlType: StructType = { -// type: 0 = sparse, 1 = dense -// We only use "values" for dense vectors, and "size", "indices", and "values" for sparse -// vectors. The "values" field is nullable because we might want to add binary vectors later, -// which uses "size" and "indices", but not "values". -StructType(Seq( - StructField("type", ByteType, nullable = false), - StructField("size", IntegerType, nullable = true), - StructField("indices", ArrayType(IntegerType, containsNull = false), nullable = true), - StructField("values", ArrayType(DoubleType, containsNull = false), nullable = true))) - } + override final def sqlType: StructType = _sqlType override def serialize(obj: Vector): InternalRow = { obj match { @@ -94,4 +84,16 @@ private[spark] class VectorUDT extends UserDefinedType[Vector] { override def typeName: String = "vector" private[spark] override def asNullable: VectorUDT = this + + private[this] val _sqlType = { +// type: 0 = sparse, 1 = dense +// We only use "values" for dense vectors, and "size", "indices", and "values" for sparse +// vectors. The "values" field is nullable because we might want to add binary vectors later, +// which uses "size" and "indices", but not "values". +StructType(Seq( + StructField("type", ByteType, nullable = false), + StructField("size", IntegerType, nullable = true), + StructField("indices", ArrayType(IntegerType, containsNull = false), nullable = true), + StructField("values", ArrayType(DoubleType, containsNull = false), nullable = true))) + } } http://git-wip-us.apache.org/repos/asf/spark/blob/07549b20/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala b/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala new file mode 100644 index 000..7e408b9 --- /dev/null +++
spark git commit: [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search
Repository: spark Updated Branches: refs/heads/branch-2.2 d02331452 -> 7446be332 [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search ## What changes were proposed in this pull request? Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search https://github.com/scalanlp/breeze/pull/651 ## How was this patch tested? N/A Author: WeichenXuCloses #18797 from WeichenXu123/update-breeze. (cherry picked from commit b35660dd0e930f4b484a079d9e2516b0a7dacf1d) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7446be33 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7446be33 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7446be33 Branch: refs/heads/branch-2.2 Commit: 7446be3328ea75a5197b2587e3a8e2ca7977726b Parents: d023314 Author: WeichenXu Authored: Wed Aug 9 14:44:10 2017 +0800 Committer: Yanbo Liang Committed: Wed Aug 9 14:44:39 2017 +0800 -- dev/deps/spark-deps-hadoop-2.6| 4 ++-- dev/deps/spark-deps-hadoop-2.7| 4 ++-- .../spark/ml/regression/AFTSurvivalRegression.scala | 2 ++ .../ml/regression/AFTSurvivalRegressionSuite.scala| 1 - .../org/apache/spark/ml/util/MLTestingUtils.scala | 1 - .../apache/spark/mllib/optimization/LBFGSSuite.scala | 4 ++-- pom.xml | 2 +- python/pyspark/ml/regression.py | 14 +++--- 8 files changed, 16 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7446be33/dev/deps/spark-deps-hadoop-2.6 -- diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6 index 9287bd4..02c0b21 100644 --- a/dev/deps/spark-deps-hadoop-2.6 +++ b/dev/deps/spark-deps-hadoop-2.6 @@ -19,8 +19,8 @@ avro-mapred-1.7.7-hadoop2.jar base64-2.3.8.jar bcprov-jdk15on-1.51.jar bonecp-0.8.0.RELEASE.jar -breeze-macros_2.11-0.13.1.jar -breeze_2.11-0.13.1.jar +breeze-macros_2.11-0.13.2.jar +breeze_2.11-0.13.2.jar calcite-avatica-1.2.0-incubating.jar calcite-core-1.2.0-incubating.jar calcite-linq4j-1.2.0-incubating.jar http://git-wip-us.apache.org/repos/asf/spark/blob/7446be33/dev/deps/spark-deps-hadoop-2.7 -- diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7 index ab1de3d..47e28de 100644 --- a/dev/deps/spark-deps-hadoop-2.7 +++ b/dev/deps/spark-deps-hadoop-2.7 @@ -19,8 +19,8 @@ avro-mapred-1.7.7-hadoop2.jar base64-2.3.8.jar bcprov-jdk15on-1.51.jar bonecp-0.8.0.RELEASE.jar -breeze-macros_2.11-0.13.1.jar -breeze_2.11-0.13.1.jar +breeze-macros_2.11-0.13.2.jar +breeze_2.11-0.13.2.jar calcite-avatica-1.2.0-incubating.jar calcite-core-1.2.0-incubating.jar calcite-linq4j-1.2.0-incubating.jar http://git-wip-us.apache.org/repos/asf/spark/blob/7446be33/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala index 094853b..0891994 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala @@ -553,6 +553,8 @@ private class AFTAggregator( val ti = data.label val delta = data.censor +require(ti > 0.0, "The lifetime or label should be greater than 0.") + val localFeaturesStd = bcFeaturesStd.value val margin = { http://git-wip-us.apache.org/repos/asf/spark/blob/7446be33/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala index fb39e50..02e5c6d 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala @@ -364,7 +364,6 @@ class AFTSurvivalRegressionSuite test("should support all NumericType censors, and not support other types") { val df = spark.createDataFrame(Seq( - (0, Vectors.dense(0)), (1, Vectors.dense(1)), (2, Vectors.dense(2)), (3, Vectors.dense(3)),
spark git commit: [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search
Repository: spark Updated Branches: refs/heads/master ae8a2b149 -> b35660dd0 [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search ## What changes were proposed in this pull request? Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search https://github.com/scalanlp/breeze/pull/651 ## How was this patch tested? N/A Author: WeichenXuCloses #18797 from WeichenXu123/update-breeze. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b35660dd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b35660dd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b35660dd Branch: refs/heads/master Commit: b35660dd0e930f4b484a079d9e2516b0a7dacf1d Parents: ae8a2b1 Author: WeichenXu Authored: Wed Aug 9 14:44:10 2017 +0800 Committer: Yanbo Liang Committed: Wed Aug 9 14:44:10 2017 +0800 -- dev/deps/spark-deps-hadoop-2.6| 4 ++-- dev/deps/spark-deps-hadoop-2.7| 4 ++-- .../spark/ml/regression/AFTSurvivalRegression.scala | 2 ++ .../ml/regression/AFTSurvivalRegressionSuite.scala| 1 - .../org/apache/spark/ml/util/MLTestingUtils.scala | 1 - .../apache/spark/mllib/optimization/LBFGSSuite.scala | 4 ++-- pom.xml | 2 +- python/pyspark/ml/regression.py | 14 +++--- 8 files changed, 16 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b35660dd/dev/deps/spark-deps-hadoop-2.6 -- diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6 index a41183a..d7587fb 100644 --- a/dev/deps/spark-deps-hadoop-2.6 +++ b/dev/deps/spark-deps-hadoop-2.6 @@ -22,8 +22,8 @@ avro-mapred-1.7.7-hadoop2.jar base64-2.3.8.jar bcprov-jdk15on-1.51.jar bonecp-0.8.0.RELEASE.jar -breeze-macros_2.11-0.13.1.jar -breeze_2.11-0.13.1.jar +breeze-macros_2.11-0.13.2.jar +breeze_2.11-0.13.2.jar calcite-avatica-1.2.0-incubating.jar calcite-core-1.2.0-incubating.jar calcite-linq4j-1.2.0-incubating.jar http://git-wip-us.apache.org/repos/asf/spark/blob/b35660dd/dev/deps/spark-deps-hadoop-2.7 -- diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7 index 5e1321b..887eeca 100644 --- a/dev/deps/spark-deps-hadoop-2.7 +++ b/dev/deps/spark-deps-hadoop-2.7 @@ -22,8 +22,8 @@ avro-mapred-1.7.7-hadoop2.jar base64-2.3.8.jar bcprov-jdk15on-1.51.jar bonecp-0.8.0.RELEASE.jar -breeze-macros_2.11-0.13.1.jar -breeze_2.11-0.13.1.jar +breeze-macros_2.11-0.13.2.jar +breeze_2.11-0.13.2.jar calcite-avatica-1.2.0-incubating.jar calcite-core-1.2.0-incubating.jar calcite-linq4j-1.2.0-incubating.jar http://git-wip-us.apache.org/repos/asf/spark/blob/b35660dd/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala index 094853b..0891994 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala @@ -553,6 +553,8 @@ private class AFTAggregator( val ti = data.label val delta = data.censor +require(ti > 0.0, "The lifetime or label should be greater than 0.") + val localFeaturesStd = bcFeaturesStd.value val margin = { http://git-wip-us.apache.org/repos/asf/spark/blob/b35660dd/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala index fb39e50..02e5c6d 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala @@ -364,7 +364,6 @@ class AFTSurvivalRegressionSuite test("should support all NumericType censors, and not support other types") { val df = spark.createDataFrame(Seq( - (0, Vectors.dense(0)), (1, Vectors.dense(1)), (2, Vectors.dense(2)), (3, Vectors.dense(3)), http://git-wip-us.apache.org/repos/asf/spark/blob/b35660dd/mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala
spark git commit: [SPARK-21306][ML] For branch 2.0, OneVsRest should support setWeightCol
Repository: spark Updated Branches: refs/heads/branch-2.0 c27a01aec -> 9f670ce5d [SPARK-21306][ML] For branch 2.0, OneVsRest should support setWeightCol The PR is related to #18554, and is modified for branch 2.0. ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (é¢åæ)Closes #18764 from facaiy/BUG/branch-2.0_OneVsRest_support_setWeightCol. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9f670ce5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9f670ce5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9f670ce5 Branch: refs/heads/branch-2.0 Commit: 9f670ce5d1aeef737226185d78f07147f0cc2693 Parents: c27a01a Author: Yan Facai (é¢åæ) Authored: Tue Aug 8 11:18:15 2017 +0800 Committer: Yanbo Liang Committed: Tue Aug 8 11:18:15 2017 +0800 -- .../spark/ml/classification/OneVsRest.scala | 39 ++-- .../ml/classification/OneVsRestSuite.scala | 11 ++ python/pyspark/ml/classification.py | 27 +++--- python/pyspark/ml/tests.py | 14 +++ 4 files changed, 82 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9f670ce5/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala index f4ab0a0..770d5db 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala @@ -34,6 +34,7 @@ import org.apache.spark.ml._ import org.apache.spark.ml.attribute._ import org.apache.spark.ml.linalg.Vector import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params} +import org.apache.spark.ml.param.shared.HasWeightCol import org.apache.spark.ml.util._ import org.apache.spark.sql.{DataFrame, Dataset, Row} import org.apache.spark.sql.functions._ @@ -53,7 +54,8 @@ private[ml] trait ClassifierTypeTrait { /** * Params for [[OneVsRest]]. */ -private[ml] trait OneVsRestParams extends PredictorParams with ClassifierTypeTrait { +private[ml] trait OneVsRestParams extends PredictorParams + with ClassifierTypeTrait with HasWeightCol { /** * param for the base binary classifier that we reduce multiclass classification into. @@ -290,6 +292,18 @@ final class OneVsRest @Since("1.4.0") ( @Since("1.5.0") def setPredictionCol(value: String): this.type = set(predictionCol, value) + /** + * Sets the value of param [[weightCol]]. + * + * This is ignored if weight is not supported by [[classifier]]. + * If this is not set or empty, we treat all instance weights as 1.0. + * Default is not set, so all instances have weight one. + * + * @group setParam + */ + @Since("2.3.0") + def setWeightCol(value: String): this.type = set(weightCol, value) + @Since("1.4.0") override def transformSchema(schema: StructType): StructType = { validateAndTransformSchema(schema, fitting = true, getClassifier.featuresDataType) @@ -308,7 +322,20 @@ final class OneVsRest @Since("1.4.0") ( } val numClasses = MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity) -val multiclassLabeled = dataset.select($(labelCol), $(featuresCol)) +val weightColIsUsed = isDefined(weightCol) && $(weightCol).nonEmpty && { + getClassifier match { +case _: HasWeightCol => true +case c => + logWarning(s"weightCol is ignored, as it is not supported by $c now.") + false + } +} + +val multiclassLabeled = if (weightColIsUsed) { + dataset.select($(labelCol), $(featuresCol), $(weightCol)) +} else { + dataset.select($(labelCol), $(featuresCol)) +} // persist if underlying dataset is not persistent. val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE @@ -328,7 +355,13 @@ final class OneVsRest @Since("1.4.0") ( paramMap.put(classifier.labelCol -> labelColName) paramMap.put(classifier.featuresCol -> getFeaturesCol) paramMap.put(classifier.predictionCol -> getPredictionCol) - classifier.fit(trainingDataset, paramMap) + if (weightColIsUsed) { +val classifier_ = classifier.asInstanceOf[ClassifierType with HasWeightCol] +paramMap.put(classifier_.weightCol -> getWeightCol) +classifier_.fit(trainingDataset, paramMap) + }
spark git commit: [SPARK-21306][ML] For branch 2.1, OneVsRest should support setWeightCol
Repository: spark Updated Branches: refs/heads/branch-2.1 444cca14d -> 9b749b6ce [SPARK-21306][ML] For branch 2.1, OneVsRest should support setWeightCol The PR is related to #18554, and is modified for branch 2.1. ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (é¢åæ)Closes #18763 from facaiy/BUG/branch-2.1_OneVsRest_support_setWeightCol. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9b749b6c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9b749b6c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9b749b6c Branch: refs/heads/branch-2.1 Commit: 9b749b6ce6b86caf8a73d6993490fc140b9ad282 Parents: 444cca1 Author: Yan Facai (é¢åæ) Authored: Tue Aug 8 11:05:36 2017 +0800 Committer: Yanbo Liang Committed: Tue Aug 8 11:05:36 2017 +0800 -- .../spark/ml/classification/OneVsRest.scala | 39 ++-- .../ml/classification/OneVsRestSuite.scala | 10 + python/pyspark/ml/classification.py | 27 +++--- python/pyspark/ml/tests.py | 14 +++ 4 files changed, 81 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9b749b6c/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala index e58b30d..c4a8f1f 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala @@ -34,6 +34,7 @@ import org.apache.spark.ml._ import org.apache.spark.ml.attribute._ import org.apache.spark.ml.linalg.Vector import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params} +import org.apache.spark.ml.param.shared.HasWeightCol import org.apache.spark.ml.util._ import org.apache.spark.sql.{DataFrame, Dataset, Row} import org.apache.spark.sql.functions._ @@ -53,7 +54,8 @@ private[ml] trait ClassifierTypeTrait { /** * Params for [[OneVsRest]]. */ -private[ml] trait OneVsRestParams extends PredictorParams with ClassifierTypeTrait { +private[ml] trait OneVsRestParams extends PredictorParams + with ClassifierTypeTrait with HasWeightCol { /** * param for the base binary classifier that we reduce multiclass classification into. @@ -299,6 +301,18 @@ final class OneVsRest @Since("1.4.0") ( @Since("1.5.0") def setPredictionCol(value: String): this.type = set(predictionCol, value) + /** + * Sets the value of param [[weightCol]]. + * + * This is ignored if weight is not supported by [[classifier]]. + * If this is not set or empty, we treat all instance weights as 1.0. + * Default is not set, so all instances have weight one. + * + * @group setParam + */ + @Since("2.3.0") + def setWeightCol(value: String): this.type = set(weightCol, value) + @Since("1.4.0") override def transformSchema(schema: StructType): StructType = { validateAndTransformSchema(schema, fitting = true, getClassifier.featuresDataType) @@ -317,7 +331,20 @@ final class OneVsRest @Since("1.4.0") ( } val numClasses = MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity) -val multiclassLabeled = dataset.select($(labelCol), $(featuresCol)) +val weightColIsUsed = isDefined(weightCol) && $(weightCol).nonEmpty && { + getClassifier match { +case _: HasWeightCol => true +case c => + logWarning(s"weightCol is ignored, as it is not supported by $c now.") + false + } +} + +val multiclassLabeled = if (weightColIsUsed) { + dataset.select($(labelCol), $(featuresCol), $(weightCol)) +} else { + dataset.select($(labelCol), $(featuresCol)) +} // persist if underlying dataset is not persistent. val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE @@ -337,7 +364,13 @@ final class OneVsRest @Since("1.4.0") ( paramMap.put(classifier.labelCol -> labelColName) paramMap.put(classifier.featuresCol -> getFeaturesCol) paramMap.put(classifier.predictionCol -> getPredictionCol) - classifier.fit(trainingDataset, paramMap) + if (weightColIsUsed) { +val classifier_ = classifier.asInstanceOf[ClassifierType with HasWeightCol] +paramMap.put(classifier_.weightCol -> getWeightCol) +classifier_.fit(trainingDataset, paramMap) + }
spark git commit: [SPARK-19270][FOLLOW-UP][ML] PySpark GLR model.summary should return a printable representation.
Repository: spark Updated Branches: refs/heads/master fdcee028a -> f763d8464 [SPARK-19270][FOLLOW-UP][ML] PySpark GLR model.summary should return a printable representation. ## What changes were proposed in this pull request? PySpark GLR ```model.summary``` should return a printable representation by calling Scala ```toString```. ## How was this patch tested? ``` from pyspark.ml.regression import GeneralizedLinearRegression dataset = spark.read.format("libsvm").load("data/mllib/sample_linear_regression_data.txt") glr = GeneralizedLinearRegression(family="gaussian", link="identity", maxIter=10, regParam=0.3) model = glr.fit(dataset) model.summary ``` Before this PR: ![image](https://user-images.githubusercontent.com/1962026/29021059-e221633e-7b96-11e7-8d77-5d53f89c81a9.png) After this PR: ![image](https://user-images.githubusercontent.com/1962026/29021097-fce80fa6-7b96-11e7-8ab4-7e113d447d5d.png) Author: Yanbo LiangCloses #18870 from yanboliang/spark-19270. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f763d846 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f763d846 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f763d846 Branch: refs/heads/master Commit: f763d8464b32852d7fd33e962e5476a7f03bc6c6 Parents: fdcee02 Author: Yanbo Liang Authored: Tue Aug 8 08:43:58 2017 +0800 Committer: Yanbo Liang Committed: Tue Aug 8 08:43:58 2017 +0800 -- python/pyspark/ml/regression.py | 3 +++ 1 file changed, 3 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f763d846/python/pyspark/ml/regression.py -- diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py index 2cc6234..72374ac 100644 --- a/python/pyspark/ml/regression.py +++ b/python/pyspark/ml/regression.py @@ -1745,6 +1745,9 @@ class GeneralizedLinearRegressionTrainingSummary(GeneralizedLinearRegressionSumm """ return self._call_java("pValues") +def __repr__(self): +return self._call_java("toString") + if __name__ == "__main__": import doctest - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20601][ML] Python API for Constrained Logistic Regression
Repository: spark Updated Branches: refs/heads/master 14e75758a -> 845c039ce [SPARK-20601][ML] Python API for Constrained Logistic Regression ## What changes were proposed in this pull request? Python API for Constrained Logistic Regression based on #17922 , thanks for the original contribution from zero323 . ## How was this patch tested? Unit tests. Author: zero323Author: Yanbo Liang Closes #18759 from yanboliang/SPARK-20601. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/845c039c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/845c039c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/845c039c Branch: refs/heads/master Commit: 845c039ceb1662632a97631b110e875e934894ad Parents: 14e7575 Author: zero323 Authored: Wed Aug 2 18:10:26 2017 +0800 Committer: Yanbo Liang Committed: Wed Aug 2 18:10:26 2017 +0800 -- python/pyspark/ml/classification.py | 105 +-- python/pyspark/ml/param/__init__.py | 11 +++- python/pyspark/ml/tests.py | 37 +++ 3 files changed, 148 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/845c039c/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index ab1617b..bccf8e7 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -252,18 +252,55 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredicti "be used in the model. Supported options: auto, binomial, multinomial", typeConverter=TypeConverters.toString) +lowerBoundsOnCoefficients = Param(Params._dummy(), "lowerBoundsOnCoefficients", + "The lower bounds on coefficients if fitting under bound " + "constrained optimization. The bound matrix must be " + "compatible with the shape " + "(1, number of features) for binomial regression, or " + "(number of classes, number of features) " + "for multinomial regression.", + typeConverter=TypeConverters.toMatrix) + +upperBoundsOnCoefficients = Param(Params._dummy(), "upperBoundsOnCoefficients", + "The upper bounds on coefficients if fitting under bound " + "constrained optimization. The bound matrix must be " + "compatible with the shape " + "(1, number of features) for binomial regression, or " + "(number of classes, number of features) " + "for multinomial regression.", + typeConverter=TypeConverters.toMatrix) + +lowerBoundsOnIntercepts = Param(Params._dummy(), "lowerBoundsOnIntercepts", +"The lower bounds on intercepts if fitting under bound " +"constrained optimization. The bounds vector size must be" +"equal with 1 for binomial regression, or the number of" +"lasses for multinomial regression.", +typeConverter=TypeConverters.toVector) + +upperBoundsOnIntercepts = Param(Params._dummy(), "upperBoundsOnIntercepts", +"The upper bounds on intercepts if fitting under bound " +"constrained optimization. The bound vector size must be " +"equal with 1 for binomial regression, or the number of " +"classes for multinomial regression.", +typeConverter=TypeConverters.toVector) + @keyword_only def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, regParam=0.0, elasticNetParam=0.0, tol=1e-6, fitIntercept=True, threshold=0.5, thresholds=None, probabilityCol="probability", rawPredictionCol="rawPrediction", standardization=True, weightCol=None, - aggregationDepth=2, family="auto"): + aggregationDepth=2, family="auto", + lowerBoundsOnCoefficients=None, upperBoundsOnCoefficients=None, +
spark git commit: [SPARK-21388][ML][PYSPARK] GBTs inherit from HasStepSize & LInearSVC from HasThreshold
Repository: spark Updated Branches: refs/heads/master 5fd0294ff -> 253a07e43 [SPARK-21388][ML][PYSPARK] GBTs inherit from HasStepSize & LInearSVC from HasThreshold ## What changes were proposed in this pull request? GBTs inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold ## How was this patch tested? existing tests Author: Zheng RuiFengAuthor: Ruifeng Zheng Closes #18612 from zhengruifeng/override_HasXXX. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/253a07e4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/253a07e4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/253a07e4 Branch: refs/heads/master Commit: 253a07e43a35f3494aa5e5ead9f4997c653325aa Parents: 5fd0294 Author: Zheng RuiFeng Authored: Tue Aug 1 21:34:26 2017 +0800 Committer: Yanbo Liang Committed: Tue Aug 1 21:34:26 2017 +0800 -- .../spark/ml/classification/LinearSVC.scala | 7 ++- .../ml/classification/LogisticRegression.scala | 1 + .../org/apache/spark/ml/feature/Word2Vec.scala | 1 - .../ml/param/shared/SharedParamsCodeGen.scala| 6 +++--- .../spark/ml/param/shared/sharedParams.scala | 6 ++ .../org/apache/spark/ml/tree/treeParams.scala| 7 ++- python/pyspark/ml/classification.py | 19 ++- python/pyspark/ml/regression.py | 5 + 8 files changed, 21 insertions(+), 31 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/253a07e4/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala index d6ed6a4..8d556de 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala @@ -42,7 +42,7 @@ import org.apache.spark.sql.functions.{col, lit} /** Params for linear SVM Classifier. */ private[classification] trait LinearSVCParams extends ClassifierParams with HasRegParam with HasMaxIter with HasFitIntercept with HasTol with HasStandardization with HasWeightCol - with HasAggregationDepth { + with HasAggregationDepth with HasThreshold { /** * Param for threshold in binary classification prediction. @@ -53,11 +53,8 @@ private[classification] trait LinearSVCParams extends ClassifierParams with HasR * * @group param */ - final val threshold: DoubleParam = new DoubleParam(this, "threshold", + final override val threshold: DoubleParam = new DoubleParam(this, "threshold", "threshold in binary classification prediction applied to rawPrediction") - - /** @group getParam */ - def getThreshold: Double = $(threshold) } /** http://git-wip-us.apache.org/repos/asf/spark/blob/253a07e4/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala index 6bba7f9..21957d9 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala @@ -366,6 +366,7 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def setThreshold(value: Double): this.type = super.setThreshold(value) + setDefault(threshold -> 0.5) @Since("1.5.0") override def getThreshold: Double = super.getThreshold http://git-wip-us.apache.org/repos/asf/spark/blob/253a07e4/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala index b6909b3..d4c8e4b 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala @@ -19,7 +19,6 @@ package org.apache.spark.ml.feature import org.apache.hadoop.fs.Path -import org.apache.spark.SparkContext import org.apache.spark.annotation.Since import org.apache.spark.ml.{Estimator, Model} import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT} http://git-wip-us.apache.org/repos/asf/spark/blob/253a07e4/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
spark git commit: [SPARK-21575][SPARKR] Eliminate needless synchronization in java-R serialization
Repository: spark Updated Branches: refs/heads/master 44e501ace -> 106eaa9b9 [SPARK-21575][SPARKR] Eliminate needless synchronization in java-R serialization ## What changes were proposed in this pull request? Remove surplus synchronized blocks. ## How was this patch tested? Unit tests run OK. Author: iurii.antCloses #18775 from SereneAnt/eliminate_unnecessary_synchronization_in_java-R_serialization. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/106eaa9b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/106eaa9b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/106eaa9b Branch: refs/heads/master Commit: 106eaa9b95192f0cdbb382c11efdcb85032e679b Parents: 44e501a Author: iurii.ant Authored: Mon Jul 31 10:42:09 2017 +0800 Committer: Yanbo Liang Committed: Mon Jul 31 10:42:09 2017 +0800 -- .../org/apache/spark/api/r/JVMObjectTracker.scala | 16 ++-- 1 file changed, 2 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/106eaa9b/core/src/main/scala/org/apache/spark/api/r/JVMObjectTracker.scala -- diff --git a/core/src/main/scala/org/apache/spark/api/r/JVMObjectTracker.scala b/core/src/main/scala/org/apache/spark/api/r/JVMObjectTracker.scala index 3432700..fe7438a 100644 --- a/core/src/main/scala/org/apache/spark/api/r/JVMObjectTracker.scala +++ b/core/src/main/scala/org/apache/spark/api/r/JVMObjectTracker.scala @@ -37,13 +37,7 @@ private[r] class JVMObjectTracker { /** * Returns the JVM object associated with the input key or None if not found. */ - final def get(id: JVMObjectId): Option[Object] = this.synchronized { -if (objMap.containsKey(id)) { - Some(objMap.get(id)) -} else { - None -} - } + final def get(id: JVMObjectId): Option[Object] = Option(objMap.get(id)) /** * Returns the JVM object associated with the input key or throws an exception if not found. @@ -67,13 +61,7 @@ private[r] class JVMObjectTracker { /** * Removes and returns a JVM object with the specific ID from the tracker, or None if not found. */ - final def remove(id: JVMObjectId): Option[Object] = this.synchronized { -if (objMap.containsKey(id)) { - Some(objMap.remove(id)) -} else { - None -} - } + final def remove(id: JVMObjectId): Option[Object] = Option(objMap.remove(id)) /** * Number of JVM objects being tracked. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: Revert "[SPARK-21306][ML] OneVsRest should support setWeightCol"
Repository: spark Updated Branches: refs/heads/branch-2.1 8520d7c6d -> 258ca40cf Revert "[SPARK-21306][ML] OneVsRest should support setWeightCol" This reverts commit 8520d7c6d5e880dea3c1a8a874148c07222b4b4b. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/258ca40c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/258ca40c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/258ca40c Branch: refs/heads/branch-2.1 Commit: 258ca40cf43eedae59b014a41fc6197df9bde299 Parents: 8520d7c Author: Yanbo LiangAuthored: Fri Jul 28 20:24:54 2017 +0800 Committer: Yanbo Liang Committed: Fri Jul 28 20:24:54 2017 +0800 -- .../spark/ml/classification/OneVsRest.scala | 39 ++-- .../ml/classification/OneVsRestSuite.scala | 10 - python/pyspark/ml/classification.py | 27 +++--- python/pyspark/ml/tests.py | 14 --- 4 files changed, 9 insertions(+), 81 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/258ca40c/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala index c4a8f1f..e58b30d 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala @@ -34,7 +34,6 @@ import org.apache.spark.ml._ import org.apache.spark.ml.attribute._ import org.apache.spark.ml.linalg.Vector import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params} -import org.apache.spark.ml.param.shared.HasWeightCol import org.apache.spark.ml.util._ import org.apache.spark.sql.{DataFrame, Dataset, Row} import org.apache.spark.sql.functions._ @@ -54,8 +53,7 @@ private[ml] trait ClassifierTypeTrait { /** * Params for [[OneVsRest]]. */ -private[ml] trait OneVsRestParams extends PredictorParams - with ClassifierTypeTrait with HasWeightCol { +private[ml] trait OneVsRestParams extends PredictorParams with ClassifierTypeTrait { /** * param for the base binary classifier that we reduce multiclass classification into. @@ -301,18 +299,6 @@ final class OneVsRest @Since("1.4.0") ( @Since("1.5.0") def setPredictionCol(value: String): this.type = set(predictionCol, value) - /** - * Sets the value of param [[weightCol]]. - * - * This is ignored if weight is not supported by [[classifier]]. - * If this is not set or empty, we treat all instance weights as 1.0. - * Default is not set, so all instances have weight one. - * - * @group setParam - */ - @Since("2.3.0") - def setWeightCol(value: String): this.type = set(weightCol, value) - @Since("1.4.0") override def transformSchema(schema: StructType): StructType = { validateAndTransformSchema(schema, fitting = true, getClassifier.featuresDataType) @@ -331,20 +317,7 @@ final class OneVsRest @Since("1.4.0") ( } val numClasses = MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity) -val weightColIsUsed = isDefined(weightCol) && $(weightCol).nonEmpty && { - getClassifier match { -case _: HasWeightCol => true -case c => - logWarning(s"weightCol is ignored, as it is not supported by $c now.") - false - } -} - -val multiclassLabeled = if (weightColIsUsed) { - dataset.select($(labelCol), $(featuresCol), $(weightCol)) -} else { - dataset.select($(labelCol), $(featuresCol)) -} +val multiclassLabeled = dataset.select($(labelCol), $(featuresCol)) // persist if underlying dataset is not persistent. val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE @@ -364,13 +337,7 @@ final class OneVsRest @Since("1.4.0") ( paramMap.put(classifier.labelCol -> labelColName) paramMap.put(classifier.featuresCol -> getFeaturesCol) paramMap.put(classifier.predictionCol -> getPredictionCol) - if (weightColIsUsed) { -val classifier_ = classifier.asInstanceOf[ClassifierType with HasWeightCol] -paramMap.put(classifier_.weightCol -> getWeightCol) -classifier_.fit(trainingDataset, paramMap) - } else { -classifier.fit(trainingDataset, paramMap) - } + classifier.fit(trainingDataset, paramMap) }.toArray[ClassificationModel[_, _]] if (handlePersistence) { http://git-wip-us.apache.org/repos/asf/spark/blob/258ca40c/mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala -- diff --git
spark git commit: Revert "[SPARK-21306][ML] OneVsRest should support setWeightCol"
Repository: spark Updated Branches: refs/heads/branch-2.0 ccb827224 -> f8ae2bdd2 Revert "[SPARK-21306][ML] OneVsRest should support setWeightCol" This reverts commit ccb82722450c20c9cdea2b2c68783943213a5aa1. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f8ae2bdd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f8ae2bdd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f8ae2bdd Branch: refs/heads/branch-2.0 Commit: f8ae2bdd2112780ec2b1104119bac2b718a55413 Parents: ccb8272 Author: Yanbo LiangAuthored: Fri Jul 28 19:45:14 2017 +0800 Committer: Yanbo Liang Committed: Fri Jul 28 19:45:14 2017 +0800 -- .../spark/ml/classification/OneVsRest.scala | 39 ++-- .../ml/classification/OneVsRestSuite.scala | 10 - python/pyspark/ml/classification.py | 27 +++--- python/pyspark/ml/tests.py | 14 --- 4 files changed, 9 insertions(+), 81 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f8ae2bdd/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala index 770d5db..f4ab0a0 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala @@ -34,7 +34,6 @@ import org.apache.spark.ml._ import org.apache.spark.ml.attribute._ import org.apache.spark.ml.linalg.Vector import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params} -import org.apache.spark.ml.param.shared.HasWeightCol import org.apache.spark.ml.util._ import org.apache.spark.sql.{DataFrame, Dataset, Row} import org.apache.spark.sql.functions._ @@ -54,8 +53,7 @@ private[ml] trait ClassifierTypeTrait { /** * Params for [[OneVsRest]]. */ -private[ml] trait OneVsRestParams extends PredictorParams - with ClassifierTypeTrait with HasWeightCol { +private[ml] trait OneVsRestParams extends PredictorParams with ClassifierTypeTrait { /** * param for the base binary classifier that we reduce multiclass classification into. @@ -292,18 +290,6 @@ final class OneVsRest @Since("1.4.0") ( @Since("1.5.0") def setPredictionCol(value: String): this.type = set(predictionCol, value) - /** - * Sets the value of param [[weightCol]]. - * - * This is ignored if weight is not supported by [[classifier]]. - * If this is not set or empty, we treat all instance weights as 1.0. - * Default is not set, so all instances have weight one. - * - * @group setParam - */ - @Since("2.3.0") - def setWeightCol(value: String): this.type = set(weightCol, value) - @Since("1.4.0") override def transformSchema(schema: StructType): StructType = { validateAndTransformSchema(schema, fitting = true, getClassifier.featuresDataType) @@ -322,20 +308,7 @@ final class OneVsRest @Since("1.4.0") ( } val numClasses = MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity) -val weightColIsUsed = isDefined(weightCol) && $(weightCol).nonEmpty && { - getClassifier match { -case _: HasWeightCol => true -case c => - logWarning(s"weightCol is ignored, as it is not supported by $c now.") - false - } -} - -val multiclassLabeled = if (weightColIsUsed) { - dataset.select($(labelCol), $(featuresCol), $(weightCol)) -} else { - dataset.select($(labelCol), $(featuresCol)) -} +val multiclassLabeled = dataset.select($(labelCol), $(featuresCol)) // persist if underlying dataset is not persistent. val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE @@ -355,13 +328,7 @@ final class OneVsRest @Since("1.4.0") ( paramMap.put(classifier.labelCol -> labelColName) paramMap.put(classifier.featuresCol -> getFeaturesCol) paramMap.put(classifier.predictionCol -> getPredictionCol) - if (weightColIsUsed) { -val classifier_ = classifier.asInstanceOf[ClassifierType with HasWeightCol] -paramMap.put(classifier_.weightCol -> getWeightCol) -classifier_.fit(trainingDataset, paramMap) - } else { -classifier.fit(trainingDataset, paramMap) - } + classifier.fit(trainingDataset, paramMap) }.toArray[ClassificationModel[_, _]] if (handlePersistence) { http://git-wip-us.apache.org/repos/asf/spark/blob/f8ae2bdd/mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala -- diff --git
spark git commit: [SPARK-21306][ML] OneVsRest should support setWeightCol
Repository: spark Updated Branches: refs/heads/branch-2.0 d7b9d6235 -> ccb827224 [SPARK-21306][ML] OneVsRest should support setWeightCol ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (é¢åæ)Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol. (cherry picked from commit a5a3189974ea4628e9489eb50099a5432174e80c) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ccb82722 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ccb82722 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ccb82722 Branch: refs/heads/branch-2.0 Commit: ccb82722450c20c9cdea2b2c68783943213a5aa1 Parents: d7b9d62 Author: Yan Facai (é¢åæ) Authored: Fri Jul 28 10:10:35 2017 +0800 Committer: Yanbo Liang Committed: Fri Jul 28 10:20:27 2017 +0800 -- .../spark/ml/classification/OneVsRest.scala | 39 ++-- .../ml/classification/OneVsRestSuite.scala | 10 + python/pyspark/ml/classification.py | 27 +++--- python/pyspark/ml/tests.py | 14 +++ 4 files changed, 81 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ccb82722/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala index f4ab0a0..770d5db 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala @@ -34,6 +34,7 @@ import org.apache.spark.ml._ import org.apache.spark.ml.attribute._ import org.apache.spark.ml.linalg.Vector import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params} +import org.apache.spark.ml.param.shared.HasWeightCol import org.apache.spark.ml.util._ import org.apache.spark.sql.{DataFrame, Dataset, Row} import org.apache.spark.sql.functions._ @@ -53,7 +54,8 @@ private[ml] trait ClassifierTypeTrait { /** * Params for [[OneVsRest]]. */ -private[ml] trait OneVsRestParams extends PredictorParams with ClassifierTypeTrait { +private[ml] trait OneVsRestParams extends PredictorParams + with ClassifierTypeTrait with HasWeightCol { /** * param for the base binary classifier that we reduce multiclass classification into. @@ -290,6 +292,18 @@ final class OneVsRest @Since("1.4.0") ( @Since("1.5.0") def setPredictionCol(value: String): this.type = set(predictionCol, value) + /** + * Sets the value of param [[weightCol]]. + * + * This is ignored if weight is not supported by [[classifier]]. + * If this is not set or empty, we treat all instance weights as 1.0. + * Default is not set, so all instances have weight one. + * + * @group setParam + */ + @Since("2.3.0") + def setWeightCol(value: String): this.type = set(weightCol, value) + @Since("1.4.0") override def transformSchema(schema: StructType): StructType = { validateAndTransformSchema(schema, fitting = true, getClassifier.featuresDataType) @@ -308,7 +322,20 @@ final class OneVsRest @Since("1.4.0") ( } val numClasses = MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity) -val multiclassLabeled = dataset.select($(labelCol), $(featuresCol)) +val weightColIsUsed = isDefined(weightCol) && $(weightCol).nonEmpty && { + getClassifier match { +case _: HasWeightCol => true +case c => + logWarning(s"weightCol is ignored, as it is not supported by $c now.") + false + } +} + +val multiclassLabeled = if (weightColIsUsed) { + dataset.select($(labelCol), $(featuresCol), $(weightCol)) +} else { + dataset.select($(labelCol), $(featuresCol)) +} // persist if underlying dataset is not persistent. val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE @@ -328,7 +355,13 @@ final class OneVsRest @Since("1.4.0") ( paramMap.put(classifier.labelCol -> labelColName) paramMap.put(classifier.featuresCol -> getFeaturesCol) paramMap.put(classifier.predictionCol -> getPredictionCol) - classifier.fit(trainingDataset, paramMap) + if (weightColIsUsed) { +val classifier_ = classifier.asInstanceOf[ClassifierType with HasWeightCol] +paramMap.put(classifier_.weightCol -> getWeightCol) +
spark git commit: [SPARK-21306][ML] OneVsRest should support setWeightCol
Repository: spark Updated Branches: refs/heads/branch-2.1 94987987a -> 8520d7c6d [SPARK-21306][ML] OneVsRest should support setWeightCol ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (é¢åæ)Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol. (cherry picked from commit a5a3189974ea4628e9489eb50099a5432174e80c) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8520d7c6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8520d7c6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8520d7c6 Branch: refs/heads/branch-2.1 Commit: 8520d7c6d5e880dea3c1a8a874148c07222b4b4b Parents: 9498798 Author: Yan Facai (é¢åæ) Authored: Fri Jul 28 10:10:35 2017 +0800 Committer: Yanbo Liang Committed: Fri Jul 28 10:15:59 2017 +0800 -- .../spark/ml/classification/OneVsRest.scala | 39 ++-- .../ml/classification/OneVsRestSuite.scala | 10 + python/pyspark/ml/classification.py | 27 +++--- python/pyspark/ml/tests.py | 14 +++ 4 files changed, 81 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8520d7c6/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala index e58b30d..c4a8f1f 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala @@ -34,6 +34,7 @@ import org.apache.spark.ml._ import org.apache.spark.ml.attribute._ import org.apache.spark.ml.linalg.Vector import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params} +import org.apache.spark.ml.param.shared.HasWeightCol import org.apache.spark.ml.util._ import org.apache.spark.sql.{DataFrame, Dataset, Row} import org.apache.spark.sql.functions._ @@ -53,7 +54,8 @@ private[ml] trait ClassifierTypeTrait { /** * Params for [[OneVsRest]]. */ -private[ml] trait OneVsRestParams extends PredictorParams with ClassifierTypeTrait { +private[ml] trait OneVsRestParams extends PredictorParams + with ClassifierTypeTrait with HasWeightCol { /** * param for the base binary classifier that we reduce multiclass classification into. @@ -299,6 +301,18 @@ final class OneVsRest @Since("1.4.0") ( @Since("1.5.0") def setPredictionCol(value: String): this.type = set(predictionCol, value) + /** + * Sets the value of param [[weightCol]]. + * + * This is ignored if weight is not supported by [[classifier]]. + * If this is not set or empty, we treat all instance weights as 1.0. + * Default is not set, so all instances have weight one. + * + * @group setParam + */ + @Since("2.3.0") + def setWeightCol(value: String): this.type = set(weightCol, value) + @Since("1.4.0") override def transformSchema(schema: StructType): StructType = { validateAndTransformSchema(schema, fitting = true, getClassifier.featuresDataType) @@ -317,7 +331,20 @@ final class OneVsRest @Since("1.4.0") ( } val numClasses = MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity) -val multiclassLabeled = dataset.select($(labelCol), $(featuresCol)) +val weightColIsUsed = isDefined(weightCol) && $(weightCol).nonEmpty && { + getClassifier match { +case _: HasWeightCol => true +case c => + logWarning(s"weightCol is ignored, as it is not supported by $c now.") + false + } +} + +val multiclassLabeled = if (weightColIsUsed) { + dataset.select($(labelCol), $(featuresCol), $(weightCol)) +} else { + dataset.select($(labelCol), $(featuresCol)) +} // persist if underlying dataset is not persistent. val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE @@ -337,7 +364,13 @@ final class OneVsRest @Since("1.4.0") ( paramMap.put(classifier.labelCol -> labelColName) paramMap.put(classifier.featuresCol -> getFeaturesCol) paramMap.put(classifier.predictionCol -> getPredictionCol) - classifier.fit(trainingDataset, paramMap) + if (weightColIsUsed) { +val classifier_ = classifier.asInstanceOf[ClassifierType with HasWeightCol] +paramMap.put(classifier_.weightCol -> getWeightCol) +
spark git commit: [SPARK-21306][ML] OneVsRest should support setWeightCol
Repository: spark Updated Branches: refs/heads/master f44ead89f -> a5a318997 [SPARK-21306][ML] OneVsRest should support setWeightCol ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (é¢åæ)Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a5a31899 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a5a31899 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a5a31899 Branch: refs/heads/master Commit: a5a3189974ea4628e9489eb50099a5432174e80c Parents: f44ead8 Author: Yan Facai (é¢åæ) Authored: Fri Jul 28 10:10:35 2017 +0800 Committer: Yanbo Liang Committed: Fri Jul 28 10:10:35 2017 +0800 -- .../spark/ml/classification/OneVsRest.scala | 39 ++-- .../ml/classification/OneVsRestSuite.scala | 10 + python/pyspark/ml/classification.py | 27 +++--- python/pyspark/ml/tests.py | 14 +++ 4 files changed, 81 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a5a31899/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala index 7cbcccf..05b8c3a 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala @@ -34,6 +34,7 @@ import org.apache.spark.ml._ import org.apache.spark.ml.attribute._ import org.apache.spark.ml.linalg.Vector import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params} +import org.apache.spark.ml.param.shared.HasWeightCol import org.apache.spark.ml.util._ import org.apache.spark.sql.{DataFrame, Dataset, Row} import org.apache.spark.sql.functions._ @@ -53,7 +54,8 @@ private[ml] trait ClassifierTypeTrait { /** * Params for [[OneVsRest]]. */ -private[ml] trait OneVsRestParams extends PredictorParams with ClassifierTypeTrait { +private[ml] trait OneVsRestParams extends PredictorParams + with ClassifierTypeTrait with HasWeightCol { /** * param for the base binary classifier that we reduce multiclass classification into. @@ -294,6 +296,18 @@ final class OneVsRest @Since("1.4.0") ( @Since("1.5.0") def setPredictionCol(value: String): this.type = set(predictionCol, value) + /** + * Sets the value of param [[weightCol]]. + * + * This is ignored if weight is not supported by [[classifier]]. + * If this is not set or empty, we treat all instance weights as 1.0. + * Default is not set, so all instances have weight one. + * + * @group setParam + */ + @Since("2.3.0") + def setWeightCol(value: String): this.type = set(weightCol, value) + @Since("1.4.0") override def transformSchema(schema: StructType): StructType = { validateAndTransformSchema(schema, fitting = true, getClassifier.featuresDataType) @@ -317,7 +331,20 @@ final class OneVsRest @Since("1.4.0") ( val numClasses = MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity) instr.logNumClasses(numClasses) -val multiclassLabeled = dataset.select($(labelCol), $(featuresCol)) +val weightColIsUsed = isDefined(weightCol) && $(weightCol).nonEmpty && { + getClassifier match { +case _: HasWeightCol => true +case c => + logWarning(s"weightCol is ignored, as it is not supported by $c now.") + false + } +} + +val multiclassLabeled = if (weightColIsUsed) { + dataset.select($(labelCol), $(featuresCol), $(weightCol)) +} else { + dataset.select($(labelCol), $(featuresCol)) +} // persist if underlying dataset is not persistent. val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE @@ -337,7 +364,13 @@ final class OneVsRest @Since("1.4.0") ( paramMap.put(classifier.labelCol -> labelColName) paramMap.put(classifier.featuresCol -> getFeaturesCol) paramMap.put(classifier.predictionCol -> getPredictionCol) - classifier.fit(trainingDataset, paramMap) + if (weightColIsUsed) { +val classifier_ = classifier.asInstanceOf[ClassifierType with HasWeightCol] +paramMap.put(classifier_.weightCol -> getWeightCol) +classifier_.fit(trainingDataset, paramMap) + } else { +classifier.fit(trainingDataset, paramMap) + }
spark git commit: [SPARK-19270][ML] Add summary table to GLM summary
Repository: spark Updated Branches: refs/heads/master 2ff35a057 -> ddcd2e826 [SPARK-19270][ML] Add summary table to GLM summary ## What changes were proposed in this pull request? Add R-like summary table to GLM summary, which includes feature name (if exist), parameter estimate, standard error, t-stat and p-value. This allows scala users to easily gather these commonly used inference results. srowen yanboliang felixcheung ## How was this patch tested? New tests. One for testing feature Name, and one for testing the summary Table. Author: actuaryzhangAuthor: Wayne Zhang Author: Yanbo Liang Closes #16630 from actuaryzhang/glmTable. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ddcd2e82 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ddcd2e82 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ddcd2e82 Branch: refs/heads/master Commit: ddcd2e8269db36e4b43edd5cee921d4b12def203 Parents: 2ff35a0 Author: actuaryzhang Authored: Thu Jul 27 22:00:59 2017 +0800 Committer: Yanbo Liang Committed: Thu Jul 27 22:00:59 2017 +0800 -- .../r/GeneralizedLinearRegressionWrapper.scala | 39 ++- .../GeneralizedLinearRegression.scala | 111 ++- .../GeneralizedLinearRegressionSuite.scala | 83 +- 3 files changed, 199 insertions(+), 34 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ddcd2e82/mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala b/mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala index ee1fc9b..176a6cf 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala @@ -83,11 +83,7 @@ private[r] object GeneralizedLinearRegressionWrapper .setStringIndexerOrderType(stringIndexerOrderType) checkDataColumns(rFormula, data) val rFormulaModel = rFormula.fit(data) -// get labels and feature names from output schema -val schema = rFormulaModel.transform(data).schema -val featureAttrs = AttributeGroup.fromStructField(schema(rFormula.getFeaturesCol)) - .attributes.get -val features = featureAttrs.map(_.name.get) + // assemble and fit the pipeline val glr = new GeneralizedLinearRegression() .setFamily(family) @@ -113,37 +109,16 @@ private[r] object GeneralizedLinearRegressionWrapper val summary = glm.summary val rFeatures: Array[String] = if (glm.getFitIntercept) { - Array("(Intercept)") ++ features + Array("(Intercept)") ++ summary.featureNames } else { - features + summary.featureNames } val rCoefficients: Array[Double] = if (summary.isNormalSolver) { - val rCoefficientStandardErrors = if (glm.getFitIntercept) { -Array(summary.coefficientStandardErrors.last) ++ - summary.coefficientStandardErrors.dropRight(1) - } else { -summary.coefficientStandardErrors - } - - val rTValues = if (glm.getFitIntercept) { -Array(summary.tValues.last) ++ summary.tValues.dropRight(1) - } else { -summary.tValues - } - - val rPValues = if (glm.getFitIntercept) { -Array(summary.pValues.last) ++ summary.pValues.dropRight(1) - } else { -summary.pValues - } - - if (glm.getFitIntercept) { -Array(glm.intercept) ++ glm.coefficients.toArray ++ - rCoefficientStandardErrors ++ rTValues ++ rPValues - } else { -glm.coefficients.toArray ++ rCoefficientStandardErrors ++ rTValues ++ rPValues - } + summary.coefficientsWithStatistics.map(_._2) ++ +summary.coefficientsWithStatistics.map(_._3) ++ +summary.coefficientsWithStatistics.map(_._4) ++ +summary.coefficientsWithStatistics.map(_._5) } else { if (glm.getFitIntercept) { Array(glm.intercept) ++ glm.coefficients.toArray http://git-wip-us.apache.org/repos/asf/spark/blob/ddcd2e82/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala index 815607f..917a4d2 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
spark git commit: [MINOR][ML] Reorg RFormula params.
Repository: spark Updated Branches: refs/heads/master 256358f66 -> 5d1850d4b [MINOR][ML] Reorg RFormula params. ## What changes were proposed in this pull request? There are mainly two reasons for this reorg: * Some params are placed in ```RFormulaBase```, while others are placed in ```RFormula```, this is disordered. * ```RFormulaModel``` should have params ```handleInvalid```, ```formula``` and ```forceIndexLabel```, that users can get invalid values handling policy, formula or whether to force index label if they only have a ```RFormulaModel```. So we need move these params to ```RFormulaBase``` which is also inherited by ```RFormulaModel```. * ```RFormulaModel``` should support set different ```handleInvalid``` when cross validation. ## How was this patch tested? Existing tests. Author: Yanbo LiangCloses #18681 from yanboliang/rformula-reorg. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5d1850d4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5d1850d4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5d1850d4 Branch: refs/heads/master Commit: 5d1850d4b541a8108c934a174097f3c7e10b5315 Parents: 256358f Author: Yanbo Liang Authored: Thu Jul 20 20:07:16 2017 +0800 Committer: Yanbo Liang Committed: Thu Jul 20 20:07:16 2017 +0800 -- .../org/apache/spark/ml/feature/RFormula.scala | 95 ++-- 1 file changed, 47 insertions(+), 48 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5d1850d4/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala index c224454..7da3339 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala @@ -35,7 +35,51 @@ import org.apache.spark.sql.types._ /** * Base trait for [[RFormula]] and [[RFormulaModel]]. */ -private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol { +private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol with HasHandleInvalid { + + /** + * R formula parameter. The formula is provided in string form. + * @group param + */ + @Since("1.5.0") + val formula: Param[String] = new Param(this, "formula", "R model formula") + + /** @group getParam */ + @Since("1.5.0") + def getFormula: String = $(formula) + + /** + * Force to index label whether it is numeric or string type. + * Usually we index label only when it is string type. + * If the formula was used by classification algorithms, + * we can force to index label even it is numeric type by setting this param with true. + * Default: false. + * @group param + */ + @Since("2.1.0") + val forceIndexLabel: BooleanParam = new BooleanParam(this, "forceIndexLabel", +"Force to index label whether it is numeric or string") + setDefault(forceIndexLabel -> false) + + /** @group getParam */ + @Since("2.1.0") + def getForceIndexLabel: Boolean = $(forceIndexLabel) + + /** + * Param for how to handle invalid data (unseen or NULL values) in features and label column + * of string type. Options are 'skip' (filter out rows with invalid data), + * 'error' (throw an error), or 'keep' (put invalid data in a special additional + * bucket, at index numLabels). + * Default: "error" + * @group param + */ + @Since("2.3.0") + final override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", +"How to handle invalid data (unseen or NULL values) in features and label column of string " + +"type. Options are 'skip' (filter out rows with invalid data), error (throw an error), " + +"or 'keep' (put invalid data in a special additional bucket, at index numLabels).", +ParamValidators.inArray(StringIndexer.supportedHandleInvalids)) + setDefault(handleInvalid, StringIndexer.ERROR_INVALID) /** * Param for how to order categories of a string FEATURE column used by `StringIndexer`. @@ -68,6 +112,7 @@ private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol { "The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', " + "RFormula drops the same category as R when encoding strings.", ParamValidators.inArray(StringIndexer.supportedStringOrderType)) + setDefault(stringIndexerOrderType, StringIndexer.frequencyDesc) /** @group getParam */ @Since("2.3.0") @@ -108,20 +153,12 @@ private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol { @Experimental @Since("1.5.0") class RFormula
spark git commit: [SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column.
Repository: spark Updated Branches: refs/heads/master 74ac1fb08 -> 69e5282d3 [SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column. ## What changes were proposed in this pull request? ```RFormula``` should handle invalid for both features and label column. #18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases. ## How was this patch tested? Add test cases. Author: Yanbo LiangCloses #18613 from yanboliang/spark-20307. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/69e5282d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/69e5282d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/69e5282d Branch: refs/heads/master Commit: 69e5282d3c2998611680d3e10f2830d4e9c5f750 Parents: 74ac1fb Author: Yanbo Liang Authored: Sat Jul 15 20:56:38 2017 +0800 Committer: Yanbo Liang Committed: Sat Jul 15 20:56:38 2017 +0800 -- R/pkg/tests/fulltests/test_mllib_tree.R | 2 +- .../org/apache/spark/ml/feature/RFormula.scala | 9 ++-- .../apache/spark/ml/feature/RFormulaSuite.scala | 49 +++- python/pyspark/ml/feature.py| 5 +- 4 files changed, 57 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/69e5282d/R/pkg/tests/fulltests/test_mllib_tree.R -- diff --git a/R/pkg/tests/fulltests/test_mllib_tree.R b/R/pkg/tests/fulltests/test_mllib_tree.R index 66a0693..e31a65f 100644 --- a/R/pkg/tests/fulltests/test_mllib_tree.R +++ b/R/pkg/tests/fulltests/test_mllib_tree.R @@ -225,7 +225,7 @@ test_that("spark.randomForest", { expect_error(collect(predictions)) model <- spark.randomForest(traindf, clicked ~ ., type = "classification", maxDepth = 10, maxBins = 10, numTrees = 10, - handleInvalid = "skip") + handleInvalid = "keep") predictions <- predict(model, testdf) expect_equal(class(collect(predictions)$clicked[1]), "character") http://git-wip-us.apache.org/repos/asf/spark/blob/69e5282d/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala index bb7acaf..c224454 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala @@ -134,16 +134,16 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String) def getFormula: String = $(formula) /** - * Param for how to handle invalid data (unseen labels or NULL values). - * Options are 'skip' (filter out rows with invalid data), + * Param for how to handle invalid data (unseen or NULL values) in features and label column + * of string type. Options are 'skip' (filter out rows with invalid data), * 'error' (throw an error), or 'keep' (put invalid data in a special additional * bucket, at index numLabels). * Default: "error" * @group param */ @Since("2.3.0") - override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", -"How to handle invalid data (unseen labels or NULL values). " + + override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "How to " + +"handle invalid data (unseen or NULL values) in features and label column of string type. " + "Options are 'skip' (filter out rows with invalid data), error (throw an error), " + "or 'keep' (put invalid data in a special additional bucket, at index numLabels).", ParamValidators.inArray(StringIndexer.supportedHandleInvalids)) @@ -265,6 +265,7 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String) encoderStages += new StringIndexer() .setInputCol(resolvedFormula.label) .setOutputCol($(labelCol)) +.setHandleInvalid($(handleInvalid)) } val pipelineModel = new Pipeline(uid).setStages(encoderStages.toArray).fit(dataset) http://git-wip-us.apache.org/repos/asf/spark/blob/69e5282d/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala index 23570d6..5d09c90 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala +++
spark git commit: [SPARK-18619][ML] Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from HasHandleInvalid
Repository: spark Updated Branches: refs/heads/master aaad34dc2 -> d2d2a5de1 [SPARK-18619][ML] Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from HasHandleInvalid ## What changes were proposed in this pull request? 1, HasHandleInvaild support override 2, Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from HasHandleInvalid ## How was this patch tested? existing tests [JIRA](https://issues.apache.org/jira/browse/SPARK-18619) Author: Zheng RuiFengCloses #18582 from zhengruifeng/heritate_HasHandleInvalid. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d2d2a5de Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d2d2a5de Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d2d2a5de Branch: refs/heads/master Commit: d2d2a5de186ddf381d0bdb353b23d64ff0224e7f Parents: aaad34d Author: Zheng RuiFeng Authored: Wed Jul 12 22:09:03 2017 +0800 Committer: Yanbo Liang Committed: Wed Jul 12 22:09:03 2017 +0800 -- .../apache/spark/ml/feature/Bucketizer.scala| 14 ++--- .../spark/ml/feature/QuantileDiscretizer.scala | 13 ++--- .../org/apache/spark/ml/feature/RFormula.scala | 13 ++--- .../apache/spark/ml/feature/StringIndexer.scala | 13 ++--- .../ml/param/shared/SharedParamsCodeGen.scala | 2 +- .../spark/ml/param/shared/sharedParams.scala| 2 +- .../GeneralizedLinearRegression.scala | 2 +- .../spark/ml/regression/LinearRegression.scala | 14 ++--- python/pyspark/ml/feature.py| 60 9 files changed, 53 insertions(+), 80 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d2d2a5de/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala index 46b512f..6a11a75 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala @@ -24,7 +24,7 @@ import org.apache.spark.annotation.Since import org.apache.spark.ml.Model import org.apache.spark.ml.attribute.NominalAttribute import org.apache.spark.ml.param._ -import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasOutputCol} import org.apache.spark.ml.util._ import org.apache.spark.sql._ import org.apache.spark.sql.expressions.UserDefinedFunction @@ -36,7 +36,8 @@ import org.apache.spark.sql.types.{DoubleType, StructField, StructType} */ @Since("1.4.0") final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String) - extends Model[Bucketizer] with HasInputCol with HasOutputCol with DefaultParamsWritable { + extends Model[Bucketizer] with HasHandleInvalid with HasInputCol with HasOutputCol +with DefaultParamsWritable { @Since("1.4.0") def this() = this(Identifiable.randomUID("bucketizer")) @@ -84,17 +85,12 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String * Default: "error" * @group param */ - // TODO: SPARK-18619 Make Bucketizer inherit from HasHandleInvalid. @Since("2.1.0") - val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "how to handle " + -"invalid entries. Options are skip (filter out rows with invalid values), " + + override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", +"how to handle invalid entries. Options are skip (filter out rows with invalid values), " + "error (throw an error), or keep (keep invalid values in a special additional bucket).", ParamValidators.inArray(Bucketizer.supportedHandleInvalids)) - /** @group getParam */ - @Since("2.1.0") - def getHandleInvalid: String = $(handleInvalid) - /** @group setParam */ @Since("2.1.0") def setHandleInvalid(value: String): this.type = set(handleInvalid, value) http://git-wip-us.apache.org/repos/asf/spark/blob/d2d2a5de/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala index feceeba..95e8830 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala @@ -22,7 +22,7 @@ import org.apache.spark.internal.Logging import org.apache.spark.ml._ import
spark git commit: [SPARK-21285][ML] VectorAssembler reports the column name of unsupported data type
Repository: spark Updated Branches: refs/heads/master 7fcbb9b57 -> 56536e999 [SPARK-21285][ML] VectorAssembler reports the column name of unsupported data type ## What changes were proposed in this pull request? add the column name in the exception which is raised by unsupported data type. ## How was this patch tested? + [x] pass all tests. Author: Yan Facai (é¢åæ)Closes #18523 from facaiy/ENH/vectorassembler_add_col. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/56536e99 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/56536e99 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/56536e99 Branch: refs/heads/master Commit: 56536e9992ac4ea771758463962e49bba410e896 Parents: 7fcbb9b Author: Yan Facai (é¢åæ) Authored: Fri Jul 7 18:32:01 2017 +0800 Committer: Yanbo Liang Committed: Fri Jul 7 18:32:01 2017 +0800 -- .../apache/spark/ml/feature/VectorAssembler.scala| 15 +-- .../spark/ml/feature/VectorAssemblerSuite.scala | 5 - 2 files changed, 13 insertions(+), 7 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/56536e99/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala index ca90053..73f27d1 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala @@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String) override def transformSchema(schema: StructType): StructType = { val inputColNames = $(inputCols) val outputColName = $(outputCol) -val inputDataTypes = inputColNames.map(name => schema(name).dataType) -inputDataTypes.foreach { - case _: NumericType | BooleanType => - case t if t.isInstanceOf[VectorUDT] => - case other => -throw new IllegalArgumentException(s"Data type $other is not supported.") +val incorrectColumns = inputColNames.flatMap { name => + schema(name).dataType match { +case _: NumericType | BooleanType => None +case t if t.isInstanceOf[VectorUDT] => None +case other => Some(s"Data type $other of column $name is not supported.") + } +} +if (incorrectColumns.nonEmpty) { + throw new IllegalArgumentException(incorrectColumns.mkString("\n")) } if (schema.fieldNames.contains(outputColName)) { throw new IllegalArgumentException(s"Output column $outputColName already exists.") http://git-wip-us.apache.org/repos/asf/spark/blob/56536e99/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala index 46cced3..6aef1c6 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala @@ -79,7 +79,10 @@ class VectorAssemblerSuite val thrown = intercept[IllegalArgumentException] { assembler.transform(df) } -assert(thrown.getMessage contains "Data type StringType is not supported") +assert(thrown.getMessage contains + "Data type StringType of column a is not supported.\n" + + "Data type StringType of column b is not supported.\n" + + "Data type StringType of column c is not supported.") } test("ML attributes") { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-21310][ML][PYSPARK] Expose offset in PySpark
Repository: spark Updated Branches: refs/heads/master a38643256 -> 4852b7d44 [SPARK-21310][ML][PYSPARK] Expose offset in PySpark ## What changes were proposed in this pull request? Add offset to PySpark in GLM as in #16699. ## How was this patch tested? Python test Author: actuaryzhangCloses #18534 from actuaryzhang/pythonOffset. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4852b7d4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4852b7d4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4852b7d4 Branch: refs/heads/master Commit: 4852b7d447e872079c2c81428354adc825a87b27 Parents: a386432 Author: actuaryzhang Authored: Wed Jul 5 18:41:00 2017 +0800 Committer: Yanbo Liang Committed: Wed Jul 5 18:41:00 2017 +0800 -- python/pyspark/ml/regression.py | 25 + python/pyspark/ml/tests.py | 14 ++ 2 files changed, 35 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4852b7d4/python/pyspark/ml/regression.py -- diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py index 84d8433..f0ff7a5 100644 --- a/python/pyspark/ml/regression.py +++ b/python/pyspark/ml/regression.py @@ -1376,17 +1376,20 @@ class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, Ha typeConverter=TypeConverters.toFloat) solver = Param(Params._dummy(), "solver", "The solver algorithm for optimization. Supported " + "options: irls.", typeConverter=TypeConverters.toString) +offsetCol = Param(Params._dummy(), "offsetCol", "The offset column name. If this is not set " + + "or empty, we treat all instance offsets as 0.0", + typeConverter=TypeConverters.toString) @keyword_only def __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction", family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None, - variancePower=0.0, linkPower=None): + variancePower=0.0, linkPower=None, offsetCol=None): """ __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction", \ family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6, \ regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None, \ - variancePower=0.0, linkPower=None) + variancePower=0.0, linkPower=None, offsetCol=None) """ super(GeneralizedLinearRegression, self).__init__() self._java_obj = self._new_java_obj( @@ -1402,12 +1405,12 @@ class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, Ha def setParams(self, labelCol="label", featuresCol="features", predictionCol="prediction", family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None, - variancePower=0.0, linkPower=None): + variancePower=0.0, linkPower=None, offsetCol=None): """ setParams(self, labelCol="label", featuresCol="features", predictionCol="prediction", \ family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6, \ regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None, \ - variancePower=0.0, linkPower=None) + variancePower=0.0, linkPower=None, offsetCol=None) Sets params for generalized linear regression. """ kwargs = self._input_kwargs @@ -1486,6 +1489,20 @@ class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, Ha """ return self.getOrDefault(self.linkPower) +@since("2.3.0") +def setOffsetCol(self, value): +""" +Sets the value of :py:attr:`offsetCol`. +""" +return self._set(offsetCol=value) + +@since("2.3.0") +def getOffsetCol(self): +""" +Gets the value of offsetCol or its default value. +""" +return self.getOrDefault(self.offsetCol) + class GeneralizedLinearRegressionModel(JavaModel, JavaPredictionModel, JavaMLWritable, JavaMLReadable): http://git-wip-us.apache.org/repos/asf/spark/blob/4852b7d4/python/pyspark/ml/tests.py -- diff --git
spark git commit: [SPARK-19852][PYSPARK][ML] Python StringIndexer supports 'keep' to handle invalid data
Repository: spark Updated Branches: refs/heads/master c605fee01 -> c19680be1 [SPARK-19852][PYSPARK][ML] Python StringIndexer supports 'keep' to handle invalid data ## What changes were proposed in this pull request? This PR is to maintain API parity with changes made in SPARK-17498 to support a new option 'keep' in StringIndexer to handle unseen labels or NULL values with PySpark. Note: This is updated version of #17237 , the primary author of this PR is VinceShieh . ## How was this patch tested? Unit tests. Author: VinceShiehAuthor: Yanbo Liang Closes #18453 from yanboliang/spark-19852. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c19680be Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c19680be Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c19680be Branch: refs/heads/master Commit: c19680be1c532dded1e70edce7a981ba28af09ad Parents: c605fee Author: Yanbo Liang Authored: Sun Jul 2 16:17:03 2017 +0800 Committer: Yanbo Liang Committed: Sun Jul 2 16:17:03 2017 +0800 -- python/pyspark/ml/feature.py | 6 ++ python/pyspark/ml/tests.py | 21 + 2 files changed, 27 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c19680be/python/pyspark/ml/feature.py -- diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index 77de1cc..25ad06f 100755 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -2132,6 +2132,12 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid, "frequencyDesc, frequencyAsc, alphabetDesc, alphabetAsc.", typeConverter=TypeConverters.toString) +handleInvalid = Param(Params._dummy(), "handleInvalid", "how to handle invalid data (unseen " + + "labels or NULL values). Options are 'skip' (filter out rows with " + + "invalid data), error (throw an error), or 'keep' (put invalid data " + + "in a special additional bucket, at index numLabels).", + typeConverter=TypeConverters.toString) + @keyword_only def __init__(self, inputCol=None, outputCol=None, handleInvalid="error", stringOrderType="frequencyDesc"): http://git-wip-us.apache.org/repos/asf/spark/blob/c19680be/python/pyspark/ml/tests.py -- diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py index 17a3947..ffb8b0a 100755 --- a/python/pyspark/ml/tests.py +++ b/python/pyspark/ml/tests.py @@ -551,6 +551,27 @@ class FeatureTests(SparkSessionTestCase): for i in range(0, len(expected)): self.assertTrue(all(observed[i]["features"].toArray() == expected[i])) +def test_string_indexer_handle_invalid(self): +df = self.spark.createDataFrame([ +(0, "a"), +(1, "d"), +(2, None)], ["id", "label"]) + +si1 = StringIndexer(inputCol="label", outputCol="indexed", handleInvalid="keep", +stringOrderType="alphabetAsc") +model1 = si1.fit(df) +td1 = model1.transform(df) +actual1 = td1.select("id", "indexed").collect() +expected1 = [Row(id=0, indexed=0.0), Row(id=1, indexed=1.0), Row(id=2, indexed=2.0)] +self.assertEqual(actual1, expected1) + +si2 = si1.setHandleInvalid("skip") +model2 = si2.fit(df) +td2 = model2.transform(df) +actual2 = td2.select("id", "indexed").collect() +expected2 = [Row(id=0, indexed=0.0), Row(id=1, indexed=1.0)] +self.assertEqual(actual2, expected2) + class HasInducedError(Params): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-18518][ML] HasSolver supports override
Repository: spark Updated Branches: refs/heads/master 37ef32e51 -> e0b047eaf [SPARK-18518][ML] HasSolver supports override ## What changes were proposed in this pull request? 1, make param support non-final with `finalFields` option 2, generate `HasSolver` with `finalFields = false` 3, override `solver` in LiR, GLR, and make MLPC inherit `HasSolver` ## How was this patch tested? existing tests Author: Ruifeng ZhengAuthor: Zheng RuiFeng Closes #16028 from zhengruifeng/param_non_final. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e0b047ea Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e0b047ea Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e0b047ea Branch: refs/heads/master Commit: e0b047eafed92eadf6842a9df964438095e12d41 Parents: 37ef32e Author: Ruifeng Zheng Authored: Sat Jul 1 15:37:41 2017 +0800 Committer: Yanbo Liang Committed: Sat Jul 1 15:37:41 2017 +0800 -- .../MultilayerPerceptronClassifier.scala| 19 .../ml/param/shared/SharedParamsCodeGen.scala | 11 +++-- .../spark/ml/param/shared/sharedParams.scala| 8 ++-- .../GeneralizedLinearRegression.scala | 21 - .../spark/ml/regression/LinearRegression.scala | 46 +++- python/pyspark/ml/classification.py | 18 +--- python/pyspark/ml/regression.py | 5 +++ 7 files changed, 82 insertions(+), 46 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e0b047ea/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala index ec39f96..ceba11e 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala @@ -27,13 +27,16 @@ import org.apache.spark.ml.ann.{FeedForwardTopology, FeedForwardTrainer} import org.apache.spark.ml.feature.LabeledPoint import org.apache.spark.ml.linalg.{Vector, Vectors} import org.apache.spark.ml.param._ -import org.apache.spark.ml.param.shared.{HasMaxIter, HasSeed, HasStepSize, HasTol} +import org.apache.spark.ml.param.shared._ import org.apache.spark.ml.util._ import org.apache.spark.sql.Dataset /** Params for Multilayer Perceptron. */ private[classification] trait MultilayerPerceptronParams extends PredictorParams - with HasSeed with HasMaxIter with HasTol with HasStepSize { + with HasSeed with HasMaxIter with HasTol with HasStepSize with HasSolver { + + import MultilayerPerceptronClassifier._ + /** * Layer sizes including input size and output size. * @@ -78,14 +81,10 @@ private[classification] trait MultilayerPerceptronParams extends PredictorParams * @group expertParam */ @Since("2.0.0") - final val solver: Param[String] = new Param[String](this, "solver", + final override val solver: Param[String] = new Param[String](this, "solver", "The solver algorithm for optimization. Supported options: " + - s"${MultilayerPerceptronClassifier.supportedSolvers.mkString(", ")}. (Default l-bfgs)", - ParamValidators.inArray[String](MultilayerPerceptronClassifier.supportedSolvers)) - - /** @group expertGetParam */ - @Since("2.0.0") - final def getSolver: String = $(solver) + s"${supportedSolvers.mkString(", ")}. (Default l-bfgs)", +ParamValidators.inArray[String](supportedSolvers)) /** * The initial weights of the model. @@ -101,7 +100,7 @@ private[classification] trait MultilayerPerceptronParams extends PredictorParams final def getInitialWeights: Vector = $(initialWeights) setDefault(maxIter -> 100, tol -> 1e-6, blockSize -> 128, -solver -> MultilayerPerceptronClassifier.LBFGS, stepSize -> 0.03) +solver -> LBFGS, stepSize -> 0.03) } /** Label to vector converter. */ http://git-wip-us.apache.org/repos/asf/spark/blob/e0b047ea/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala index 013817a..23e0d45 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala @@ -80,8 +80,7 @@
spark git commit: [SPARK-21275][ML] Update GLM test to use supportedFamilyNames
Repository: spark Updated Branches: refs/heads/master b1d719e7c -> 37ef32e51 [SPARK-21275][ML] Update GLM test to use supportedFamilyNames ## What changes were proposed in this pull request? Update GLM test to use supportedFamilyNames as suggested here: https://github.com/apache/spark/pull/16699#discussion-diff-100574976R855 Author: actuaryzhangCloses #18495 from actuaryzhang/mlGlmTest2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/37ef32e5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/37ef32e5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/37ef32e5 Branch: refs/heads/master Commit: 37ef32e515ea071afe63b56ba0d4299bb76e8a75 Parents: b1d719e Author: actuaryzhang Authored: Sat Jul 1 14:57:57 2017 +0800 Committer: Yanbo Liang Committed: Sat Jul 1 14:57:57 2017 +0800 -- .../GeneralizedLinearRegressionSuite.scala | 33 ++-- 1 file changed, 16 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/37ef32e5/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala index 83f1344..a47bd17 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala @@ -749,15 +749,15 @@ class GeneralizedLinearRegressionSuite library(statmod) y <- c(1.0, 0.5, 0.7, 0.3) w <- c(1, 2, 3, 4) - for (fam in list(gaussian(), poisson(), binomial(), Gamma(), tweedie(1.6))) { + for (fam in list(binomial(), Gamma(), gaussian(), poisson(), tweedie(1.6))) { model1 <- glm(y ~ 1, family = fam) model2 <- glm(y ~ 1, family = fam, weights = w) print(as.vector(c(coef(model1), coef(model2 } - [1] 0.625 0.530 - [1] -0.4700036 -0.6348783 [1] 0.5108256 0.1201443 [1] 1.60 1.886792 + [1] 0.625 0.530 + [1] -0.4700036 -0.6348783 [1] 1.325782 1.463641 */ @@ -768,13 +768,13 @@ class GeneralizedLinearRegressionSuite Instance(0.3, 4.0, Vectors.zeros(0)) ).toDF() -val expected = Seq(0.625, 0.530, -0.4700036, -0.6348783, 0.5108256, 0.1201443, - 1.60, 1.886792, 1.325782, 1.463641) +val expected = Seq(0.5108256, 0.1201443, 1.60, 1.886792, 0.625, 0.530, + -0.4700036, -0.6348783, 1.325782, 1.463641) import GeneralizedLinearRegression._ var idx = 0 -for (family <- Seq("gaussian", "poisson", "binomial", "gamma", "tweedie")) { +for (family <- GeneralizedLinearRegression.supportedFamilyNames.sortWith(_ < _)) { for (useWeight <- Seq(false, true)) { val trainer = new GeneralizedLinearRegression().setFamily(family) if (useWeight) trainer.setWeightCol("weight") @@ -807,7 +807,7 @@ class GeneralizedLinearRegressionSuite 0.5, 2.1, 0.5, 1.0, 2.0, 0.9, 0.4, 1.0, 2.0, 1.0, 0.7, 0.7, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE)) - families <- list(gaussian, binomial, poisson, Gamma, tweedie(1.5)) + families <- list(binomial, Gamma, gaussian, poisson, tweedie(1.5)) f1 <- V1 ~ -1 + V4 + V5 f2 <- V1 ~ V4 + V5 for (f in c(f1, f2)) { @@ -816,15 +816,15 @@ class GeneralizedLinearRegressionSuite print(as.vector(coef(model))) } } - [1] 0.5169222 -0.334 [1] 0.9419107 -0.6864404 - [1] 0.1812436 -0.6568422 [1] -0.2869094 0.7857710 + [1] 0.5169222 -0.334 + [1] 0.1812436 -0.6568422 [1] 0.1055254 0.2979113 - [1] -0.05990345 0.53188982 -0.32118415 [1] -0.2147117 0.9911750 -0.6356096 - [1] -1.5616130 0.6646470 -0.3192581 [1] 0.3390397 -0.3406099 0.6870259 + [1] -0.05990345 0.53188982 -0.32118415 + [1] -1.5616130 0.6646470 -0.3192581 [1] 0.3665034 0.1039416 0.1484616 */ val dataset = Seq( @@ -835,23 +835,22 @@ class GeneralizedLinearRegressionSuite ).toDF() val expected = Seq( - Vectors.dense(0, 0.5169222, -0.334), Vectors.dense(0, 0.9419107, -0.6864404), - Vectors.dense(0, 0.1812436, -0.6568422), Vectors.dense(0, -0.2869094, 0.785771), + Vectors.dense(0, 0.5169222, -0.334), + Vectors.dense(0, 0.1812436, -0.6568422), Vectors.dense(0, 0.1055254, 0.2979113), - Vectors.dense(-0.05990345, 0.53188982, -0.32118415),
spark git commit: [ML] Fix scala-2.10 build failure of GeneralizedLinearRegressionSuite.
Repository: spark Updated Branches: refs/heads/master 3c2fc19d4 -> 528c9281a [ML] Fix scala-2.10 build failure of GeneralizedLinearRegressionSuite. ## What changes were proposed in this pull request? Fix scala-2.10 build failure of ```GeneralizedLinearRegressionSuite```. ## How was this patch tested? Build with scala-2.10. Author: Yanbo LiangCloses #18489 from yanboliang/glr. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/528c9281 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/528c9281 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/528c9281 Branch: refs/heads/master Commit: 528c9281aecc49e9bff204dd303962c705c6f237 Parents: 3c2fc19 Author: Yanbo Liang Authored: Fri Jun 30 23:25:14 2017 +0800 Committer: Yanbo Liang Committed: Fri Jun 30 23:25:14 2017 +0800 -- .../ml/regression/GeneralizedLinearRegressionSuite.scala | 8 1 file changed, 4 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/528c9281/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala index cfaa573..83f1344 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala @@ -1075,7 +1075,7 @@ class GeneralizedLinearRegressionSuite val seCoefR = Array(1.23439, 0.9669, 3.56866) val tValsR = Array(0.80297, -0.65737, -0.06017) val pValsR = Array(0.42199, 0.51094, 0.95202) -val dispersionR = 1 +val dispersionR = 1.0 val nullDevianceR = 2.17561 val residualDevianceR = 0.00018 val residualDegreeOfFreedomNullR = 3 @@ -1114,7 +1114,7 @@ class GeneralizedLinearRegressionSuite assert(x._1 ~== x._2 absTol 1E-3) } summary.tValues.zip(tValsR).foreach{ x => assert(x._1 ~== x._2 absTol 1E-3) } summary.pValues.zip(pValsR).foreach{ x => assert(x._1 ~== x._2 absTol 1E-3) } -assert(summary.dispersion ~== dispersionR absTol 1E-3) +assert(summary.dispersion === dispersionR) assert(summary.nullDeviance ~== nullDevianceR absTol 1E-3) assert(summary.deviance ~== residualDevianceR absTol 1E-3) assert(summary.residualDegreeOfFreedom === residualDegreeOfFreedomR) @@ -1190,7 +1190,7 @@ class GeneralizedLinearRegressionSuite val seCoefR = Array(1.16826, 0.41703, 1.96249) val tValsR = Array(-2.46387, 2.12428, -2.32757) val pValsR = Array(0.01374, 0.03365, 0.01993) -val dispersionR = 1 +val dispersionR = 1.0 val nullDevianceR = 22.55853 val residualDevianceR = 9.5622 val residualDegreeOfFreedomNullR = 3 @@ -1229,7 +1229,7 @@ class GeneralizedLinearRegressionSuite assert(x._1 ~== x._2 absTol 1E-3) } summary.tValues.zip(tValsR).foreach{ x => assert(x._1 ~== x._2 absTol 1E-3) } summary.pValues.zip(pValsR).foreach{ x => assert(x._1 ~== x._2 absTol 1E-3) } -assert(summary.dispersion ~== dispersionR absTol 1E-3) +assert(summary.dispersion === dispersionR) assert(summary.nullDeviance ~== nullDevianceR absTol 1E-3) assert(summary.deviance ~== residualDevianceR absTol 1E-3) assert(summary.residualDegreeOfFreedom === residualDegreeOfFreedomR) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-18710][ML] Add offset in GLM
Repository: spark Updated Branches: refs/heads/master 52981715b -> 49d767d83 [SPARK-18710][ML] Add offset in GLM ## What changes were proposed in this pull request? Add support for offset in GLM. This is useful for at least two reasons: 1. Account for exposure: e.g., when modeling the number of accidents, we may need to use miles driven as an offset to access factors on frequency. 2. Test incremental effects of new variables: we can use predictions from the existing model as offset and run a much smaller model on only new variables. This avoids re-estimating the large model with all variables (old + new) and can be very important for efficient large-scaled analysis. ## How was this patch tested? New test. yanboliang srowen felixcheung sethah Author: actuaryzhangCloses #16699 from actuaryzhang/offset. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/49d767d8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/49d767d8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/49d767d8 Branch: refs/heads/master Commit: 49d767d838691fc7d964be2c4349662f5500ff2b Parents: 5298171 Author: actuaryzhang Authored: Fri Jun 30 20:02:15 2017 +0800 Committer: Yanbo Liang Committed: Fri Jun 30 20:02:15 2017 +0800 -- .../org/apache/spark/ml/feature/Instance.scala | 21 + .../IterativelyReweightedLeastSquares.scala | 14 +- .../spark/ml/optim/WeightedLeastSquares.scala | 2 +- .../GeneralizedLinearRegression.scala | 184 -- ...IterativelyReweightedLeastSquaresSuite.scala | 40 +- .../GeneralizedLinearRegressionSuite.scala | 634 +++ 6 files changed, 534 insertions(+), 361 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/49d767d8/mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala index cce3ca4..dd56fbb 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala @@ -27,3 +27,24 @@ import org.apache.spark.ml.linalg.Vector * @param features The vector of features for this data point. */ private[ml] case class Instance(label: Double, weight: Double, features: Vector) + +/** + * Case class that represents an instance of data point with + * label, weight, offset and features. + * This is mainly used in GeneralizedLinearRegression currently. + * + * @param label Label for this data point. + * @param weight The weight of this instance. + * @param offset The offset used for this data point. + * @param features The vector of features for this data point. + */ +private[ml] case class OffsetInstance( +label: Double, +weight: Double, +offset: Double, +features: Vector) { + + /** Converts to an [[Instance]] object by leaving out the offset. */ + def toInstance: Instance = Instance(label, weight, features) + +} http://git-wip-us.apache.org/repos/asf/spark/blob/49d767d8/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala b/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala index 9c49551..6961b45 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala @@ -18,7 +18,7 @@ package org.apache.spark.ml.optim import org.apache.spark.internal.Logging -import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.feature.{Instance, OffsetInstance} import org.apache.spark.ml.linalg._ import org.apache.spark.rdd.RDD @@ -43,7 +43,7 @@ private[ml] class IterativelyReweightedLeastSquaresModel( * find M-estimator in robust regression and other optimization problems. * * @param initialModel the initial guess model. - * @param reweightFunc the reweight function which is used to update offsets and weights + * @param reweightFunc the reweight function which is used to update working labels and weights * at each iteration. * @param fitIntercept whether to fit intercept. * @param regParam L2 regularization parameter used by WLS. @@ -57,13 +57,13 @@ private[ml] class IterativelyReweightedLeastSquaresModel( */ private[ml] class IterativelyReweightedLeastSquares( val initialModel: WeightedLeastSquaresModel, -val reweightFunc:
spark git commit: [SPARK-14657][SPARKR][ML] RFormula w/o intercept should output reference category when encoding string terms
Repository: spark Updated Branches: refs/heads/master 376d90d55 -> 0c8444cf6 [SPARK-14657][SPARKR][ML] RFormula w/o intercept should output reference category when encoding string terms ## What changes were proposed in this pull request? Please see [SPARK-14657](https://issues.apache.org/jira/browse/SPARK-14657) for detail of this bug. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. ## How was this patch tested? Add standard unit tests. cc mengxr Author: Yanbo LiangCloses #12414 from yanboliang/spark-14657. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0c8444cf Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0c8444cf Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0c8444cf Branch: refs/heads/master Commit: 0c8444cf6d0620cd219ddcf5f50b12ff648639e9 Parents: 376d90d Author: Yanbo Liang Authored: Thu Jun 29 10:32:32 2017 +0800 Committer: Yanbo Liang Committed: Thu Jun 29 10:32:32 2017 +0800 -- .../org/apache/spark/ml/feature/RFormula.scala | 10 ++- .../apache/spark/ml/feature/RFormulaSuite.scala | 83 2 files changed, 92 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0c8444cf/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala index 1fad0a6..4b44878 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala @@ -205,12 +205,20 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String) }.toMap // Then we handle one-hot encoding and interactions between terms. +var keepReferenceCategory = false val encodedTerms = resolvedFormula.terms.map { case Seq(term) if dataset.schema(term).dataType == StringType => val encodedCol = tmpColumn("onehot") -encoderStages += new OneHotEncoder() +var encoder = new OneHotEncoder() .setInputCol(indexed(term)) .setOutputCol(encodedCol) +// Formula w/o intercept, one of the categories in the first category feature is +// being used as reference category, we will not drop any category for that feature. +if (!hasIntercept && !keepReferenceCategory) { + encoder = encoder.setDropLast(false) + keepReferenceCategory = true +} +encoderStages += encoder prefixesToRewrite(encodedCol + "_") = term + "_" encodedCol case Seq(term) => http://git-wip-us.apache.org/repos/asf/spark/blob/0c8444cf/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala index 41d0062..23570d6 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala @@ -213,6 +213,89 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul assert(result.collect() === expected.collect()) } + test("formula w/o intercept, we should output reference category when encoding string terms") { +/* + R code: + + df <- data.frame(id = c(1, 2, 3, 4), + a = c("foo", "bar", "bar", "baz"), + b = c("zq", "zz", "zz", "zz"), + c = c(4, 4, 5, 5)) + model.matrix(id ~ a + b + c - 1, df) + + abar abaz afoo bzz c + 1001 0 4 + 2100 1 4 + 3100 1 5 + 4010 1 5 + + model.matrix(id ~ a:b + c - 1, df) + + c abar:bzq abaz:bzq afoo:bzq abar:bzz abaz:bzz afoo:bzz + 1 4001000 + 2 4000100 + 3 5000100 + 4 5000010 +*/ +val original = Seq((1, "foo", "zq", 4), (2, "bar", "zz", 4), (3, "bar", "zz", 5), + (4, "baz", "zz", 5)).toDF("id", "a", "b",
spark git commit: [SPARK-20899][PYSPARK] PySpark supports stringIndexerOrderType in RFormula
Repository: spark Updated Branches: refs/heads/master 35b644bd0 -> ff5676b01 [SPARK-20899][PYSPARK] PySpark supports stringIndexerOrderType in RFormula ## What changes were proposed in this pull request? PySpark supports stringIndexerOrderType in RFormula as in #17967. ## How was this patch tested? docstring test Author: actuaryzhangCloses #18122 from actuaryzhang/PythonRFormula. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ff5676b0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ff5676b0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ff5676b0 Branch: refs/heads/master Commit: ff5676b01ffd8adfe753cb749582579cbd496e7f Parents: 35b644b Author: actuaryzhang Authored: Wed May 31 01:02:19 2017 +0800 Committer: Yanbo Liang Committed: Wed May 31 01:02:19 2017 +0800 -- python/pyspark/ml/feature.py | 33 - python/pyspark/ml/tests.py | 13 + 2 files changed, 41 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ff5676b0/python/pyspark/ml/feature.py -- diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index 955bc97..77de1cc 100755 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -3043,26 +3043,35 @@ class RFormula(JavaEstimator, HasFeaturesCol, HasLabelCol, JavaMLReadable, JavaM "Force to index label whether it is numeric or string", typeConverter=TypeConverters.toBoolean) +stringIndexerOrderType = Param(Params._dummy(), "stringIndexerOrderType", + "How to order categories of a string feature column used by " + + "StringIndexer. The last category after ordering is dropped " + + "when encoding strings. Supported options: frequencyDesc, " + + "frequencyAsc, alphabetDesc, alphabetAsc. The default value " + + "is frequencyDesc. When the ordering is set to alphabetDesc, " + + "RFormula drops the same category as R when encoding strings.", + typeConverter=TypeConverters.toString) + @keyword_only def __init__(self, formula=None, featuresCol="features", labelCol="label", - forceIndexLabel=False): + forceIndexLabel=False, stringIndexerOrderType="frequencyDesc"): """ __init__(self, formula=None, featuresCol="features", labelCol="label", \ - forceIndexLabel=False) + forceIndexLabel=False, stringIndexerOrderType="frequencyDesc") """ super(RFormula, self).__init__() self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.RFormula", self.uid) -self._setDefault(forceIndexLabel=False) +self._setDefault(forceIndexLabel=False, stringIndexerOrderType="frequencyDesc") kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only @since("1.5.0") def setParams(self, formula=None, featuresCol="features", labelCol="label", - forceIndexLabel=False): + forceIndexLabel=False, stringIndexerOrderType="frequencyDesc"): """ setParams(self, formula=None, featuresCol="features", labelCol="label", \ - forceIndexLabel=False) + forceIndexLabel=False, stringIndexerOrderType="frequencyDesc") Sets params for RFormula. """ kwargs = self._input_kwargs @@ -3096,6 +3105,20 @@ class RFormula(JavaEstimator, HasFeaturesCol, HasLabelCol, JavaMLReadable, JavaM """ return self.getOrDefault(self.forceIndexLabel) +@since("2.3.0") +def setStringIndexerOrderType(self, value): +""" +Sets the value of :py:attr:`stringIndexerOrderType`. +""" +return self._set(stringIndexerOrderType=value) + +@since("2.3.0") +def getStringIndexerOrderType(self): +""" +Gets the value of :py:attr:`stringIndexerOrderType` or its default value 'frequencyDesc'. +""" +return self.getOrDefault(self.stringIndexerOrderType) + def _create_model(self, java_model): return RFormulaModel(java_model) http://git-wip-us.apache.org/repos/asf/spark/blob/ff5676b0/python/pyspark/ml/tests.py -- diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py index 0daf29d..17a3947 100755 ---
spark git commit: [SPARK-14659][ML] RFormula consistent with R when handling strings
Repository: spark Updated Branches: refs/heads/master 2dbe0c528 -> f47700c9c [SPARK-14659][ML] RFormula consistent with R when handling strings ## What changes were proposed in this pull request? When handling strings, the category dropped by RFormula and R are different: - RFormula drops the least frequent level - R drops the first level after ascending alphabetical ordering This PR supports different string ordering types in StringIndexer #17879 so that RFormula can drop the same level as R when handling strings using`stringOrderType = "alphabetDesc"`. ## How was this patch tested? new tests Author: Wayne ZhangCloses #17967 from actuaryzhang/RFormula. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f47700c9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f47700c9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f47700c9 Branch: refs/heads/master Commit: f47700c9cadd72a2495f97f250790449705f631f Parents: 2dbe0c5 Author: Wayne Zhang Authored: Fri May 26 10:44:40 2017 +0800 Committer: Yanbo Liang Committed: Fri May 26 10:44:40 2017 +0800 -- .../org/apache/spark/ml/feature/RFormula.scala | 44 +- .../apache/spark/ml/feature/StringIndexer.scala | 4 +- .../apache/spark/ml/feature/RFormulaSuite.scala | 84 3 files changed, 129 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f47700c9/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala index 5a3e292..1fad0a6 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala @@ -26,7 +26,7 @@ import org.apache.spark.annotation.{Experimental, Since} import org.apache.spark.ml.{Estimator, Model, Pipeline, PipelineModel, PipelineStage, Transformer} import org.apache.spark.ml.attribute.AttributeGroup import org.apache.spark.ml.linalg.VectorUDT -import org.apache.spark.ml.param.{BooleanParam, Param, ParamMap} +import org.apache.spark.ml.param.{BooleanParam, Param, ParamMap, ParamValidators} import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} import org.apache.spark.ml.util._ import org.apache.spark.sql.{DataFrame, Dataset} @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._ */ private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol { + /** + * Param for how to order categories of a string FEATURE column used by `StringIndexer`. + * The last category after ordering is dropped when encoding strings. + * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'. + * The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula` + * drops the same category as R when encoding strings. + * + * The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`: + * {{{ + * +-+---+--+ + * | Option | Category mapped to 0 by StringIndexer | Category dropped by RFormula| + * +-+---+--+ + * | 'frequencyDesc' | most frequent category ('b') | least frequent category ('c')| + * | 'frequencyAsc' | least frequent category ('c') | most frequent category ('b') | + * | 'alphabetDesc' | last alphabetical category ('c') | first alphabetical category ('a')| + * | 'alphabetAsc' | first alphabetical category ('a') | last alphabetical category ('c') | + * +-+---+--+ + * }}} + * Note that this ordering option is NOT used for the label column. When the label column is + * indexed, it uses the default descending frequency ordering in `StringIndexer`. + * + * @group param + */ + @Since("2.3.0") + final val stringIndexerOrderType: Param[String] = new Param(this, "stringIndexerOrderType", +"How to order categories of a string FEATURE column used by StringIndexer. " + +"The last category after ordering is dropped when encoding strings. " + +s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}. " + +"The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', " + +"RFormula drops the same category as R when encoding strings.", +
spark git commit: [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark FPGrowth.
Repository: spark Updated Branches: refs/heads/branch-2.2 9cbf39f1c -> e01f1f222 [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark FPGrowth. ## What changes were proposed in this pull request? Expose numPartitions (expert) param of PySpark FPGrowth. ## How was this patch tested? + [x] Pass all unit tests. Author: Yan Facai (é¢åæ)Closes #18058 from facaiy/ENH/pyspark_fpg_add_num_partition. (cherry picked from commit 139da116f130ed21481d3e9bdee5df4b8d7760ac) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e01f1f22 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e01f1f22 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e01f1f22 Branch: refs/heads/branch-2.2 Commit: e01f1f222bcb7c469b1e1595e9338ed478d99894 Parents: 9cbf39f Author: Yan Facai (é¢åæ) Authored: Thu May 25 21:40:39 2017 +0800 Committer: Yanbo Liang Committed: Thu May 25 21:40:52 2017 +0800 -- python/pyspark/ml/fpm.py | 30 +- 1 file changed, 29 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e01f1f22/python/pyspark/ml/fpm.py -- diff --git a/python/pyspark/ml/fpm.py b/python/pyspark/ml/fpm.py index 6ff7d2c..dd7dda5 100644 --- a/python/pyspark/ml/fpm.py +++ b/python/pyspark/ml/fpm.py @@ -49,6 +49,32 @@ class HasMinSupport(Params): return self.getOrDefault(self.minSupport) +class HasNumPartitions(Params): +""" +Mixin for param numPartitions: Number of partitions (at least 1) used by parallel FP-growth. +""" + +numPartitions = Param( +Params._dummy(), +"numPartitions", +"Number of partitions (at least 1) used by parallel FP-growth. " + +"By default the param is not set, " + +"and partition number of the input dataset is used.", +typeConverter=TypeConverters.toInt) + +def setNumPartitions(self, value): +""" +Sets the value of :py:attr:`numPartitions`. +""" +return self._set(numPartitions=value) + +def getNumPartitions(self): +""" +Gets the value of :py:attr:`numPartitions` or its default value. +""" +return self.getOrDefault(self.numPartitions) + + class HasMinConfidence(Params): """ Mixin for param minConfidence. @@ -127,7 +153,9 @@ class FPGrowthModel(JavaModel, JavaMLWritable, JavaMLReadable): class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol, - HasMinSupport, HasMinConfidence, JavaMLWritable, JavaMLReadable): + HasMinSupport, HasNumPartitions, HasMinConfidence, + JavaMLWritable, JavaMLReadable): + """ .. note:: Experimental - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark FPGrowth.
Repository: spark Updated Branches: refs/heads/master 913a6bfe4 -> 139da116f [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark FPGrowth. ## What changes were proposed in this pull request? Expose numPartitions (expert) param of PySpark FPGrowth. ## How was this patch tested? + [x] Pass all unit tests. Author: Yan Facai (é¢åæ)Closes #18058 from facaiy/ENH/pyspark_fpg_add_num_partition. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/139da116 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/139da116 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/139da116 Branch: refs/heads/master Commit: 139da116f130ed21481d3e9bdee5df4b8d7760ac Parents: 913a6bf Author: Yan Facai (é¢åæ) Authored: Thu May 25 21:40:39 2017 +0800 Committer: Yanbo Liang Committed: Thu May 25 21:40:39 2017 +0800 -- python/pyspark/ml/fpm.py | 30 +- 1 file changed, 29 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/139da116/python/pyspark/ml/fpm.py -- diff --git a/python/pyspark/ml/fpm.py b/python/pyspark/ml/fpm.py index 6ff7d2c..dd7dda5 100644 --- a/python/pyspark/ml/fpm.py +++ b/python/pyspark/ml/fpm.py @@ -49,6 +49,32 @@ class HasMinSupport(Params): return self.getOrDefault(self.minSupport) +class HasNumPartitions(Params): +""" +Mixin for param numPartitions: Number of partitions (at least 1) used by parallel FP-growth. +""" + +numPartitions = Param( +Params._dummy(), +"numPartitions", +"Number of partitions (at least 1) used by parallel FP-growth. " + +"By default the param is not set, " + +"and partition number of the input dataset is used.", +typeConverter=TypeConverters.toInt) + +def setNumPartitions(self, value): +""" +Sets the value of :py:attr:`numPartitions`. +""" +return self._set(numPartitions=value) + +def getNumPartitions(self): +""" +Gets the value of :py:attr:`numPartitions` or its default value. +""" +return self.getOrDefault(self.numPartitions) + + class HasMinConfidence(Params): """ Mixin for param minConfidence. @@ -127,7 +153,9 @@ class FPGrowthModel(JavaModel, JavaMLWritable, JavaMLReadable): class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol, - HasMinSupport, HasMinConfidence, JavaMLWritable, JavaMLReadable): + HasMinSupport, HasNumPartitions, HasMinConfidence, + JavaMLWritable, JavaMLReadable): + """ .. note:: Experimental - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth.
Repository: spark Updated Branches: refs/heads/branch-2.2 8896c4ee9 -> 9cbf39f1c [SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth. ## What changes were proposed in this pull request? Follow-up for #17218, some minor fix for PySpark ```FPGrowth```. ## How was this patch tested? Existing UT. Author: Yanbo LiangCloses #18089 from yanboliang/spark-19281. (cherry picked from commit 913a6bfe4b0eb6b80a03b858ab4b2767194103de) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9cbf39f1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9cbf39f1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9cbf39f1 Branch: refs/heads/branch-2.2 Commit: 9cbf39f1c74f16483865cd93d6ffc3c521e878a7 Parents: 8896c4e Author: Yanbo Liang Authored: Thu May 25 20:15:15 2017 +0800 Committer: Yanbo Liang Committed: Thu May 25 20:15:38 2017 +0800 -- python/pyspark/ml/fpm.py | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9cbf39f1/python/pyspark/ml/fpm.py -- diff --git a/python/pyspark/ml/fpm.py b/python/pyspark/ml/fpm.py index b30d4ed..6ff7d2c 100644 --- a/python/pyspark/ml/fpm.py +++ b/python/pyspark/ml/fpm.py @@ -23,17 +23,17 @@ from pyspark.ml.param.shared import * __all__ = ["FPGrowth", "FPGrowthModel"] -class HasSupport(Params): +class HasMinSupport(Params): """ -Mixin for param support. +Mixin for param minSupport. """ minSupport = Param( Params._dummy(), "minSupport", -"""Minimal support level of the frequent pattern. [0.0, 1.0]. -Any pattern that appears more than (minSupport * size-of-the-dataset) -times will be output""", +"Minimal support level of the frequent pattern. [0.0, 1.0]. " + +"Any pattern that appears more than (minSupport * size-of-the-dataset) " + +"times will be output in the frequent itemsets.", typeConverter=TypeConverters.toFloat) def setMinSupport(self, value): @@ -49,16 +49,17 @@ class HasSupport(Params): return self.getOrDefault(self.minSupport) -class HasConfidence(Params): +class HasMinConfidence(Params): """ -Mixin for param confidence. +Mixin for param minConfidence. """ minConfidence = Param( Params._dummy(), "minConfidence", -"""Minimal confidence for generating Association Rule. [0.0, 1.0] -Note that minConfidence has no effect during fitting.""", +"Minimal confidence for generating Association Rule. [0.0, 1.0]. " + +"minConfidence will not affect the mining for frequent itemsets, " + +"but will affect the association rules generation.", typeConverter=TypeConverters.toFloat) def setMinConfidence(self, value): @@ -126,7 +127,7 @@ class FPGrowthModel(JavaModel, JavaMLWritable, JavaMLReadable): class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol, - HasSupport, HasConfidence, JavaMLWritable, JavaMLReadable): + HasMinSupport, HasMinConfidence, JavaMLWritable, JavaMLReadable): """ .. note:: Experimental - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth.
Repository: spark Updated Branches: refs/heads/master 3f94e64aa -> 913a6bfe4 [SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth. ## What changes were proposed in this pull request? Follow-up for #17218, some minor fix for PySpark ```FPGrowth```. ## How was this patch tested? Existing UT. Author: Yanbo LiangCloses #18089 from yanboliang/spark-19281. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/913a6bfe Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/913a6bfe Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/913a6bfe Branch: refs/heads/master Commit: 913a6bfe4b0eb6b80a03b858ab4b2767194103de Parents: 3f94e64 Author: Yanbo Liang Authored: Thu May 25 20:15:15 2017 +0800 Committer: Yanbo Liang Committed: Thu May 25 20:15:15 2017 +0800 -- python/pyspark/ml/fpm.py | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/913a6bfe/python/pyspark/ml/fpm.py -- diff --git a/python/pyspark/ml/fpm.py b/python/pyspark/ml/fpm.py index b30d4ed..6ff7d2c 100644 --- a/python/pyspark/ml/fpm.py +++ b/python/pyspark/ml/fpm.py @@ -23,17 +23,17 @@ from pyspark.ml.param.shared import * __all__ = ["FPGrowth", "FPGrowthModel"] -class HasSupport(Params): +class HasMinSupport(Params): """ -Mixin for param support. +Mixin for param minSupport. """ minSupport = Param( Params._dummy(), "minSupport", -"""Minimal support level of the frequent pattern. [0.0, 1.0]. -Any pattern that appears more than (minSupport * size-of-the-dataset) -times will be output""", +"Minimal support level of the frequent pattern. [0.0, 1.0]. " + +"Any pattern that appears more than (minSupport * size-of-the-dataset) " + +"times will be output in the frequent itemsets.", typeConverter=TypeConverters.toFloat) def setMinSupport(self, value): @@ -49,16 +49,17 @@ class HasSupport(Params): return self.getOrDefault(self.minSupport) -class HasConfidence(Params): +class HasMinConfidence(Params): """ -Mixin for param confidence. +Mixin for param minConfidence. """ minConfidence = Param( Params._dummy(), "minConfidence", -"""Minimal confidence for generating Association Rule. [0.0, 1.0] -Note that minConfidence has no effect during fitting.""", +"Minimal confidence for generating Association Rule. [0.0, 1.0]. " + +"minConfidence will not affect the mining for frequent itemsets, " + +"but will affect the association rules generation.", typeConverter=TypeConverters.toFloat) def setMinConfidence(self, value): @@ -126,7 +127,7 @@ class FPGrowthModel(JavaModel, JavaMLWritable, JavaMLReadable): class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol, - HasSupport, HasConfidence, JavaMLWritable, JavaMLReadable): + HasMinSupport, HasMinConfidence, JavaMLWritable, JavaMLReadable): """ .. note:: Experimental - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel
Repository: spark Updated Branches: refs/heads/branch-2.0 4dd34d004 -> 72e1f83d7 [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel ## What changes were proposed in this pull request? Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float. ## How was this patch tested? Existing tests run using python3 and numpy 1.12. Author: Bago AmirbekianCloses #18081 from MrBago/BF-py3floatbug. (cherry picked from commit bc66a77bbe2120cc21bd8da25194efca4cde13c3) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/72e1f83d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/72e1f83d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/72e1f83d Branch: refs/heads/branch-2.0 Commit: 72e1f83d78e51b53c104d1cd101c10bbe557c047 Parents: 4dd34d0 Author: Bago Amirbekian Authored: Wed May 24 22:55:38 2017 +0800 Committer: Yanbo Liang Committed: Wed May 24 23:00:01 2017 +0800 -- python/pyspark/mllib/classification.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/72e1f83d/python/pyspark/mllib/classification.py -- diff --git a/python/pyspark/mllib/classification.py b/python/pyspark/mllib/classification.py index 9f53ed0..e04eeb2 100644 --- a/python/pyspark/mllib/classification.py +++ b/python/pyspark/mllib/classification.py @@ -171,7 +171,7 @@ class LogisticRegressionModel(LinearClassificationModel): self._dataWithBiasSize = None self._weightsMatrix = None else: -self._dataWithBiasSize = self._coeff.size / (self._numClasses - 1) +self._dataWithBiasSize = self._coeff.size // (self._numClasses - 1) self._weightsMatrix = self._coeff.toArray().reshape(self._numClasses - 1, self._dataWithBiasSize) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel
Repository: spark Updated Branches: refs/heads/branch-2.1 f4538c95f -> 13adc0fc0 [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel ## What changes were proposed in this pull request? Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float. ## How was this patch tested? Existing tests run using python3 and numpy 1.12. Author: Bago AmirbekianCloses #18081 from MrBago/BF-py3floatbug. (cherry picked from commit bc66a77bbe2120cc21bd8da25194efca4cde13c3) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/13adc0fc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/13adc0fc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/13adc0fc Branch: refs/heads/branch-2.1 Commit: 13adc0fc0e940a4ea8b703241666440357a597e3 Parents: f4538c9 Author: Bago Amirbekian Authored: Wed May 24 22:55:38 2017 +0800 Committer: Yanbo Liang Committed: Wed May 24 22:58:16 2017 +0800 -- python/pyspark/mllib/classification.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/13adc0fc/python/pyspark/mllib/classification.py -- diff --git a/python/pyspark/mllib/classification.py b/python/pyspark/mllib/classification.py index 9f53ed0..e04eeb2 100644 --- a/python/pyspark/mllib/classification.py +++ b/python/pyspark/mllib/classification.py @@ -171,7 +171,7 @@ class LogisticRegressionModel(LinearClassificationModel): self._dataWithBiasSize = None self._weightsMatrix = None else: -self._dataWithBiasSize = self._coeff.size / (self._numClasses - 1) +self._dataWithBiasSize = self._coeff.size // (self._numClasses - 1) self._weightsMatrix = self._coeff.toArray().reshape(self._numClasses - 1, self._dataWithBiasSize) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel
Repository: spark Updated Branches: refs/heads/branch-2.2 1d107242f -> 83aeac9e0 [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel ## What changes were proposed in this pull request? Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float. ## How was this patch tested? Existing tests run using python3 and numpy 1.12. Author: Bago AmirbekianCloses #18081 from MrBago/BF-py3floatbug. (cherry picked from commit bc66a77bbe2120cc21bd8da25194efca4cde13c3) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/83aeac9e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/83aeac9e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/83aeac9e Branch: refs/heads/branch-2.2 Commit: 83aeac9e0590e99010d0af8e067822d0ed0971fe Parents: 1d10724 Author: Bago Amirbekian Authored: Wed May 24 22:55:38 2017 +0800 Committer: Yanbo Liang Committed: Wed May 24 22:56:28 2017 +0800 -- python/pyspark/mllib/classification.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/83aeac9e/python/pyspark/mllib/classification.py -- diff --git a/python/pyspark/mllib/classification.py b/python/pyspark/mllib/classification.py index 9f53ed0..e04eeb2 100644 --- a/python/pyspark/mllib/classification.py +++ b/python/pyspark/mllib/classification.py @@ -171,7 +171,7 @@ class LogisticRegressionModel(LinearClassificationModel): self._dataWithBiasSize = None self._weightsMatrix = None else: -self._dataWithBiasSize = self._coeff.size / (self._numClasses - 1) +self._dataWithBiasSize = self._coeff.size // (self._numClasses - 1) self._weightsMatrix = self._coeff.toArray().reshape(self._numClasses - 1, self._dataWithBiasSize) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel
Repository: spark Updated Branches: refs/heads/master 1816eb3be -> bc66a77bb [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel ## What changes were proposed in this pull request? Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float. ## How was this patch tested? Existing tests run using python3 and numpy 1.12. Author: Bago AmirbekianCloses #18081 from MrBago/BF-py3floatbug. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bc66a77b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bc66a77b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bc66a77b Branch: refs/heads/master Commit: bc66a77bbe2120cc21bd8da25194efca4cde13c3 Parents: 1816eb3 Author: Bago Amirbekian Authored: Wed May 24 22:55:38 2017 +0800 Committer: Yanbo Liang Committed: Wed May 24 22:55:38 2017 +0800 -- python/pyspark/mllib/classification.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bc66a77b/python/pyspark/mllib/classification.py -- diff --git a/python/pyspark/mllib/classification.py b/python/pyspark/mllib/classification.py index 9f53ed0..e04eeb2 100644 --- a/python/pyspark/mllib/classification.py +++ b/python/pyspark/mllib/classification.py @@ -171,7 +171,7 @@ class LogisticRegressionModel(LinearClassificationModel): self._dataWithBiasSize = None self._weightsMatrix = None else: -self._dataWithBiasSize = self._coeff.size / (self._numClasses - 1) +self._dataWithBiasSize = self._coeff.size // (self._numClasses - 1) self._weightsMatrix = self._coeff.toArray().reshape(self._numClasses - 1, self._dataWithBiasSize) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20631][FOLLOW-UP] Fix incorrect tests.
Repository: spark Updated Branches: refs/heads/branch-2.2 e936a96ba -> 1d107242f [SPARK-20631][FOLLOW-UP] Fix incorrect tests. ## What changes were proposed in this pull request? - Fix incorrect tests for `_check_thresholds`. - Move test to `ParamTests`. ## How was this patch tested? Unit tests. Author: zero323Closes #18085 from zero323/SPARK-20631-FOLLOW-UP. (cherry picked from commit 1816eb3bef930407dc9e083de08f5105725c55d1) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1d107242 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1d107242 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1d107242 Branch: refs/heads/branch-2.2 Commit: 1d107242f8ec842c009e0b427f6e4a8313d99aa2 Parents: e936a96 Author: zero323 Authored: Wed May 24 19:57:44 2017 +0800 Committer: Yanbo Liang Committed: Wed May 24 19:58:40 2017 +0800 -- python/pyspark/ml/tests.py | 24 1 file changed, 12 insertions(+), 12 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1d107242/python/pyspark/ml/tests.py -- diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py index a3393c6..0daf29d 100755 --- a/python/pyspark/ml/tests.py +++ b/python/pyspark/ml/tests.py @@ -404,6 +404,18 @@ class ParamTests(PySparkTestCase): self.assertEqual(tp._paramMap, copied_no_extra) self.assertEqual(tp._defaultParamMap, tp_copy._defaultParamMap) +def test_logistic_regression_check_thresholds(self): +self.assertIsInstance( +LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]), +LogisticRegression +) + +self.assertRaisesRegexp( +ValueError, +"Logistic Regression getThreshold found inconsistent.*$", +LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] +) + class EvaluatorTests(SparkSessionTestCase): @@ -807,18 +819,6 @@ class PersistenceTest(SparkSessionTestCase): except OSError: pass -def logistic_regression_check_thresholds(self): -self.assertIsInstance( -LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]), -LogisticRegressionModel -) - -self.assertRaisesRegexp( -ValueError, -"Logistic Regression getThreshold found inconsistent.*$", -LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] -) - def _compare_params(self, m1, m2, param): """ Compare 2 ML Params instances for the given param, and assert both have the same param value - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20631][FOLLOW-UP] Fix incorrect tests.
Repository: spark Updated Branches: refs/heads/master 9afcf127d -> 1816eb3be [SPARK-20631][FOLLOW-UP] Fix incorrect tests. ## What changes were proposed in this pull request? - Fix incorrect tests for `_check_thresholds`. - Move test to `ParamTests`. ## How was this patch tested? Unit tests. Author: zero323Closes #18085 from zero323/SPARK-20631-FOLLOW-UP. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1816eb3b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1816eb3b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1816eb3b Branch: refs/heads/master Commit: 1816eb3bef930407dc9e083de08f5105725c55d1 Parents: 9afcf12 Author: zero323 Authored: Wed May 24 19:57:44 2017 +0800 Committer: Yanbo Liang Committed: Wed May 24 19:57:44 2017 +0800 -- python/pyspark/ml/tests.py | 24 1 file changed, 12 insertions(+), 12 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1816eb3b/python/pyspark/ml/tests.py -- diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py index a3393c6..0daf29d 100755 --- a/python/pyspark/ml/tests.py +++ b/python/pyspark/ml/tests.py @@ -404,6 +404,18 @@ class ParamTests(PySparkTestCase): self.assertEqual(tp._paramMap, copied_no_extra) self.assertEqual(tp._defaultParamMap, tp_copy._defaultParamMap) +def test_logistic_regression_check_thresholds(self): +self.assertIsInstance( +LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]), +LogisticRegression +) + +self.assertRaisesRegexp( +ValueError, +"Logistic Regression getThreshold found inconsistent.*$", +LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] +) + class EvaluatorTests(SparkSessionTestCase): @@ -807,18 +819,6 @@ class PersistenceTest(SparkSessionTestCase): except OSError: pass -def logistic_regression_check_thresholds(self): -self.assertIsInstance( -LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]), -LogisticRegressionModel -) - -self.assertRaisesRegexp( -ValueError, -"Logistic Regression getThreshold found inconsistent.*$", -LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] -) - def _compare_params(self, m1, m2, param): """ Compare 2 ML Params instances for the given param, and assert both have the same param value - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version
Repository: spark Updated Branches: refs/heads/branch-2.2 ee9d5975e -> e936a96ba [SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version ## What changes were proposed in this pull request? Add test cases for PR-18062 ## How was this patch tested? The existing UT Author: PengCloses #18068 from mpjlu/moreTest. (cherry picked from commit 9afcf127d31b5477a539dde6e5f01861532a1c4c) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e936a96b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e936a96b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e936a96b Branch: refs/heads/branch-2.2 Commit: e936a96badfeeb2051ee35dc4b0fbecefa9bf4cb Parents: ee9d597 Author: Peng Authored: Wed May 24 19:54:17 2017 +0800 Committer: Yanbo Liang Committed: Wed May 24 19:54:58 2017 +0800 -- python/pyspark/ml/tests.py | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e936a96b/python/pyspark/ml/tests.py -- diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py index 51a3e8e..a3393c6 100755 --- a/python/pyspark/ml/tests.py +++ b/python/pyspark/ml/tests.py @@ -1066,6 +1066,7 @@ class TrainingSummaryTest(SparkSessionTestCase): self.assertAlmostEqual(s.r2, 1.0, 2) self.assertTrue(isinstance(s.residuals, DataFrame)) self.assertEqual(s.numInstances, 2) +self.assertEqual(s.degreesOfFreedom, 1) devResiduals = s.devianceResiduals self.assertTrue(isinstance(devResiduals, list) and isinstance(devResiduals[0], float)) coefStdErr = s.coefficientStandardErrors @@ -1075,7 +1076,8 @@ class TrainingSummaryTest(SparkSessionTestCase): pValues = s.pValues self.assertTrue(isinstance(pValues, list) and isinstance(pValues[0], float)) # test evaluation (with training dataset) produces a summary with same values -# one check is enough to verify a summary is returned, Scala version runs full test +# one check is enough to verify a summary is returned +# The child class LinearRegressionTrainingSummary runs full test sameSummary = model.evaluate(df) self.assertAlmostEqual(sameSummary.explainedVariance, s.explainedVariance) @@ -1093,6 +1095,7 @@ class TrainingSummaryTest(SparkSessionTestCase): self.assertEqual(s.numIterations, 1) # this should default to a single iteration of WLS self.assertTrue(isinstance(s.predictions, DataFrame)) self.assertEqual(s.predictionCol, "prediction") +self.assertEqual(s.numInstances, 2) self.assertTrue(isinstance(s.residuals(), DataFrame)) self.assertTrue(isinstance(s.residuals("pearson"), DataFrame)) coefStdErr = s.coefficientStandardErrors @@ -,7 +1114,8 @@ class TrainingSummaryTest(SparkSessionTestCase): self.assertTrue(isinstance(s.nullDeviance, float)) self.assertTrue(isinstance(s.dispersion, float)) # test evaluation (with training dataset) produces a summary with same values -# one check is enough to verify a summary is returned, Scala version runs full test +# one check is enough to verify a summary is returned +# The child class GeneralizedLinearRegressionTrainingSummary runs full test sameSummary = model.evaluate(df) self.assertAlmostEqual(sameSummary.deviance, s.deviance) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version
Repository: spark Updated Branches: refs/heads/master d76633e3c -> 9afcf127d [SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version ## What changes were proposed in this pull request? Add test cases for PR-18062 ## How was this patch tested? The existing UT Author: PengCloses #18068 from mpjlu/moreTest. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9afcf127 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9afcf127 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9afcf127 Branch: refs/heads/master Commit: 9afcf127d31b5477a539dde6e5f01861532a1c4c Parents: d76633e Author: Peng Authored: Wed May 24 19:54:17 2017 +0800 Committer: Yanbo Liang Committed: Wed May 24 19:54:17 2017 +0800 -- python/pyspark/ml/tests.py | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9afcf127/python/pyspark/ml/tests.py -- diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py index 51a3e8e..a3393c6 100755 --- a/python/pyspark/ml/tests.py +++ b/python/pyspark/ml/tests.py @@ -1066,6 +1066,7 @@ class TrainingSummaryTest(SparkSessionTestCase): self.assertAlmostEqual(s.r2, 1.0, 2) self.assertTrue(isinstance(s.residuals, DataFrame)) self.assertEqual(s.numInstances, 2) +self.assertEqual(s.degreesOfFreedom, 1) devResiduals = s.devianceResiduals self.assertTrue(isinstance(devResiduals, list) and isinstance(devResiduals[0], float)) coefStdErr = s.coefficientStandardErrors @@ -1075,7 +1076,8 @@ class TrainingSummaryTest(SparkSessionTestCase): pValues = s.pValues self.assertTrue(isinstance(pValues, list) and isinstance(pValues[0], float)) # test evaluation (with training dataset) produces a summary with same values -# one check is enough to verify a summary is returned, Scala version runs full test +# one check is enough to verify a summary is returned +# The child class LinearRegressionTrainingSummary runs full test sameSummary = model.evaluate(df) self.assertAlmostEqual(sameSummary.explainedVariance, s.explainedVariance) @@ -1093,6 +1095,7 @@ class TrainingSummaryTest(SparkSessionTestCase): self.assertEqual(s.numIterations, 1) # this should default to a single iteration of WLS self.assertTrue(isinstance(s.predictions, DataFrame)) self.assertEqual(s.predictionCol, "prediction") +self.assertEqual(s.numInstances, 2) self.assertTrue(isinstance(s.residuals(), DataFrame)) self.assertTrue(isinstance(s.residuals("pearson"), DataFrame)) coefStdErr = s.coefficientStandardErrors @@ -,7 +1114,8 @@ class TrainingSummaryTest(SparkSessionTestCase): self.assertTrue(isinstance(s.nullDeviance, float)) self.assertTrue(isinstance(s.dispersion, float)) # test evaluation (with training dataset) produces a summary with same values -# one check is enough to verify a summary is returned, Scala version runs full test +# one check is enough to verify a summary is returned +# The child class GeneralizedLinearRegressionTrainingSummary runs full test sameSummary = model.evaluate(df) self.assertAlmostEqual(sameSummary.deviance, s.deviance) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary.
Repository: spark Updated Branches: refs/heads/master 442287ae2 -> ad09e4ca0 [MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. ## What changes were proposed in this pull request? Joint coefficients with intercept for SparkR linear SVM summary. ## How was this patch tested? Existing tests. Author: Yanbo LiangCloses #18035 from yanboliang/svm-r. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ad09e4ca Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ad09e4ca Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ad09e4ca Branch: refs/heads/master Commit: ad09e4ca045715d053a672c2ba23f598f06085d8 Parents: 442287a Author: Yanbo Liang Authored: Tue May 23 16:16:14 2017 +0800 Committer: Yanbo Liang Committed: Tue May 23 16:16:14 2017 +0800 -- R/pkg/R/mllib_classification.R | 38 .../tests/testthat/test_mllib_classification.R | 3 +- .../apache/spark/ml/r/LinearSVCWrapper.scala| 12 +-- 3 files changed, 26 insertions(+), 27 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ad09e4ca/R/pkg/R/mllib_classification.R -- diff --git a/R/pkg/R/mllib_classification.R b/R/pkg/R/mllib_classification.R index 4db9cc3..306a9b8 100644 --- a/R/pkg/R/mllib_classification.R +++ b/R/pkg/R/mllib_classification.R @@ -46,15 +46,16 @@ setClass("MultilayerPerceptronClassificationModel", representation(jobj = "jobj" #' @note NaiveBayesModel since 2.0.0 setClass("NaiveBayesModel", representation(jobj = "jobj")) -#' linear SVM Model +#' Linear SVM Model #' -#' Fits an linear SVM model against a SparkDataFrame. It is a binary classifier, similar to svm in glmnet package +#' Fits a linear SVM model against a SparkDataFrame, similar to svm in e1071 package. +#' Currently only supports binary classification model with linear kernel. #' Users can print, make predictions on the produced model and save the model to the input path. #' #' @param data SparkDataFrame for training. #' @param formula A symbolic description of the model to be fitted. Currently only a few formula #'operators are supported, including '~', '.', ':', '+', and '-'. -#' @param regParam The regularization parameter. +#' @param regParam The regularization parameter. Only supports L2 regularization currently. #' @param maxIter Maximum iteration number. #' @param tol Convergence tolerance of iterations. #' @param standardization Whether to standardize the training features before fitting the model. The coefficients @@ -111,10 +112,10 @@ setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = "formu new("LinearSVCModel", jobj = jobj) }) -# Predicted values based on an LinearSVCModel model +# Predicted values based on a LinearSVCModel model #' @param newData a SparkDataFrame for testing. -#' @return \code{predict} returns the predicted values based on an LinearSVCModel. +#' @return \code{predict} returns the predicted values based on a LinearSVCModel. #' @rdname spark.svmLinear #' @aliases predict,LinearSVCModel,SparkDataFrame-method #' @export @@ -124,13 +125,12 @@ setMethod("predict", signature(object = "LinearSVCModel"), predict_internal(object, newData) }) -# Get the summary of an LinearSVCModel +# Get the summary of a LinearSVCModel -#' @param object an LinearSVCModel fitted by \code{spark.svmLinear}. +#' @param object a LinearSVCModel fitted by \code{spark.svmLinear}. #' @return \code{summary} returns summary information of the fitted model, which is a list. #' The list includes \code{coefficients} (coefficients of the fitted model), -#' \code{intercept} (intercept of the fitted model), \code{numClasses} (number of classes), -#' \code{numFeatures} (number of features). +#' \code{numClasses} (number of classes), \code{numFeatures} (number of features). #' @rdname spark.svmLinear #' @aliases summary,LinearSVCModel-method #' @export @@ -138,22 +138,14 @@ setMethod("predict", signature(object = "LinearSVCModel"), setMethod("summary", signature(object = "LinearSVCModel"), function(object) { jobj <- object@jobj -features <- callJMethod(jobj, "features") -labels <- callJMethod(jobj, "labels") -coefficients <- callJMethod(jobj, "coefficients") -nCol <- length(coefficients) / length(features) -coefficients <- matrix(unlist(coefficients), ncol = nCol) -intercept <- callJMethod(jobj, "intercept") +features <- callJMethod(jobj, "rFeatures") +coefficients
spark git commit: [MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary.
Repository: spark Updated Branches: refs/heads/branch-2.2 06c985c1b -> dbb068f4f [MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. ## What changes were proposed in this pull request? Joint coefficients with intercept for SparkR linear SVM summary. ## How was this patch tested? Existing tests. Author: Yanbo LiangCloses #18035 from yanboliang/svm-r. (cherry picked from commit ad09e4ca045715d053a672c2ba23f598f06085d8) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dbb068f4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dbb068f4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dbb068f4 Branch: refs/heads/branch-2.2 Commit: dbb068f4f280fd48c991302f9e9728378926b1a2 Parents: 06c985c Author: Yanbo Liang Authored: Tue May 23 16:16:14 2017 +0800 Committer: Yanbo Liang Committed: Tue May 23 16:16:29 2017 +0800 -- R/pkg/R/mllib_classification.R | 38 .../tests/testthat/test_mllib_classification.R | 3 +- .../apache/spark/ml/r/LinearSVCWrapper.scala| 12 +-- 3 files changed, 26 insertions(+), 27 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dbb068f4/R/pkg/R/mllib_classification.R -- diff --git a/R/pkg/R/mllib_classification.R b/R/pkg/R/mllib_classification.R index 4db9cc3..306a9b8 100644 --- a/R/pkg/R/mllib_classification.R +++ b/R/pkg/R/mllib_classification.R @@ -46,15 +46,16 @@ setClass("MultilayerPerceptronClassificationModel", representation(jobj = "jobj" #' @note NaiveBayesModel since 2.0.0 setClass("NaiveBayesModel", representation(jobj = "jobj")) -#' linear SVM Model +#' Linear SVM Model #' -#' Fits an linear SVM model against a SparkDataFrame. It is a binary classifier, similar to svm in glmnet package +#' Fits a linear SVM model against a SparkDataFrame, similar to svm in e1071 package. +#' Currently only supports binary classification model with linear kernel. #' Users can print, make predictions on the produced model and save the model to the input path. #' #' @param data SparkDataFrame for training. #' @param formula A symbolic description of the model to be fitted. Currently only a few formula #'operators are supported, including '~', '.', ':', '+', and '-'. -#' @param regParam The regularization parameter. +#' @param regParam The regularization parameter. Only supports L2 regularization currently. #' @param maxIter Maximum iteration number. #' @param tol Convergence tolerance of iterations. #' @param standardization Whether to standardize the training features before fitting the model. The coefficients @@ -111,10 +112,10 @@ setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = "formu new("LinearSVCModel", jobj = jobj) }) -# Predicted values based on an LinearSVCModel model +# Predicted values based on a LinearSVCModel model #' @param newData a SparkDataFrame for testing. -#' @return \code{predict} returns the predicted values based on an LinearSVCModel. +#' @return \code{predict} returns the predicted values based on a LinearSVCModel. #' @rdname spark.svmLinear #' @aliases predict,LinearSVCModel,SparkDataFrame-method #' @export @@ -124,13 +125,12 @@ setMethod("predict", signature(object = "LinearSVCModel"), predict_internal(object, newData) }) -# Get the summary of an LinearSVCModel +# Get the summary of a LinearSVCModel -#' @param object an LinearSVCModel fitted by \code{spark.svmLinear}. +#' @param object a LinearSVCModel fitted by \code{spark.svmLinear}. #' @return \code{summary} returns summary information of the fitted model, which is a list. #' The list includes \code{coefficients} (coefficients of the fitted model), -#' \code{intercept} (intercept of the fitted model), \code{numClasses} (number of classes), -#' \code{numFeatures} (number of features). +#' \code{numClasses} (number of classes), \code{numFeatures} (number of features). #' @rdname spark.svmLinear #' @aliases summary,LinearSVCModel-method #' @export @@ -138,22 +138,14 @@ setMethod("predict", signature(object = "LinearSVCModel"), setMethod("summary", signature(object = "LinearSVCModel"), function(object) { jobj <- object@jobj -features <- callJMethod(jobj, "features") -labels <- callJMethod(jobj, "labels") -coefficients <- callJMethod(jobj, "coefficients") -nCol <- length(coefficients) / length(features) -coefficients <- matrix(unlist(coefficients), ncol = nCol) -
spark git commit: [SPARK-20764][ML][PYSPARK] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version
Repository: spark Updated Branches: refs/heads/branch-2.2 a57553279 -> a0bf5c47c [SPARK-20764][ML][PYSPARK] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version ## What changes were proposed in this pull request? SPARK-20097 exposed degreesOfFreedom in LinearRegressionSummary and numInstances in GeneralizedLinearRegressionSummary. Python API should be updated to reflect these changes. ## How was this patch tested? The existing UT Author: PengCloses #18062 from mpjlu/spark-20764. (cherry picked from commit cfca01136bd7443c1d9daf8e8e256635eec20ddc) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a0bf5c47 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a0bf5c47 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a0bf5c47 Branch: refs/heads/branch-2.2 Commit: a0bf5c47cb9c72d73616f876a4521ae80e2e4ecb Parents: a575532 Author: Peng Authored: Mon May 22 22:42:37 2017 +0800 Committer: Yanbo Liang Committed: Mon May 22 22:42:56 2017 +0800 -- python/pyspark/ml/regression.py | 16 1 file changed, 16 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a0bf5c47/python/pyspark/ml/regression.py -- diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py index 3c3fcc8..2d17f95 100644 --- a/python/pyspark/ml/regression.py +++ b/python/pyspark/ml/regression.py @@ -324,6 +324,14 @@ class LinearRegressionSummary(JavaWrapper): return self._call_java("numInstances") @property +@since("2.2.0") +def degreesOfFreedom(self): +""" +Degrees of freedom. +""" +return self._call_java("degreesOfFreedom") + +@property @since("2.0.0") def devianceResiduals(self): """ @@ -1566,6 +1574,14 @@ class GeneralizedLinearRegressionSummary(JavaWrapper): return self._call_java("predictionCol") @property +@since("2.2.0") +def numInstances(self): +""" +Number of instances in DataFrame predictions. +""" +return self._call_java("numInstances") + +@property @since("2.0.0") def rank(self): """ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20764][ML][PYSPARK] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version
Repository: spark Updated Branches: refs/heads/master f3ed62a38 -> cfca01136 [SPARK-20764][ML][PYSPARK] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version ## What changes were proposed in this pull request? SPARK-20097 exposed degreesOfFreedom in LinearRegressionSummary and numInstances in GeneralizedLinearRegressionSummary. Python API should be updated to reflect these changes. ## How was this patch tested? The existing UT Author: PengCloses #18062 from mpjlu/spark-20764. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cfca0113 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cfca0113 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cfca0113 Branch: refs/heads/master Commit: cfca01136bd7443c1d9daf8e8e256635eec20ddc Parents: f3ed62a Author: Peng Authored: Mon May 22 22:42:37 2017 +0800 Committer: Yanbo Liang Committed: Mon May 22 22:42:37 2017 +0800 -- python/pyspark/ml/regression.py | 16 1 file changed, 16 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cfca0113/python/pyspark/ml/regression.py -- diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py index 3c3fcc8..2d17f95 100644 --- a/python/pyspark/ml/regression.py +++ b/python/pyspark/ml/regression.py @@ -324,6 +324,14 @@ class LinearRegressionSummary(JavaWrapper): return self._call_java("numInstances") @property +@since("2.2.0") +def degreesOfFreedom(self): +""" +Degrees of freedom. +""" +return self._call_java("degreesOfFreedom") + +@property @since("2.0.0") def devianceResiduals(self): """ @@ -1566,6 +1574,14 @@ class GeneralizedLinearRegressionSummary(JavaWrapper): return self._call_java("predictionCol") @property +@since("2.2.0") +def numInstances(self): +""" +Number of instances in DataFrame predictions. +""" +return self._call_java("numInstances") + +@property @since("2.0.0") def rank(self): """ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20505][ML] Add docs and examples for ml.stat.Correlation and ml.stat.ChiSquareTest.
Repository: spark Updated Branches: refs/heads/branch-2.2 b8fa79cec -> ba0117c27 [SPARK-20505][ML] Add docs and examples for ml.stat.Correlation and ml.stat.ChiSquareTest. ## What changes were proposed in this pull request? Add docs and examples for ```ml.stat.Correlation``` and ```ml.stat.ChiSquareTest```. ## How was this patch tested? Generate docs and run examples manually, successfully. Author: Yanbo LiangCloses #17994 from yanboliang/spark-20505. (cherry picked from commit 697a5e5517e32c5ef44c273e3b26662d0eb70f24) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ba0117c2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ba0117c2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ba0117c2 Branch: refs/heads/branch-2.2 Commit: ba0117c2716a6a3b9810bc17b67f9f502c49fa9b Parents: b8fa79c Author: Yanbo Liang Authored: Thu May 18 11:54:09 2017 +0800 Committer: Yanbo Liang Committed: Thu May 18 11:54:21 2017 +0800 -- docs/_data/menu-ml.yaml | 2 + docs/ml-statistics.md | 92 .../examples/ml/JavaChiSquareTestExample.java | 75 .../examples/ml/JavaCorrelationExample.java | 72 +++ .../main/python/ml/chi_square_test_example.py | 52 +++ .../src/main/python/ml/correlation_example.py | 51 +++ .../examples/ml/ChiSquareTestExample.scala | 63 ++ .../spark/examples/ml/CorrelationExample.scala | 63 ++ 8 files changed, 470 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ba0117c2/docs/_data/menu-ml.yaml -- diff --git a/docs/_data/menu-ml.yaml b/docs/_data/menu-ml.yaml index 047423f..b5a6641 100644 --- a/docs/_data/menu-ml.yaml +++ b/docs/_data/menu-ml.yaml @@ -1,3 +1,5 @@ +- text: Basic statistics + url: ml-statistics.html - text: Pipelines url: ml-pipeline.html - text: Extracting, transforming and selecting features http://git-wip-us.apache.org/repos/asf/spark/blob/ba0117c2/docs/ml-statistics.md -- diff --git a/docs/ml-statistics.md b/docs/ml-statistics.md new file mode 100644 index 000..abfb3ca --- /dev/null +++ b/docs/ml-statistics.md @@ -0,0 +1,92 @@ +--- +layout: global +title: Basic Statistics +displayTitle: Basic Statistics +--- + + +`\[ +\newcommand{\R}{\mathbb{R}} +\newcommand{\E}{\mathbb{E}} +\newcommand{\x}{\mathbf{x}} +\newcommand{\y}{\mathbf{y}} +\newcommand{\wv}{\mathbf{w}} +\newcommand{\av}{\mathbf{\alpha}} +\newcommand{\bv}{\mathbf{b}} +\newcommand{\N}{\mathbb{N}} +\newcommand{\id}{\mathbf{I}} +\newcommand{\ind}{\mathbf{1}} +\newcommand{\0}{\mathbf{0}} +\newcommand{\unit}{\mathbf{e}} +\newcommand{\one}{\mathbf{1}} +\newcommand{\zero}{\mathbf{0}} +\]` + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +## Correlation + +Calculating the correlation between two series of data is a common operation in Statistics. In `spark.ml` +we provide the flexibility to calculate pairwise correlations among many series. The supported +correlation methods are currently Pearson's and Spearman's correlation. + + + +[`Correlation`](api/scala/index.html#org.apache.spark.ml.stat.Correlation$) +computes the correlation matrix for the input Dataset of Vectors using the specified method. +The output will be a DataFrame that contains the correlation matrix of the column of vectors. + +{% include_example scala/org/apache/spark/examples/ml/CorrelationExample.scala %} + + + +[`Correlation`](api/java/org/apache/spark/ml/stat/Correlation.html) +computes the correlation matrix for the input Dataset of Vectors using the specified method. +The output will be a DataFrame that contains the correlation matrix of the column of vectors. + +{% include_example java/org/apache/spark/examples/ml/JavaCorrelationExample.java %} + + + +[`Correlation`](api/python/pyspark.ml.html#pyspark.ml.stat.Correlation$) +computes the correlation matrix for the input Dataset of Vectors using the specified method. +The output will be a DataFrame that contains the correlation matrix of the column of vectors. + +{% include_example python/ml/correlation_example.py %} + + + + +## Hypothesis testing + +Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically +significant, whether this result occurred by chance or not. `spark.ml` currently supports Pearson's +Chi-squared ( $\chi^2$) tests for independence. + +`ChiSquareTest` conducts Pearson's independence test for every feature against the label. +For
spark git commit: [SPARK-20505][ML] Add docs and examples for ml.stat.Correlation and ml.stat.ChiSquareTest.
Repository: spark Updated Branches: refs/heads/master 324a904d8 -> 697a5e551 [SPARK-20505][ML] Add docs and examples for ml.stat.Correlation and ml.stat.ChiSquareTest. ## What changes were proposed in this pull request? Add docs and examples for ```ml.stat.Correlation``` and ```ml.stat.ChiSquareTest```. ## How was this patch tested? Generate docs and run examples manually, successfully. Author: Yanbo LiangCloses #17994 from yanboliang/spark-20505. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/697a5e55 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/697a5e55 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/697a5e55 Branch: refs/heads/master Commit: 697a5e5517e32c5ef44c273e3b26662d0eb70f24 Parents: 324a904 Author: Yanbo Liang Authored: Thu May 18 11:54:09 2017 +0800 Committer: Yanbo Liang Committed: Thu May 18 11:54:09 2017 +0800 -- docs/_data/menu-ml.yaml | 2 + docs/ml-statistics.md | 92 .../examples/ml/JavaChiSquareTestExample.java | 75 .../examples/ml/JavaCorrelationExample.java | 72 +++ .../main/python/ml/chi_square_test_example.py | 52 +++ .../src/main/python/ml/correlation_example.py | 51 +++ .../examples/ml/ChiSquareTestExample.scala | 63 ++ .../spark/examples/ml/CorrelationExample.scala | 63 ++ 8 files changed, 470 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/697a5e55/docs/_data/menu-ml.yaml -- diff --git a/docs/_data/menu-ml.yaml b/docs/_data/menu-ml.yaml index 047423f..b5a6641 100644 --- a/docs/_data/menu-ml.yaml +++ b/docs/_data/menu-ml.yaml @@ -1,3 +1,5 @@ +- text: Basic statistics + url: ml-statistics.html - text: Pipelines url: ml-pipeline.html - text: Extracting, transforming and selecting features http://git-wip-us.apache.org/repos/asf/spark/blob/697a5e55/docs/ml-statistics.md -- diff --git a/docs/ml-statistics.md b/docs/ml-statistics.md new file mode 100644 index 000..abfb3ca --- /dev/null +++ b/docs/ml-statistics.md @@ -0,0 +1,92 @@ +--- +layout: global +title: Basic Statistics +displayTitle: Basic Statistics +--- + + +`\[ +\newcommand{\R}{\mathbb{R}} +\newcommand{\E}{\mathbb{E}} +\newcommand{\x}{\mathbf{x}} +\newcommand{\y}{\mathbf{y}} +\newcommand{\wv}{\mathbf{w}} +\newcommand{\av}{\mathbf{\alpha}} +\newcommand{\bv}{\mathbf{b}} +\newcommand{\N}{\mathbb{N}} +\newcommand{\id}{\mathbf{I}} +\newcommand{\ind}{\mathbf{1}} +\newcommand{\0}{\mathbf{0}} +\newcommand{\unit}{\mathbf{e}} +\newcommand{\one}{\mathbf{1}} +\newcommand{\zero}{\mathbf{0}} +\]` + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +## Correlation + +Calculating the correlation between two series of data is a common operation in Statistics. In `spark.ml` +we provide the flexibility to calculate pairwise correlations among many series. The supported +correlation methods are currently Pearson's and Spearman's correlation. + + + +[`Correlation`](api/scala/index.html#org.apache.spark.ml.stat.Correlation$) +computes the correlation matrix for the input Dataset of Vectors using the specified method. +The output will be a DataFrame that contains the correlation matrix of the column of vectors. + +{% include_example scala/org/apache/spark/examples/ml/CorrelationExample.scala %} + + + +[`Correlation`](api/java/org/apache/spark/ml/stat/Correlation.html) +computes the correlation matrix for the input Dataset of Vectors using the specified method. +The output will be a DataFrame that contains the correlation matrix of the column of vectors. + +{% include_example java/org/apache/spark/examples/ml/JavaCorrelationExample.java %} + + + +[`Correlation`](api/python/pyspark.ml.html#pyspark.ml.stat.Correlation$) +computes the correlation matrix for the input Dataset of Vectors using the specified method. +The output will be a DataFrame that contains the correlation matrix of the column of vectors. + +{% include_example python/ml/correlation_example.py %} + + + + +## Hypothesis testing + +Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically +significant, whether this result occurred by chance or not. `spark.ml` currently supports Pearson's +Chi-squared ( $\chi^2$) tests for independence. + +`ChiSquareTest` conducts Pearson's independence test for every feature against the label. +For each feature, the (feature, label) pairs are converted into a contingency matrix for which +the Chi-squared statistic is
spark git commit: [SPARK-20707][ML] ML deprecated APIs should be removed in major release.
Repository: spark Updated Branches: refs/heads/branch-2.2 10e599f69 -> a869e8bfd [SPARK-20707][ML] ML deprecated APIs should be removed in major release. ## What changes were proposed in this pull request? Before 2.2, MLlib keep to remove APIs deprecated in last feature/minor release. But from Spark 2.2, we decide to remove deprecated APIs in a major release, so we need to change corresponding annotations to tell users those will be removed in 3.0. Meanwhile, this fixed bugs in ML documents. The original ML docs can't show deprecated annotations in ```MLWriter``` and ```MLReader``` related class, we correct it in this PR. Before: ![image](https://cloud.githubusercontent.com/assets/1962026/25939889/f8c55f20-3666-11e7-9fa2-0605bfb3ed06.png) After: ![image](https://cloud.githubusercontent.com/assets/1962026/25939870/e9b0d5be-3666-11e7-9765-5e04885e4b32.png) ## How was this patch tested? Existing tests. Author: Yanbo LiangCloses #17946 from yanboliang/spark-20707. (cherry picked from commit d4022d49514cc1f8ffc5bfe243186ec3748df475) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a869e8bf Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a869e8bf Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a869e8bf Branch: refs/heads/branch-2.2 Commit: a869e8bfdc23b9e3796a7c4d51f91902b5a067d2 Parents: 10e599f Author: Yanbo Liang Authored: Tue May 16 10:08:23 2017 +0800 Committer: Yanbo Liang Committed: Tue May 16 10:08:35 2017 +0800 -- .../org/apache/spark/ml/tree/treeParams.scala | 60 ++-- .../org/apache/spark/ml/util/ReadWrite.scala| 4 +- python/docs/pyspark.ml.rst | 8 +++ python/pyspark/ml/util.py | 16 -- 4 files changed, 51 insertions(+), 37 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a869e8bf/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala b/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala index cd1950b..3fc3ac5 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala @@ -110,77 +110,77 @@ private[ml] trait DecisionTreeParams extends PredictorParams maxMemoryInMB -> 256, cacheNodeIds -> false, checkpointInterval -> 10) /** - * @deprecated This method is deprecated and will be removed in 2.2.0. + * @deprecated This method is deprecated and will be removed in 3.0.0. * @group setParam */ - @deprecated("This method is deprecated and will be removed in 2.2.0.", "2.1.0") + @deprecated("This method is deprecated and will be removed in 3.0.0.", "2.1.0") def setMaxDepth(value: Int): this.type = set(maxDepth, value) /** @group getParam */ final def getMaxDepth: Int = $(maxDepth) /** - * @deprecated This method is deprecated and will be removed in 2.2.0. + * @deprecated This method is deprecated and will be removed in 3.0.0. * @group setParam */ - @deprecated("This method is deprecated and will be removed in 2.2.0.", "2.1.0") + @deprecated("This method is deprecated and will be removed in 3.0.0.", "2.1.0") def setMaxBins(value: Int): this.type = set(maxBins, value) /** @group getParam */ final def getMaxBins: Int = $(maxBins) /** - * @deprecated This method is deprecated and will be removed in 2.2.0. + * @deprecated This method is deprecated and will be removed in 3.0.0. * @group setParam */ - @deprecated("This method is deprecated and will be removed in 2.2.0.", "2.1.0") + @deprecated("This method is deprecated and will be removed in 3.0.0.", "2.1.0") def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, value) /** @group getParam */ final def getMinInstancesPerNode: Int = $(minInstancesPerNode) /** - * @deprecated This method is deprecated and will be removed in 2.2.0. + * @deprecated This method is deprecated and will be removed in 3.0.0. * @group setParam */ - @deprecated("This method is deprecated and will be removed in 2.2.0.", "2.1.0") + @deprecated("This method is deprecated and will be removed in 3.0.0.", "2.1.0") def setMinInfoGain(value: Double): this.type = set(minInfoGain, value) /** @group getParam */ final def getMinInfoGain: Double = $(minInfoGain) /** - * @deprecated This method is deprecated and will be removed in 2.2.0. + * @deprecated This method is deprecated and will be removed in 3.0.0. * @group setParam */ - @deprecated("This
spark git commit: [SPARK-20669][ML] LoR.family and LDA.optimizer should be case insensitive
Repository: spark Updated Branches: refs/heads/master b0888d1ac -> 9970aa096 [SPARK-20669][ML] LoR.family and LDA.optimizer should be case insensitive ## What changes were proposed in this pull request? make param `family` in LoR and `optimizer` in LDA case insensitive ## How was this patch tested? updated tests yanboliang Author: Zheng RuiFengCloses #17910 from zhengruifeng/lr_family_lowercase. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9970aa09 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9970aa09 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9970aa09 Branch: refs/heads/master Commit: 9970aa0962ec253a6e838aea26a627689dc5b011 Parents: b0888d1 Author: Zheng RuiFeng Authored: Mon May 15 23:21:44 2017 +0800 Committer: Yanbo Liang Committed: Mon May 15 23:21:44 2017 +0800 -- .../ml/classification/LogisticRegression.scala | 4 +-- .../org/apache/spark/ml/clustering/LDA.scala| 30 ++-- .../LogisticRegressionSuite.scala | 11 +++ .../apache/spark/ml/clustering/LDASuite.scala | 10 +++ 4 files changed, 38 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9970aa09/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala index 42dc7fb..0534872 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala @@ -94,7 +94,7 @@ private[classification] trait LogisticRegressionParams extends ProbabilisticClas final val family: Param[String] = new Param(this, "family", "The name of family which is a description of the label distribution to be used in the " + s"model. Supported options: ${supportedFamilyNames.mkString(", ")}.", -ParamValidators.inArray[String](supportedFamilyNames)) +(value: String) => supportedFamilyNames.contains(value.toLowerCase(Locale.ROOT))) /** @group getParam */ @Since("2.1.0") @@ -526,7 +526,7 @@ class LogisticRegression @Since("1.2.0") ( case None => histogram.length } -val isMultinomial = $(family) match { +val isMultinomial = getFamily.toLowerCase(Locale.ROOT) match { case "binomial" => require(numClasses == 1 || numClasses == 2, s"Binomial family only supports 1 or 2 " + s"outcome classes but found $numClasses.") http://git-wip-us.apache.org/repos/asf/spark/blob/9970aa09/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala b/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala index e3026c8..3da29b1 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala @@ -174,8 +174,7 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM @Since("1.6.0") final val optimizer = new Param[String](this, "optimizer", "Optimizer or inference" + " algorithm used to estimate the LDA model. Supported: " + supportedOptimizers.mkString(", "), -(o: String) => - ParamValidators.inArray(supportedOptimizers).apply(o.toLowerCase(Locale.ROOT))) +(value: String) => supportedOptimizers.contains(value.toLowerCase(Locale.ROOT))) /** @group getParam */ @Since("1.6.0") @@ -325,7 +324,7 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM s" ${getDocConcentration.length}, but k = $getK. docConcentration must be an array of" + s" length either 1 (scalar) or k (num topics).") } - getOptimizer match { + getOptimizer.toLowerCase(Locale.ROOT) match { case "online" => require(getDocConcentration.forall(_ >= 0), "For Online LDA optimizer, docConcentration values must be >= 0. Found values: " + @@ -337,7 +336,7 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM } } if (isSet(topicConcentration)) { - getOptimizer match { + getOptimizer.toLowerCase(Locale.ROOT) match { case "online" => require(getTopicConcentration >= 0, s"For Online LDA optimizer, topicConcentration" + s" must be >= 0. Found value: $getTopicConcentration") @@ -350,17 +349,18 @@ private[clustering] trait
spark git commit: [SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML"
Repository: spark Updated Branches: refs/heads/branch-2.2 3eb0ee06a -> 80a57fa90 [SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML" This reverts commit b8733e0ad9f5a700f385e210450fd2c10137293e. Author: Yanbo LiangCloses #17944 from yanboliang/spark-20606-revert. (cherry picked from commit 0698e6c88ca11fdfd6e5498cab784cf6dbcdfacb) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/80a57fa9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/80a57fa9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/80a57fa9 Branch: refs/heads/branch-2.2 Commit: 80a57fa90be8dca4340345c09b4ea28fbf11a516 Parents: 3eb0ee0 Author: Yanbo Liang Authored: Thu May 11 14:48:13 2017 +0800 Committer: Yanbo Liang Committed: Thu May 11 14:48:26 2017 +0800 -- .../classification/DecisionTreeClassifier.scala | 18 ++-- .../spark/ml/classification/GBTClassifier.scala | 24 ++--- .../classification/RandomForestClassifier.scala | 24 ++--- .../ml/regression/DecisionTreeRegressor.scala | 18 ++-- .../spark/ml/regression/GBTRegressor.scala | 24 ++--- .../ml/regression/RandomForestRegressor.scala | 24 ++--- .../org/apache/spark/ml/tree/treeParams.scala | 105 +++ .../org/apache/spark/ml/util/ReadWrite.scala| 16 +++ project/MimaExcludes.scala | 68 python/pyspark/ml/util.py | 32 ++ 10 files changed, 219 insertions(+), 134 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/80a57fa9/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala index 5fb105c..9f60f08 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala @@ -54,27 +54,27 @@ class DecisionTreeClassifier @Since("1.4.0") ( /** @group setParam */ @Since("1.4.0") - def setMaxDepth(value: Int): this.type = set(maxDepth, value) + override def setMaxDepth(value: Int): this.type = set(maxDepth, value) /** @group setParam */ @Since("1.4.0") - def setMaxBins(value: Int): this.type = set(maxBins, value) + override def setMaxBins(value: Int): this.type = set(maxBins, value) /** @group setParam */ @Since("1.4.0") - def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, value) + override def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, value) /** @group setParam */ @Since("1.4.0") - def setMinInfoGain(value: Double): this.type = set(minInfoGain, value) + override def setMinInfoGain(value: Double): this.type = set(minInfoGain, value) /** @group expertSetParam */ @Since("1.4.0") - def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, value) + override def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, value) /** @group expertSetParam */ @Since("1.4.0") - def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, value) + override def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, value) /** * Specifies how often to checkpoint the cached node IDs. @@ -86,15 +86,15 @@ class DecisionTreeClassifier @Since("1.4.0") ( * @group setParam */ @Since("1.4.0") - def setCheckpointInterval(value: Int): this.type = set(checkpointInterval, value) + override def setCheckpointInterval(value: Int): this.type = set(checkpointInterval, value) /** @group setParam */ @Since("1.4.0") - def setImpurity(value: String): this.type = set(impurity, value) + override def setImpurity(value: String): this.type = set(impurity, value) /** @group setParam */ @Since("1.6.0") - def setSeed(value: Long): this.type = set(seed, value) + override def setSeed(value: Long): this.type = set(seed, value) override protected def train(dataset: Dataset[_]): DecisionTreeClassificationModel = { val categoricalFeatures: Map[Int, Int] = http://git-wip-us.apache.org/repos/asf/spark/blob/80a57fa9/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala index 263ed10..ade0960 100644 ---
spark git commit: [SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML"
Repository: spark Updated Branches: refs/heads/master 8ddbc431d -> 0698e6c88 [SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML" This reverts commit b8733e0ad9f5a700f385e210450fd2c10137293e. Author: Yanbo LiangCloses #17944 from yanboliang/spark-20606-revert. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0698e6c8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0698e6c8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0698e6c8 Branch: refs/heads/master Commit: 0698e6c88ca11fdfd6e5498cab784cf6dbcdfacb Parents: 8ddbc43 Author: Yanbo Liang Authored: Thu May 11 14:48:13 2017 +0800 Committer: Yanbo Liang Committed: Thu May 11 14:48:13 2017 +0800 -- .../classification/DecisionTreeClassifier.scala | 18 ++-- .../spark/ml/classification/GBTClassifier.scala | 24 ++--- .../classification/RandomForestClassifier.scala | 24 ++--- .../ml/regression/DecisionTreeRegressor.scala | 18 ++-- .../spark/ml/regression/GBTRegressor.scala | 24 ++--- .../ml/regression/RandomForestRegressor.scala | 24 ++--- .../org/apache/spark/ml/tree/treeParams.scala | 105 +++ .../org/apache/spark/ml/util/ReadWrite.scala| 16 +++ project/MimaExcludes.scala | 68 python/pyspark/ml/util.py | 32 ++ 10 files changed, 219 insertions(+), 134 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0698e6c8/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala index 5fb105c..9f60f08 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala @@ -54,27 +54,27 @@ class DecisionTreeClassifier @Since("1.4.0") ( /** @group setParam */ @Since("1.4.0") - def setMaxDepth(value: Int): this.type = set(maxDepth, value) + override def setMaxDepth(value: Int): this.type = set(maxDepth, value) /** @group setParam */ @Since("1.4.0") - def setMaxBins(value: Int): this.type = set(maxBins, value) + override def setMaxBins(value: Int): this.type = set(maxBins, value) /** @group setParam */ @Since("1.4.0") - def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, value) + override def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, value) /** @group setParam */ @Since("1.4.0") - def setMinInfoGain(value: Double): this.type = set(minInfoGain, value) + override def setMinInfoGain(value: Double): this.type = set(minInfoGain, value) /** @group expertSetParam */ @Since("1.4.0") - def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, value) + override def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, value) /** @group expertSetParam */ @Since("1.4.0") - def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, value) + override def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, value) /** * Specifies how often to checkpoint the cached node IDs. @@ -86,15 +86,15 @@ class DecisionTreeClassifier @Since("1.4.0") ( * @group setParam */ @Since("1.4.0") - def setCheckpointInterval(value: Int): this.type = set(checkpointInterval, value) + override def setCheckpointInterval(value: Int): this.type = set(checkpointInterval, value) /** @group setParam */ @Since("1.4.0") - def setImpurity(value: String): this.type = set(impurity, value) + override def setImpurity(value: String): this.type = set(impurity, value) /** @group setParam */ @Since("1.6.0") - def setSeed(value: Long): this.type = set(seed, value) + override def setSeed(value: Long): this.type = set(seed, value) override protected def train(dataset: Dataset[_]): DecisionTreeClassificationModel = { val categoricalFeatures: Map[Int, Int] = http://git-wip-us.apache.org/repos/asf/spark/blob/0698e6c8/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala index 263ed10..ade0960 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala +++
spark git commit: [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params
Repository: spark Updated Branches: refs/heads/branch-2.0 46659974e -> d86dae8fe [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params ## What changes were proposed in this pull request? - Replace `getParam` calls with `getOrDefault` calls. - Fix exception message to avoid unintended `TypeError`. - Add unit tests ## How was this patch tested? New unit tests. Author: zero323Closes #17891 from zero323/SPARK-20631. (cherry picked from commit 804949c6bf00b8e26c39d48bbcc4d0470ee84e47) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d86dae8f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d86dae8f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d86dae8f Branch: refs/heads/branch-2.0 Commit: d86dae8feec5e9bf77dd5ba0cf9caa1b955de020 Parents: 4665997 Author: zero323 Authored: Wed May 10 16:57:52 2017 +0800 Committer: Yanbo Liang Committed: Wed May 10 17:00:22 2017 +0800 -- python/pyspark/ml/classification.py | 6 +++--- python/pyspark/ml/tests.py | 12 2 files changed, 15 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d86dae8f/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index bfeda7c..0a30321 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -200,13 +200,13 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredicti def _checkThresholdConsistency(self): if self.isSet(self.threshold) and self.isSet(self.thresholds): -ts = self.getParam(self.thresholds) +ts = self.getOrDefault(self.thresholds) if len(ts) != 2: raise ValueError("Logistic Regression getThreshold only applies to" + " binary classification, but thresholds has length != 2." + - " thresholds: " + ",".join(ts)) + " thresholds: {0}".format(str(ts))) t = 1.0/(1.0 + ts[0]/ts[1]) -t2 = self.getParam(self.threshold) +t2 = self.getOrDefault(self.threshold) if abs(t2 - t) >= 1E-5: raise ValueError("Logistic Regression getThreshold found inconsistent values for" + " threshold (%g) and thresholds (equivalent to %g)" % (t2, t)) http://git-wip-us.apache.org/repos/asf/spark/blob/d86dae8f/python/pyspark/ml/tests.py -- diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py index 3c346b9..87f0aff 100755 --- a/python/pyspark/ml/tests.py +++ b/python/pyspark/ml/tests.py @@ -765,6 +765,18 @@ class PersistenceTest(SparkSessionTestCase): except OSError: pass +def logistic_regression_check_thresholds(self): +self.assertIsInstance( +LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]), +LogisticRegressionModel +) + +self.assertRaisesRegexp( +ValueError, +"Logistic Regression getThreshold found inconsistent.*$", +LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] +) + def _compare_params(self, m1, m2, param): """ Compare 2 ML Params instances for the given param, and assert both have the same param value - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params
Repository: spark Updated Branches: refs/heads/master 0ef16bd4b -> 804949c6b [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params ## What changes were proposed in this pull request? - Replace `getParam` calls with `getOrDefault` calls. - Fix exception message to avoid unintended `TypeError`. - Add unit tests ## How was this patch tested? New unit tests. Author: zero323Closes #17891 from zero323/SPARK-20631. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/804949c6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/804949c6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/804949c6 Branch: refs/heads/master Commit: 804949c6bf00b8e26c39d48bbcc4d0470ee84e47 Parents: 0ef16bd Author: zero323 Authored: Wed May 10 16:57:52 2017 +0800 Committer: Yanbo Liang Committed: Wed May 10 16:57:52 2017 +0800 -- python/pyspark/ml/classification.py | 6 +++--- python/pyspark/ml/tests.py | 12 2 files changed, 15 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/804949c6/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index a9756ea..dcc12d9 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -349,13 +349,13 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredicti def _checkThresholdConsistency(self): if self.isSet(self.threshold) and self.isSet(self.thresholds): -ts = self.getParam(self.thresholds) +ts = self.getOrDefault(self.thresholds) if len(ts) != 2: raise ValueError("Logistic Regression getThreshold only applies to" + " binary classification, but thresholds has length != 2." + - " thresholds: " + ",".join(ts)) + " thresholds: {0}".format(str(ts))) t = 1.0/(1.0 + ts[0]/ts[1]) -t2 = self.getParam(self.threshold) +t2 = self.getOrDefault(self.threshold) if abs(t2 - t) >= 1E-5: raise ValueError("Logistic Regression getThreshold found inconsistent values for" + " threshold (%g) and thresholds (equivalent to %g)" % (t2, t)) http://git-wip-us.apache.org/repos/asf/spark/blob/804949c6/python/pyspark/ml/tests.py -- diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py index 571ac4b..51a3e8e 100755 --- a/python/pyspark/ml/tests.py +++ b/python/pyspark/ml/tests.py @@ -807,6 +807,18 @@ class PersistenceTest(SparkSessionTestCase): except OSError: pass +def logistic_regression_check_thresholds(self): +self.assertIsInstance( +LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]), +LogisticRegressionModel +) + +self.assertRaisesRegexp( +ValueError, +"Logistic Regression getThreshold found inconsistent.*$", +LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] +) + def _compare_params(self, m1, m2, param): """ Compare 2 ML Params instances for the given param, and assert both have the same param value - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params
Repository: spark Updated Branches: refs/heads/branch-2.1 8e097890a -> 69786ea3a [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params ## What changes were proposed in this pull request? - Replace `getParam` calls with `getOrDefault` calls. - Fix exception message to avoid unintended `TypeError`. - Add unit tests ## How was this patch tested? New unit tests. Author: zero323Closes #17891 from zero323/SPARK-20631. (cherry picked from commit 804949c6bf00b8e26c39d48bbcc4d0470ee84e47) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/69786ea3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/69786ea3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/69786ea3 Branch: refs/heads/branch-2.1 Commit: 69786ea3a972af1b29a332dc11ac507ed4368cc6 Parents: 8e09789 Author: zero323 Authored: Wed May 10 16:57:52 2017 +0800 Committer: Yanbo Liang Committed: Wed May 10 16:58:34 2017 +0800 -- python/pyspark/ml/classification.py | 6 +++--- python/pyspark/ml/tests.py | 12 2 files changed, 15 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/69786ea3/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index 570a414..2b47c40 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -238,13 +238,13 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredicti def _checkThresholdConsistency(self): if self.isSet(self.threshold) and self.isSet(self.thresholds): -ts = self.getParam(self.thresholds) +ts = self.getOrDefault(self.thresholds) if len(ts) != 2: raise ValueError("Logistic Regression getThreshold only applies to" + " binary classification, but thresholds has length != 2." + - " thresholds: " + ",".join(ts)) + " thresholds: {0}".format(str(ts))) t = 1.0/(1.0 + ts[0]/ts[1]) -t2 = self.getParam(self.threshold) +t2 = self.getOrDefault(self.threshold) if abs(t2 - t) >= 1E-5: raise ValueError("Logistic Regression getThreshold found inconsistent values for" + " threshold (%g) and thresholds (equivalent to %g)" % (t2, t)) http://git-wip-us.apache.org/repos/asf/spark/blob/69786ea3/python/pyspark/ml/tests.py -- diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py index 70e0c6d..7152036 100755 --- a/python/pyspark/ml/tests.py +++ b/python/pyspark/ml/tests.py @@ -808,6 +808,18 @@ class PersistenceTest(SparkSessionTestCase): except OSError: pass +def logistic_regression_check_thresholds(self): +self.assertIsInstance( +LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]), +LogisticRegressionModel +) + +self.assertRaisesRegexp( +ValueError, +"Logistic Regression getThreshold found inconsistent.*$", +LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] +) + def _compare_params(self, m1, m2, param): """ Compare 2 ML Params instances for the given param, and assert both have the same param value - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params
Repository: spark Updated Branches: refs/heads/branch-2.2 ef50a9548 -> 3ed2f4d51 [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params ## What changes were proposed in this pull request? - Replace `getParam` calls with `getOrDefault` calls. - Fix exception message to avoid unintended `TypeError`. - Add unit tests ## How was this patch tested? New unit tests. Author: zero323Closes #17891 from zero323/SPARK-20631. (cherry picked from commit 804949c6bf00b8e26c39d48bbcc4d0470ee84e47) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3ed2f4d5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3ed2f4d5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3ed2f4d5 Branch: refs/heads/branch-2.2 Commit: 3ed2f4d516ce02dfef929195778f8214703913d8 Parents: ef50a95 Author: zero323 Authored: Wed May 10 16:57:52 2017 +0800 Committer: Yanbo Liang Committed: Wed May 10 16:58:08 2017 +0800 -- python/pyspark/ml/classification.py | 6 +++--- python/pyspark/ml/tests.py | 12 2 files changed, 15 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3ed2f4d5/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index a9756ea..dcc12d9 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -349,13 +349,13 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredicti def _checkThresholdConsistency(self): if self.isSet(self.threshold) and self.isSet(self.thresholds): -ts = self.getParam(self.thresholds) +ts = self.getOrDefault(self.thresholds) if len(ts) != 2: raise ValueError("Logistic Regression getThreshold only applies to" + " binary classification, but thresholds has length != 2." + - " thresholds: " + ",".join(ts)) + " thresholds: {0}".format(str(ts))) t = 1.0/(1.0 + ts[0]/ts[1]) -t2 = self.getParam(self.threshold) +t2 = self.getOrDefault(self.threshold) if abs(t2 - t) >= 1E-5: raise ValueError("Logistic Regression getThreshold found inconsistent values for" + " threshold (%g) and thresholds (equivalent to %g)" % (t2, t)) http://git-wip-us.apache.org/repos/asf/spark/blob/3ed2f4d5/python/pyspark/ml/tests.py -- diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py index 571ac4b..51a3e8e 100755 --- a/python/pyspark/ml/tests.py +++ b/python/pyspark/ml/tests.py @@ -807,6 +807,18 @@ class PersistenceTest(SparkSessionTestCase): except OSError: pass +def logistic_regression_check_thresholds(self): +self.assertIsInstance( +LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]), +LogisticRegressionModel +) + +self.assertRaisesRegexp( +ValueError, +"Logistic Regression getThreshold found inconsistent.*$", +LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] +) + def _compare_params(self, m1, m2, param): """ Compare 2 ML Params instances for the given param, and assert both have the same param value - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML
Repository: spark Updated Branches: refs/heads/branch-2.2 4bbfad44e -> 4b7aa0b1d [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML ## What changes were proposed in this pull request? Remove ML methods we deprecated in 2.1. ## How was this patch tested? Existing tests. Author: Yanbo LiangCloses #17867 from yanboliang/spark-20606. (cherry picked from commit b8733e0ad9f5a700f385e210450fd2c10137293e) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4b7aa0b1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4b7aa0b1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4b7aa0b1 Branch: refs/heads/branch-2.2 Commit: 4b7aa0b1dbd85e2238acba45e8f94c097358fb72 Parents: 4bbfad4 Author: Yanbo Liang Authored: Tue May 9 17:30:37 2017 +0800 Committer: Yanbo Liang Committed: Tue May 9 17:30:50 2017 +0800 -- .../classification/DecisionTreeClassifier.scala | 18 ++-- .../spark/ml/classification/GBTClassifier.scala | 24 ++--- .../classification/RandomForestClassifier.scala | 24 ++--- .../ml/regression/DecisionTreeRegressor.scala | 18 ++-- .../spark/ml/regression/GBTRegressor.scala | 24 ++--- .../ml/regression/RandomForestRegressor.scala | 24 ++--- .../org/apache/spark/ml/tree/treeParams.scala | 105 --- .../org/apache/spark/ml/util/ReadWrite.scala| 16 --- project/MimaExcludes.scala | 68 python/pyspark/ml/util.py | 32 -- 10 files changed, 134 insertions(+), 219 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4b7aa0b1/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala index 9f60f08..5fb105c 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala @@ -54,27 +54,27 @@ class DecisionTreeClassifier @Since("1.4.0") ( /** @group setParam */ @Since("1.4.0") - override def setMaxDepth(value: Int): this.type = set(maxDepth, value) + def setMaxDepth(value: Int): this.type = set(maxDepth, value) /** @group setParam */ @Since("1.4.0") - override def setMaxBins(value: Int): this.type = set(maxBins, value) + def setMaxBins(value: Int): this.type = set(maxBins, value) /** @group setParam */ @Since("1.4.0") - override def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, value) + def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, value) /** @group setParam */ @Since("1.4.0") - override def setMinInfoGain(value: Double): this.type = set(minInfoGain, value) + def setMinInfoGain(value: Double): this.type = set(minInfoGain, value) /** @group expertSetParam */ @Since("1.4.0") - override def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, value) + def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, value) /** @group expertSetParam */ @Since("1.4.0") - override def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, value) + def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, value) /** * Specifies how often to checkpoint the cached node IDs. @@ -86,15 +86,15 @@ class DecisionTreeClassifier @Since("1.4.0") ( * @group setParam */ @Since("1.4.0") - override def setCheckpointInterval(value: Int): this.type = set(checkpointInterval, value) + def setCheckpointInterval(value: Int): this.type = set(checkpointInterval, value) /** @group setParam */ @Since("1.4.0") - override def setImpurity(value: String): this.type = set(impurity, value) + def setImpurity(value: String): this.type = set(impurity, value) /** @group setParam */ @Since("1.6.0") - override def setSeed(value: Long): this.type = set(seed, value) + def setSeed(value: Long): this.type = set(seed, value) override protected def train(dataset: Dataset[_]): DecisionTreeClassificationModel = { val categoricalFeatures: Map[Int, Int] = http://git-wip-us.apache.org/repos/asf/spark/blob/4b7aa0b1/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
spark git commit: [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML
Repository: spark Updated Branches: refs/heads/master be53a7835 -> b8733e0ad [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML ## What changes were proposed in this pull request? Remove ML methods we deprecated in 2.1. ## How was this patch tested? Existing tests. Author: Yanbo LiangCloses #17867 from yanboliang/spark-20606. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b8733e0a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b8733e0a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b8733e0a Branch: refs/heads/master Commit: b8733e0ad9f5a700f385e210450fd2c10137293e Parents: be53a78 Author: Yanbo Liang Authored: Tue May 9 17:30:37 2017 +0800 Committer: Yanbo Liang Committed: Tue May 9 17:30:37 2017 +0800 -- .../classification/DecisionTreeClassifier.scala | 18 ++-- .../spark/ml/classification/GBTClassifier.scala | 24 ++--- .../classification/RandomForestClassifier.scala | 24 ++--- .../ml/regression/DecisionTreeRegressor.scala | 18 ++-- .../spark/ml/regression/GBTRegressor.scala | 24 ++--- .../ml/regression/RandomForestRegressor.scala | 24 ++--- .../org/apache/spark/ml/tree/treeParams.scala | 105 --- .../org/apache/spark/ml/util/ReadWrite.scala| 16 --- project/MimaExcludes.scala | 68 python/pyspark/ml/util.py | 32 -- 10 files changed, 134 insertions(+), 219 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b8733e0a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala index 9f60f08..5fb105c 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala @@ -54,27 +54,27 @@ class DecisionTreeClassifier @Since("1.4.0") ( /** @group setParam */ @Since("1.4.0") - override def setMaxDepth(value: Int): this.type = set(maxDepth, value) + def setMaxDepth(value: Int): this.type = set(maxDepth, value) /** @group setParam */ @Since("1.4.0") - override def setMaxBins(value: Int): this.type = set(maxBins, value) + def setMaxBins(value: Int): this.type = set(maxBins, value) /** @group setParam */ @Since("1.4.0") - override def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, value) + def setMinInstancesPerNode(value: Int): this.type = set(minInstancesPerNode, value) /** @group setParam */ @Since("1.4.0") - override def setMinInfoGain(value: Double): this.type = set(minInfoGain, value) + def setMinInfoGain(value: Double): this.type = set(minInfoGain, value) /** @group expertSetParam */ @Since("1.4.0") - override def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, value) + def setMaxMemoryInMB(value: Int): this.type = set(maxMemoryInMB, value) /** @group expertSetParam */ @Since("1.4.0") - override def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, value) + def setCacheNodeIds(value: Boolean): this.type = set(cacheNodeIds, value) /** * Specifies how often to checkpoint the cached node IDs. @@ -86,15 +86,15 @@ class DecisionTreeClassifier @Since("1.4.0") ( * @group setParam */ @Since("1.4.0") - override def setCheckpointInterval(value: Int): this.type = set(checkpointInterval, value) + def setCheckpointInterval(value: Int): this.type = set(checkpointInterval, value) /** @group setParam */ @Since("1.4.0") - override def setImpurity(value: String): this.type = set(impurity, value) + def setImpurity(value: String): this.type = set(impurity, value) /** @group setParam */ @Since("1.6.0") - override def setSeed(value: Long): this.type = set(seed, value) + def setSeed(value: Long): this.type = set(seed, value) override protected def train(dataset: Dataset[_]): DecisionTreeClassificationModel = { val categoricalFeatures: Map[Int, Int] = http://git-wip-us.apache.org/repos/asf/spark/blob/b8733e0a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala index ade0960..263ed10 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala +++
spark git commit: [SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column
Repository: spark Updated Branches: refs/heads/master bfc8c79c8 -> 0d16faab9 [SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column ## What changes were proposed in this pull request? Bucketizer currently requires input column to be Double, but the logic should work on any numeric data types. Many practical problems have integer/float data types, and it could get very tedious to manually cast them into Double before calling bucketizer. This PR extends bucketizer to handle all numeric types. ## How was this patch tested? New test. Author: Wayne ZhangCloses #17840 from actuaryzhang/bucketizer. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0d16faab Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0d16faab Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0d16faab Branch: refs/heads/master Commit: 0d16faab90e4cd1f73c5b749dbda7bc2a400b26f Parents: bfc8c79 Author: Wayne Zhang Authored: Fri May 5 10:23:58 2017 +0800 Committer: Yanbo Liang Committed: Fri May 5 10:23:58 2017 +0800 -- .../apache/spark/ml/feature/Bucketizer.scala| 4 ++-- .../spark/ml/feature/BucketizerSuite.scala | 25 2 files changed, 27 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0d16faab/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala index d1f3b2a..bb8f2a3 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala @@ -116,7 +116,7 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid) } -val newCol = bucketizer(filteredDataset($(inputCol))) +val newCol = bucketizer(filteredDataset($(inputCol)).cast(DoubleType)) val newField = prepOutputField(filteredDataset.schema) filteredDataset.withColumn($(outputCol), newCol, newField.metadata) } @@ -130,7 +130,7 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String @Since("1.4.0") override def transformSchema(schema: StructType): StructType = { -SchemaUtils.checkColumnType(schema, $(inputCol), DoubleType) +SchemaUtils.checkNumericType(schema, $(inputCol)) SchemaUtils.appendColumn(schema, prepOutputField(schema)) } http://git-wip-us.apache.org/repos/asf/spark/blob/0d16faab/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala index aac2913..420fb17 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala @@ -26,6 +26,8 @@ import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils} import org.apache.spark.ml.util.TestingUtils._ import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { @@ -162,6 +164,29 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa .setSplits(Array(0.1, 0.8, 0.9)) testDefaultReadWrite(t) } + + test("Bucket numeric features") { +val splits = Array(-3.0, 0.0, 3.0) +val data = Array(-2.0, -1.0, 0.0, 1.0, 2.0) +val expectedBuckets = Array(0.0, 0.0, 1.0, 1.0, 1.0) +val dataFrame: DataFrame = data.zip(expectedBuckets).toSeq.toDF("feature", "expected") + +val bucketizer: Bucketizer = new Bucketizer() + .setInputCol("feature") + .setOutputCol("result") + .setSplits(splits) + +val types = Seq(ShortType, IntegerType, LongType, FloatType, DoubleType, + ByteType, DecimalType(10, 0)) +for (mType <- types) { + val df = dataFrame.withColumn("feature", col("feature").cast(mType)) + bucketizer.transform(df).select("result", "expected").collect().foreach { +case Row(x: Double, y: Double) => + assert(x === y, "The result is not correct after bucketing in type " + +mType.toString + ". " + s"Expected $y but found $x.") + } +} + } } private object BucketizerSuite extends
spark git commit: [SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column
Repository: spark Updated Branches: refs/heads/branch-2.2 425ed26d2 -> c8756288d [SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column ## What changes were proposed in this pull request? Bucketizer currently requires input column to be Double, but the logic should work on any numeric data types. Many practical problems have integer/float data types, and it could get very tedious to manually cast them into Double before calling bucketizer. This PR extends bucketizer to handle all numeric types. ## How was this patch tested? New test. Author: Wayne ZhangCloses #17840 from actuaryzhang/bucketizer. (cherry picked from commit 0d16faab90e4cd1f73c5b749dbda7bc2a400b26f) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c8756288 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c8756288 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c8756288 Branch: refs/heads/branch-2.2 Commit: c8756288de12cfd9528d8d3ff73ff600909d657a Parents: 425ed26 Author: Wayne Zhang Authored: Fri May 5 10:23:58 2017 +0800 Committer: Yanbo Liang Committed: Fri May 5 10:24:12 2017 +0800 -- .../apache/spark/ml/feature/Bucketizer.scala| 4 ++-- .../spark/ml/feature/BucketizerSuite.scala | 25 2 files changed, 27 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c8756288/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala index d1f3b2a..bb8f2a3 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala @@ -116,7 +116,7 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid) } -val newCol = bucketizer(filteredDataset($(inputCol))) +val newCol = bucketizer(filteredDataset($(inputCol)).cast(DoubleType)) val newField = prepOutputField(filteredDataset.schema) filteredDataset.withColumn($(outputCol), newCol, newField.metadata) } @@ -130,7 +130,7 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String @Since("1.4.0") override def transformSchema(schema: StructType): StructType = { -SchemaUtils.checkColumnType(schema, $(inputCol), DoubleType) +SchemaUtils.checkNumericType(schema, $(inputCol)) SchemaUtils.appendColumn(schema, prepOutputField(schema)) } http://git-wip-us.apache.org/repos/asf/spark/blob/c8756288/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala index aac2913..420fb17 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala @@ -26,6 +26,8 @@ import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils} import org.apache.spark.ml.util.TestingUtils._ import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { @@ -162,6 +164,29 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa .setSplits(Array(0.1, 0.8, 0.9)) testDefaultReadWrite(t) } + + test("Bucket numeric features") { +val splits = Array(-3.0, 0.0, 3.0) +val data = Array(-2.0, -1.0, 0.0, 1.0, 2.0) +val expectedBuckets = Array(0.0, 0.0, 1.0, 1.0, 1.0) +val dataFrame: DataFrame = data.zip(expectedBuckets).toSeq.toDF("feature", "expected") + +val bucketizer: Bucketizer = new Bucketizer() + .setInputCol("feature") + .setOutputCol("result") + .setSplits(splits) + +val types = Seq(ShortType, IntegerType, LongType, FloatType, DoubleType, + ByteType, DecimalType(10, 0)) +for (mType <- types) { + val df = dataFrame.withColumn("feature", col("feature").cast(mType)) + bucketizer.transform(df).select("result", "expected").collect().foreach { +case Row(x: Double, y: Double) => + assert(x === y, "The result is not correct after bucketing in type " + +
spark git commit: [SPARK-20047][FOLLOWUP][ML] Constrained Logistic Regression follow up
Repository: spark Updated Branches: refs/heads/branch-2.2 b6727795f -> 425ed26d2 [SPARK-20047][FOLLOWUP][ML] Constrained Logistic Regression follow up ## What changes were proposed in this pull request? Address some minor comments for #17715: * Put bound-constrained optimization params under expertParams. * Update some docs. ## How was this patch tested? Existing tests. Author: Yanbo LiangCloses #17829 from yanboliang/spark-20047-followup. (cherry picked from commit c5dceb8c65545169bc96628140b5acdaa85dd226) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/425ed26d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/425ed26d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/425ed26d Branch: refs/heads/branch-2.2 Commit: 425ed26d2a0f6d3308bdb4fcbf0cedc6ef12612e Parents: b672779 Author: Yanbo Liang Authored: Thu May 4 17:56:43 2017 +0800 Committer: Yanbo Liang Committed: Thu May 4 17:57:08 2017 +0800 -- .../ml/classification/LogisticRegression.scala | 54 +--- 1 file changed, 35 insertions(+), 19 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/425ed26d/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala index d7dde32..42dc7fb 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala @@ -183,14 +183,15 @@ private[classification] trait LogisticRegressionParams extends ProbabilisticClas * The bound matrix must be compatible with the shape (1, number of features) for binomial * regression, or (number of classes, number of features) for multinomial regression. * Otherwise, it throws exception. + * Default is none. * - * @group param + * @group expertParam */ @Since("2.2.0") val lowerBoundsOnCoefficients: Param[Matrix] = new Param(this, "lowerBoundsOnCoefficients", "The lower bounds on coefficients if fitting under bound constrained optimization.") - /** @group getParam */ + /** @group expertGetParam */ @Since("2.2.0") def getLowerBoundsOnCoefficients: Matrix = $(lowerBoundsOnCoefficients) @@ -199,14 +200,15 @@ private[classification] trait LogisticRegressionParams extends ProbabilisticClas * The bound matrix must be compatible with the shape (1, number of features) for binomial * regression, or (number of classes, number of features) for multinomial regression. * Otherwise, it throws exception. + * Default is none. * - * @group param + * @group expertParam */ @Since("2.2.0") val upperBoundsOnCoefficients: Param[Matrix] = new Param(this, "upperBoundsOnCoefficients", "The upper bounds on coefficients if fitting under bound constrained optimization.") - /** @group getParam */ + /** @group expertGetParam */ @Since("2.2.0") def getUpperBoundsOnCoefficients: Matrix = $(upperBoundsOnCoefficients) @@ -214,14 +216,15 @@ private[classification] trait LogisticRegressionParams extends ProbabilisticClas * The lower bounds on intercepts if fitting under bound constrained optimization. * The bounds vector size must be equal with 1 for binomial regression, or the number * of classes for multinomial regression. Otherwise, it throws exception. + * Default is none. * - * @group param + * @group expertParam */ @Since("2.2.0") val lowerBoundsOnIntercepts: Param[Vector] = new Param(this, "lowerBoundsOnIntercepts", "The lower bounds on intercepts if fitting under bound constrained optimization.") - /** @group getParam */ + /** @group expertGetParam */ @Since("2.2.0") def getLowerBoundsOnIntercepts: Vector = $(lowerBoundsOnIntercepts) @@ -229,14 +232,15 @@ private[classification] trait LogisticRegressionParams extends ProbabilisticClas * The upper bounds on intercepts if fitting under bound constrained optimization. * The bound vector size must be equal with 1 for binomial regression, or the number * of classes for multinomial regression. Otherwise, it throws exception. + * Default is none. * - * @group param + * @group expertParam */ @Since("2.2.0") val upperBoundsOnIntercepts: Param[Vector] = new Param(this, "upperBoundsOnIntercepts", "The upper bounds on intercepts if fitting under bound constrained optimization.") - /** @group getParam */ + /** @group
spark git commit: [SPARK-20047][FOLLOWUP][ML] Constrained Logistic Regression follow up
Repository: spark Updated Branches: refs/heads/master 57b64703e -> c5dceb8c6 [SPARK-20047][FOLLOWUP][ML] Constrained Logistic Regression follow up ## What changes were proposed in this pull request? Address some minor comments for #17715: * Put bound-constrained optimization params under expertParams. * Update some docs. ## How was this patch tested? Existing tests. Author: Yanbo LiangCloses #17829 from yanboliang/spark-20047-followup. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c5dceb8c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c5dceb8c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c5dceb8c Branch: refs/heads/master Commit: c5dceb8c65545169bc96628140b5acdaa85dd226 Parents: 57b6470 Author: Yanbo Liang Authored: Thu May 4 17:56:43 2017 +0800 Committer: Yanbo Liang Committed: Thu May 4 17:56:43 2017 +0800 -- .../ml/classification/LogisticRegression.scala | 54 +--- 1 file changed, 35 insertions(+), 19 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c5dceb8c/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala index d7dde32..42dc7fb 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala @@ -183,14 +183,15 @@ private[classification] trait LogisticRegressionParams extends ProbabilisticClas * The bound matrix must be compatible with the shape (1, number of features) for binomial * regression, or (number of classes, number of features) for multinomial regression. * Otherwise, it throws exception. + * Default is none. * - * @group param + * @group expertParam */ @Since("2.2.0") val lowerBoundsOnCoefficients: Param[Matrix] = new Param(this, "lowerBoundsOnCoefficients", "The lower bounds on coefficients if fitting under bound constrained optimization.") - /** @group getParam */ + /** @group expertGetParam */ @Since("2.2.0") def getLowerBoundsOnCoefficients: Matrix = $(lowerBoundsOnCoefficients) @@ -199,14 +200,15 @@ private[classification] trait LogisticRegressionParams extends ProbabilisticClas * The bound matrix must be compatible with the shape (1, number of features) for binomial * regression, or (number of classes, number of features) for multinomial regression. * Otherwise, it throws exception. + * Default is none. * - * @group param + * @group expertParam */ @Since("2.2.0") val upperBoundsOnCoefficients: Param[Matrix] = new Param(this, "upperBoundsOnCoefficients", "The upper bounds on coefficients if fitting under bound constrained optimization.") - /** @group getParam */ + /** @group expertGetParam */ @Since("2.2.0") def getUpperBoundsOnCoefficients: Matrix = $(upperBoundsOnCoefficients) @@ -214,14 +216,15 @@ private[classification] trait LogisticRegressionParams extends ProbabilisticClas * The lower bounds on intercepts if fitting under bound constrained optimization. * The bounds vector size must be equal with 1 for binomial regression, or the number * of classes for multinomial regression. Otherwise, it throws exception. + * Default is none. * - * @group param + * @group expertParam */ @Since("2.2.0") val lowerBoundsOnIntercepts: Param[Vector] = new Param(this, "lowerBoundsOnIntercepts", "The lower bounds on intercepts if fitting under bound constrained optimization.") - /** @group getParam */ + /** @group expertGetParam */ @Since("2.2.0") def getLowerBoundsOnIntercepts: Vector = $(lowerBoundsOnIntercepts) @@ -229,14 +232,15 @@ private[classification] trait LogisticRegressionParams extends ProbabilisticClas * The upper bounds on intercepts if fitting under bound constrained optimization. * The bound vector size must be equal with 1 for binomial regression, or the number * of classes for multinomial regression. Otherwise, it throws exception. + * Default is none. * - * @group param + * @group expertParam */ @Since("2.2.0") val upperBoundsOnIntercepts: Param[Vector] = new Param(this, "upperBoundsOnIntercepts", "The upper bounds on intercepts if fitting under bound constrained optimization.") - /** @group getParam */ + /** @group expertGetParam */ @Since("2.2.0") def getUpperBoundsOnIntercepts: Vector = $(upperBoundsOnIntercepts) @@ -256,7 +260,7 @@
spark git commit: [MINOR][ML] Fix some PySpark & SparkR flaky tests
Repository: spark Updated Branches: refs/heads/branch-2.2 612952251 -> 34dec68d7 [MINOR][ML] Fix some PySpark & SparkR flaky tests ## What changes were proposed in this pull request? Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I donât think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1. ## How was this patch tested? Existing tests. Author: Yanbo LiangCloses #17757 from yanboliang/flaky-test. (cherry picked from commit dbb06c689c157502cb081421baecce411832aad8) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/34dec68d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/34dec68d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/34dec68d Branch: refs/heads/branch-2.2 Commit: 34dec68d7eb647d997fdb27fe65d579c74b39e58 Parents: 6129522 Author: Yanbo Liang Authored: Wed Apr 26 21:34:18 2017 +0800 Committer: Yanbo Liang Committed: Wed Apr 26 21:34:35 2017 +0800 -- .../tests/testthat/test_mllib_classification.R | 17 + python/pyspark/ml/classification.py | 71 ++-- 2 files changed, 38 insertions(+), 50 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/34dec68d/R/pkg/inst/tests/testthat/test_mllib_classification.R -- diff --git a/R/pkg/inst/tests/testthat/test_mllib_classification.R b/R/pkg/inst/tests/testthat/test_mllib_classification.R index af7cbdc..cbc7087 100644 --- a/R/pkg/inst/tests/testthat/test_mllib_classification.R +++ b/R/pkg/inst/tests/testthat/test_mllib_classification.R @@ -284,22 +284,11 @@ test_that("spark.mlp", { c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "2.0", "2.0", "1.0", "0.0")) # test initialWeights - model <- spark.mlp(df, label ~ features, layers = c(4, 3), maxIter = 2, initialWeights = + model <- spark.mlp(df, label ~ features, layers = c(4, 3), initialWeights = c(0, 0, 0, 0, 0, 5, 5, 5, 5, 5, 9, 9, 9, 9, 9)) mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction")) expect_equal(head(mlpPredictions$prediction, 10), - c("1.0", "1.0", "2.0", "1.0", "2.0", "1.0", "2.0", "2.0", "1.0", "0.0")) - - model <- spark.mlp(df, label ~ features, layers = c(4, 3), maxIter = 2, initialWeights = -c(0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 9.0, 9.0, 9.0, 9.0, 9.0)) - mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction")) - expect_equal(head(mlpPredictions$prediction, 10), - c("1.0", "1.0", "2.0", "1.0", "2.0", "1.0", "2.0", "2.0", "1.0", "0.0")) - - model <- spark.mlp(df, label ~ features, layers = c(4, 3), maxIter = 2) - mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction")) - expect_equal(head(mlpPredictions$prediction, 10), - c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "0.0", "0.0", "1.0", "0.0")) + c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "2.0", "2.0", "1.0", "0.0")) # Test formula works well df <- suppressWarnings(createDataFrame(iris)) @@ -310,8 +299,6 @@ test_that("spark.mlp", { expect_equal(summary$numOfOutputs, 3) expect_equal(summary$layers, c(4, 3)) expect_equal(length(summary$weights), 15) - expect_equal(head(summary$weights, 5), list(-0.5793153, -4.652961, 6.216155, -6.649478, - -10.51147), tolerance = 1e-3) }) test_that("spark.naiveBayes", { http://git-wip-us.apache.org/repos/asf/spark/blob/34dec68d/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index 8649683..a9756ea 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -185,34 +185,33 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredicti >>> from pyspark.sql import Row >>> from pyspark.ml.linalg import Vectors >>> bdf = sc.parallelize([ -... Row(label=1.0, weight=2.0, features=Vectors.dense(1.0)), -... Row(label=0.0, weight=2.0, features=Vectors.sparse(1, [], []))]).toDF() ->>> blor = LogisticRegression(maxIter=5, regParam=0.01, weightCol="weight") +... Row(label=1.0, weight=1.0, features=Vectors.dense(0.0, 5.0)), +... Row(label=0.0, weight=2.0, features=Vectors.dense(1.0, 2.0)), +... Row(label=1.0, weight=3.0, features=Vectors.dense(2.0, 1.0)), +
spark git commit: [MINOR][ML] Fix some PySpark & SparkR flaky tests
Repository: spark Updated Branches: refs/heads/master 7fecf5130 -> dbb06c689 [MINOR][ML] Fix some PySpark & SparkR flaky tests ## What changes were proposed in this pull request? Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I donât think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1. ## How was this patch tested? Existing tests. Author: Yanbo LiangCloses #17757 from yanboliang/flaky-test. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dbb06c68 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dbb06c68 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dbb06c68 Branch: refs/heads/master Commit: dbb06c689c157502cb081421baecce411832aad8 Parents: 7fecf51 Author: Yanbo Liang Authored: Wed Apr 26 21:34:18 2017 +0800 Committer: Yanbo Liang Committed: Wed Apr 26 21:34:18 2017 +0800 -- .../tests/testthat/test_mllib_classification.R | 17 + python/pyspark/ml/classification.py | 71 ++-- 2 files changed, 38 insertions(+), 50 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dbb06c68/R/pkg/inst/tests/testthat/test_mllib_classification.R -- diff --git a/R/pkg/inst/tests/testthat/test_mllib_classification.R b/R/pkg/inst/tests/testthat/test_mllib_classification.R index af7cbdc..cbc7087 100644 --- a/R/pkg/inst/tests/testthat/test_mllib_classification.R +++ b/R/pkg/inst/tests/testthat/test_mllib_classification.R @@ -284,22 +284,11 @@ test_that("spark.mlp", { c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "2.0", "2.0", "1.0", "0.0")) # test initialWeights - model <- spark.mlp(df, label ~ features, layers = c(4, 3), maxIter = 2, initialWeights = + model <- spark.mlp(df, label ~ features, layers = c(4, 3), initialWeights = c(0, 0, 0, 0, 0, 5, 5, 5, 5, 5, 9, 9, 9, 9, 9)) mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction")) expect_equal(head(mlpPredictions$prediction, 10), - c("1.0", "1.0", "2.0", "1.0", "2.0", "1.0", "2.0", "2.0", "1.0", "0.0")) - - model <- spark.mlp(df, label ~ features, layers = c(4, 3), maxIter = 2, initialWeights = -c(0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 9.0, 9.0, 9.0, 9.0, 9.0)) - mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction")) - expect_equal(head(mlpPredictions$prediction, 10), - c("1.0", "1.0", "2.0", "1.0", "2.0", "1.0", "2.0", "2.0", "1.0", "0.0")) - - model <- spark.mlp(df, label ~ features, layers = c(4, 3), maxIter = 2) - mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction")) - expect_equal(head(mlpPredictions$prediction, 10), - c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "0.0", "0.0", "1.0", "0.0")) + c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "2.0", "2.0", "1.0", "0.0")) # Test formula works well df <- suppressWarnings(createDataFrame(iris)) @@ -310,8 +299,6 @@ test_that("spark.mlp", { expect_equal(summary$numOfOutputs, 3) expect_equal(summary$layers, c(4, 3)) expect_equal(length(summary$weights), 15) - expect_equal(head(summary$weights, 5), list(-0.5793153, -4.652961, 6.216155, -6.649478, - -10.51147), tolerance = 1e-3) }) test_that("spark.naiveBayes", { http://git-wip-us.apache.org/repos/asf/spark/blob/dbb06c68/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index 8649683..a9756ea 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -185,34 +185,33 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredicti >>> from pyspark.sql import Row >>> from pyspark.ml.linalg import Vectors >>> bdf = sc.parallelize([ -... Row(label=1.0, weight=2.0, features=Vectors.dense(1.0)), -... Row(label=0.0, weight=2.0, features=Vectors.sparse(1, [], []))]).toDF() ->>> blor = LogisticRegression(maxIter=5, regParam=0.01, weightCol="weight") +... Row(label=1.0, weight=1.0, features=Vectors.dense(0.0, 5.0)), +... Row(label=0.0, weight=2.0, features=Vectors.dense(1.0, 2.0)), +... Row(label=1.0, weight=3.0, features=Vectors.dense(2.0, 1.0)), +... Row(label=0.0, weight=4.0, features=Vectors.dense(3.0, 3.0))]).toDF() +>>> blor =
spark git commit: [SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant
Repository: spark Updated Branches: refs/heads/branch-2.2 b62ebd91b -> e2591c6d7 [SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant ## What changes were proposed in this pull request? This is a follow-up PR of #17478. ## How was this patch tested? Existing tests Author: wangmiao1981Closes #17754 from wangmiao1981/followup. (cherry picked from commit 387565cf14b490810f9479ff3adbf776e2edecdc) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e2591c6d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e2591c6d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e2591c6d Branch: refs/heads/branch-2.2 Commit: e2591c6d74081e9edad2e8982c0125a4f1d21437 Parents: b62ebd9 Author: wangmiao1981 Authored: Tue Apr 25 16:30:36 2017 +0800 Committer: Yanbo Liang Committed: Tue Apr 25 16:30:53 2017 +0800 -- .../scala/org/apache/spark/ml/classification/LinearSVC.scala| 5 ++--- .../scala/org/apache/spark/ml/regression/LinearRegression.scala | 5 - 2 files changed, 2 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e2591c6d/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala index f76b14e..7507c75 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala @@ -458,9 +458,7 @@ private class LinearSVCAggregator( */ def add(instance: Instance): this.type = { instance match { case Instance(label, weight, features) => - require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0") - require(numFeatures == features.size, s"Dimensions mismatch when adding new instance." + -s" Expecting $numFeatures but got ${features.size}.") + if (weight == 0.0) return this val localFeaturesStd = bcFeaturesStd.value val localCoefficients = coefficientsArray @@ -512,6 +510,7 @@ private class LinearSVCAggregator( * @return This LinearSVCAggregator object. */ def merge(other: LinearSVCAggregator): this.type = { + if (other.weightSum != 0.0) { weightSum += other.weightSum lossSum += other.lossSum http://git-wip-us.apache.org/repos/asf/spark/blob/e2591c6d/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala index f7e3c8f..eaad549 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala @@ -971,9 +971,6 @@ private class LeastSquaresAggregator( */ def add(instance: Instance): this.type = { instance match { case Instance(label, weight, features) => - require(dim == features.size, s"Dimensions mismatch when adding new sample." + -s" Expecting $dim but got ${features.size}.") - require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0") if (weight == 0.0) return this @@ -1005,8 +1002,6 @@ private class LeastSquaresAggregator( * @return This LeastSquaresAggregator object. */ def merge(other: LeastSquaresAggregator): this.type = { -require(dim == other.dim, s"Dimensions mismatch when merging with another " + - s"LeastSquaresAggregator. Expecting $dim but got ${other.dim}.") if (other.weightSum != 0) { totalCnt += other.totalCnt - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant
Repository: spark Updated Branches: refs/heads/master 0bc7a9021 -> 387565cf1 [SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant ## What changes were proposed in this pull request? This is a follow-up PR of #17478. ## How was this patch tested? Existing tests Author: wangmiao1981Closes #17754 from wangmiao1981/followup. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/387565cf Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/387565cf Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/387565cf Branch: refs/heads/master Commit: 387565cf14b490810f9479ff3adbf776e2edecdc Parents: 0bc7a90 Author: wangmiao1981 Authored: Tue Apr 25 16:30:36 2017 +0800 Committer: Yanbo Liang Committed: Tue Apr 25 16:30:36 2017 +0800 -- .../scala/org/apache/spark/ml/classification/LinearSVC.scala| 5 ++--- .../scala/org/apache/spark/ml/regression/LinearRegression.scala | 5 - 2 files changed, 2 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/387565cf/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala index f76b14e..7507c75 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala @@ -458,9 +458,7 @@ private class LinearSVCAggregator( */ def add(instance: Instance): this.type = { instance match { case Instance(label, weight, features) => - require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0") - require(numFeatures == features.size, s"Dimensions mismatch when adding new instance." + -s" Expecting $numFeatures but got ${features.size}.") + if (weight == 0.0) return this val localFeaturesStd = bcFeaturesStd.value val localCoefficients = coefficientsArray @@ -512,6 +510,7 @@ private class LinearSVCAggregator( * @return This LinearSVCAggregator object. */ def merge(other: LinearSVCAggregator): this.type = { + if (other.weightSum != 0.0) { weightSum += other.weightSum lossSum += other.lossSum http://git-wip-us.apache.org/repos/asf/spark/blob/387565cf/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala index f7e3c8f..eaad549 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala @@ -971,9 +971,6 @@ private class LeastSquaresAggregator( */ def add(instance: Instance): this.type = { instance match { case Instance(label, weight, features) => - require(dim == features.size, s"Dimensions mismatch when adding new sample." + -s" Expecting $dim but got ${features.size}.") - require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0") if (weight == 0.0) return this @@ -1005,8 +1002,6 @@ private class LeastSquaresAggregator( * @return This LeastSquaresAggregator object. */ def merge(other: LeastSquaresAggregator): this.type = { -require(dim == other.dim, s"Dimensions mismatch when merging with another " + - s"LeastSquaresAggregator. Expecting $dim but got ${other.dim}.") if (other.weightSum != 0) { totalCnt += other.totalCnt - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-18901][ML] Require in LR LogisticAggregator is redundant
Repository: spark Updated Branches: refs/heads/branch-2.2 2bef01f64 -> cf16c3250 [SPARK-18901][ML] Require in LR LogisticAggregator is redundant ## What changes were proposed in this pull request? In MultivariateOnlineSummarizer, `add` and `merge` have check for weights and feature sizes. The checks in LR are redundant, which are removed from this PR. ## How was this patch tested? Existing tests. Author: wm...@hotmail.comCloses #17478 from wangmiao1981/logit. (cherry picked from commit 90264aced7cfdf265636517b91e5d1324fe60112) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cf16c325 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cf16c325 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cf16c325 Branch: refs/heads/branch-2.2 Commit: cf16c3250e946c4f89edc999d8764e8fa3dfb056 Parents: 2bef01f Author: wm...@hotmail.com Authored: Mon Apr 24 23:43:06 2017 +0800 Committer: Yanbo Liang Committed: Mon Apr 24 23:43:23 2017 +0800 -- .../org/apache/spark/ml/classification/LogisticRegression.scala | 5 - 1 file changed, 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cf16c325/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala index bc81546..44b3478 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala @@ -1571,9 +1571,6 @@ private class LogisticAggregator( */ def add(instance: Instance): this.type = { instance match { case Instance(label, weight, features) => - require(numFeatures == features.size, s"Dimensions mismatch when adding new instance." + -s" Expecting $numFeatures but got ${features.size}.") - require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0") if (weight == 0.0) return this @@ -1596,8 +1593,6 @@ private class LogisticAggregator( * @return This LogisticAggregator object. */ def merge(other: LogisticAggregator): this.type = { -require(numFeatures == other.numFeatures, s"Dimensions mismatch when merging with another " + - s"LogisticAggregator. Expecting $numFeatures but got ${other.numFeatures}.") if (other.weightSum != 0.0) { weightSum += other.weightSum - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-18901][ML] Require in LR LogisticAggregator is redundant
Repository: spark Updated Branches: refs/heads/master 776a2c0e9 -> 90264aced [SPARK-18901][ML] Require in LR LogisticAggregator is redundant ## What changes were proposed in this pull request? In MultivariateOnlineSummarizer, `add` and `merge` have check for weights and feature sizes. The checks in LR are redundant, which are removed from this PR. ## How was this patch tested? Existing tests. Author: wm...@hotmail.comCloses #17478 from wangmiao1981/logit. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/90264ace Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/90264ace Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/90264ace Branch: refs/heads/master Commit: 90264aced7cfdf265636517b91e5d1324fe60112 Parents: 776a2c0 Author: wm...@hotmail.com Authored: Mon Apr 24 23:43:06 2017 +0800 Committer: Yanbo Liang Committed: Mon Apr 24 23:43:06 2017 +0800 -- .../org/apache/spark/ml/classification/LogisticRegression.scala | 5 - 1 file changed, 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/90264ace/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala index bc81546..44b3478 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala @@ -1571,9 +1571,6 @@ private class LogisticAggregator( */ def add(instance: Instance): this.type = { instance match { case Instance(label, weight, features) => - require(numFeatures == features.size, s"Dimensions mismatch when adding new instance." + -s" Expecting $numFeatures but got ${features.size}.") - require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0") if (weight == 0.0) return this @@ -1596,8 +1593,6 @@ private class LogisticAggregator( * @return This LogisticAggregator object. */ def merge(other: LogisticAggregator): this.type = { -require(numFeatures == other.numFeatures, s"Dimensions mismatch when merging with another " + - s"LogisticAggregator. Expecting $numFeatures but got ${other.numFeatures}.") if (other.weightSum != 0.0) { weightSum += other.weightSum - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][SPARKR] Move 'Data type mapping between R and Spark' to right place in SparkR doc.
Repository: spark Updated Branches: refs/heads/master 3fada2f50 -> 1d00761b9 [MINOR][SPARKR] Move 'Data type mapping between R and Spark' to right place in SparkR doc. Section ```Data type mapping between R and Spark``` was put in the wrong place in SparkR doc currently, we should move it to a separate section. ## What changes were proposed in this pull request? Before this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/24340911/bc01a532-126a-11e7-9a08-0d60d13a547c.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/24340938/d9d32a9a-126a-11e7-8891-d2f5b46e0c71.png) Author: Yanbo LiangCloses #17440 from yanboliang/sparkr-doc. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1d00761b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1d00761b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1d00761b Branch: refs/heads/master Commit: 1d00761b9176a1f42976057ca78638c5b0763abc Parents: 3fada2f Author: Yanbo Liang Authored: Mon Mar 27 17:37:24 2017 -0700 Committer: Yanbo Liang Committed: Mon Mar 27 17:37:24 2017 -0700 -- docs/sparkr.md | 138 ++-- 1 file changed, 69 insertions(+), 69 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1d00761b/docs/sparkr.md -- diff --git a/docs/sparkr.md b/docs/sparkr.md index d7ffd9b..a1a35a7 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -394,75 +394,6 @@ head(result[order(result$max_eruption, decreasing = TRUE), ]) {% endhighlight %} - Data type mapping between R and Spark - -RSpark - - byte - byte - - - integer - integer - - - float - float - - - double - double - - - numeric - double - - - character - string - - - string - string - - - binary - binary - - - raw - binary - - - logical - boolean - - - https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html;>POSIXct - timestamp - - - https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html;>POSIXlt - timestamp - - - https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html;>Date - date - - - array - array - - - list - array - - - env - map - - - Run local R functions distributed using `spark.lapply` # spark.lapply @@ -557,6 +488,75 @@ SparkR supports a subset of the available R formula operators for model fitting, The following example shows how to save/load a MLlib model by SparkR. {% include_example read_write r/ml/ml.R %} +# Data type mapping between R and Spark + +RSpark + + byte + byte + + + integer + integer + + + float + float + + + double + double + + + numeric + double + + + character + string + + + string + string + + + binary + binary + + + raw + binary + + + logical + boolean + + + https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html;>POSIXct + timestamp + + + https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html;>POSIXlt + timestamp + + + https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html;>Date + date + + + array + array + + + list + array + + + env + map + + + # R Function Name Conflicts When loading and attaching a new package in R, it is possible to have a name [conflict](https://stat.ethz.ch/R-manual/R-devel/library/base/html/library.html), where a - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19925][SPARKR] Fix SparkR spark.getSparkFiles fails when it was called on executors.
Repository: spark Updated Branches: refs/heads/branch-2.1 c4d2b8338 -> 277ed375b [SPARK-19925][SPARKR] Fix SparkR spark.getSparkFiles fails when it was called on executors. ## What changes were proposed in this pull request? SparkR ```spark.getSparkFiles``` fails when it was called on executors, see details at [SPARK-19925](https://issues.apache.org/jira/browse/SPARK-19925). ## How was this patch tested? Add unit tests, and verify this fix at standalone and yarn cluster. Author: Yanbo LiangCloses #17274 from yanboliang/spark-19925. (cherry picked from commit 478fbc866fbfdb4439788583281863ecea14e8af) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/277ed375 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/277ed375 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/277ed375 Branch: refs/heads/branch-2.1 Commit: 277ed375b0af3e8fe2a8b9dee62997dcf16d5872 Parents: c4d2b83 Author: Yanbo Liang Authored: Tue Mar 21 21:50:54 2017 -0700 Committer: Yanbo Liang Committed: Tue Mar 21 22:12:55 2017 -0700 -- R/pkg/R/context.R | 16 ++-- R/pkg/inst/tests/testthat/test_context.R| 7 +++ .../main/scala/org/apache/spark/api/r/RRunner.scala | 2 ++ 3 files changed, 23 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/277ed375/R/pkg/R/context.R -- diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R index 1a0dd65..634bdcb 100644 --- a/R/pkg/R/context.R +++ b/R/pkg/R/context.R @@ -330,7 +330,13 @@ spark.addFile <- function(path, recursive = FALSE) { #'} #' @note spark.getSparkFilesRootDirectory since 2.1.0 spark.getSparkFilesRootDirectory <- function() { - callJStatic("org.apache.spark.SparkFiles", "getRootDirectory") + if (Sys.getenv("SPARKR_IS_RUNNING_ON_WORKER") == "") { +# Running on driver. +callJStatic("org.apache.spark.SparkFiles", "getRootDirectory") + } else { +# Running on worker. +Sys.getenv("SPARKR_SPARKFILES_ROOT_DIR") + } } #' Get the absolute path of a file added through spark.addFile. @@ -345,7 +351,13 @@ spark.getSparkFilesRootDirectory <- function() { #'} #' @note spark.getSparkFiles since 2.1.0 spark.getSparkFiles <- function(fileName) { - callJStatic("org.apache.spark.SparkFiles", "get", as.character(fileName)) + if (Sys.getenv("SPARKR_IS_RUNNING_ON_WORKER") == "") { +# Running on driver. +callJStatic("org.apache.spark.SparkFiles", "get", as.character(fileName)) + } else { +# Running on worker. +file.path(spark.getSparkFilesRootDirectory(), as.character(fileName)) + } } #' Run a function over a list of elements, distributing the computations with Spark http://git-wip-us.apache.org/repos/asf/spark/blob/277ed375/R/pkg/inst/tests/testthat/test_context.R -- diff --git a/R/pkg/inst/tests/testthat/test_context.R b/R/pkg/inst/tests/testthat/test_context.R index caca069..c847113 100644 --- a/R/pkg/inst/tests/testthat/test_context.R +++ b/R/pkg/inst/tests/testthat/test_context.R @@ -177,6 +177,13 @@ test_that("add and get file to be downloaded with Spark job on every node", { spark.addFile(path) download_path <- spark.getSparkFiles(filename) expect_equal(readLines(download_path), words) + + # Test spark.getSparkFiles works well on executors. + seq <- seq(from = 1, to = 10, length.out = 5) + f <- function(seq) { spark.getSparkFiles(filename) } + results <- spark.lapply(seq, f) + for (i in 1:5) { expect_equal(basename(results[[i]]), filename) } + unlink(path) # Test add directory recursively. http://git-wip-us.apache.org/repos/asf/spark/blob/277ed375/core/src/main/scala/org/apache/spark/api/r/RRunner.scala -- diff --git a/core/src/main/scala/org/apache/spark/api/r/RRunner.scala b/core/src/main/scala/org/apache/spark/api/r/RRunner.scala index 29e21b3..8811839 100644 --- a/core/src/main/scala/org/apache/spark/api/r/RRunner.scala +++ b/core/src/main/scala/org/apache/spark/api/r/RRunner.scala @@ -347,6 +347,8 @@ private[r] object RRunner { pb.environment().put("SPARKR_RLIBDIR", rLibDir.mkString(",")) pb.environment().put("SPARKR_WORKER_PORT", port.toString) pb.environment().put("SPARKR_BACKEND_CONNECTION_TIMEOUT", rConnectionTimeout.toString) +pb.environment().put("SPARKR_SPARKFILES_ROOT_DIR", SparkFiles.getRootDirectory()) +pb.environment().put("SPARKR_IS_RUNNING_ON_WORKER", "TRUE") pb.redirectErrorStream(true) // redirect stderr into stdout val proc = pb.start() val
spark git commit: [SPARK-19925][SPARKR] Fix SparkR spark.getSparkFiles fails when it was called on executors.
Repository: spark Updated Branches: refs/heads/master c1e87e384 -> 478fbc866 [SPARK-19925][SPARKR] Fix SparkR spark.getSparkFiles fails when it was called on executors. ## What changes were proposed in this pull request? SparkR ```spark.getSparkFiles``` fails when it was called on executors, see details at [SPARK-19925](https://issues.apache.org/jira/browse/SPARK-19925). ## How was this patch tested? Add unit tests, and verify this fix at standalone and yarn cluster. Author: Yanbo LiangCloses #17274 from yanboliang/spark-19925. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/478fbc86 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/478fbc86 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/478fbc86 Branch: refs/heads/master Commit: 478fbc866fbfdb4439788583281863ecea14e8af Parents: c1e87e3 Author: Yanbo Liang Authored: Tue Mar 21 21:50:54 2017 -0700 Committer: Yanbo Liang Committed: Tue Mar 21 21:50:54 2017 -0700 -- R/pkg/R/context.R | 16 ++-- R/pkg/inst/tests/testthat/test_context.R| 7 +++ .../main/scala/org/apache/spark/api/r/RRunner.scala | 2 ++ 3 files changed, 23 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/478fbc86/R/pkg/R/context.R -- diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R index 1ca573e..50856e3 100644 --- a/R/pkg/R/context.R +++ b/R/pkg/R/context.R @@ -330,7 +330,13 @@ spark.addFile <- function(path, recursive = FALSE) { #'} #' @note spark.getSparkFilesRootDirectory since 2.1.0 spark.getSparkFilesRootDirectory <- function() { - callJStatic("org.apache.spark.SparkFiles", "getRootDirectory") + if (Sys.getenv("SPARKR_IS_RUNNING_ON_WORKER") == "") { +# Running on driver. +callJStatic("org.apache.spark.SparkFiles", "getRootDirectory") + } else { +# Running on worker. +Sys.getenv("SPARKR_SPARKFILES_ROOT_DIR") + } } #' Get the absolute path of a file added through spark.addFile. @@ -345,7 +351,13 @@ spark.getSparkFilesRootDirectory <- function() { #'} #' @note spark.getSparkFiles since 2.1.0 spark.getSparkFiles <- function(fileName) { - callJStatic("org.apache.spark.SparkFiles", "get", as.character(fileName)) + if (Sys.getenv("SPARKR_IS_RUNNING_ON_WORKER") == "") { +# Running on driver. +callJStatic("org.apache.spark.SparkFiles", "get", as.character(fileName)) + } else { +# Running on worker. +file.path(spark.getSparkFilesRootDirectory(), as.character(fileName)) + } } #' Run a function over a list of elements, distributing the computations with Spark http://git-wip-us.apache.org/repos/asf/spark/blob/478fbc86/R/pkg/inst/tests/testthat/test_context.R -- diff --git a/R/pkg/inst/tests/testthat/test_context.R b/R/pkg/inst/tests/testthat/test_context.R index caca069..c847113 100644 --- a/R/pkg/inst/tests/testthat/test_context.R +++ b/R/pkg/inst/tests/testthat/test_context.R @@ -177,6 +177,13 @@ test_that("add and get file to be downloaded with Spark job on every node", { spark.addFile(path) download_path <- spark.getSparkFiles(filename) expect_equal(readLines(download_path), words) + + # Test spark.getSparkFiles works well on executors. + seq <- seq(from = 1, to = 10, length.out = 5) + f <- function(seq) { spark.getSparkFiles(filename) } + results <- spark.lapply(seq, f) + for (i in 1:5) { expect_equal(basename(results[[i]]), filename) } + unlink(path) # Test add directory recursively. http://git-wip-us.apache.org/repos/asf/spark/blob/478fbc86/core/src/main/scala/org/apache/spark/api/r/RRunner.scala -- diff --git a/core/src/main/scala/org/apache/spark/api/r/RRunner.scala b/core/src/main/scala/org/apache/spark/api/r/RRunner.scala index 29e21b3..8811839 100644 --- a/core/src/main/scala/org/apache/spark/api/r/RRunner.scala +++ b/core/src/main/scala/org/apache/spark/api/r/RRunner.scala @@ -347,6 +347,8 @@ private[r] object RRunner { pb.environment().put("SPARKR_RLIBDIR", rLibDir.mkString(",")) pb.environment().put("SPARKR_WORKER_PORT", port.toString) pb.environment().put("SPARKR_BACKEND_CONNECTION_TIMEOUT", rConnectionTimeout.toString) +pb.environment().put("SPARKR_SPARKFILES_ROOT_DIR", SparkFiles.getRootDirectory()) +pb.environment().put("SPARKR_IS_RUNNING_ON_WORKER", "TRUE") pb.redirectErrorStream(true) // redirect stderr into stdout val proc = pb.start() val errThread = startStdoutThread(proc) - To unsubscribe,
spark git commit: [SPARK-19806][ML][PYSPARK] PySpark GeneralizedLinearRegression supports tweedie distribution.
Repository: spark Updated Branches: refs/heads/master 1fa58868b -> 81303f7ca [SPARK-19806][ML][PYSPARK] PySpark GeneralizedLinearRegression supports tweedie distribution. ## What changes were proposed in this pull request? PySpark ```GeneralizedLinearRegression``` supports tweedie distribution. ## How was this patch tested? Add unit tests. Author: Yanbo LiangCloses #17146 from yanboliang/spark-19806. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/81303f7c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/81303f7c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/81303f7c Branch: refs/heads/master Commit: 81303f7ca7808d51229411dce8feeed8c23dbe15 Parents: 1fa5886 Author: Yanbo Liang Authored: Wed Mar 8 02:09:36 2017 -0800 Committer: Yanbo Liang Committed: Wed Mar 8 02:09:36 2017 -0800 -- .../GeneralizedLinearRegression.scala | 8 +-- python/pyspark/ml/regression.py | 61 +--- python/pyspark/ml/tests.py | 20 +++ 3 files changed, 77 insertions(+), 12 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/81303f7c/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala index 110764d..3be8b53 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala @@ -66,7 +66,7 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam /** * Param for the power in the variance function of the Tweedie distribution which provides * the relationship between the variance and mean of the distribution. - * Only applicable for the Tweedie family. + * Only applicable to the Tweedie family. * (see https://en.wikipedia.org/wiki/Tweedie_distribution;> * Tweedie Distribution (Wikipedia)) * Supported values: 0 and [1, Inf). @@ -79,7 +79,7 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam final val variancePower: DoubleParam = new DoubleParam(this, "variancePower", "The power in the variance function of the Tweedie distribution which characterizes " + "the relationship between the variance and mean of the distribution. " + -"Only applicable for the Tweedie family. Supported values: 0 and [1, Inf).", +"Only applicable to the Tweedie family. Supported values: 0 and [1, Inf).", (x: Double) => x >= 1.0 || x == 0.0) /** @group getParam */ @@ -106,7 +106,7 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam def getLink: String = $(link) /** - * Param for the index in the power link function. Only applicable for the Tweedie family. + * Param for the index in the power link function. Only applicable to the Tweedie family. * Note that link power 0, 1, -1 or 0.5 corresponds to the Log, Identity, Inverse or Sqrt * link, respectively. * When not set, this value defaults to 1 - [[variancePower]], which matches the R "statmod" @@ -116,7 +116,7 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam */ @Since("2.2.0") final val linkPower: DoubleParam = new DoubleParam(this, "linkPower", -"The index in the power link function. Only applicable for the Tweedie family.") +"The index in the power link function. Only applicable to the Tweedie family.") /** @group getParam */ @Since("2.2.0") http://git-wip-us.apache.org/repos/asf/spark/blob/81303f7c/python/pyspark/ml/regression.py -- diff --git a/python/pyspark/ml/regression.py b/python/pyspark/ml/regression.py index b199bf2..3c3fcc8 100644 --- a/python/pyspark/ml/regression.py +++ b/python/pyspark/ml/regression.py @@ -1294,8 +1294,8 @@ class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, Ha Fit a Generalized Linear Model specified by giving a symbolic description of the linear predictor (link function) and a description of the error distribution (family). It supports -"gaussian", "binomial", "poisson" and "gamma" as family. Valid link functions for each family -is listed below. The first link function of each family is the default one. +"gaussian", "binomial", "poisson", "gamma" and "tweedie" as family. Valid link functions for +each family is listed below.
spark git commit: [SPARK-19745][ML] SVCAggregator captures coefficients in its closure
Repository: spark Updated Branches: refs/heads/master 8417a7ae6 -> 93ae176e8 [SPARK-19745][ML] SVCAggregator captures coefficients in its closure ## What changes were proposed in this pull request? JIRA: [SPARK-19745](https://issues.apache.org/jira/browse/SPARK-19745) Reorganize SVCAggregator to avoid serializing coefficients. This patch also makes the gradient array a `lazy val` which will avoid materializing a large array on the driver before shipping the class to the executors. This improvement stems from https://github.com/apache/spark/pull/16037. Actually, probably all ML aggregators can benefit from this. We can either: a.) separate the gradient improvement into another patch b.) keep what's here _plus_ add the lazy evaluation to all other aggregators in this patch or c.) keep it as is. ## How was this patch tested? This is an interesting question! I don't know of a reasonable way to test this right now. Ideally, we could perform an optimization and look at the shuffle write data for each task, and we could compare the size to what it we know it should be: `numCoefficients * 8 bytes`. Not sure if there is a good way to do that right now? We could discuss this here or in another JIRA, but I suspect it would be a significant undertaking. Author: sethahCloses #17076 from sethah/svc_agg. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/93ae176e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/93ae176e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/93ae176e Branch: refs/heads/master Commit: 93ae176e8943d6b346c80deea778bffd188366a1 Parents: 8417a7a Author: sethah Authored: Thu Mar 2 19:38:25 2017 -0800 Committer: Yanbo Liang Committed: Thu Mar 2 19:38:25 2017 -0800 -- .../spark/ml/classification/LinearSVC.scala | 29 .../ml/classification/LogisticRegression.scala | 2 +- .../spark/ml/clustering/GaussianMixture.scala | 6 ++-- .../ml/regression/AFTSurvivalRegression.scala | 2 +- .../spark/ml/regression/LinearRegression.scala | 2 +- .../ml/classification/LinearSVCSuite.scala | 17 +++- 6 files changed, 34 insertions(+), 24 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/93ae176e/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala index bf6e76d..f76b14e 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala @@ -440,19 +440,14 @@ private class LinearSVCAggregator( private val numFeatures: Int = bcFeaturesStd.value.length private val numFeaturesPlusIntercept: Int = if (fitIntercept) numFeatures + 1 else numFeatures - private val coefficients: Vector = bcCoefficients.value private var weightSum: Double = 0.0 private var lossSum: Double = 0.0 - require(numFeaturesPlusIntercept == coefficients.size, s"Dimension mismatch. Coefficients " + -s"length ${coefficients.size}, FeaturesStd length ${numFeatures}, fitIntercept: $fitIntercept") - - private val coefficientsArray = coefficients match { -case dv: DenseVector => dv.values -case _ => - throw new IllegalArgumentException( -s"coefficients only supports dense vector but got type ${coefficients.getClass}.") + @transient private lazy val coefficientsArray = bcCoefficients.value match { +case DenseVector(values) => values +case _ => throw new IllegalArgumentException(s"coefficients only supports dense vector" + + s" but got type ${bcCoefficients.value.getClass}.") } - private val gradientSumArray = Array.fill[Double](coefficientsArray.length)(0) + private lazy val gradientSumArray = new Array[Double](numFeaturesPlusIntercept) /** * Add a new training instance to this LinearSVCAggregator, and update the loss and gradient @@ -463,6 +458,9 @@ private class LinearSVCAggregator( */ def add(instance: Instance): this.type = { instance match { case Instance(label, weight, features) => + require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0") + require(numFeatures == features.size, s"Dimensions mismatch when adding new instance." + +s" Expecting $numFeatures but got ${features.size}.") if (weight == 0.0) return this val localFeaturesStd = bcFeaturesStd.value val localCoefficients = coefficientsArray @@ -530,18 +528,15 @@ private class LinearSVCAggregator( this } - def loss:
spark git commit: [SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dropLast
Repository: spark Updated Branches: refs/heads/master 3bd8ddf7c -> d2a879762 [SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dropLast ## What changes were proposed in this pull request? Updates the doc string to match up with the code i.e. say dropLast instead of includeFirst ## How was this patch tested? Not much, since it's a doc-like change. Will run unit tests via Jenkins job. Author: Mark GroverCloses #17127 from markgrover/spark_19734. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d2a87976 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d2a87976 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d2a87976 Branch: refs/heads/master Commit: d2a879762a2b4f3c4d703cc183275af12b3c7de1 Parents: 3bd8ddf Author: Mark Grover Authored: Wed Mar 1 22:57:34 2017 -0800 Committer: Yanbo Liang Committed: Wed Mar 1 22:57:34 2017 -0800 -- python/pyspark/ml/feature.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d2a87976/python/pyspark/ml/feature.py -- diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index 67c12d8..83cf763 100755 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -1363,7 +1363,7 @@ class OneHotEncoder(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, @keyword_only def __init__(self, dropLast=True, inputCol=None, outputCol=None): """ -__init__(self, includeFirst=True, inputCol=None, outputCol=None) +__init__(self, dropLast=True, inputCol=None, outputCol=None) """ super(OneHotEncoder, self).__init__() self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.OneHotEncoder", self.uid) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][ML] Fix comments in LSH Examples and Python API
Repository: spark Updated Branches: refs/heads/master de2b53df4 -> 3bd8ddf7c [MINOR][ML] Fix comments in LSH Examples and Python API ## What changes were proposed in this pull request? Remove `org.apache.spark.examples.` in Add slash in one of the python doc. ## How was this patch tested? Run examples using the commands in the comments. Author: Yun NiCloses #17104 from Yunni/yunn_minor. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3bd8ddf7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3bd8ddf7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3bd8ddf7 Branch: refs/heads/master Commit: 3bd8ddf7c34be35e5adeb802d6e63120f9f11713 Parents: de2b53d Author: Yun Ni Authored: Wed Mar 1 22:55:13 2017 -0800 Committer: Yanbo Liang Committed: Wed Mar 1 22:55:13 2017 -0800 -- .../spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java | 2 +- .../java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java | 2 +- .../spark/examples/ml/BucketedRandomProjectionLSHExample.scala | 2 +- .../scala/org/apache/spark/examples/ml/MinHashLSHExample.scala | 2 +- python/pyspark/ml/feature.py | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3bd8ddf7/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java index 4594e34..ff917b7 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java @@ -42,7 +42,7 @@ import static org.apache.spark.sql.functions.col; /** * An example demonstrating BucketedRandomProjectionLSH. * Run with: - * bin/run-example org.apache.spark.examples.ml.JavaBucketedRandomProjectionLSHExample + * bin/run-example ml.JavaBucketedRandomProjectionLSHExample */ public class JavaBucketedRandomProjectionLSHExample { public static void main(String[] args) { http://git-wip-us.apache.org/repos/asf/spark/blob/3bd8ddf7/examples/src/main/java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java index 0aace46..e164598 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java @@ -42,7 +42,7 @@ import static org.apache.spark.sql.functions.col; /** * An example demonstrating MinHashLSH. * Run with: - * bin/run-example org.apache.spark.examples.ml.JavaMinHashLSHExample + * bin/run-example ml.JavaMinHashLSHExample */ public class JavaMinHashLSHExample { public static void main(String[] args) { http://git-wip-us.apache.org/repos/asf/spark/blob/3bd8ddf7/examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala -- diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala index 654535c..16da4fa 100644 --- a/examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala +++ b/examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala @@ -28,7 +28,7 @@ import org.apache.spark.sql.SparkSession /** * An example demonstrating BucketedRandomProjectionLSH. * Run with: - * bin/run-example org.apache.spark.examples.ml.BucketedRandomProjectionLSHExample + * bin/run-example ml.BucketedRandomProjectionLSHExample */ object BucketedRandomProjectionLSHExample { def main(args: Array[String]): Unit = { http://git-wip-us.apache.org/repos/asf/spark/blob/3bd8ddf7/examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala -- diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala index 6c1e222..b94ab9b 100644 ---
spark git commit: [MINOR][ML][DOC] Document default value for GeneralizedLinearRegression.linkPower
Repository: spark Updated Branches: refs/heads/master 410392ed7 -> 6ab60542e [MINOR][ML][DOC] Document default value for GeneralizedLinearRegression.linkPower Add Scaladoc for GeneralizedLinearRegression.linkPower default value Follow-up to https://github.com/apache/spark/pull/16344 Author: Joseph K. BradleyCloses #17069 from jkbradley/tweedie-comment. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6ab60542 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6ab60542 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6ab60542 Branch: refs/heads/master Commit: 6ab60542e8e803b1d91371a92f4aaef6a64106f6 Parents: 410392e Author: Joseph K. Bradley Authored: Sat Feb 25 22:24:08 2017 -0800 Committer: Yanbo Liang Committed: Sat Feb 25 22:24:08 2017 -0800 -- .../apache/spark/ml/regression/GeneralizedLinearRegression.scala | 2 ++ 1 file changed, 2 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6ab60542/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala index fdeadaf..110764d 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala @@ -109,6 +109,8 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam * Param for the index in the power link function. Only applicable for the Tweedie family. * Note that link power 0, 1, -1 or 0.5 corresponds to the Log, Identity, Inverse or Sqrt * link, respectively. + * When not set, this value defaults to 1 - [[variancePower]], which matches the R "statmod" + * package. * * @group param */ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-18285][SPARKR] SparkR approxQuantile supports input multiple columns
Repository: spark Updated Branches: refs/heads/master 1a3f5f8c5 -> b40659838 [SPARK-18285][SPARKR] SparkR approxQuantile supports input multiple columns ## What changes were proposed in this pull request? SparkR ```approxQuantile``` supports input multiple columns. ## How was this patch tested? Unit test. Author: Yanbo LiangCloses #16951 from yanboliang/spark-19619. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b4065983 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b4065983 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b4065983 Branch: refs/heads/master Commit: b40659838245ecaefb4e83d2ec6155f3f23a6675 Parents: 1a3f5f8 Author: Yanbo Liang Authored: Fri Feb 17 11:58:39 2017 -0800 Committer: Yanbo Liang Committed: Fri Feb 17 11:58:39 2017 -0800 -- R/pkg/R/generics.R| 2 +- R/pkg/R/stats.R | 25 + R/pkg/inst/tests/testthat/test_sparkSQL.R | 18 +- 3 files changed, 31 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b4065983/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index 68864e6..11940d3 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -66,7 +66,7 @@ setGeneric("freqItems", function(x, cols, support = 0.01) { standardGeneric("fre # @rdname approxQuantile # @export setGeneric("approxQuantile", - function(x, col, probabilities, relativeError) { + function(x, cols, probabilities, relativeError) { standardGeneric("approxQuantile") }) http://git-wip-us.apache.org/repos/asf/spark/blob/b4065983/R/pkg/R/stats.R -- diff --git a/R/pkg/R/stats.R b/R/pkg/R/stats.R index dcd7198..8d1d165 100644 --- a/R/pkg/R/stats.R +++ b/R/pkg/R/stats.R @@ -138,9 +138,9 @@ setMethod("freqItems", signature(x = "SparkDataFrame", cols = "character"), collect(dataFrame(sct)) }) -#' Calculates the approximate quantiles of a numerical column of a SparkDataFrame +#' Calculates the approximate quantiles of numerical columns of a SparkDataFrame #' -#' Calculates the approximate quantiles of a numerical column of a SparkDataFrame. +#' Calculates the approximate quantiles of numerical columns of a SparkDataFrame. #' The result of this algorithm has the following deterministic bound: #' If the SparkDataFrame has N elements and if we request the quantile at probability p up to #' error err, then the algorithm will return a sample x from the SparkDataFrame so that the @@ -149,15 +149,19 @@ setMethod("freqItems", signature(x = "SparkDataFrame", cols = "character"), #' This method implements a variation of the Greenwald-Khanna algorithm (with some speed #' optimizations). The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670 #' Space-efficient Online Computation of Quantile Summaries]] by Greenwald and Khanna. +#' Note that rows containing any NA values will be removed before calculation. #' #' @param x A SparkDataFrame. -#' @param col The name of the numerical column. +#' @param cols A single column name, or a list of names for multiple columns. #' @param probabilities A list of quantile probabilities. Each number must belong to [0, 1]. #' For example 0 is the minimum, 0.5 is the median, 1 is the maximum. #' @param relativeError The relative target precision to achieve (>= 0). If set to zero, #' the exact quantiles are computed, which could be very expensive. #' Note that values greater than 1 are accepted but give the same result as 1. -#' @return The approximate quantiles at the given probabilities. +#' @return The approximate quantiles at the given probabilities. If the input is a single column name, +#' the output is a list of approximate quantiles in that column; If the input is +#' multiple column names, the output should be a list, and each element in it is a list of +#' numeric values which represents the approximate quantiles in corresponding column. #' #' @rdname approxQuantile #' @name approxQuantile @@ -171,12 +175,17 @@ setMethod("freqItems", signature(x = "SparkDataFrame", cols = "character"), #' } #' @note approxQuantile since 2.0.0 setMethod("approxQuantile", - signature(x = "SparkDataFrame", col = "character", + signature(x = "SparkDataFrame", cols = "character", probabilities = "numeric", relativeError = "numeric"), - function(x, col, probabilities,
spark git commit: [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing
Repository: spark Updated Branches: refs/heads/master 21b4ba2d6 -> 08c1972a0 [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing ## What changes were proposed in this pull request? This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH. ## How was this patch tested? API and examples are tested using spark-submit: `bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py` `bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py` User guide changes are generated and manually inspected: `SKIP_API=1 jekyll build` Author: Yun NiAuthor: Yanbo Liang Author: Yunni Closes #16715 from Yunni/spark-18080. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/08c1972a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/08c1972a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/08c1972a Branch: refs/heads/master Commit: 08c1972a0661d42f300520cc6e5fb31023de093b Parents: 21b4ba2 Author: Yun Ni Authored: Wed Feb 15 16:26:05 2017 -0800 Committer: Yanbo Liang Committed: Wed Feb 15 16:26:05 2017 -0800 -- docs/ml-features.md | 17 ++ .../JavaBucketedRandomProjectionLSHExample.java | 38 ++- .../examples/ml/JavaMinHashLSHExample.java | 57 +++- .../bucketed_random_projection_lsh_example.py | 81 ++ .../src/main/python/ml/min_hash_lsh_example.py | 81 ++ .../ml/BucketedRandomProjectionLSHExample.scala | 39 ++- .../spark/examples/ml/MinHashLSHExample.scala | 43 ++- .../scala/org/apache/spark/ml/feature/LSH.scala | 7 +- python/pyspark/ml/feature.py| 291 +++ 9 files changed, 601 insertions(+), 53 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/08c1972a/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index 13d97a2..57605ba 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1558,6 +1558,15 @@ for more details on the API. {% include_example java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %} + + + +Refer to the [BucketedRandomProjectionLSH Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH) +for more details on the API. + +{% include_example python/ml/bucketed_random_projection_lsh_example.py %} + + ### MinHash for Jaccard Distance @@ -1590,4 +1599,12 @@ for more details on the API. {% include_example java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %} + + + +Refer to the [MinHashLSH Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.MinHashLSH) +for more details on the API. + +{% include_example python/ml/min_hash_lsh_example.py %} + http://git-wip-us.apache.org/repos/asf/spark/blob/08c1972a/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java index ca3ee5a..4594e34 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java @@ -35,8 +35,15 @@ import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; + +import static org.apache.spark.sql.functions.col; // $example off$ +/** + * An example demonstrating BucketedRandomProjectionLSH. + * Run with: + * bin/run-example org.apache.spark.examples.ml.JavaBucketedRandomProjectionLSHExample + */ public class JavaBucketedRandomProjectionLSHExample { public static void main(String[] args) { SparkSession spark = SparkSession @@ -61,7 +68,7 @@ public class JavaBucketedRandomProjectionLSHExample { StructType schema = new StructType(new StructField[]{ new StructField("id", DataTypes.IntegerType, false, Metadata.empty()), - new StructField("keys", new VectorUDT(), false, Metadata.empty()) + new StructField("features", new VectorUDT(), false, Metadata.empty()) }); Dataset dfA = spark.createDataFrame(dataA, schema); Dataset dfB =
spark git commit: [SPARK-18929][ML] Add Tweedie distribution in GLM
Repository: spark Updated Branches: refs/heads/master 90817a6cd -> 4172ff80d [SPARK-18929][ML] Add Tweedie distribution in GLM ## What changes were proposed in this pull request? I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution. yanboliang srowen sethah Author: actuaryzhangAuthor: Wayne Zhang Closes #16344 from actuaryzhang/tweedie. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4172ff80 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4172ff80 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4172ff80 Branch: refs/heads/master Commit: 4172ff80dd9ca9cde4f310953bfc386cbfc62ba4 Parents: 90817a6 Author: actuaryzhang Authored: Thu Jan 26 23:01:13 2017 -0800 Committer: Yanbo Liang Committed: Thu Jan 26 23:01:13 2017 -0800 -- .../GeneralizedLinearRegression.scala | 359 +++ .../GeneralizedLinearRegressionSuite.scala | 291 ++- 2 files changed, 567 insertions(+), 83 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4172ff80/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala index 3ffed39..c4f41d0 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala @@ -48,7 +48,7 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam /** * Param for the name of family which is a description of the error distribution * to be used in the model. - * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Supported options: "gaussian", "binomial", "poisson", "gamma" and "tweedie". * Default is "gaussian". * * @group param @@ -64,9 +64,34 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam def getFamily: String = $(family) /** + * Param for the power in the variance function of the Tweedie distribution which provides + * the relationship between the variance and mean of the distribution. + * Only applicable for the Tweedie family. + * (see https://en.wikipedia.org/wiki/Tweedie_distribution;> + * Tweedie Distribution (Wikipedia)) + * Supported values: 0 and [1, Inf). + * Note that variance power 0, 1, or 2 corresponds to the Gaussian, Poisson or Gamma + * family, respectively. + * + * @group param + */ + @Since("2.2.0") + final val variancePower: DoubleParam = new DoubleParam(this, "variancePower", +"The power in the variance function of the Tweedie distribution which characterizes " + +"the relationship between the variance and mean of the distribution. " + +"Only applicable for the Tweedie family. Supported values: 0 and [1, Inf).", +(x: Double) => x >= 1.0 || x == 0.0) + + /** @group getParam */ + @Since("2.2.0") + def getVariancePower: Double = $(variancePower) + + /** * Param for the name of link function which provides the relationship * between the linear predictor and the mean of the distribution function. * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * This is used only when family is not "tweedie". The link function for the "tweedie" family + * must be specified through [[linkPower]]. * * @group param */ @@ -81,6 +106,21 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam def getLink: String = $(link) /** + * Param for the index in the power link function. Only applicable for the Tweedie family. + * Note that link power 0, 1, -1 or 0.5 corresponds to the Log, Identity, Inverse or Sqrt + * link, respectively. + * + * @group param + */ + @Since("2.2.0") + final val linkPower: DoubleParam = new DoubleParam(this, "linkPower", +"The index in the power link function. Only applicable for the Tweedie family.") + + /** @group getParam */ + @Since("2.2.0") + def getLinkPower: Double = $(linkPower) + + /** * Param for link prediction (linear predictor) column name. * Default is not set, which means we do not output link prediction. * @@
spark git commit: [SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features
Repository: spark Updated Branches: refs/heads/master 76db394f2 -> 0e821ec6f [SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features ## What changes were proposed in this pull request? The following test will fail on current master scala test("gmm fails on high dimensional data") { val ctx = spark.sqlContext import ctx.implicits._ val df = Seq( Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)), Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0))) .map(Tuple1.apply).toDF("features") val gm = new GaussianMixture() intercept[IllegalArgumentException] { gm.fit(df) } } Instead, you'll get an `ArrayIndexOutOfBoundsException` or something similar for MLlib. That's because the covariance matrix allocates an array of `numFeatures * numFeatures`, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users. This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)` check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something like `math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order of `numFeatures * numFeatures` to the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency. ## How was this patch tested? Unit tests in ML and MLlib. Author: sethahCloses #16661 from sethah/gmm_high_dim. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0e821ec6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0e821ec6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0e821ec6 Branch: refs/heads/master Commit: 0e821ec6fa98f4b0aa6e2eb6fecd18cc1ee6f3f2 Parents: 76db394 Author: sethah Authored: Wed Jan 25 07:12:25 2017 -0800 Committer: Yanbo Liang Committed: Wed Jan 25 07:12:25 2017 -0800 -- .../apache/spark/ml/clustering/GaussianMixture.scala | 14 +++--- .../spark/mllib/clustering/GaussianMixture.scala | 15 --- .../spark/ml/clustering/GaussianMixtureSuite.scala | 14 ++ .../mllib/clustering/GaussianMixtureSuite.scala | 14 ++ 4 files changed, 51 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0e821ec6/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala b/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala index db5fff5..ea2dc6c 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala @@ -278,7 +278,9 @@ object GaussianMixtureModel extends MLReadable[GaussianMixtureModel] { * While this process is generally guaranteed to converge, it is not guaranteed * to find a global optimum. * - * @note For high-dimensional data (with many features), this algorithm may perform poorly. + * @note This algorithm is limited in its number of features since it requires storing a covariance + * matrix which has size quadratic in the number of features. Even when the number of features does + * not exceed this limit, this algorithm may perform poorly on high-dimensional data. * This is due to high-dimensional data (a) making it difficult to cluster at all (based * on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions. */ @@ -344,6 +346,9 @@ class GaussianMixture @Since("2.0.0") ( // Extract the number of features. val numFeatures = instances.first().size +require(numFeatures < GaussianMixture.MAX_NUM_FEATURES, s"GaussianMixture cannot handle more " + + s"than ${GaussianMixture.MAX_NUM_FEATURES} features because the size of the covariance" + + s" matrix is quadratic in the number of features.") val instr = Instrumentation.create(this, instances) instr.logParams(featuresCol, predictionCol, probabilityCol, k, maxIter, seed, tol) @@ -391,8 +396,8 @@ class GaussianMixture @Since("2.0.0") ( val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean, cov, weight) => GaussianMixture.updateWeightsAndGaussians(mean, cov, weight, sumWeights) }.collect().unzip -
spark git commit: [SPARK-19155][ML] Make family case insensitive in GLM
Repository: spark Updated Branches: refs/heads/branch-2.1 8daf10e3f -> 1e07a7192 [SPARK-19155][ML] Make family case insensitive in GLM ## What changes were proposed in this pull request? This is a supplement to PR #16516 which did not make the value from `getFamily` case insensitive. Current tests of poisson/binomial glm with weight fail when specifying 'Poisson' or 'Binomial', because the calculation of `dispersion` and `pValue` checks the value of family retrieved from `getFamily` ``` model.getFamily == Binomial.name || model.getFamily == Poisson.name ``` ## How was this patch tested? Update existing tests for 'Poisson' and 'Binomial'. yanboliang felixcheung imatiach-msft Author: actuaryzhangCloses #16675 from actuaryzhang/family. (cherry picked from commit f067acefabebf04939d03a639a2aaa654e1bc8f9) Signed-off-by: Yanbo Liang Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e07a719 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e07a719 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e07a719 Branch: refs/heads/branch-2.1 Commit: 1e07a71924ef1420c96a3a0a8cb5be2f3a830037 Parents: 8daf10e Author: actuaryzhang Authored: Mon Jan 23 00:53:44 2017 -0800 Committer: Yanbo Liang Committed: Mon Jan 23 00:54:08 2017 -0800 -- .../spark/ml/regression/GeneralizedLinearRegression.scala | 6 -- .../spark/ml/regression/GeneralizedLinearRegressionSuite.scala | 4 ++-- 2 files changed, 6 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1e07a719/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala index 1e7ba91..676be61 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala @@ -1027,7 +1027,8 @@ class GeneralizedLinearRegressionSummary private[regression] ( */ @Since("2.0.0") lazy val dispersion: Double = if ( -model.getFamily == Binomial.name || model.getFamily == Poisson.name) { +model.getFamily.toLowerCase == Binomial.name || + model.getFamily.toLowerCase == Poisson.name) { 1.0 } else { val rss = pearsonResiduals.agg(sum(pow(col("pearsonResiduals"), 2.0))).first().getDouble(0) @@ -1130,7 +1131,8 @@ class GeneralizedLinearRegressionTrainingSummary private[regression] ( @Since("2.0.0") lazy val pValues: Array[Double] = { if (isNormalSolver) { - if (model.getFamily == Binomial.name || model.getFamily == Poisson.name) { + if (model.getFamily.toLowerCase == Binomial.name || +model.getFamily.toLowerCase == Poisson.name) { tValues.map { x => 2.0 * (1.0 - dist.Gaussian(0.0, 1.0).cdf(math.abs(x))) } } else { tValues.map { x => http://git-wip-us.apache.org/repos/asf/spark/blob/1e07a719/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala index 415d426..95b443d 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala @@ -757,7 +757,7 @@ class GeneralizedLinearRegressionSuite 0.5554219 -0.4034267 0.6567520 -0.2611382 */ val trainer = new GeneralizedLinearRegression() - .setFamily("binomial") + .setFamily("Binomial") .setWeightCol("weight") .setFitIntercept(false) @@ -874,7 +874,7 @@ class GeneralizedLinearRegressionSuite -0.4378554 0.2189277 0.1459518 -0.1094638 */ val trainer = new GeneralizedLinearRegression() - .setFamily("poisson") + .setFamily("Poisson") .setWeightCol("weight") .setFitIntercept(true) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org