[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/11136 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190806584 LGTM. Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190607485 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52227/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190607484 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190607102 **[Test build #52227 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52227/consoleFull)** for PR 11136 at commit [`007a4ec`](https://github.com/apache/spark/commit/007a4ec324db273c048ed65fe8942daba0c9d844). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190591444 **[Test build #52227 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52227/consoleFull)** for PR 11136 at commit [`007a4ec`](https://github.com/apache/spark/commit/007a4ec324db273c048ed65fe8942daba0c9d844). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54524989 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,577 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionBase extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Default is "gaussian". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"The name of family which is a description of the error distribution to be used in the " + + "model. Supported options: gaussian(default), binomial, poisson and gamma.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of link function which provides the relationship + * between the linear predictor and the mean of the distribution function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "The name of link function " + +"which provides the relationship between the linear predictor and the mean of the " + +"distribution function. Supported options: identity, log, inverse, logit, probit, " + +"cloglog and sqrt.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + import GeneralizedLinearRegression._ + + @Since("2.0.0") + override def validateParams(): Unit = { +if ($(solver) == "irls") { + setDefault(maxIter -> 25) +} +if (isDefined(link)) { + require(supportedFamilyAndLinkPairs.contains( +Family.fromName($(family)) -> Link.fromName($(link))), "Generalized Linear Regression " + +s"with ${$(family)} family does not support ${$(link)} link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor (link function) and + * a description of the error distribution (family). + * It supports "gaussian", "binomial", "poisson" and "gamma" as family. + * Valid link functions for each family is listed below. The first link function of each family + * is the default one. + * - "gaussian" -> "identity", "log", "inverse" + * - "binomial" -> "logit", "probit", "cloglog" + * - "poisson" -> "log", "identity", "sqrt" + * - "gamma"-> "inverse", "identity", "log" + */
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190538874 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52216/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190538867 **[Test build #52216 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52216/consoleFull)** for PR 11136 at commit [`31a912c`](https://github.com/apache/spark/commit/31a912cd74cf3dffbf8cc0af8c57b777d49579eb). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190538871 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190537367 **[Test build #52216 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52216/consoleFull)** for PR 11136 at commit [`31a912c`](https://github.com/apache/spark/commit/31a912cd74cf3dffbf8cc0af8c57b777d49579eb). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54521794 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -0,0 +1,499 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.MLTestingUtils +import org.apache.spark.mllib.classification.LogisticRegressionSuite._ +import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.random._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Row} + +class GeneralizedLinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { + + private val seed: Int = 42 + @transient var datasetGaussianIdentity: DataFrame = _ + @transient var datasetGaussianLog: DataFrame = _ + @transient var datasetGaussianInverse: DataFrame = _ + @transient var datasetBinomial: DataFrame = _ + @transient var datasetPoissonLog: DataFrame = _ + @transient var datasetPoissonIdentity: DataFrame = _ + @transient var datasetPoissonSqrt: DataFrame = _ + @transient var datasetGammaInverse: DataFrame = _ + @transient var datasetGammaIdentity: DataFrame = _ + @transient var datasetGammaLog: DataFrame = _ + + override def beforeAll(): Unit = { +super.beforeAll() + +import GeneralizedLinearRegressionSuite._ + +datasetGaussianIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "identity"), 2)) + +datasetGaussianLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "log"), 2)) + +datasetGaussianInverse = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "inverse"), 2)) + +datasetBinomial = { + val nPoints = 1 + val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) + val xMean = Array(5.843, 3.057, 3.758, 1.199) + val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) + + val testData = +generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed) + + sqlContext.createDataFrame(sc.parallelize(testData, 4)) +} + +datasetPoissonLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "log"), 2)) + +datasetPoissonIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "identity"), 2)) + +datasetPoissonSqrt = sqlContext.createDataFrame( +
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190530641 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190530637 **[Test build #52215 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52215/consoleFull)** for PR 11136 at commit [`314b562`](https://github.com/apache/spark/commit/314b562f315723a7117851289c8f5b6e1b16a6ac). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190530642 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52215/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190529580 **[Test build #52215 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52215/consoleFull)** for PR 11136 at commit [`314b562`](https://github.com/apache/spark/commit/314b562f315723a7117851289c8f5b6e1b16a6ac). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190528993 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190527136 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52214/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190527127 **[Test build #52214 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52214/consoleFull)** for PR 11136 at commit [`314b562`](https://github.com/apache/spark/commit/314b562f315723a7117851289c8f5b6e1b16a6ac). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190527134 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54519291 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -0,0 +1,499 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.MLTestingUtils +import org.apache.spark.mllib.classification.LogisticRegressionSuite._ +import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.random._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Row} + +class GeneralizedLinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { + + private val seed: Int = 42 + @transient var datasetGaussianIdentity: DataFrame = _ + @transient var datasetGaussianLog: DataFrame = _ + @transient var datasetGaussianInverse: DataFrame = _ + @transient var datasetBinomial: DataFrame = _ + @transient var datasetPoissonLog: DataFrame = _ + @transient var datasetPoissonIdentity: DataFrame = _ + @transient var datasetPoissonSqrt: DataFrame = _ + @transient var datasetGammaInverse: DataFrame = _ + @transient var datasetGammaIdentity: DataFrame = _ + @transient var datasetGammaLog: DataFrame = _ + + override def beforeAll(): Unit = { +super.beforeAll() + +import GeneralizedLinearRegressionSuite._ + +datasetGaussianIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "identity"), 2)) + +datasetGaussianLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "log"), 2)) + +datasetGaussianInverse = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "inverse"), 2)) + +datasetBinomial = { + val nPoints = 1 + val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) + val xMean = Array(5.843, 3.057, 3.758, 1.199) + val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) + + val testData = +generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed) + + sqlContext.createDataFrame(sc.parallelize(testData, 4)) +} + +datasetPoissonLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "log"), 2)) + +datasetPoissonIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "identity"), 2)) + +datasetPoissonSqrt = sqlContext.createDataFrame( +
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190526388 **[Test build #52214 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52214/consoleFull)** for PR 11136 at commit [`314b562`](https://github.com/apache/spark/commit/314b562f315723a7117851289c8f5b6e1b16a6ac). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190526311 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54519070 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -0,0 +1,499 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.MLTestingUtils +import org.apache.spark.mllib.classification.LogisticRegressionSuite._ +import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.random._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Row} + +class GeneralizedLinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { + + private val seed: Int = 42 + @transient var datasetGaussianIdentity: DataFrame = _ + @transient var datasetGaussianLog: DataFrame = _ + @transient var datasetGaussianInverse: DataFrame = _ + @transient var datasetBinomial: DataFrame = _ + @transient var datasetPoissonLog: DataFrame = _ + @transient var datasetPoissonIdentity: DataFrame = _ + @transient var datasetPoissonSqrt: DataFrame = _ + @transient var datasetGammaInverse: DataFrame = _ + @transient var datasetGammaIdentity: DataFrame = _ + @transient var datasetGammaLog: DataFrame = _ + + override def beforeAll(): Unit = { +super.beforeAll() + +import GeneralizedLinearRegressionSuite._ + +datasetGaussianIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "identity"), 2)) + +datasetGaussianLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "log"), 2)) + +datasetGaussianInverse = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "inverse"), 2)) + +datasetBinomial = { + val nPoints = 1 + val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) + val xMean = Array(5.843, 3.057, 3.758, 1.199) + val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) + + val testData = +generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed) + + sqlContext.createDataFrame(sc.parallelize(testData, 4)) +} + +datasetPoissonLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "log"), 2)) + +datasetPoissonIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "identity"), 2)) + +datasetPoissonSqrt = sqlContext.createDataFrame( +
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190522461 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190522465 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52211/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190522453 **[Test build #52211 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52211/consoleFull)** for PR 11136 at commit [`314b562`](https://github.com/apache/spark/commit/314b562f315723a7117851289c8f5b6e1b16a6ac). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190520555 **[Test build #52211 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52211/consoleFull)** for PR 11136 at commit [`314b562`](https://github.com/apache/spark/commit/314b562f315723a7117851289c8f5b6e1b16a6ac). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54518547 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -0,0 +1,499 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.MLTestingUtils +import org.apache.spark.mllib.classification.LogisticRegressionSuite._ +import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.random._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Row} + +class GeneralizedLinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { + + private val seed: Int = 42 + @transient var datasetGaussianIdentity: DataFrame = _ + @transient var datasetGaussianLog: DataFrame = _ + @transient var datasetGaussianInverse: DataFrame = _ + @transient var datasetBinomial: DataFrame = _ + @transient var datasetPoissonLog: DataFrame = _ + @transient var datasetPoissonIdentity: DataFrame = _ + @transient var datasetPoissonSqrt: DataFrame = _ + @transient var datasetGammaInverse: DataFrame = _ + @transient var datasetGammaIdentity: DataFrame = _ + @transient var datasetGammaLog: DataFrame = _ + + override def beforeAll(): Unit = { +super.beforeAll() + +import GeneralizedLinearRegressionSuite._ + +datasetGaussianIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "identity"), 2)) + +datasetGaussianLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "log"), 2)) + +datasetGaussianInverse = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "inverse"), 2)) + +datasetBinomial = { + val nPoints = 1 + val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) + val xMean = Array(5.843, 3.057, 3.758, 1.199) + val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) + + val testData = +generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed) + + sqlContext.createDataFrame(sc.parallelize(testData, 4)) +} + +datasetPoissonLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "log"), 2)) + +datasetPoissonIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "identity"), 2)) + +datasetPoissonSqrt = sqlContext.createDataFrame( +
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-190477077 I made one pass on the tests, only some minor comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54508437 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -0,0 +1,499 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.MLTestingUtils +import org.apache.spark.mllib.classification.LogisticRegressionSuite._ +import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.random._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Row} + +class GeneralizedLinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { + + private val seed: Int = 42 + @transient var datasetGaussianIdentity: DataFrame = _ + @transient var datasetGaussianLog: DataFrame = _ + @transient var datasetGaussianInverse: DataFrame = _ + @transient var datasetBinomial: DataFrame = _ + @transient var datasetPoissonLog: DataFrame = _ + @transient var datasetPoissonIdentity: DataFrame = _ + @transient var datasetPoissonSqrt: DataFrame = _ + @transient var datasetGammaInverse: DataFrame = _ + @transient var datasetGammaIdentity: DataFrame = _ + @transient var datasetGammaLog: DataFrame = _ + + override def beforeAll(): Unit = { +super.beforeAll() + +import GeneralizedLinearRegressionSuite._ + +datasetGaussianIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "identity"), 2)) + +datasetGaussianLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "log"), 2)) + +datasetGaussianInverse = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "inverse"), 2)) + +datasetBinomial = { + val nPoints = 1 + val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) + val xMean = Array(5.843, 3.057, 3.758, 1.199) + val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) + + val testData = +generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed) + + sqlContext.createDataFrame(sc.parallelize(testData, 4)) +} + +datasetPoissonLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "log"), 2)) + +datasetPoissonIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "identity"), 2)) + +datasetPoissonSqrt = sqlContext.createDataFrame( +
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54508443 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -0,0 +1,499 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.MLTestingUtils +import org.apache.spark.mllib.classification.LogisticRegressionSuite._ +import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.random._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Row} + +class GeneralizedLinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { + + private val seed: Int = 42 + @transient var datasetGaussianIdentity: DataFrame = _ + @transient var datasetGaussianLog: DataFrame = _ + @transient var datasetGaussianInverse: DataFrame = _ + @transient var datasetBinomial: DataFrame = _ + @transient var datasetPoissonLog: DataFrame = _ + @transient var datasetPoissonIdentity: DataFrame = _ + @transient var datasetPoissonSqrt: DataFrame = _ + @transient var datasetGammaInverse: DataFrame = _ + @transient var datasetGammaIdentity: DataFrame = _ + @transient var datasetGammaLog: DataFrame = _ + + override def beforeAll(): Unit = { +super.beforeAll() + +import GeneralizedLinearRegressionSuite._ + +datasetGaussianIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "identity"), 2)) + +datasetGaussianLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "log"), 2)) + +datasetGaussianInverse = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "inverse"), 2)) + +datasetBinomial = { + val nPoints = 1 + val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) + val xMean = Array(5.843, 3.057, 3.758, 1.199) + val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) + + val testData = +generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed) + + sqlContext.createDataFrame(sc.parallelize(testData, 4)) +} + +datasetPoissonLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "log"), 2)) + +datasetPoissonIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "identity"), 2)) + +datasetPoissonSqrt = sqlContext.createDataFrame( +
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54508387 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -0,0 +1,499 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.MLTestingUtils +import org.apache.spark.mllib.classification.LogisticRegressionSuite._ +import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.random._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Row} + +class GeneralizedLinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { + + private val seed: Int = 42 + @transient var datasetGaussianIdentity: DataFrame = _ + @transient var datasetGaussianLog: DataFrame = _ + @transient var datasetGaussianInverse: DataFrame = _ + @transient var datasetBinomial: DataFrame = _ + @transient var datasetPoissonLog: DataFrame = _ + @transient var datasetPoissonIdentity: DataFrame = _ + @transient var datasetPoissonSqrt: DataFrame = _ + @transient var datasetGammaInverse: DataFrame = _ + @transient var datasetGammaIdentity: DataFrame = _ + @transient var datasetGammaLog: DataFrame = _ + + override def beforeAll(): Unit = { +super.beforeAll() + +import GeneralizedLinearRegressionSuite._ + +datasetGaussianIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "identity"), 2)) + +datasetGaussianLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "log"), 2)) + +datasetGaussianInverse = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "inverse"), 2)) + +datasetBinomial = { + val nPoints = 1 + val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) + val xMean = Array(5.843, 3.057, 3.758, 1.199) + val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) + + val testData = +generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed) + + sqlContext.createDataFrame(sc.parallelize(testData, 4)) +} + +datasetPoissonLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "log"), 2)) + +datasetPoissonIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "identity"), 2)) + +datasetPoissonSqrt = sqlContext.createDataFrame( +
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54508365 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -0,0 +1,499 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.MLTestingUtils +import org.apache.spark.mllib.classification.LogisticRegressionSuite._ +import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.random._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Row} + +class GeneralizedLinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { + + private val seed: Int = 42 + @transient var datasetGaussianIdentity: DataFrame = _ + @transient var datasetGaussianLog: DataFrame = _ + @transient var datasetGaussianInverse: DataFrame = _ + @transient var datasetBinomial: DataFrame = _ + @transient var datasetPoissonLog: DataFrame = _ + @transient var datasetPoissonIdentity: DataFrame = _ + @transient var datasetPoissonSqrt: DataFrame = _ + @transient var datasetGammaInverse: DataFrame = _ + @transient var datasetGammaIdentity: DataFrame = _ + @transient var datasetGammaLog: DataFrame = _ + + override def beforeAll(): Unit = { +super.beforeAll() + +import GeneralizedLinearRegressionSuite._ + +datasetGaussianIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "identity"), 2)) + +datasetGaussianLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "log"), 2)) + +datasetGaussianInverse = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "inverse"), 2)) + +datasetBinomial = { + val nPoints = 1 + val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) + val xMean = Array(5.843, 3.057, 3.758, 1.199) + val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) + + val testData = +generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed) --- End diff -- it would be good to say `addIntercept = true` instead of just `true`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54508381 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -0,0 +1,499 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.MLTestingUtils +import org.apache.spark.mllib.classification.LogisticRegressionSuite._ +import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.random._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Row} + +class GeneralizedLinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { + + private val seed: Int = 42 + @transient var datasetGaussianIdentity: DataFrame = _ + @transient var datasetGaussianLog: DataFrame = _ + @transient var datasetGaussianInverse: DataFrame = _ + @transient var datasetBinomial: DataFrame = _ + @transient var datasetPoissonLog: DataFrame = _ + @transient var datasetPoissonIdentity: DataFrame = _ + @transient var datasetPoissonSqrt: DataFrame = _ + @transient var datasetGammaInverse: DataFrame = _ + @transient var datasetGammaIdentity: DataFrame = _ + @transient var datasetGammaLog: DataFrame = _ + + override def beforeAll(): Unit = { +super.beforeAll() + +import GeneralizedLinearRegressionSuite._ + +datasetGaussianIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "identity"), 2)) + +datasetGaussianLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "log"), 2)) + +datasetGaussianInverse = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "inverse"), 2)) + +datasetBinomial = { + val nPoints = 1 + val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) + val xMean = Array(5.843, 3.057, 3.758, 1.199) + val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) + + val testData = +generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed) + + sqlContext.createDataFrame(sc.parallelize(testData, 4)) +} + +datasetPoissonLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "log"), 2)) + +datasetPoissonIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "identity"), 2)) + +datasetPoissonSqrt = sqlContext.createDataFrame( +
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54508392 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -0,0 +1,499 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.MLTestingUtils +import org.apache.spark.mllib.classification.LogisticRegressionSuite._ +import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.random._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Row} + +class GeneralizedLinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { + + private val seed: Int = 42 + @transient var datasetGaussianIdentity: DataFrame = _ + @transient var datasetGaussianLog: DataFrame = _ + @transient var datasetGaussianInverse: DataFrame = _ + @transient var datasetBinomial: DataFrame = _ + @transient var datasetPoissonLog: DataFrame = _ + @transient var datasetPoissonIdentity: DataFrame = _ + @transient var datasetPoissonSqrt: DataFrame = _ + @transient var datasetGammaInverse: DataFrame = _ + @transient var datasetGammaIdentity: DataFrame = _ + @transient var datasetGammaLog: DataFrame = _ + + override def beforeAll(): Unit = { +super.beforeAll() + +import GeneralizedLinearRegressionSuite._ + +datasetGaussianIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "identity"), 2)) + +datasetGaussianLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "log"), 2)) + +datasetGaussianInverse = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "inverse"), 2)) + +datasetBinomial = { + val nPoints = 1 + val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) + val xMean = Array(5.843, 3.057, 3.758, 1.199) + val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) + + val testData = +generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed) + + sqlContext.createDataFrame(sc.parallelize(testData, 4)) +} + +datasetPoissonLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "log"), 2)) + +datasetPoissonIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "identity"), 2)) + +datasetPoissonSqrt = sqlContext.createDataFrame( +
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54508370 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -0,0 +1,499 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.MLTestingUtils +import org.apache.spark.mllib.classification.LogisticRegressionSuite._ +import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.random._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Row} + +class GeneralizedLinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { + + private val seed: Int = 42 + @transient var datasetGaussianIdentity: DataFrame = _ + @transient var datasetGaussianLog: DataFrame = _ + @transient var datasetGaussianInverse: DataFrame = _ + @transient var datasetBinomial: DataFrame = _ + @transient var datasetPoissonLog: DataFrame = _ + @transient var datasetPoissonIdentity: DataFrame = _ + @transient var datasetPoissonSqrt: DataFrame = _ + @transient var datasetGammaInverse: DataFrame = _ + @transient var datasetGammaIdentity: DataFrame = _ + @transient var datasetGammaLog: DataFrame = _ + + override def beforeAll(): Unit = { +super.beforeAll() + +import GeneralizedLinearRegressionSuite._ + +datasetGaussianIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "identity"), 2)) + +datasetGaussianLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "log"), 2)) + +datasetGaussianInverse = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "inverse"), 2)) + +datasetBinomial = { + val nPoints = 1 + val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) + val xMean = Array(5.843, 3.057, 3.758, 1.199) + val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) + + val testData = +generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed) + + sqlContext.createDataFrame(sc.parallelize(testData, 4)) --- End diff -- Why using 4 partitions instead of 2, which is used in other datasets? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54508383 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -0,0 +1,499 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.MLTestingUtils +import org.apache.spark.mllib.classification.LogisticRegressionSuite._ +import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.random._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Row} + +class GeneralizedLinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { + + private val seed: Int = 42 + @transient var datasetGaussianIdentity: DataFrame = _ + @transient var datasetGaussianLog: DataFrame = _ + @transient var datasetGaussianInverse: DataFrame = _ + @transient var datasetBinomial: DataFrame = _ + @transient var datasetPoissonLog: DataFrame = _ + @transient var datasetPoissonIdentity: DataFrame = _ + @transient var datasetPoissonSqrt: DataFrame = _ + @transient var datasetGammaInverse: DataFrame = _ + @transient var datasetGammaIdentity: DataFrame = _ + @transient var datasetGammaLog: DataFrame = _ + + override def beforeAll(): Unit = { +super.beforeAll() + +import GeneralizedLinearRegressionSuite._ + +datasetGaussianIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "identity"), 2)) + +datasetGaussianLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "log"), 2)) + +datasetGaussianInverse = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "inverse"), 2)) + +datasetBinomial = { + val nPoints = 1 + val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) + val xMean = Array(5.843, 3.057, 3.758, 1.199) + val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) + + val testData = +generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed) + + sqlContext.createDataFrame(sc.parallelize(testData, 4)) +} + +datasetPoissonLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "log"), 2)) + +datasetPoissonIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "identity"), 2)) + +datasetPoissonSqrt = sqlContext.createDataFrame( +
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-189211004 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52044/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-189211002 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-189210653 **[Test build #52044 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52044/consoleFull)** for PR 11136 at commit [`c05a948`](https://github.com/apache/spark/commit/c05a94899c39bfa9ede9071bd6db6a937b33cf83). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-189197945 **[Test build #52044 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52044/consoleFull)** for PR 11136 at commit [`c05a948`](https://github.com/apache/spark/commit/c05a94899c39bfa9ede9071bd6db6a937b33cf83). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-189192045 Gonna do another detail pass of the code tomorrow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54206474 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -157,6 +157,12 @@ private[ml] class WeightedLeastSquares( private[ml] object WeightedLeastSquares { /** + * In order to take the normal equation approach efficiently, [[WeightedLeastSquares]] + * only supports the number of features is no more than 4096. + */ + val MaxNumFeatures: Int = 4096 --- End diff -- OK, I will update it to ```MAX_NUM_FEATURES``` after collecting more comments. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54205705 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -157,6 +157,12 @@ private[ml] class WeightedLeastSquares( private[ml] object WeightedLeastSquares { /** + * In order to take the normal equation approach efficiently, [[WeightedLeastSquares]] + * only supports the number of features is no more than 4096. + */ + val MaxNumFeatures: Int = 4096 --- End diff -- This is not specified in Spark Code Style guide and Scala code style guide recommends `MaxNumFeatures`. But I do like `MAX_NUM_FEATURES` better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54141380 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -157,6 +157,12 @@ private[ml] class WeightedLeastSquares( private[ml] object WeightedLeastSquares { /** + * In order to take the normal equation approach efficiently, [[WeightedLeastSquares]] + * only supports the number of features is no more than 4096. + */ + val MaxNumFeatures: Int = 4096 --- End diff -- For constant, do we have naming convention? Like `MAX_NUM_FEATURES`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-188745905 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51963/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-188745899 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-188745464 **[Test build #51963 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51963/consoleFull)** for PR 11136 at commit [`2ebcef7`](https://github.com/apache/spark/commit/2ebcef728315ad6b03d48c7f7e7f504e5e193748). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-188725749 **[Test build #51963 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51963/consoleFull)** for PR 11136 at commit [`2ebcef7`](https://github.com/apache/spark/commit/2ebcef728315ad6b03d48c7f7e7f504e5e193748). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-188662047 Only some minor comments on the implementation. I will make a pass on the tests tomorrow. @dbtsai It would be great if you can make a pass too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54059562 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,565 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionBase extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Default is "gaussian". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"The name of family which is a description of the error distribution to be used in the " + + "model. Supported options: gaussian(default), binomial, poisson and gamma.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of link function which provides the relationship + * between the linear predictor and the mean of the distribution function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "The name of link function " + +"which provides the relationship between the linear predictor and the mean of the " + +"distribution function. Supported options: identity, log, inverse, logit, probit, " + +"cloglog and sqrt.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + import GeneralizedLinearRegression._ + protected lazy val familyObj = Family.fromName($(family)) + protected lazy val linkObj = if (isDefined(link)) { +Link.fromName($(link)) + } else { +familyObj.defaultLink + } + protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj) + + @Since("2.0.0") + override def validateParams(): Unit = { +if ($(solver) == "irls") { + setDefault(maxIter -> 25) +} +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains( +familyObj -> linkObj), s"Generalized Linear Regression with ${$(family)} family " + +s"does not support ${$(link)} link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor (link function) and + * a description of the error distribution (family). + * It supports "gaussian", "binomial", "poisson" and "gamma" as family. + * Valid link functions for each family is listed below. The
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54059544 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,565 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionBase extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Default is "gaussian". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"The name of family which is a description of the error distribution to be used in the " + + "model. Supported options: gaussian(default), binomial, poisson and gamma.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of link function which provides the relationship + * between the linear predictor and the mean of the distribution function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "The name of link function " + +"which provides the relationship between the linear predictor and the mean of the " + +"distribution function. Supported options: identity, log, inverse, logit, probit, " + +"cloglog and sqrt.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + import GeneralizedLinearRegression._ + protected lazy val familyObj = Family.fromName($(family)) + protected lazy val linkObj = if (isDefined(link)) { +Link.fromName($(link)) + } else { +familyObj.defaultLink + } + protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj) + + @Since("2.0.0") + override def validateParams(): Unit = { +if ($(solver) == "irls") { + setDefault(maxIter -> 25) +} +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains( +familyObj -> linkObj), s"Generalized Linear Regression with ${$(family)} family " + +s"does not support ${$(link)} link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor (link function) and + * a description of the error distribution (family). + * It supports "gaussian", "binomial", "poisson" and "gamma" as family. + * Valid link functions for each family is listed below. The
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54059585 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,565 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionBase extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Default is "gaussian". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"The name of family which is a description of the error distribution to be used in the " + + "model. Supported options: gaussian(default), binomial, poisson and gamma.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of link function which provides the relationship + * between the linear predictor and the mean of the distribution function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "The name of link function " + +"which provides the relationship between the linear predictor and the mean of the " + +"distribution function. Supported options: identity, log, inverse, logit, probit, " + +"cloglog and sqrt.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + import GeneralizedLinearRegression._ + protected lazy val familyObj = Family.fromName($(family)) + protected lazy val linkObj = if (isDefined(link)) { +Link.fromName($(link)) + } else { +familyObj.defaultLink + } + protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj) + + @Since("2.0.0") + override def validateParams(): Unit = { +if ($(solver) == "irls") { + setDefault(maxIter -> 25) +} +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains( +familyObj -> linkObj), s"Generalized Linear Regression with ${$(family)} family " + +s"does not support ${$(link)} link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor (link function) and + * a description of the error distribution (family). + * It supports "gaussian", "binomial", "poisson" and "gamma" as family. + * Valid link functions for each family is listed below. The
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54059572 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,565 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionBase extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Default is "gaussian". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"The name of family which is a description of the error distribution to be used in the " + + "model. Supported options: gaussian(default), binomial, poisson and gamma.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of link function which provides the relationship + * between the linear predictor and the mean of the distribution function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "The name of link function " + +"which provides the relationship between the linear predictor and the mean of the " + +"distribution function. Supported options: identity, log, inverse, logit, probit, " + +"cloglog and sqrt.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + import GeneralizedLinearRegression._ + protected lazy val familyObj = Family.fromName($(family)) + protected lazy val linkObj = if (isDefined(link)) { +Link.fromName($(link)) + } else { +familyObj.defaultLink + } + protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj) + + @Since("2.0.0") + override def validateParams(): Unit = { +if ($(solver) == "irls") { + setDefault(maxIter -> 25) +} +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains( +familyObj -> linkObj), s"Generalized Linear Regression with ${$(family)} family " + +s"does not support ${$(link)} link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor (link function) and + * a description of the error distribution (family). + * It supports "gaussian", "binomial", "poisson" and "gamma" as family. + * Valid link functions for each family is listed below. The
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54059552 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,565 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionBase extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Default is "gaussian". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"The name of family which is a description of the error distribution to be used in the " + + "model. Supported options: gaussian(default), binomial, poisson and gamma.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of link function which provides the relationship + * between the linear predictor and the mean of the distribution function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "The name of link function " + +"which provides the relationship between the linear predictor and the mean of the " + +"distribution function. Supported options: identity, log, inverse, logit, probit, " + +"cloglog and sqrt.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + import GeneralizedLinearRegression._ + protected lazy val familyObj = Family.fromName($(family)) + protected lazy val linkObj = if (isDefined(link)) { +Link.fromName($(link)) + } else { +familyObj.defaultLink + } + protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj) + + @Since("2.0.0") + override def validateParams(): Unit = { +if ($(solver) == "irls") { + setDefault(maxIter -> 25) +} +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains( +familyObj -> linkObj), s"Generalized Linear Regression with ${$(family)} family " + +s"does not support ${$(link)} link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor (link function) and + * a description of the error distribution (family). + * It supports "gaussian", "binomial", "poisson" and "gamma" as family. + * Valid link functions for each family is listed below. The
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54059534 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,565 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionBase extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Default is "gaussian". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"The name of family which is a description of the error distribution to be used in the " + + "model. Supported options: gaussian(default), binomial, poisson and gamma.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of link function which provides the relationship + * between the linear predictor and the mean of the distribution function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "The name of link function " + +"which provides the relationship between the linear predictor and the mean of the " + +"distribution function. Supported options: identity, log, inverse, logit, probit, " + +"cloglog and sqrt.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + import GeneralizedLinearRegression._ + protected lazy val familyObj = Family.fromName($(family)) --- End diff -- This cannot be a member `val`. Users can set param `family` multiple times. We should move it and `linkObj` to `fit/train`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54059489 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -0,0 +1,499 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.MLTestingUtils +import org.apache.spark.mllib.classification.LogisticRegressionSuite._ +import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.random._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Row} + +class GeneralizedLinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { + + private val seed: Int = 42 + @transient var datasetGaussianIdentity: DataFrame = _ + @transient var datasetGaussianLog: DataFrame = _ + @transient var datasetGaussianInverse: DataFrame = _ + @transient var datasetBinomial: DataFrame = _ + @transient var datasetPoissonLog: DataFrame = _ + @transient var datasetPoissonIdentity: DataFrame = _ + @transient var datasetPoissonSqrt: DataFrame = _ + @transient var datasetGammaInverse: DataFrame = _ + @transient var datasetGammaIdentity: DataFrame = _ + @transient var datasetGammaLog: DataFrame = _ + + override def beforeAll(): Unit = { +super.beforeAll() + +import GeneralizedLinearRegressionSuite._ + +datasetGaussianIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "identity"), 2)) + +datasetGaussianLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "log"), 2)) + +datasetGaussianInverse = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "inverse"), 2)) + +datasetBinomial = { + val nPoints = 1 + val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) + val xMean = Array(5.843, 3.057, 3.758, 1.199) + val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) + + val testData = +generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed) + + sqlContext.createDataFrame(sc.parallelize(testData, 4)) +} + +datasetPoissonLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "log"), 2)) + +datasetPoissonIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "identity"), 2)) + +datasetPoissonSqrt = sqlContext.createDataFrame( +
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54059524 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -156,6 +156,8 @@ private[ml] class WeightedLeastSquares( private[ml] object WeightedLeastSquares { + val MaxNumFeatures: Int = 4096 --- End diff -- add doc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-188165176 @mengxr This PR is ready for another pass. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53913710 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,565 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionBase extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Default is "gaussian". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"The name of family which is a description of the error distribution to be used in the " + + "model. Supported options: gaussian(default), binomial, poisson and gamma.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of link function which provides the relationship + * between the linear predictor and the mean of the distribution function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "The name of link function " + +"which provides the relationship between the linear predictor and the mean of the " + +"distribution function. Supported options: identity, log, inverse, logit, probit, " + +"cloglog and sqrt.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + import GeneralizedLinearRegression._ + protected lazy val familyObj = Family.fromName($(family)) + protected lazy val linkObj = if (isDefined(link)) { +Link.fromName($(link)) + } else { +familyObj.defaultLink + } + protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj) + + @Since("2.0.0") + override def validateParams(): Unit = { +if ($(solver) == "irls") { + setDefault(maxIter -> 25) +} +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains( +familyObj -> linkObj), s"Generalized Linear Regression with ${$(family)} family " + +s"does not support ${$(link)} link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor (link function) and + * a description of the error distribution (family). + * It supports "gaussian", "binomial", "poisson" and "gamma" as family. + * Valid link functions for each family is listed below.
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53911596 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,565 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionBase extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Default is "gaussian". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"The name of family which is a description of the error distribution to be used in the " + + "model. Supported options: gaussian(default), binomial, poisson and gamma.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of link function which provides the relationship + * between the linear predictor and the mean of the distribution function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "The name of link function " + +"which provides the relationship between the linear predictor and the mean of the " + +"distribution function. Supported options: identity, log, inverse, logit, probit, " + +"cloglog and sqrt.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + import GeneralizedLinearRegression._ + protected lazy val familyObj = Family.fromName($(family)) + protected lazy val linkObj = if (isDefined(link)) { +Link.fromName($(link)) + } else { +familyObj.defaultLink + } + protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj) + + @Since("2.0.0") + override def validateParams(): Unit = { +if ($(solver) == "irls") { + setDefault(maxIter -> 25) +} +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains( +familyObj -> linkObj), s"Generalized Linear Regression with ${$(family)} family " + +s"does not support ${$(link)} link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor (link function) and + * a description of the error distribution (family). + * It supports "gaussian", "binomial", "poisson" and "gamma" as family. + * Valid link functions for each family is listed below.
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53911268 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -0,0 +1,499 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.MLTestingUtils +import org.apache.spark.mllib.classification.LogisticRegressionSuite._ +import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.random._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Row} + +class GeneralizedLinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { + + private val seed: Int = 42 + @transient var datasetGaussianIdentity: DataFrame = _ + @transient var datasetGaussianLog: DataFrame = _ + @transient var datasetGaussianInverse: DataFrame = _ + @transient var datasetBinomial: DataFrame = _ + @transient var datasetPoissonLog: DataFrame = _ + @transient var datasetPoissonIdentity: DataFrame = _ + @transient var datasetPoissonSqrt: DataFrame = _ + @transient var datasetGammaInverse: DataFrame = _ + @transient var datasetGammaIdentity: DataFrame = _ + @transient var datasetGammaLog: DataFrame = _ + + override def beforeAll(): Unit = { +super.beforeAll() + +import GeneralizedLinearRegressionSuite._ + +datasetGaussianIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "identity"), 2)) + +datasetGaussianLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "log"), 2)) + +datasetGaussianInverse = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "inverse"), 2)) + +datasetBinomial = { + val nPoints = 1 + val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) + val xMean = Array(5.843, 3.057, 3.758, 1.199) + val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) + + val testData = +generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed) + + sqlContext.createDataFrame(sc.parallelize(testData, 4)) +} + +datasetPoissonLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "log"), 2)) + +datasetPoissonIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "identity"), 2)) + +datasetPoissonSqrt = sqlContext.createDataFrame( +
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53909986 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,565 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionBase extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Default is "gaussian". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"The name of family which is a description of the error distribution to be used in the " + + "model. Supported options: gaussian(default), binomial, poisson and gamma.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of link function which provides the relationship + * between the linear predictor and the mean of the distribution function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "The name of link function " + +"which provides the relationship between the linear predictor and the mean of the " + +"distribution function. Supported options: identity, log, inverse, logit, probit, " + +"cloglog and sqrt.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + import GeneralizedLinearRegression._ + protected lazy val familyObj = Family.fromName($(family)) + protected lazy val linkObj = if (isDefined(link)) { +Link.fromName($(link)) + } else { +familyObj.defaultLink + } + protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj) + + @Since("2.0.0") + override def validateParams(): Unit = { +if ($(solver) == "irls") { + setDefault(maxIter -> 25) +} +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains( +familyObj -> linkObj), s"Generalized Linear Regression with ${$(family)} family " + +s"does not support ${$(link)} link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor (link function) and + * a description of the error distribution (family). + * It supports "gaussian", "binomial", "poisson" and "gamma" as family. + * Valid link functions for each family is listed below.
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53909458 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,565 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionBase extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Default is "gaussian". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"The name of family which is a description of the error distribution to be used in the " + + "model. Supported options: gaussian(default), binomial, poisson and gamma.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of link function which provides the relationship + * between the linear predictor and the mean of the distribution function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "The name of link function " + +"which provides the relationship between the linear predictor and the mean of the " + +"distribution function. Supported options: identity, log, inverse, logit, probit, " + +"cloglog and sqrt.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + import GeneralizedLinearRegression._ + protected lazy val familyObj = Family.fromName($(family)) + protected lazy val linkObj = if (isDefined(link)) { +Link.fromName($(link)) + } else { +familyObj.defaultLink + } + protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj) + + @Since("2.0.0") + override def validateParams(): Unit = { +if ($(solver) == "irls") { + setDefault(maxIter -> 25) +} +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains( +familyObj -> linkObj), s"Generalized Linear Regression with ${$(family)} family " + +s"does not support ${$(link)} link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor (link function) and + * a description of the error distribution (family). + * It supports "gaussian", "binomial", "poisson" and "gamma" as family. + * Valid link functions for each family is listed below.
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-188138657 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-188138718 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51857/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-188138376 **[Test build #51857 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51857/consoleFull)** for PR 11136 at commit [`aa89fdc`](https://github.com/apache/spark/commit/aa89fdcb837e12481bed23f781925ea6e8f6acbe). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-188129136 **[Test build #51857 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51857/consoleFull)** for PR 11136 at commit [`aa89fdc`](https://github.com/apache/spark/commit/aa89fdcb837e12481bed23f781925ea6e8f6acbe). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722773 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams --- End diff -- We can call it `GeneralizedLinearRegressionBase` because it also implements other functions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722712 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722694 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722714 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722703 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722678 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722697 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722686 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722637 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722623 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722641 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722631 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722634 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722629 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722508 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722512 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722503 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722484 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722437 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722423 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) --- End diff -- Set the default to "gaussian"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA.
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722443 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722430 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722396 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) --- End diff -- "glm", which is quite common --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722393 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", --- End diff -- * Include supported options and the default behavior in the param doc and the ScalaDoc. * Mention the list of valid (family, link) combinations somewhere in the public doc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722399 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution --- End diff -- * `Set` -> `Sets` * I don't think it is necessary to repeat the param doc. Maybe we can simply say `Sets the value of param [[family]].` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53722391 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", --- End diff -- * Include supported options and the default value in the param doc (and the ScalaDoc). * Shall we make "gaussian" the default? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-187443051 I'm making a pass. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53592447 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,547 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), +s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + + s"link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0") + def setLink(value: String): this.type = set(link, value) + + /** + * Set if we should fit the intercept. + * Default is true. + * @group setParam + */ + @Since("2.0.0") + def setFitIntercept(value:
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-184175086 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-184175089 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51306/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-184174776 **[Test build #51306 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51306/consoleFull)** for PR 11136 at commit [`4a27970`](https://github.com/apache/spark/commit/4a27970486091b6359b9d73ba477044d04c20875). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org