[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-03-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/11136


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-03-01 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190806584
  
LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-03-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190607485
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52227/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-03-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190607484
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-03-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190607102
  
**[Test build #52227 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52227/consoleFull)**
 for PR 11136 at commit 
[`007a4ec`](https://github.com/apache/spark/commit/007a4ec324db273c048ed65fe8942daba0c9d844).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190591444
  
**[Test build #52227 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52227/consoleFull)**
 for PR 11136 at commit 
[`007a4ec`](https://github.com/apache/spark/commit/007a4ec324db273c048ed65fe8942daba0c9d844).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54524989
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,577 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionBase extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Default is "gaussian".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"The name of family which is a description of the error distribution 
to be used in the " +
+  "model. Supported options: gaussian(default), binomial, poisson and 
gamma.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of link function which provides the relationship
+   * between the linear predictor and the mean of the distribution 
function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
+"which provides the relationship between the linear predictor and the 
mean of the " +
+"distribution function. Supported options: identity, log, inverse, 
logit, probit, " +
+"cloglog and sqrt.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  import GeneralizedLinearRegression._
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if ($(solver) == "irls") {
+  setDefault(maxIter -> 25)
+}
+if (isDefined(link)) {
+  require(supportedFamilyAndLinkPairs.contains(
+Family.fromName($(family)) -> Link.fromName($(link))), 
"Generalized Linear Regression " +
+s"with ${$(family)} family does not support ${$(link)} link 
function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor 
(link function) and
+ * a description of the error distribution (family).
+ * It supports "gaussian", "binomial", "poisson" and "gamma" as family.
+ * Valid link functions for each family is listed below. The first link 
function of each family
+ * is the default one.
+ *  - "gaussian" -> "identity", "log", "inverse"
+ *  - "binomial" -> "logit", "probit", "cloglog"
+ *  - "poisson"  -> "log", "identity", "sqrt"
+ *  - "gamma"-> "inverse", "identity", "log"
+ */

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190538874
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52216/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190538867
  
**[Test build #52216 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52216/consoleFull)**
 for PR 11136 at commit 
[`31a912c`](https://github.com/apache/spark/commit/31a912cd74cf3dffbf8cc0af8c57b777d49579eb).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190538871
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190537367
  
**[Test build #52216 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52216/consoleFull)**
 for PR 11136 at commit 
[`31a912c`](https://github.com/apache/spark/commit/31a912cd74cf3dffbf8cc0af8c57b777d49579eb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54521794
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.util.MLTestingUtils
+import org.apache.spark.mllib.classification.LogisticRegressionSuite._
+import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors}
+import org.apache.spark.mllib.random._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Row}
+
+class GeneralizedLinearRegressionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  private val seed: Int = 42
+  @transient var datasetGaussianIdentity: DataFrame = _
+  @transient var datasetGaussianLog: DataFrame = _
+  @transient var datasetGaussianInverse: DataFrame = _
+  @transient var datasetBinomial: DataFrame = _
+  @transient var datasetPoissonLog: DataFrame = _
+  @transient var datasetPoissonIdentity: DataFrame = _
+  @transient var datasetPoissonSqrt: DataFrame = _
+  @transient var datasetGammaInverse: DataFrame = _
+  @transient var datasetGammaIdentity: DataFrame = _
+  @transient var datasetGammaLog: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+import GeneralizedLinearRegressionSuite._
+
+datasetGaussianIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "identity"), 2))
+
+datasetGaussianLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "log"), 2))
+
+datasetGaussianInverse = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "inverse"), 2))
+
+datasetBinomial = {
+  val nPoints = 1
+  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
+  val xMean = Array(5.843, 3.057, 3.758, 1.199)
+  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+  val testData =
+generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, seed)
+
+  sqlContext.createDataFrame(sc.parallelize(testData, 4))
+}
+
+datasetPoissonLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "log"), 2))
+
+datasetPoissonIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "identity"), 2))
+
+datasetPoissonSqrt = sqlContext.createDataFrame(
+  

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190530641
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190530637
  
**[Test build #52215 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52215/consoleFull)**
 for PR 11136 at commit 
[`314b562`](https://github.com/apache/spark/commit/314b562f315723a7117851289c8f5b6e1b16a6ac).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190530642
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52215/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190529580
  
**[Test build #52215 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52215/consoleFull)**
 for PR 11136 at commit 
[`314b562`](https://github.com/apache/spark/commit/314b562f315723a7117851289c8f5b6e1b16a6ac).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190528993
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190527136
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52214/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190527127
  
**[Test build #52214 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52214/consoleFull)**
 for PR 11136 at commit 
[`314b562`](https://github.com/apache/spark/commit/314b562f315723a7117851289c8f5b6e1b16a6ac).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190527134
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54519291
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.util.MLTestingUtils
+import org.apache.spark.mllib.classification.LogisticRegressionSuite._
+import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors}
+import org.apache.spark.mllib.random._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Row}
+
+class GeneralizedLinearRegressionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  private val seed: Int = 42
+  @transient var datasetGaussianIdentity: DataFrame = _
+  @transient var datasetGaussianLog: DataFrame = _
+  @transient var datasetGaussianInverse: DataFrame = _
+  @transient var datasetBinomial: DataFrame = _
+  @transient var datasetPoissonLog: DataFrame = _
+  @transient var datasetPoissonIdentity: DataFrame = _
+  @transient var datasetPoissonSqrt: DataFrame = _
+  @transient var datasetGammaInverse: DataFrame = _
+  @transient var datasetGammaIdentity: DataFrame = _
+  @transient var datasetGammaLog: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+import GeneralizedLinearRegressionSuite._
+
+datasetGaussianIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "identity"), 2))
+
+datasetGaussianLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "log"), 2))
+
+datasetGaussianInverse = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "inverse"), 2))
+
+datasetBinomial = {
+  val nPoints = 1
+  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
+  val xMean = Array(5.843, 3.057, 3.758, 1.199)
+  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+  val testData =
+generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, seed)
+
+  sqlContext.createDataFrame(sc.parallelize(testData, 4))
+}
+
+datasetPoissonLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "log"), 2))
+
+datasetPoissonIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "identity"), 2))
+
+datasetPoissonSqrt = sqlContext.createDataFrame(
+  

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190526388
  
**[Test build #52214 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52214/consoleFull)**
 for PR 11136 at commit 
[`314b562`](https://github.com/apache/spark/commit/314b562f315723a7117851289c8f5b6e1b16a6ac).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190526311
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54519070
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.util.MLTestingUtils
+import org.apache.spark.mllib.classification.LogisticRegressionSuite._
+import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors}
+import org.apache.spark.mllib.random._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Row}
+
+class GeneralizedLinearRegressionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  private val seed: Int = 42
+  @transient var datasetGaussianIdentity: DataFrame = _
+  @transient var datasetGaussianLog: DataFrame = _
+  @transient var datasetGaussianInverse: DataFrame = _
+  @transient var datasetBinomial: DataFrame = _
+  @transient var datasetPoissonLog: DataFrame = _
+  @transient var datasetPoissonIdentity: DataFrame = _
+  @transient var datasetPoissonSqrt: DataFrame = _
+  @transient var datasetGammaInverse: DataFrame = _
+  @transient var datasetGammaIdentity: DataFrame = _
+  @transient var datasetGammaLog: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+import GeneralizedLinearRegressionSuite._
+
+datasetGaussianIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "identity"), 2))
+
+datasetGaussianLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "log"), 2))
+
+datasetGaussianInverse = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "inverse"), 2))
+
+datasetBinomial = {
+  val nPoints = 1
+  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
+  val xMean = Array(5.843, 3.057, 3.758, 1.199)
+  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+  val testData =
+generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, seed)
+
+  sqlContext.createDataFrame(sc.parallelize(testData, 4))
+}
+
+datasetPoissonLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "log"), 2))
+
+datasetPoissonIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "identity"), 2))
+
+datasetPoissonSqrt = sqlContext.createDataFrame(
+  

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190522461
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190522465
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52211/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190522453
  
**[Test build #52211 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52211/consoleFull)**
 for PR 11136 at commit 
[`314b562`](https://github.com/apache/spark/commit/314b562f315723a7117851289c8f5b6e1b16a6ac).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190520555
  
**[Test build #52211 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52211/consoleFull)**
 for PR 11136 at commit 
[`314b562`](https://github.com/apache/spark/commit/314b562f315723a7117851289c8f5b6e1b16a6ac).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54518547
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.util.MLTestingUtils
+import org.apache.spark.mllib.classification.LogisticRegressionSuite._
+import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors}
+import org.apache.spark.mllib.random._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Row}
+
+class GeneralizedLinearRegressionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  private val seed: Int = 42
+  @transient var datasetGaussianIdentity: DataFrame = _
+  @transient var datasetGaussianLog: DataFrame = _
+  @transient var datasetGaussianInverse: DataFrame = _
+  @transient var datasetBinomial: DataFrame = _
+  @transient var datasetPoissonLog: DataFrame = _
+  @transient var datasetPoissonIdentity: DataFrame = _
+  @transient var datasetPoissonSqrt: DataFrame = _
+  @transient var datasetGammaInverse: DataFrame = _
+  @transient var datasetGammaIdentity: DataFrame = _
+  @transient var datasetGammaLog: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+import GeneralizedLinearRegressionSuite._
+
+datasetGaussianIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "identity"), 2))
+
+datasetGaussianLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "log"), 2))
+
+datasetGaussianInverse = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "inverse"), 2))
+
+datasetBinomial = {
+  val nPoints = 1
+  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
+  val xMean = Array(5.843, 3.057, 3.758, 1.199)
+  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+  val testData =
+generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, seed)
+
+  sqlContext.createDataFrame(sc.parallelize(testData, 4))
+}
+
+datasetPoissonLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "log"), 2))
+
+datasetPoissonIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "identity"), 2))
+
+datasetPoissonSqrt = sqlContext.createDataFrame(
+  

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-190477077
  
I made one pass on the tests, only some minor comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54508437
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.util.MLTestingUtils
+import org.apache.spark.mllib.classification.LogisticRegressionSuite._
+import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors}
+import org.apache.spark.mllib.random._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Row}
+
+class GeneralizedLinearRegressionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  private val seed: Int = 42
+  @transient var datasetGaussianIdentity: DataFrame = _
+  @transient var datasetGaussianLog: DataFrame = _
+  @transient var datasetGaussianInverse: DataFrame = _
+  @transient var datasetBinomial: DataFrame = _
+  @transient var datasetPoissonLog: DataFrame = _
+  @transient var datasetPoissonIdentity: DataFrame = _
+  @transient var datasetPoissonSqrt: DataFrame = _
+  @transient var datasetGammaInverse: DataFrame = _
+  @transient var datasetGammaIdentity: DataFrame = _
+  @transient var datasetGammaLog: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+import GeneralizedLinearRegressionSuite._
+
+datasetGaussianIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "identity"), 2))
+
+datasetGaussianLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "log"), 2))
+
+datasetGaussianInverse = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "inverse"), 2))
+
+datasetBinomial = {
+  val nPoints = 1
+  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
+  val xMean = Array(5.843, 3.057, 3.758, 1.199)
+  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+  val testData =
+generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, seed)
+
+  sqlContext.createDataFrame(sc.parallelize(testData, 4))
+}
+
+datasetPoissonLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "log"), 2))
+
+datasetPoissonIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "identity"), 2))
+
+datasetPoissonSqrt = sqlContext.createDataFrame(
+  

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54508443
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.util.MLTestingUtils
+import org.apache.spark.mllib.classification.LogisticRegressionSuite._
+import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors}
+import org.apache.spark.mllib.random._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Row}
+
+class GeneralizedLinearRegressionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  private val seed: Int = 42
+  @transient var datasetGaussianIdentity: DataFrame = _
+  @transient var datasetGaussianLog: DataFrame = _
+  @transient var datasetGaussianInverse: DataFrame = _
+  @transient var datasetBinomial: DataFrame = _
+  @transient var datasetPoissonLog: DataFrame = _
+  @transient var datasetPoissonIdentity: DataFrame = _
+  @transient var datasetPoissonSqrt: DataFrame = _
+  @transient var datasetGammaInverse: DataFrame = _
+  @transient var datasetGammaIdentity: DataFrame = _
+  @transient var datasetGammaLog: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+import GeneralizedLinearRegressionSuite._
+
+datasetGaussianIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "identity"), 2))
+
+datasetGaussianLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "log"), 2))
+
+datasetGaussianInverse = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "inverse"), 2))
+
+datasetBinomial = {
+  val nPoints = 1
+  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
+  val xMean = Array(5.843, 3.057, 3.758, 1.199)
+  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+  val testData =
+generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, seed)
+
+  sqlContext.createDataFrame(sc.parallelize(testData, 4))
+}
+
+datasetPoissonLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "log"), 2))
+
+datasetPoissonIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "identity"), 2))
+
+datasetPoissonSqrt = sqlContext.createDataFrame(
+  

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54508387
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.util.MLTestingUtils
+import org.apache.spark.mllib.classification.LogisticRegressionSuite._
+import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors}
+import org.apache.spark.mllib.random._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Row}
+
+class GeneralizedLinearRegressionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  private val seed: Int = 42
+  @transient var datasetGaussianIdentity: DataFrame = _
+  @transient var datasetGaussianLog: DataFrame = _
+  @transient var datasetGaussianInverse: DataFrame = _
+  @transient var datasetBinomial: DataFrame = _
+  @transient var datasetPoissonLog: DataFrame = _
+  @transient var datasetPoissonIdentity: DataFrame = _
+  @transient var datasetPoissonSqrt: DataFrame = _
+  @transient var datasetGammaInverse: DataFrame = _
+  @transient var datasetGammaIdentity: DataFrame = _
+  @transient var datasetGammaLog: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+import GeneralizedLinearRegressionSuite._
+
+datasetGaussianIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "identity"), 2))
+
+datasetGaussianLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "log"), 2))
+
+datasetGaussianInverse = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "inverse"), 2))
+
+datasetBinomial = {
+  val nPoints = 1
+  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
+  val xMean = Array(5.843, 3.057, 3.758, 1.199)
+  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+  val testData =
+generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, seed)
+
+  sqlContext.createDataFrame(sc.parallelize(testData, 4))
+}
+
+datasetPoissonLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "log"), 2))
+
+datasetPoissonIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "identity"), 2))
+
+datasetPoissonSqrt = sqlContext.createDataFrame(
+  

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54508365
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.util.MLTestingUtils
+import org.apache.spark.mllib.classification.LogisticRegressionSuite._
+import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors}
+import org.apache.spark.mllib.random._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Row}
+
+class GeneralizedLinearRegressionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  private val seed: Int = 42
+  @transient var datasetGaussianIdentity: DataFrame = _
+  @transient var datasetGaussianLog: DataFrame = _
+  @transient var datasetGaussianInverse: DataFrame = _
+  @transient var datasetBinomial: DataFrame = _
+  @transient var datasetPoissonLog: DataFrame = _
+  @transient var datasetPoissonIdentity: DataFrame = _
+  @transient var datasetPoissonSqrt: DataFrame = _
+  @transient var datasetGammaInverse: DataFrame = _
+  @transient var datasetGammaIdentity: DataFrame = _
+  @transient var datasetGammaLog: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+import GeneralizedLinearRegressionSuite._
+
+datasetGaussianIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "identity"), 2))
+
+datasetGaussianLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "log"), 2))
+
+datasetGaussianInverse = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "inverse"), 2))
+
+datasetBinomial = {
+  val nPoints = 1
+  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
+  val xMean = Array(5.843, 3.057, 3.758, 1.199)
+  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+  val testData =
+generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, seed)
--- End diff --

it would be good to say `addIntercept = true` instead of just `true`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54508381
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.util.MLTestingUtils
+import org.apache.spark.mllib.classification.LogisticRegressionSuite._
+import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors}
+import org.apache.spark.mllib.random._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Row}
+
+class GeneralizedLinearRegressionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  private val seed: Int = 42
+  @transient var datasetGaussianIdentity: DataFrame = _
+  @transient var datasetGaussianLog: DataFrame = _
+  @transient var datasetGaussianInverse: DataFrame = _
+  @transient var datasetBinomial: DataFrame = _
+  @transient var datasetPoissonLog: DataFrame = _
+  @transient var datasetPoissonIdentity: DataFrame = _
+  @transient var datasetPoissonSqrt: DataFrame = _
+  @transient var datasetGammaInverse: DataFrame = _
+  @transient var datasetGammaIdentity: DataFrame = _
+  @transient var datasetGammaLog: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+import GeneralizedLinearRegressionSuite._
+
+datasetGaussianIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "identity"), 2))
+
+datasetGaussianLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "log"), 2))
+
+datasetGaussianInverse = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "inverse"), 2))
+
+datasetBinomial = {
+  val nPoints = 1
+  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
+  val xMean = Array(5.843, 3.057, 3.758, 1.199)
+  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+  val testData =
+generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, seed)
+
+  sqlContext.createDataFrame(sc.parallelize(testData, 4))
+}
+
+datasetPoissonLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "log"), 2))
+
+datasetPoissonIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "identity"), 2))
+
+datasetPoissonSqrt = sqlContext.createDataFrame(
+  

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54508392
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.util.MLTestingUtils
+import org.apache.spark.mllib.classification.LogisticRegressionSuite._
+import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors}
+import org.apache.spark.mllib.random._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Row}
+
+class GeneralizedLinearRegressionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  private val seed: Int = 42
+  @transient var datasetGaussianIdentity: DataFrame = _
+  @transient var datasetGaussianLog: DataFrame = _
+  @transient var datasetGaussianInverse: DataFrame = _
+  @transient var datasetBinomial: DataFrame = _
+  @transient var datasetPoissonLog: DataFrame = _
+  @transient var datasetPoissonIdentity: DataFrame = _
+  @transient var datasetPoissonSqrt: DataFrame = _
+  @transient var datasetGammaInverse: DataFrame = _
+  @transient var datasetGammaIdentity: DataFrame = _
+  @transient var datasetGammaLog: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+import GeneralizedLinearRegressionSuite._
+
+datasetGaussianIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "identity"), 2))
+
+datasetGaussianLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "log"), 2))
+
+datasetGaussianInverse = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "inverse"), 2))
+
+datasetBinomial = {
+  val nPoints = 1
+  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
+  val xMean = Array(5.843, 3.057, 3.758, 1.199)
+  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+  val testData =
+generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, seed)
+
+  sqlContext.createDataFrame(sc.parallelize(testData, 4))
+}
+
+datasetPoissonLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "log"), 2))
+
+datasetPoissonIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "identity"), 2))
+
+datasetPoissonSqrt = sqlContext.createDataFrame(
+  

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54508370
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.util.MLTestingUtils
+import org.apache.spark.mllib.classification.LogisticRegressionSuite._
+import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors}
+import org.apache.spark.mllib.random._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Row}
+
+class GeneralizedLinearRegressionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  private val seed: Int = 42
+  @transient var datasetGaussianIdentity: DataFrame = _
+  @transient var datasetGaussianLog: DataFrame = _
+  @transient var datasetGaussianInverse: DataFrame = _
+  @transient var datasetBinomial: DataFrame = _
+  @transient var datasetPoissonLog: DataFrame = _
+  @transient var datasetPoissonIdentity: DataFrame = _
+  @transient var datasetPoissonSqrt: DataFrame = _
+  @transient var datasetGammaInverse: DataFrame = _
+  @transient var datasetGammaIdentity: DataFrame = _
+  @transient var datasetGammaLog: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+import GeneralizedLinearRegressionSuite._
+
+datasetGaussianIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "identity"), 2))
+
+datasetGaussianLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "log"), 2))
+
+datasetGaussianInverse = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "inverse"), 2))
+
+datasetBinomial = {
+  val nPoints = 1
+  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
+  val xMean = Array(5.843, 3.057, 3.758, 1.199)
+  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+  val testData =
+generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, seed)
+
+  sqlContext.createDataFrame(sc.parallelize(testData, 4))
--- End diff --

Why using 4 partitions instead of 2, which is used in other datasets?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54508383
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.util.MLTestingUtils
+import org.apache.spark.mllib.classification.LogisticRegressionSuite._
+import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors}
+import org.apache.spark.mllib.random._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Row}
+
+class GeneralizedLinearRegressionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  private val seed: Int = 42
+  @transient var datasetGaussianIdentity: DataFrame = _
+  @transient var datasetGaussianLog: DataFrame = _
+  @transient var datasetGaussianInverse: DataFrame = _
+  @transient var datasetBinomial: DataFrame = _
+  @transient var datasetPoissonLog: DataFrame = _
+  @transient var datasetPoissonIdentity: DataFrame = _
+  @transient var datasetPoissonSqrt: DataFrame = _
+  @transient var datasetGammaInverse: DataFrame = _
+  @transient var datasetGammaIdentity: DataFrame = _
+  @transient var datasetGammaLog: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+import GeneralizedLinearRegressionSuite._
+
+datasetGaussianIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "identity"), 2))
+
+datasetGaussianLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "log"), 2))
+
+datasetGaussianInverse = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "inverse"), 2))
+
+datasetBinomial = {
+  val nPoints = 1
+  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
+  val xMean = Array(5.843, 3.057, 3.758, 1.199)
+  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+  val testData =
+generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, seed)
+
+  sqlContext.createDataFrame(sc.parallelize(testData, 4))
+}
+
+datasetPoissonLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "log"), 2))
+
+datasetPoissonIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "identity"), 2))
+
+datasetPoissonSqrt = sqlContext.createDataFrame(
+  

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-189211004
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52044/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-189211002
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-189210653
  
**[Test build #52044 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52044/consoleFull)**
 for PR 11136 at commit 
[`c05a948`](https://github.com/apache/spark/commit/c05a94899c39bfa9ede9071bd6db6a937b33cf83).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-189197945
  
**[Test build #52044 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52044/consoleFull)**
 for PR 11136 at commit 
[`c05a948`](https://github.com/apache/spark/commit/c05a94899c39bfa9ede9071bd6db6a937b33cf83).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-26 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-189192045
  
Gonna do another detail pass of the code tomorrow. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-25 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54206474
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
@@ -157,6 +157,12 @@ private[ml] class WeightedLeastSquares(
 private[ml] object WeightedLeastSquares {
 
   /**
+   * In order to take the normal equation approach efficiently, 
[[WeightedLeastSquares]]
+   * only supports the number of features is no more than 4096.
+   */
+  val MaxNumFeatures: Int = 4096
--- End diff --

OK, I will update it to ```MAX_NUM_FEATURES``` after collecting more 
comments. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-25 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54205705
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
@@ -157,6 +157,12 @@ private[ml] class WeightedLeastSquares(
 private[ml] object WeightedLeastSquares {
 
   /**
+   * In order to take the normal equation approach efficiently, 
[[WeightedLeastSquares]]
+   * only supports the number of features is no more than 4096.
+   */
+  val MaxNumFeatures: Int = 4096
--- End diff --

This is not specified in Spark Code Style guide and Scala code style guide 
recommends `MaxNumFeatures`. But I do like `MAX_NUM_FEATURES` better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-25 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54141380
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
@@ -157,6 +157,12 @@ private[ml] class WeightedLeastSquares(
 private[ml] object WeightedLeastSquares {
 
   /**
+   * In order to take the normal equation approach efficiently, 
[[WeightedLeastSquares]]
+   * only supports the number of features is no more than 4096.
+   */
+  val MaxNumFeatures: Int = 4096
--- End diff --

For constant, do we have naming convention? Like `MAX_NUM_FEATURES`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-188745905
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51963/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-188745899
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-188745464
  
**[Test build #51963 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51963/consoleFull)**
 for PR 11136 at commit 
[`2ebcef7`](https://github.com/apache/spark/commit/2ebcef728315ad6b03d48c7f7e7f504e5e193748).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-188725749
  
**[Test build #51963 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51963/consoleFull)**
 for PR 11136 at commit 
[`2ebcef7`](https://github.com/apache/spark/commit/2ebcef728315ad6b03d48c7f7e7f504e5e193748).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-25 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-188662047
  
Only some minor comments on the implementation. I will make a pass on the 
tests tomorrow. @dbtsai It would be great if you can make a pass too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54059562
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,565 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionBase extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Default is "gaussian".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"The name of family which is a description of the error distribution 
to be used in the " +
+  "model. Supported options: gaussian(default), binomial, poisson and 
gamma.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of link function which provides the relationship
+   * between the linear predictor and the mean of the distribution 
function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
+"which provides the relationship between the linear predictor and the 
mean of the " +
+"distribution function. Supported options: identity, log, inverse, 
logit, probit, " +
+"cloglog and sqrt.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  import GeneralizedLinearRegression._
+  protected lazy val familyObj = Family.fromName($(family))
+  protected lazy val linkObj = if (isDefined(link)) {
+Link.fromName($(link))
+  } else {
+familyObj.defaultLink
+  }
+  protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if ($(solver) == "irls") {
+  setDefault(maxIter -> 25)
+}
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains(
+familyObj -> linkObj), s"Generalized Linear Regression with 
${$(family)} family " +
+s"does not support ${$(link)} link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor 
(link function) and
+ * a description of the error distribution (family).
+ * It supports "gaussian", "binomial", "poisson" and "gamma" as family.
+ * Valid link functions for each family is listed below. The 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54059544
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,565 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionBase extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Default is "gaussian".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"The name of family which is a description of the error distribution 
to be used in the " +
+  "model. Supported options: gaussian(default), binomial, poisson and 
gamma.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of link function which provides the relationship
+   * between the linear predictor and the mean of the distribution 
function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
+"which provides the relationship between the linear predictor and the 
mean of the " +
+"distribution function. Supported options: identity, log, inverse, 
logit, probit, " +
+"cloglog and sqrt.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  import GeneralizedLinearRegression._
+  protected lazy val familyObj = Family.fromName($(family))
+  protected lazy val linkObj = if (isDefined(link)) {
+Link.fromName($(link))
+  } else {
+familyObj.defaultLink
+  }
+  protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if ($(solver) == "irls") {
+  setDefault(maxIter -> 25)
+}
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains(
+familyObj -> linkObj), s"Generalized Linear Regression with 
${$(family)} family " +
+s"does not support ${$(link)} link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor 
(link function) and
+ * a description of the error distribution (family).
+ * It supports "gaussian", "binomial", "poisson" and "gamma" as family.
+ * Valid link functions for each family is listed below. The 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54059585
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,565 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionBase extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Default is "gaussian".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"The name of family which is a description of the error distribution 
to be used in the " +
+  "model. Supported options: gaussian(default), binomial, poisson and 
gamma.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of link function which provides the relationship
+   * between the linear predictor and the mean of the distribution 
function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
+"which provides the relationship between the linear predictor and the 
mean of the " +
+"distribution function. Supported options: identity, log, inverse, 
logit, probit, " +
+"cloglog and sqrt.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  import GeneralizedLinearRegression._
+  protected lazy val familyObj = Family.fromName($(family))
+  protected lazy val linkObj = if (isDefined(link)) {
+Link.fromName($(link))
+  } else {
+familyObj.defaultLink
+  }
+  protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if ($(solver) == "irls") {
+  setDefault(maxIter -> 25)
+}
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains(
+familyObj -> linkObj), s"Generalized Linear Regression with 
${$(family)} family " +
+s"does not support ${$(link)} link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor 
(link function) and
+ * a description of the error distribution (family).
+ * It supports "gaussian", "binomial", "poisson" and "gamma" as family.
+ * Valid link functions for each family is listed below. The 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54059572
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,565 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionBase extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Default is "gaussian".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"The name of family which is a description of the error distribution 
to be used in the " +
+  "model. Supported options: gaussian(default), binomial, poisson and 
gamma.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of link function which provides the relationship
+   * between the linear predictor and the mean of the distribution 
function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
+"which provides the relationship between the linear predictor and the 
mean of the " +
+"distribution function. Supported options: identity, log, inverse, 
logit, probit, " +
+"cloglog and sqrt.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  import GeneralizedLinearRegression._
+  protected lazy val familyObj = Family.fromName($(family))
+  protected lazy val linkObj = if (isDefined(link)) {
+Link.fromName($(link))
+  } else {
+familyObj.defaultLink
+  }
+  protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if ($(solver) == "irls") {
+  setDefault(maxIter -> 25)
+}
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains(
+familyObj -> linkObj), s"Generalized Linear Regression with 
${$(family)} family " +
+s"does not support ${$(link)} link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor 
(link function) and
+ * a description of the error distribution (family).
+ * It supports "gaussian", "binomial", "poisson" and "gamma" as family.
+ * Valid link functions for each family is listed below. The 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54059552
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,565 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionBase extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Default is "gaussian".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"The name of family which is a description of the error distribution 
to be used in the " +
+  "model. Supported options: gaussian(default), binomial, poisson and 
gamma.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of link function which provides the relationship
+   * between the linear predictor and the mean of the distribution 
function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
+"which provides the relationship between the linear predictor and the 
mean of the " +
+"distribution function. Supported options: identity, log, inverse, 
logit, probit, " +
+"cloglog and sqrt.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  import GeneralizedLinearRegression._
+  protected lazy val familyObj = Family.fromName($(family))
+  protected lazy val linkObj = if (isDefined(link)) {
+Link.fromName($(link))
+  } else {
+familyObj.defaultLink
+  }
+  protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if ($(solver) == "irls") {
+  setDefault(maxIter -> 25)
+}
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains(
+familyObj -> linkObj), s"Generalized Linear Regression with 
${$(family)} family " +
+s"does not support ${$(link)} link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor 
(link function) and
+ * a description of the error distribution (family).
+ * It supports "gaussian", "binomial", "poisson" and "gamma" as family.
+ * Valid link functions for each family is listed below. The 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54059534
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,565 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionBase extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Default is "gaussian".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"The name of family which is a description of the error distribution 
to be used in the " +
+  "model. Supported options: gaussian(default), binomial, poisson and 
gamma.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of link function which provides the relationship
+   * between the linear predictor and the mean of the distribution 
function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
+"which provides the relationship between the linear predictor and the 
mean of the " +
+"distribution function. Supported options: identity, log, inverse, 
logit, probit, " +
+"cloglog and sqrt.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  import GeneralizedLinearRegression._
+  protected lazy val familyObj = Family.fromName($(family))
--- End diff --

This cannot be a member `val`. Users can set param `family` multiple times. 
We should move it and `linkObj` to `fit/train`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54059489
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.util.MLTestingUtils
+import org.apache.spark.mllib.classification.LogisticRegressionSuite._
+import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors}
+import org.apache.spark.mllib.random._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Row}
+
+class GeneralizedLinearRegressionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  private val seed: Int = 42
+  @transient var datasetGaussianIdentity: DataFrame = _
+  @transient var datasetGaussianLog: DataFrame = _
+  @transient var datasetGaussianInverse: DataFrame = _
+  @transient var datasetBinomial: DataFrame = _
+  @transient var datasetPoissonLog: DataFrame = _
+  @transient var datasetPoissonIdentity: DataFrame = _
+  @transient var datasetPoissonSqrt: DataFrame = _
+  @transient var datasetGammaInverse: DataFrame = _
+  @transient var datasetGammaIdentity: DataFrame = _
+  @transient var datasetGammaLog: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+import GeneralizedLinearRegressionSuite._
+
+datasetGaussianIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "identity"), 2))
+
+datasetGaussianLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "log"), 2))
+
+datasetGaussianInverse = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "inverse"), 2))
+
+datasetBinomial = {
+  val nPoints = 1
+  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
+  val xMean = Array(5.843, 3.057, 3.758, 1.199)
+  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+  val testData =
+generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, seed)
+
+  sqlContext.createDataFrame(sc.parallelize(testData, 4))
+}
+
+datasetPoissonLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "log"), 2))
+
+datasetPoissonIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "identity"), 2))
+
+datasetPoissonSqrt = sqlContext.createDataFrame(
+  

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54059524
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
@@ -156,6 +156,8 @@ private[ml] class WeightedLeastSquares(
 
 private[ml] object WeightedLeastSquares {
 
+  val MaxNumFeatures: Int = 4096
--- End diff --

add doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-188165176
  
@mengxr This PR is ready for another pass. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53913710
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,565 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionBase extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Default is "gaussian".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"The name of family which is a description of the error distribution 
to be used in the " +
+  "model. Supported options: gaussian(default), binomial, poisson and 
gamma.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of link function which provides the relationship
+   * between the linear predictor and the mean of the distribution 
function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
+"which provides the relationship between the linear predictor and the 
mean of the " +
+"distribution function. Supported options: identity, log, inverse, 
logit, probit, " +
+"cloglog and sqrt.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  import GeneralizedLinearRegression._
+  protected lazy val familyObj = Family.fromName($(family))
+  protected lazy val linkObj = if (isDefined(link)) {
+Link.fromName($(link))
+  } else {
+familyObj.defaultLink
+  }
+  protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if ($(solver) == "irls") {
+  setDefault(maxIter -> 25)
+}
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains(
+familyObj -> linkObj), s"Generalized Linear Regression with 
${$(family)} family " +
+s"does not support ${$(link)} link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor 
(link function) and
+ * a description of the error distribution (family).
+ * It supports "gaussian", "binomial", "poisson" and "gamma" as family.
+ * Valid link functions for each family is listed below. 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53911596
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,565 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionBase extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Default is "gaussian".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"The name of family which is a description of the error distribution 
to be used in the " +
+  "model. Supported options: gaussian(default), binomial, poisson and 
gamma.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of link function which provides the relationship
+   * between the linear predictor and the mean of the distribution 
function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
+"which provides the relationship between the linear predictor and the 
mean of the " +
+"distribution function. Supported options: identity, log, inverse, 
logit, probit, " +
+"cloglog and sqrt.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  import GeneralizedLinearRegression._
+  protected lazy val familyObj = Family.fromName($(family))
+  protected lazy val linkObj = if (isDefined(link)) {
+Link.fromName($(link))
+  } else {
+familyObj.defaultLink
+  }
+  protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if ($(solver) == "irls") {
+  setDefault(maxIter -> 25)
+}
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains(
+familyObj -> linkObj), s"Generalized Linear Regression with 
${$(family)} family " +
+s"does not support ${$(link)} link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor 
(link function) and
+ * a description of the error distribution (family).
+ * It supports "gaussian", "binomial", "poisson" and "gamma" as family.
+ * Valid link functions for each family is listed below. 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53911268
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.util.MLTestingUtils
+import org.apache.spark.mllib.classification.LogisticRegressionSuite._
+import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors}
+import org.apache.spark.mllib.random._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Row}
+
+class GeneralizedLinearRegressionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  private val seed: Int = 42
+  @transient var datasetGaussianIdentity: DataFrame = _
+  @transient var datasetGaussianLog: DataFrame = _
+  @transient var datasetGaussianInverse: DataFrame = _
+  @transient var datasetBinomial: DataFrame = _
+  @transient var datasetPoissonLog: DataFrame = _
+  @transient var datasetPoissonIdentity: DataFrame = _
+  @transient var datasetPoissonSqrt: DataFrame = _
+  @transient var datasetGammaInverse: DataFrame = _
+  @transient var datasetGammaIdentity: DataFrame = _
+  @transient var datasetGammaLog: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+import GeneralizedLinearRegressionSuite._
+
+datasetGaussianIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "identity"), 2))
+
+datasetGaussianLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "log"), 2))
+
+datasetGaussianInverse = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "inverse"), 2))
+
+datasetBinomial = {
+  val nPoints = 1
+  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
+  val xMean = Array(5.843, 3.057, 3.758, 1.199)
+  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+  val testData =
+generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, seed)
+
+  sqlContext.createDataFrame(sc.parallelize(testData, 4))
+}
+
+datasetPoissonLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "log"), 2))
+
+datasetPoissonIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "identity"), 2))
+
+datasetPoissonSqrt = sqlContext.createDataFrame(
+  

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53909986
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,565 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionBase extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Default is "gaussian".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"The name of family which is a description of the error distribution 
to be used in the " +
+  "model. Supported options: gaussian(default), binomial, poisson and 
gamma.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of link function which provides the relationship
+   * between the linear predictor and the mean of the distribution 
function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
+"which provides the relationship between the linear predictor and the 
mean of the " +
+"distribution function. Supported options: identity, log, inverse, 
logit, probit, " +
+"cloglog and sqrt.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  import GeneralizedLinearRegression._
+  protected lazy val familyObj = Family.fromName($(family))
+  protected lazy val linkObj = if (isDefined(link)) {
+Link.fromName($(link))
+  } else {
+familyObj.defaultLink
+  }
+  protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if ($(solver) == "irls") {
+  setDefault(maxIter -> 25)
+}
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains(
+familyObj -> linkObj), s"Generalized Linear Regression with 
${$(family)} family " +
+s"does not support ${$(link)} link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor 
(link function) and
+ * a description of the error distribution (family).
+ * It supports "gaussian", "binomial", "poisson" and "gamma" as family.
+ * Valid link functions for each family is listed below. 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53909458
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,565 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionBase extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Default is "gaussian".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"The name of family which is a description of the error distribution 
to be used in the " +
+  "model. Supported options: gaussian(default), binomial, poisson and 
gamma.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of link function which provides the relationship
+   * between the linear predictor and the mean of the distribution 
function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
+"which provides the relationship between the linear predictor and the 
mean of the " +
+"distribution function. Supported options: identity, log, inverse, 
logit, probit, " +
+"cloglog and sqrt.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  import GeneralizedLinearRegression._
+  protected lazy val familyObj = Family.fromName($(family))
+  protected lazy val linkObj = if (isDefined(link)) {
+Link.fromName($(link))
+  } else {
+familyObj.defaultLink
+  }
+  protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if ($(solver) == "irls") {
+  setDefault(maxIter -> 25)
+}
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains(
+familyObj -> linkObj), s"Generalized Linear Regression with 
${$(family)} family " +
+s"does not support ${$(link)} link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor 
(link function) and
+ * a description of the error distribution (family).
+ * It supports "gaussian", "binomial", "poisson" and "gamma" as family.
+ * Valid link functions for each family is listed below. 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-188138657
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-188138718
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51857/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-188138376
  
**[Test build #51857 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51857/consoleFull)**
 for PR 11136 at commit 
[`aa89fdc`](https://github.com/apache/spark/commit/aa89fdcb837e12481bed23f781925ea6e8f6acbe).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-188129136
  
**[Test build #51857 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51857/consoleFull)**
 for PR 11136 at commit 
[`aa89fdc`](https://github.com/apache/spark/commit/aa89fdcb837e12481bed23f781925ea6e8f6acbe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722773
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
--- End diff --

We can call it `GeneralizedLinearRegressionBase` because it also implements 
other functions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722712
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722694
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722714
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722703
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722678
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722697
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722686
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722637
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722623
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722641
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722631
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722634
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722629
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722508
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722512
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722503
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722484
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722437
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722423
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
--- End diff --

Set the default to "gaussian"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722443
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722430
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722396
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
--- End diff --

"glm", which is quite common


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722393
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
--- End diff --

* Include supported options and the default behavior in the param doc and 
the ScalaDoc.
* Mention the list of valid (family, link) combinations somewhere in the 
public doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722399
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
--- End diff --

* `Set` -> `Sets`
* I don't think it is necessary to repeat the param doc. Maybe we can 
simply say `Sets the value of param [[family]].`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53722391
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
--- End diff --

* Include supported options and the default value in the param doc (and the 
ScalaDoc).
* Shall we make "gaussian" the default?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-22 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-187443051
  
I'm making a pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-21 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53592447
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,547 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+  s"link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setLink(value: String): this.type = set(link, value)
+
+  /**
+   * Set if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFitIntercept(value: 

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-184175086
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-184175089
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51306/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-184174776
  
**[Test build #51306 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51306/consoleFull)**
 for PR 11136 at commit 
[`4a27970`](https://github.com/apache/spark/commit/4a27970486091b6359b9d73ba477044d04c20875).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >