[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/11694


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-15 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/11694#issuecomment-197159610
  
LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11694#issuecomment-197152626
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53269/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11694#issuecomment-197152624
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11694#issuecomment-197152557
  
**[Test build #53269 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53269/consoleFull)**
 for PR 11694 at commit 
[`f89cdf0`](https://github.com/apache/spark/commit/f89cdf01d94faab7d6f9372df35033afe695f8aa).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11694#issuecomment-197143000
  
**[Test build #53269 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53269/consoleFull)**
 for PR 11694 at commit 
[`f89cdf0`](https://github.com/apache/spark/commit/f89cdf01d94faab7d6f9372df35033afe695f8aa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/11694#issuecomment-196481131
  
I made one pass and left some minor comments line. This looks great overall!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11694#discussion_r56058258
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -348,7 +376,20 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 override def initialize(y: Double, weight: Double): Double = y
 
-def variance(mu: Double): Double = 1.0
+override def variance(mu: Double): Double = 1.0
+
+override def deviance(y: Double, mu: Double, weight: Double): Double = 
{
+  weight * math.pow(y - mu, 2.0)
--- End diff --

`(y - mu) * (y - mu)`, which is faster


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11694#discussion_r56058139
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -466,6 +468,461 @@ class GeneralizedLinearRegressionSuite
 }
   }
 
+  test("glm summary: gaussian family with weight") {
+/*
+   R code:
+
+   A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
+   b <- c(17, 19, 23, 29)
+   w <- c(1, 2, 3, 4)
+   df <- as.data.frame(cbind(A, b))
+ */
+val datasetWithWeight = sqlContext.createDataFrame(sc.parallelize(Seq(
+  Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
+  Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)),
+  Instance(23.0, 3.0, Vectors.dense(2.0, 11.0)),
+  Instance(29.0, 4.0, Vectors.dense(3.0, 13.0))
+), 2))
+/*
+   R code:
+
+   model <- glm(formula = "b ~ .", family="gaussian", data = df, 
weights = w)
+   summary(model)
+
+   Deviance Residuals:
+   1   2   3   4
+   1.920  -1.358  -1.109   0.960
+
+   Coefficients:
+   Estimate Std. Error t value Pr(>|t|)
+   (Intercept)   18.080  9.608   1.8820.311
+   V1 6.080  5.556   1.0940.471
+   V2-0.600  1.960  -0.3060.811
+
+   (Dispersion parameter for gaussian family taken to be 7.68)
+
+   Null deviance: 202.00  on 3  degrees of freedom
+   Residual deviance:   7.68  on 1  degrees of freedom
+   AIC: 18.783
+
+   Number of Fisher Scoring iterations: 2
+
+   residuals(model, type="pearson")
+  1 2 3 4
+   1.92 -1.357645 -1.108513  0.96
+
+   residuals(model, type="working")
+  1 2 3 4
+   1.92 -0.96 -0.64  0.48
+
+   residuals(model, type="response")
+  1 2 3 4
+   1.92 -0.96 -0.64  0.48
+ */
+val trainer = new GeneralizedLinearRegression()
+  .setWeightCol("weight")
+
+val model = trainer.fit(datasetWithWeight)
+
+val coefficientsR = Vectors.dense(Array(6.080, -0.600))
+val interceptR = 18.080
+val devianceResidualsR = Array(1.920, -1.358, -1.109, 0.960)
+val pearsonResidualsR = Array(1.92, -1.357645, -1.108513, 0.96)
+val workingResidualsR = Array(1.92, -0.96, -0.64, 0.48)
+val responseResidualsR = Array(1.92, -0.96, -0.64, 0.48)
+val seCoefR = Array(5.556, 1.960, 9.608)
+val tValsR = Array(1.094, -0.306, 1.882)
+val pValsR = Array(0.471, 0.811, 0.311)
+val dispersionR = 7.68
+val nullDevianceR = 202.00
+val residualDevianceR = 7.68
+val residualDegreeOfFreedomNullR = 3
+val residualDegreeOfFreedomR = 1
+val aicR = 18.783
+
+val summary = model.summary
+
+val devianceResiduals = summary.residuals()
+  .select(col("devianceResiduals"))
+  .collect()
+  .map(_.getAs[Double](0))
--- End diff --

`_.getDouble(0)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11694#discussion_r56058081
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -633,3 +755,179 @@ object GeneralizedLinearRegressionModel extends 
MLReadable[GeneralizedLinearRegr
 }
   }
 }
+
+/**
+ * :: Experimental ::
+ * Summarizing Generalized Linear regression Fits.
+ *
+ * @param predictions predictions outputted by the model's `transform` 
method
+ * @param predictionCol field in "predictions" which gives the prediction 
value of each instance
+ * @param family the family object of the model
+ * @param link the link object of the model
+ * @param model the model that should be summarized
+ * @param diagInvAtWA diagonal of matrix (A^T * W * A)^-1 in the last 
iteration
+ * @param numIterations number of iterations
+ */
+@Since("2.0.0")
+@Experimental
+class GeneralizedLinearRegressionSummary private[regression] (
+@transient val predictions: DataFrame,
+val predictionCol: String,
+val family: GeneralizedLinearRegression.Family,
+val link: GeneralizedLinearRegression.Link,
+val model: GeneralizedLinearRegressionModel,
+private val diagInvAtWA: Array[Double],
+val numIterations: Int) extends Serializable {
+
+  import GeneralizedLinearRegression._
+
+  /** Number of instances in DataFrame predictions */
+  lazy val numInstances: Long = predictions.count()
+
+  /** The numeric rank of the fitted linear model */
+  lazy val rank: Long = if (model.getFitIntercept) {
+model.coefficients.size + 1
+  } else {
+model.coefficients.size
+  }
+
+  /** Degrees of freedom */
+  lazy val degreesOfFreedom: Long = {
+numInstances - rank
+  }
+
+  /** The residual degrees of freedom */
+  lazy val residualDegreeOfFreedom: Long = degreesOfFreedom
+
+  /** The residual degrees of freedom for the null model */
+  lazy val residualDegreeOfFreedomNull: Long = if (model.getFitIntercept) {
+numInstances - 1
+  } else {
+numInstances
+  }
+
+  private lazy val devianceResiduals: DataFrame = {
+val drUDF = udf { (y: Double, mu: Double, weight: Double) =>
+  val r = math.sqrt(math.max(family.deviance(y, mu, weight), 0.0))
+  if (y > mu) r else -1.0 * r
+}
+val w = if (model.getWeightCol.isEmpty) lit(1.0) else 
col(model.getWeightCol)
+predictions.select(
+  drUDF(col(model.getLabelCol), col(predictionCol), 
w).as("devianceResiduals"))
+  }
+
+  private lazy val pearsonResiduals: DataFrame = {
+val prUDF = udf { mu: Double => family.variance(mu) }
+val w = if (model.getWeightCol.isEmpty) lit(1.0) else 
col(model.getWeightCol)
+predictions.select(col(model.getLabelCol).minus(col(predictionCol))
+  
.multiply(sqrt(w)).divide(sqrt(prUDF(col(predictionCol.as("pearsonResiduals"))
+  }
+
+  private lazy val workingResiduals: DataFrame = {
+val wrUDF = udf { (y: Double, mu: Double) => (y - mu) * link.deriv(mu) 
}
+predictions.select(wrUDF(col(model.getLabelCol), 
col(predictionCol)).as("workingResiduals"))
+  }
+
+  private lazy val responseResiduals: DataFrame = {
+
predictions.select(col(model.getLabelCol).minus(col(predictionCol)).as("responseResiduals"))
+  }
+
+  /**
+   * Get the residuals of the fitted model by type.
+   * @param residualsType The type of residuals which should be returned.
+   *  Supported options: deviance(default), pearson, 
working and response.
+   */
+  def residuals(residualsType: String = "deviance"): DataFrame = {
+residualsType match {
+  case "deviance" => devianceResiduals
+  case "pearson" => pearsonResiduals
+  case "working" => workingResiduals
+  case "response" => responseResiduals
+  case other => throw new UnsupportedOperationException(
+s"The residuals type $other is not supported by Generalized Linear 
Regression.")
+}
+  }
+
+  /**
+   * The deviance for the null model.
+   */
+  lazy val nullDeviance: Double = {
+val w = if (model.getWeightCol.isEmpty) lit(1.0) else 
col(model.getWeightCol)
+val wtdmu: Double = if (model.getFitIntercept) {
+  val agg = predictions.agg(sum(w.multiply(col(model.getLabelCol))), 
sum(w)).first()
+  agg.getDouble(0) / agg.getDouble(1)
+} else {
+  link.unlink(0.0)
+}
+predictions.select(col(model.getLabelCol), w).rdd.map {
+  case Row(y: Double, weight: Double) =>
+family.deviance(y, wtdmu, weight)
+}.sum()
+  }
+
+  /**
+   * The dev

[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11694#discussion_r56057691
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -633,3 +755,179 @@ object GeneralizedLinearRegressionModel extends 
MLReadable[GeneralizedLinearRegr
 }
   }
 }
+
+/**
+ * :: Experimental ::
+ * Summarizing Generalized Linear regression Fits.
+ *
+ * @param predictions predictions outputted by the model's `transform` 
method
+ * @param predictionCol field in "predictions" which gives the prediction 
value of each instance
+ * @param family the family object of the model
+ * @param link the link object of the model
+ * @param model the model that should be summarized
+ * @param diagInvAtWA diagonal of matrix (A^T * W * A)^-1 in the last 
iteration
+ * @param numIterations number of iterations
+ */
+@Since("2.0.0")
+@Experimental
+class GeneralizedLinearRegressionSummary private[regression] (
+@transient val predictions: DataFrame,
+val predictionCol: String,
+val family: GeneralizedLinearRegression.Family,
+val link: GeneralizedLinearRegression.Link,
+val model: GeneralizedLinearRegressionModel,
+private val diagInvAtWA: Array[Double],
+val numIterations: Int) extends Serializable {
+
+  import GeneralizedLinearRegression._
+
+  /** Number of instances in DataFrame predictions */
+  lazy val numInstances: Long = predictions.count()
+
+  /** The numeric rank of the fitted linear model */
+  lazy val rank: Long = if (model.getFitIntercept) {
+model.coefficients.size + 1
+  } else {
+model.coefficients.size
+  }
+
+  /** Degrees of freedom */
+  lazy val degreesOfFreedom: Long = {
+numInstances - rank
+  }
+
+  /** The residual degrees of freedom */
+  lazy val residualDegreeOfFreedom: Long = degreesOfFreedom
+
+  /** The residual degrees of freedom for the null model */
+  lazy val residualDegreeOfFreedomNull: Long = if (model.getFitIntercept) {
+numInstances - 1
+  } else {
+numInstances
+  }
+
+  private lazy val devianceResiduals: DataFrame = {
+val drUDF = udf { (y: Double, mu: Double, weight: Double) =>
+  val r = math.sqrt(math.max(family.deviance(y, mu, weight), 0.0))
+  if (y > mu) r else -1.0 * r
+}
+val w = if (model.getWeightCol.isEmpty) lit(1.0) else 
col(model.getWeightCol)
+predictions.select(
+  drUDF(col(model.getLabelCol), col(predictionCol), 
w).as("devianceResiduals"))
+  }
+
+  private lazy val pearsonResiduals: DataFrame = {
+val prUDF = udf { mu: Double => family.variance(mu) }
+val w = if (model.getWeightCol.isEmpty) lit(1.0) else 
col(model.getWeightCol)
+predictions.select(col(model.getLabelCol).minus(col(predictionCol))
+  
.multiply(sqrt(w)).divide(sqrt(prUDF(col(predictionCol.as("pearsonResiduals"))
+  }
+
+  private lazy val workingResiduals: DataFrame = {
+val wrUDF = udf { (y: Double, mu: Double) => (y - mu) * link.deriv(mu) 
}
+predictions.select(wrUDF(col(model.getLabelCol), 
col(predictionCol)).as("workingResiduals"))
+  }
+
+  private lazy val responseResiduals: DataFrame = {
+
predictions.select(col(model.getLabelCol).minus(col(predictionCol)).as("responseResiduals"))
+  }
+
+  /**
+   * Get the residuals of the fitted model by type.
+   * @param residualsType The type of residuals which should be returned.
+   *  Supported options: deviance(default), pearson, 
working and response.
+   */
+  def residuals(residualsType: String = "deviance"): DataFrame = {
--- End diff --

We shall not use default values for Java compatibility. Overload 
`residuals` instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11694#discussion_r56057316
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -432,6 +503,22 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 override def variance(mu: Double): Double = math.pow(mu, 2.0)
 
+override def deviance(y: Double, mu: Double, weight: Double): Double = 
{
+  val x = if (y == 0.0) 1.0 else y / mu
--- End diff --

When would `y == 0.0` happen? If this is not feasible in Gamma family, we 
should throw an error. In the current setting, this method returns `-2.0 * 
weight` when `y == 0.0`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11694#discussion_r56056935
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -405,6 +462,20 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 override def variance(mu: Double): Double = mu
 
+override def deviance(y: Double, mu: Double, weight: Double): Double = 
{
+  2 * weight * (y * math.log(y / mu) - (y - mu))
--- End diff --

`2.0` (just to be consistent)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11694#discussion_r56056640
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -573,6 +660,41 @@ class GeneralizedLinearRegressionModel private[ml] (
 familyAndLink.fitted(eta)
   }
 
+  private var trainingSummary: Option[GeneralizedLinearRegressionSummary] 
= None
+
+  /**
+   * Gets R-like summary of model on training set. An exception is
+   * thrown if `trainingSummary == None`.
+   */
+  @Since("2.0.0")
+  def summary: GeneralizedLinearRegressionSummary = trainingSummary match {
+case Some(summ) => summ
+case None =>
+  throw new SparkException(
+"No training summary available for this 
GeneralizedLinearRegressionModel",
+new NullPointerException())
+  }
+
+  private[regression] def setSummary(summary: 
GeneralizedLinearRegressionSummary): this.type = {
+this.trainingSummary = Some(summary)
+this
+  }
+
+  /**
+   * If the prediction column is set returns the current model and 
prediction column,
+   * otherwise generates a new column and sets it as the prediction column 
on a new copy
+   * of the current model.
+   */
+  private[regression] def findSummaryModelAndPredictionCol(): (
--- End diff --

move `: (` to the next line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11694#discussion_r56056645
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -633,3 +755,179 @@ object GeneralizedLinearRegressionModel extends 
MLReadable[GeneralizedLinearRegr
 }
   }
 }
+
+/**
+ * :: Experimental ::
+ * Summarizing Generalized Linear regression Fits.
+ *
+ * @param predictions predictions outputted by the model's `transform` 
method
+ * @param predictionCol field in "predictions" which gives the prediction 
value of each instance
+ * @param family the family object of the model
+ * @param link the link object of the model
+ * @param model the model that should be summarized
+ * @param diagInvAtWA diagonal of matrix (A^T * W * A)^-1 in the last 
iteration
+ * @param numIterations number of iterations
+ */
+@Since("2.0.0")
+@Experimental
+class GeneralizedLinearRegressionSummary private[regression] (
+@transient val predictions: DataFrame,
+val predictionCol: String,
+val family: GeneralizedLinearRegression.Family,
--- End diff --

This exposes private APIs in a public member. We should use `String` 
instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11694#discussion_r56056648
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -633,3 +755,179 @@ object GeneralizedLinearRegressionModel extends 
MLReadable[GeneralizedLinearRegr
 }
   }
 }
+
+/**
+ * :: Experimental ::
+ * Summarizing Generalized Linear regression Fits.
+ *
+ * @param predictions predictions outputted by the model's `transform` 
method
+ * @param predictionCol field in "predictions" which gives the prediction 
value of each instance
+ * @param family the family object of the model
+ * @param link the link object of the model
+ * @param model the model that should be summarized
+ * @param diagInvAtWA diagonal of matrix (A^T * W * A)^-1 in the last 
iteration
+ * @param numIterations number of iterations
+ */
+@Since("2.0.0")
+@Experimental
+class GeneralizedLinearRegressionSummary private[regression] (
+@transient val predictions: DataFrame,
+val predictionCol: String,
+val family: GeneralizedLinearRegression.Family,
+val link: GeneralizedLinearRegression.Link,
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11694#discussion_r56056465
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -573,6 +660,41 @@ class GeneralizedLinearRegressionModel private[ml] (
 familyAndLink.fitted(eta)
   }
 
+  private var trainingSummary: Option[GeneralizedLinearRegressionSummary] 
= None
+
+  /**
+   * Gets R-like summary of model on training set. An exception is
+   * thrown if `trainingSummary == None`.
+   */
+  @Since("2.0.0")
+  def summary: GeneralizedLinearRegressionSummary = trainingSummary match {
+case Some(summ) => summ
--- End diff --

~~~scala
trainingSummary.getOrElse {
  throw new Exception(...)
}
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11694#discussion_r56056435
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -318,6 +339,13 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 /** The variance of the endogenous variable's mean, given the value 
mu. */
 def variance(mu: Double): Double
 
+/** Deviance of (y, mu) pair. */
+def deviance(y: Double, mu: Double, weight: Double): Double
+
+/** Akaike's 'An Information Criterion'(AIC) value of the family. */
+def aic(predictions: RDD[(Double, Double, Double)], deviance: Double,
--- End diff --

* document params, especially `predictions` (y, mu, weight)
* chop down args


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11694#discussion_r56056490
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -573,6 +660,41 @@ class GeneralizedLinearRegressionModel private[ml] (
 familyAndLink.fitted(eta)
   }
 
+  private var trainingSummary: Option[GeneralizedLinearRegressionSummary] 
= None
+
+  /**
+   * Gets R-like summary of model on training set. An exception is
+   * thrown if `trainingSummary == None`.
+   */
+  @Since("2.0.0")
+  def summary: GeneralizedLinearRegressionSummary = trainingSummary match {
+case Some(summ) => summ
+case None =>
+  throw new SparkException(
+"No training summary available for this 
GeneralizedLinearRegressionModel",
+new NullPointerException())
--- End diff --

This is not a `NullPointerException`. We can just leave it as a 
`RuntimeException` for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11694#discussion_r56056414
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
 ---
@@ -103,6 +107,7 @@ private[ml] class IterativelyReweightedLeastSquares(
 
 }
 
-new IterativelyReweightedLeastSquaresModel(model.coefficients, 
model.intercept)
+new IterativelyReweightedLeastSquaresModel(model.coefficients,
+  model.intercept, model.diagInvAtWA, iter)
--- End diff --

~~~scala
new IterativelyReweightedLeastSquaresModel(
  model.coefficients, model.intercept, model.diagInvAtWA, iter)
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11694#discussion_r56056445
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -378,6 +419,22 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
 override def variance(mu: Double): Double = mu * (1.0 - mu)
 
+override def deviance(y: Double, mu: Double, weight: Double): Double = 
{
+  val my = 1.0 - y
+  2.0 * weight * (y * math.log(math.max(y, 1.0) / mu) +
+my * math.log(math.max(my, 1.0) / (1.0 - mu)))
+}
+
+override def aic(
+predictions: RDD[(Double, Double, Double)],
+deviance: Double,
+numInstances: Double,
+weightSum: Double): Double = {
+  -2.0 * predictions.map { case (y: Double, mu: Double, weight: 
Double) =>
+weight * breeze.stats.distributions.Binomial(1, 
mu).logProbabilityOf(math.round(y).toInt)
--- End diff --

import `distributions` as `dist` and use `dist.Binomial` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/11694#discussion_r56056415
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -17,7 +17,7 @@
 
 package org.apache.spark.ml.regression
 
-import breeze.stats.distributions.{Gaussian => GD}
+import breeze.stats.distributions.{Gaussian => GD, StudentsT}
--- End diff --

We can import `distributions` as `dist` then use `dist.Abc` in the code. 
See comments below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11694#issuecomment-196244057
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11694#issuecomment-196244058
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53067/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11694#issuecomment-196243699
  
**[Test build #53067 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53067/consoleFull)**
 for PR 11694 at commit 
[`fba1112`](https://github.com/apache/spark/commit/fba11123984af8b0e2482281e39c60df59170306).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11694#issuecomment-196228660
  
**[Test build #53067 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53067/consoleFull)**
 for PR 11694 at commit 
[`fba1112`](https://github.com/apache/spark/commit/fba11123984af8b0e2482281e39c60df59170306).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11694#issuecomment-196223229
  
**[Test build #53066 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53066/consoleFull)**
 for PR 11694 at commit 
[`5d4c87b`](https://github.com/apache/spark/commit/5d4c87b9ba9a88f2f8aee985e5f918bf02600ae9).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11694#issuecomment-196223237
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11694#issuecomment-196223240
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53066/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11694#issuecomment-196222428
  
**[Test build #53066 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53066/consoleFull)**
 for PR 11694 at commit 
[`5d4c87b`](https://github.com/apache/spark/commit/5d4c87b9ba9a88f2f8aee985e5f918bf02600ae9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...

2016-03-14 Thread yanboliang
GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/11694

[SPARK-9837] [ML] R-like summary statistics for GLMs via iteratively 
reweighted least squares

## What changes were proposed in this pull request?
Provide R-like summary statistics for GLMs via iteratively reweighted least 
squares.
## How was this patch tested?
unit tests.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-9837

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11694.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11694


commit 35063ab150c46989a8f1a78732fa9a25a98eb9f9
Author: Yanbo Liang 
Date:   2016-03-10T09:53:47Z

Initial version of GeneralizedLinearRegressionSummary

commit 6bb7cbe2dfb38220c4912eb333c6ae88c932754d
Author: Yanbo Liang 
Date:   2016-03-11T10:19:05Z

add test cases & update API doc

commit 5d4c87b9ba9a88f2f8aee985e5f918bf02600ae9
Author: Yanbo Liang 
Date:   2016-03-14T08:39:54Z

update AIC calculation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org