[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/11694 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-197159610 LGTM. Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-197152626 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53269/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-197152624 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-197152557 **[Test build #53269 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53269/consoleFull)** for PR 11694 at commit [`f89cdf0`](https://github.com/apache/spark/commit/f89cdf01d94faab7d6f9372df35033afe695f8aa). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-197143000 **[Test build #53269 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53269/consoleFull)** for PR 11694 at commit [`f89cdf0`](https://github.com/apache/spark/commit/f89cdf01d94faab7d6f9372df35033afe695f8aa). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-196481131 I made one pass and left some minor comments line. This looks great overall! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11694#discussion_r56058258 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -348,7 +376,20 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine override def initialize(y: Double, weight: Double): Double = y -def variance(mu: Double): Double = 1.0 +override def variance(mu: Double): Double = 1.0 + +override def deviance(y: Double, mu: Double, weight: Double): Double = { + weight * math.pow(y - mu, 2.0) --- End diff -- `(y - mu) * (y - mu)`, which is faster --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11694#discussion_r56058139 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -466,6 +468,461 @@ class GeneralizedLinearRegressionSuite } } + test("glm summary: gaussian family with weight") { +/* + R code: + + A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2) + b <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + df <- as.data.frame(cbind(A, b)) + */ +val datasetWithWeight = sqlContext.createDataFrame(sc.parallelize(Seq( + Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse), + Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(23.0, 3.0, Vectors.dense(2.0, 11.0)), + Instance(29.0, 4.0, Vectors.dense(3.0, 13.0)) +), 2)) +/* + R code: + + model <- glm(formula = "b ~ .", family="gaussian", data = df, weights = w) + summary(model) + + Deviance Residuals: + 1 2 3 4 + 1.920 -1.358 -1.109 0.960 + + Coefficients: + Estimate Std. Error t value Pr(>|t|) + (Intercept) 18.080 9.608 1.8820.311 + V1 6.080 5.556 1.0940.471 + V2-0.600 1.960 -0.3060.811 + + (Dispersion parameter for gaussian family taken to be 7.68) + + Null deviance: 202.00 on 3 degrees of freedom + Residual deviance: 7.68 on 1 degrees of freedom + AIC: 18.783 + + Number of Fisher Scoring iterations: 2 + + residuals(model, type="pearson") + 1 2 3 4 + 1.92 -1.357645 -1.108513 0.96 + + residuals(model, type="working") + 1 2 3 4 + 1.92 -0.96 -0.64 0.48 + + residuals(model, type="response") + 1 2 3 4 + 1.92 -0.96 -0.64 0.48 + */ +val trainer = new GeneralizedLinearRegression() + .setWeightCol("weight") + +val model = trainer.fit(datasetWithWeight) + +val coefficientsR = Vectors.dense(Array(6.080, -0.600)) +val interceptR = 18.080 +val devianceResidualsR = Array(1.920, -1.358, -1.109, 0.960) +val pearsonResidualsR = Array(1.92, -1.357645, -1.108513, 0.96) +val workingResidualsR = Array(1.92, -0.96, -0.64, 0.48) +val responseResidualsR = Array(1.92, -0.96, -0.64, 0.48) +val seCoefR = Array(5.556, 1.960, 9.608) +val tValsR = Array(1.094, -0.306, 1.882) +val pValsR = Array(0.471, 0.811, 0.311) +val dispersionR = 7.68 +val nullDevianceR = 202.00 +val residualDevianceR = 7.68 +val residualDegreeOfFreedomNullR = 3 +val residualDegreeOfFreedomR = 1 +val aicR = 18.783 + +val summary = model.summary + +val devianceResiduals = summary.residuals() + .select(col("devianceResiduals")) + .collect() + .map(_.getAs[Double](0)) --- End diff -- `_.getDouble(0)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11694#discussion_r56058081 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -633,3 +755,179 @@ object GeneralizedLinearRegressionModel extends MLReadable[GeneralizedLinearRegr } } } + +/** + * :: Experimental :: + * Summarizing Generalized Linear regression Fits. + * + * @param predictions predictions outputted by the model's `transform` method + * @param predictionCol field in "predictions" which gives the prediction value of each instance + * @param family the family object of the model + * @param link the link object of the model + * @param model the model that should be summarized + * @param diagInvAtWA diagonal of matrix (A^T * W * A)^-1 in the last iteration + * @param numIterations number of iterations + */ +@Since("2.0.0") +@Experimental +class GeneralizedLinearRegressionSummary private[regression] ( +@transient val predictions: DataFrame, +val predictionCol: String, +val family: GeneralizedLinearRegression.Family, +val link: GeneralizedLinearRegression.Link, +val model: GeneralizedLinearRegressionModel, +private val diagInvAtWA: Array[Double], +val numIterations: Int) extends Serializable { + + import GeneralizedLinearRegression._ + + /** Number of instances in DataFrame predictions */ + lazy val numInstances: Long = predictions.count() + + /** The numeric rank of the fitted linear model */ + lazy val rank: Long = if (model.getFitIntercept) { +model.coefficients.size + 1 + } else { +model.coefficients.size + } + + /** Degrees of freedom */ + lazy val degreesOfFreedom: Long = { +numInstances - rank + } + + /** The residual degrees of freedom */ + lazy val residualDegreeOfFreedom: Long = degreesOfFreedom + + /** The residual degrees of freedom for the null model */ + lazy val residualDegreeOfFreedomNull: Long = if (model.getFitIntercept) { +numInstances - 1 + } else { +numInstances + } + + private lazy val devianceResiduals: DataFrame = { +val drUDF = udf { (y: Double, mu: Double, weight: Double) => + val r = math.sqrt(math.max(family.deviance(y, mu, weight), 0.0)) + if (y > mu) r else -1.0 * r +} +val w = if (model.getWeightCol.isEmpty) lit(1.0) else col(model.getWeightCol) +predictions.select( + drUDF(col(model.getLabelCol), col(predictionCol), w).as("devianceResiduals")) + } + + private lazy val pearsonResiduals: DataFrame = { +val prUDF = udf { mu: Double => family.variance(mu) } +val w = if (model.getWeightCol.isEmpty) lit(1.0) else col(model.getWeightCol) +predictions.select(col(model.getLabelCol).minus(col(predictionCol)) + .multiply(sqrt(w)).divide(sqrt(prUDF(col(predictionCol.as("pearsonResiduals")) + } + + private lazy val workingResiduals: DataFrame = { +val wrUDF = udf { (y: Double, mu: Double) => (y - mu) * link.deriv(mu) } +predictions.select(wrUDF(col(model.getLabelCol), col(predictionCol)).as("workingResiduals")) + } + + private lazy val responseResiduals: DataFrame = { + predictions.select(col(model.getLabelCol).minus(col(predictionCol)).as("responseResiduals")) + } + + /** + * Get the residuals of the fitted model by type. + * @param residualsType The type of residuals which should be returned. + * Supported options: deviance(default), pearson, working and response. + */ + def residuals(residualsType: String = "deviance"): DataFrame = { +residualsType match { + case "deviance" => devianceResiduals + case "pearson" => pearsonResiduals + case "working" => workingResiduals + case "response" => responseResiduals + case other => throw new UnsupportedOperationException( +s"The residuals type $other is not supported by Generalized Linear Regression.") +} + } + + /** + * The deviance for the null model. + */ + lazy val nullDeviance: Double = { +val w = if (model.getWeightCol.isEmpty) lit(1.0) else col(model.getWeightCol) +val wtdmu: Double = if (model.getFitIntercept) { + val agg = predictions.agg(sum(w.multiply(col(model.getLabelCol))), sum(w)).first() + agg.getDouble(0) / agg.getDouble(1) +} else { + link.unlink(0.0) +} +predictions.select(col(model.getLabelCol), w).rdd.map { + case Row(y: Double, weight: Double) => +family.deviance(y, wtdmu, weight) +}.sum() + } + + /** + * The dev
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11694#discussion_r56057691 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -633,3 +755,179 @@ object GeneralizedLinearRegressionModel extends MLReadable[GeneralizedLinearRegr } } } + +/** + * :: Experimental :: + * Summarizing Generalized Linear regression Fits. + * + * @param predictions predictions outputted by the model's `transform` method + * @param predictionCol field in "predictions" which gives the prediction value of each instance + * @param family the family object of the model + * @param link the link object of the model + * @param model the model that should be summarized + * @param diagInvAtWA diagonal of matrix (A^T * W * A)^-1 in the last iteration + * @param numIterations number of iterations + */ +@Since("2.0.0") +@Experimental +class GeneralizedLinearRegressionSummary private[regression] ( +@transient val predictions: DataFrame, +val predictionCol: String, +val family: GeneralizedLinearRegression.Family, +val link: GeneralizedLinearRegression.Link, +val model: GeneralizedLinearRegressionModel, +private val diagInvAtWA: Array[Double], +val numIterations: Int) extends Serializable { + + import GeneralizedLinearRegression._ + + /** Number of instances in DataFrame predictions */ + lazy val numInstances: Long = predictions.count() + + /** The numeric rank of the fitted linear model */ + lazy val rank: Long = if (model.getFitIntercept) { +model.coefficients.size + 1 + } else { +model.coefficients.size + } + + /** Degrees of freedom */ + lazy val degreesOfFreedom: Long = { +numInstances - rank + } + + /** The residual degrees of freedom */ + lazy val residualDegreeOfFreedom: Long = degreesOfFreedom + + /** The residual degrees of freedom for the null model */ + lazy val residualDegreeOfFreedomNull: Long = if (model.getFitIntercept) { +numInstances - 1 + } else { +numInstances + } + + private lazy val devianceResiduals: DataFrame = { +val drUDF = udf { (y: Double, mu: Double, weight: Double) => + val r = math.sqrt(math.max(family.deviance(y, mu, weight), 0.0)) + if (y > mu) r else -1.0 * r +} +val w = if (model.getWeightCol.isEmpty) lit(1.0) else col(model.getWeightCol) +predictions.select( + drUDF(col(model.getLabelCol), col(predictionCol), w).as("devianceResiduals")) + } + + private lazy val pearsonResiduals: DataFrame = { +val prUDF = udf { mu: Double => family.variance(mu) } +val w = if (model.getWeightCol.isEmpty) lit(1.0) else col(model.getWeightCol) +predictions.select(col(model.getLabelCol).minus(col(predictionCol)) + .multiply(sqrt(w)).divide(sqrt(prUDF(col(predictionCol.as("pearsonResiduals")) + } + + private lazy val workingResiduals: DataFrame = { +val wrUDF = udf { (y: Double, mu: Double) => (y - mu) * link.deriv(mu) } +predictions.select(wrUDF(col(model.getLabelCol), col(predictionCol)).as("workingResiduals")) + } + + private lazy val responseResiduals: DataFrame = { + predictions.select(col(model.getLabelCol).minus(col(predictionCol)).as("responseResiduals")) + } + + /** + * Get the residuals of the fitted model by type. + * @param residualsType The type of residuals which should be returned. + * Supported options: deviance(default), pearson, working and response. + */ + def residuals(residualsType: String = "deviance"): DataFrame = { --- End diff -- We shall not use default values for Java compatibility. Overload `residuals` instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11694#discussion_r56057316 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -432,6 +503,22 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine override def variance(mu: Double): Double = math.pow(mu, 2.0) +override def deviance(y: Double, mu: Double, weight: Double): Double = { + val x = if (y == 0.0) 1.0 else y / mu --- End diff -- When would `y == 0.0` happen? If this is not feasible in Gamma family, we should throw an error. In the current setting, this method returns `-2.0 * weight` when `y == 0.0`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11694#discussion_r56056935 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -405,6 +462,20 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine override def variance(mu: Double): Double = mu +override def deviance(y: Double, mu: Double, weight: Double): Double = { + 2 * weight * (y * math.log(y / mu) - (y - mu)) --- End diff -- `2.0` (just to be consistent) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11694#discussion_r56056640 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -573,6 +660,41 @@ class GeneralizedLinearRegressionModel private[ml] ( familyAndLink.fitted(eta) } + private var trainingSummary: Option[GeneralizedLinearRegressionSummary] = None + + /** + * Gets R-like summary of model on training set. An exception is + * thrown if `trainingSummary == None`. + */ + @Since("2.0.0") + def summary: GeneralizedLinearRegressionSummary = trainingSummary match { +case Some(summ) => summ +case None => + throw new SparkException( +"No training summary available for this GeneralizedLinearRegressionModel", +new NullPointerException()) + } + + private[regression] def setSummary(summary: GeneralizedLinearRegressionSummary): this.type = { +this.trainingSummary = Some(summary) +this + } + + /** + * If the prediction column is set returns the current model and prediction column, + * otherwise generates a new column and sets it as the prediction column on a new copy + * of the current model. + */ + private[regression] def findSummaryModelAndPredictionCol(): ( --- End diff -- move `: (` to the next line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11694#discussion_r56056645 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -633,3 +755,179 @@ object GeneralizedLinearRegressionModel extends MLReadable[GeneralizedLinearRegr } } } + +/** + * :: Experimental :: + * Summarizing Generalized Linear regression Fits. + * + * @param predictions predictions outputted by the model's `transform` method + * @param predictionCol field in "predictions" which gives the prediction value of each instance + * @param family the family object of the model + * @param link the link object of the model + * @param model the model that should be summarized + * @param diagInvAtWA diagonal of matrix (A^T * W * A)^-1 in the last iteration + * @param numIterations number of iterations + */ +@Since("2.0.0") +@Experimental +class GeneralizedLinearRegressionSummary private[regression] ( +@transient val predictions: DataFrame, +val predictionCol: String, +val family: GeneralizedLinearRegression.Family, --- End diff -- This exposes private APIs in a public member. We should use `String` instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11694#discussion_r56056648 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -633,3 +755,179 @@ object GeneralizedLinearRegressionModel extends MLReadable[GeneralizedLinearRegr } } } + +/** + * :: Experimental :: + * Summarizing Generalized Linear regression Fits. + * + * @param predictions predictions outputted by the model's `transform` method + * @param predictionCol field in "predictions" which gives the prediction value of each instance + * @param family the family object of the model + * @param link the link object of the model + * @param model the model that should be summarized + * @param diagInvAtWA diagonal of matrix (A^T * W * A)^-1 in the last iteration + * @param numIterations number of iterations + */ +@Since("2.0.0") +@Experimental +class GeneralizedLinearRegressionSummary private[regression] ( +@transient val predictions: DataFrame, +val predictionCol: String, +val family: GeneralizedLinearRegression.Family, +val link: GeneralizedLinearRegression.Link, --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11694#discussion_r56056465 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -573,6 +660,41 @@ class GeneralizedLinearRegressionModel private[ml] ( familyAndLink.fitted(eta) } + private var trainingSummary: Option[GeneralizedLinearRegressionSummary] = None + + /** + * Gets R-like summary of model on training set. An exception is + * thrown if `trainingSummary == None`. + */ + @Since("2.0.0") + def summary: GeneralizedLinearRegressionSummary = trainingSummary match { +case Some(summ) => summ --- End diff -- ~~~scala trainingSummary.getOrElse { throw new Exception(...) } ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11694#discussion_r56056435 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -318,6 +339,13 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine /** The variance of the endogenous variable's mean, given the value mu. */ def variance(mu: Double): Double +/** Deviance of (y, mu) pair. */ +def deviance(y: Double, mu: Double, weight: Double): Double + +/** Akaike's 'An Information Criterion'(AIC) value of the family. */ +def aic(predictions: RDD[(Double, Double, Double)], deviance: Double, --- End diff -- * document params, especially `predictions` (y, mu, weight) * chop down args --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11694#discussion_r56056490 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -573,6 +660,41 @@ class GeneralizedLinearRegressionModel private[ml] ( familyAndLink.fitted(eta) } + private var trainingSummary: Option[GeneralizedLinearRegressionSummary] = None + + /** + * Gets R-like summary of model on training set. An exception is + * thrown if `trainingSummary == None`. + */ + @Since("2.0.0") + def summary: GeneralizedLinearRegressionSummary = trainingSummary match { +case Some(summ) => summ +case None => + throw new SparkException( +"No training summary available for this GeneralizedLinearRegressionModel", +new NullPointerException()) --- End diff -- This is not a `NullPointerException`. We can just leave it as a `RuntimeException` for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11694#discussion_r56056414 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala --- @@ -103,6 +107,7 @@ private[ml] class IterativelyReweightedLeastSquares( } -new IterativelyReweightedLeastSquaresModel(model.coefficients, model.intercept) +new IterativelyReweightedLeastSquaresModel(model.coefficients, + model.intercept, model.diagInvAtWA, iter) --- End diff -- ~~~scala new IterativelyReweightedLeastSquaresModel( model.coefficients, model.intercept, model.diagInvAtWA, iter) ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11694#discussion_r56056445 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -378,6 +419,22 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine override def variance(mu: Double): Double = mu * (1.0 - mu) +override def deviance(y: Double, mu: Double, weight: Double): Double = { + val my = 1.0 - y + 2.0 * weight * (y * math.log(math.max(y, 1.0) / mu) + +my * math.log(math.max(my, 1.0) / (1.0 - mu))) +} + +override def aic( +predictions: RDD[(Double, Double, Double)], +deviance: Double, +numInstances: Double, +weightSum: Double): Double = { + -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => +weight * breeze.stats.distributions.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) --- End diff -- import `distributions` as `dist` and use `dist.Binomial` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/11694#discussion_r56056415 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -17,7 +17,7 @@ package org.apache.spark.ml.regression -import breeze.stats.distributions.{Gaussian => GD} +import breeze.stats.distributions.{Gaussian => GD, StudentsT} --- End diff -- We can import `distributions` as `dist` then use `dist.Abc` in the code. See comments below. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-196244057 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-196244058 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53067/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-196243699 **[Test build #53067 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53067/consoleFull)** for PR 11694 at commit [`fba1112`](https://github.com/apache/spark/commit/fba11123984af8b0e2482281e39c60df59170306). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-196228660 **[Test build #53067 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53067/consoleFull)** for PR 11694 at commit [`fba1112`](https://github.com/apache/spark/commit/fba11123984af8b0e2482281e39c60df59170306). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-196223229 **[Test build #53066 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53066/consoleFull)** for PR 11694 at commit [`5d4c87b`](https://github.com/apache/spark/commit/5d4c87b9ba9a88f2f8aee985e5f918bf02600ae9). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-196223237 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-196223240 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53066/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-196222428 **[Test build #53066 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53066/consoleFull)** for PR 11694 at commit [`5d4c87b`](https://github.com/apache/spark/commit/5d4c87b9ba9a88f2f8aee985e5f918bf02600ae9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/11694 [SPARK-9837] [ML] R-like summary statistics for GLMs via iteratively reweighted least squares ## What changes were proposed in this pull request? Provide R-like summary statistics for GLMs via iteratively reweighted least squares. ## How was this patch tested? unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-9837 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11694.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11694 commit 35063ab150c46989a8f1a78732fa9a25a98eb9f9 Author: Yanbo Liang Date: 2016-03-10T09:53:47Z Initial version of GeneralizedLinearRegressionSummary commit 6bb7cbe2dfb38220c4912eb333c6ae88c932754d Author: Yanbo Liang Date: 2016-03-11T10:19:05Z add test cases & update API doc commit 5d4c87b9ba9a88f2f8aee985e5f918bf02600ae9 Author: Yanbo Liang Date: 2016-03-14T08:39:54Z update AIC calculation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org