[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15435 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r134628661 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1324,90 +1350,136 @@ private[ml] class MultiClassSummarizer extends Serializable { } /** - * Abstraction for multinomial Logistic Regression Training results. - * Currently, the training summary ignores the training weights except - * for the objective trace. - */ -sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary { - - /** objective function (scaled loss + regularization) at each iteration. */ - def objectiveHistory: Array[Double] - - /** Number of training iterations until termination */ - def totalIterations: Int = objectiveHistory.length - -} - -/** - * Abstraction for Logistic Regression Results for a given model. + * Abstraction for logistic regression results for a given model. */ sealed trait LogisticRegressionSummary extends Serializable { /** * Dataframe output by the model's `transform` method. */ + @Since("2.3.0") --- End diff -- I see; I think this should work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r134627545 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1324,90 +1354,147 @@ private[ml] class MultiClassSummarizer extends Serializable { } /** - * Abstraction for multinomial Logistic Regression Training results. - * Currently, the training summary ignores the training weights except - * for the objective trace. - */ -sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary { - - /** objective function (scaled loss + regularization) at each iteration. */ - def objectiveHistory: Array[Double] - - /** Number of training iterations until termination */ - def totalIterations: Int = objectiveHistory.length - -} - -/** - * Abstraction for Logistic Regression Results for a given model. + * Abstraction for logistic regression results for a given model. */ sealed trait LogisticRegressionSummary extends Serializable { /** * Dataframe output by the model's `transform` method. */ + @Since("1.5.0") def predictions: DataFrame /** Field in "predictions" which gives the probability of each class as a vector. */ + @Since("1.5.0") def probabilityCol: String + /** Field in "predictions" which gives the prediction of each class. */ + @Since("2.3.0") + def predictionCol: String + /** Field in "predictions" which gives the true label of each instance (if available). */ + @Since("1.5.0") def labelCol: String /** Field in "predictions" which gives the features of each instance as a vector. */ + @Since("1.6.0") def featuresCol: String + @transient private val multiclassMetrics = { +new MulticlassMetrics( + predictions.select( +col(predictionCol), +col(labelCol).cast(DoubleType)) +.rdd.map { case Row(prediction: Double, label: Double) => (prediction, label) }) + } + + /** + * Returns the sequence of labels in ascending order --- End diff -- Clarify: "Returns the sequence of labels in ascending order. This order matches the order used in metrics which are specified as arrays over labels, e.g., truePositiveRateByLabel." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133895488 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1574,18 +1588,17 @@ sealed trait BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegre * @param objectiveHistory objective function (scaled loss + regularization) at each iteration. */ private class LogisticRegressionTrainingSummaryImpl( -override val predictions: DataFrame, -override val probabilityCol: String, -override val predictionCol: String, -override val labelCol: String, -override val featuresCol: String, -val objectiveHistory: Array[Double]) +predictions: DataFrame, +probabilityCol: String, +predictionCol: String, +labelCol: String, +featuresCol: String, +override val objectiveHistory: Array[Double]) --- End diff -- This `val` cannot be removed because in base class it is a `def` so override here is required. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133895254 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1357,23 +1361,23 @@ sealed trait LogisticRegressionSummary extends Serializable { /** * Dataframe output by the model's `transform` method. */ - @Since("2.3.0") + @Since("1.5.0") def predictions: DataFrame /** Field in "predictions" which gives the probability of each class as a vector. */ - @Since("2.3.0") + @Since("1.5.0") def probabilityCol: String /** Field in "predictions" which gives the prediction of each class. */ @Since("2.3.0") def predictionCol: String /** Field in "predictions" which gives the true label of each instance (if available). */ - @Since("2.3.0") + @Since("1.5.0") def labelCol: String /** Field in "predictions" which gives the features of each instance as a vector. */ - @Since("2.3.0") + @Since("1.6.0") --- End diff -- current master code featuresCol is marked "1.6.0". I do not remove the `@since` into concrete impl class because they are all private. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133883361 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1324,90 +1350,136 @@ private[ml] class MultiClassSummarizer extends Serializable { } /** - * Abstraction for multinomial Logistic Regression Training results. - * Currently, the training summary ignores the training weights except - * for the objective trace. - */ -sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary { - - /** objective function (scaled loss + regularization) at each iteration. */ - def objectiveHistory: Array[Double] - - /** Number of training iterations until termination */ - def totalIterations: Int = objectiveHistory.length - -} - -/** - * Abstraction for Logistic Regression Results for a given model. + * Abstraction for logistic regression results for a given model. */ sealed trait LogisticRegressionSummary extends Serializable { /** * Dataframe output by the model's `transform` method. */ + @Since("2.3.0") --- End diff -- @jkbradley I think about this, currently the concrete impl class are marked as private, so it seems add @Since in `trait` is better (this is discussed with @sethah before) So what do you think about this: ``` sealed trait LogisticRegressionSummary extends Serializable { /** * Dataframe output by the model's `transform` method. */ @Since("1.5.0") def predictions: DataFrame /** Field in "predictions" which gives the probability of each class as a vector. */ @Since("1.5.0") def probabilityCol: String /** Field in "predictions" which gives the prediction of each class. */ @Since("2.3.0") def predictionCol: String /** Field in "predictions" which gives the true label of each instance (if available). */ @Since("1.5.0") def labelCol: String /** Field in "predictions" which gives the features of each instance as a vector. */ @Since("1.6.0") def featuresCol: String ... } ``` I also check current master code and `featuresCol` is marked "1.6.0", others marked "1.5.0", and "predictionCol" is new added so mark it "2.3.0" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133844018 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1324,90 +1350,136 @@ private[ml] class MultiClassSummarizer extends Serializable { } /** - * Abstraction for multinomial Logistic Regression Training results. - * Currently, the training summary ignores the training weights except - * for the objective trace. - */ -sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary { - - /** objective function (scaled loss + regularization) at each iteration. */ - def objectiveHistory: Array[Double] - - /** Number of training iterations until termination */ - def totalIterations: Int = objectiveHistory.length - -} - -/** - * Abstraction for Logistic Regression Results for a given model. + * Abstraction for logistic regression results for a given model. */ sealed trait LogisticRegressionSummary extends Serializable { /** * Dataframe output by the model's `transform` method. */ + @Since("2.3.0") --- End diff -- I confirmed: These annotations are inherited, so users may be confused about why some Since annotations have changed. E.g., BinaryLogisticRegressionSummary.featuresCol will change from 1.5.0 to 2.3.0. I'd just remove the ones in the trait. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133828976 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1484,3 +1556,96 @@ class BinaryLogisticRegressionSummary private[classification] ( binaryMetrics.recallByThreshold().toDF("threshold", "recall") } } + +sealed trait BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary + with LogisticRegressionTrainingSummary + +/** + * :: Experimental :: + * Multiclass logistic regression training results. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction for a data instance as a + * double. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + * @param objectiveHistory objective function (scaled loss + regularization) at each iteration. + */ +private class LogisticRegressionTrainingSummaryImpl( +override val predictions: DataFrame, --- End diff -- I think you can leave out the "override val" since the value is passed to LogisticRegressionSummaryImpl, which provides it as a val. Same for the other vals provided by LogisticRegressionSummaryImpl. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133771590 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -882,21 +882,28 @@ class LogisticRegression @Since("1.2.0") ( val model = copyValues(new LogisticRegressionModel(uid, coefficientMatrix, interceptVector, numClasses, isMultinomial)) -// TODO: implement summary model for multinomial case -val m = if (!isMultinomial) { - val (summaryModel, probabilityColName) = model.findSummaryModelAndProbabilityCol() - val logRegSummary = new BinaryLogisticRegressionTrainingSummary( + +val (summaryModel, probabilityColName, predictionColName) = model.findSummaryModel() +val logRegSummary = if (numClasses <=2) { --- End diff -- nit: space before "2" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133772559 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1324,90 +1350,136 @@ private[ml] class MultiClassSummarizer extends Serializable { } /** - * Abstraction for multinomial Logistic Regression Training results. - * Currently, the training summary ignores the training weights except - * for the objective trace. - */ -sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary { - - /** objective function (scaled loss + regularization) at each iteration. */ - def objectiveHistory: Array[Double] - - /** Number of training iterations until termination */ - def totalIterations: Int = objectiveHistory.length - -} - -/** - * Abstraction for Logistic Regression Results for a given model. + * Abstraction for logistic regression results for a given model. */ sealed trait LogisticRegressionSummary extends Serializable { /** * Dataframe output by the model's `transform` method. */ + @Since("2.3.0") --- End diff -- These are not new for this trait, so I'd remove the Since annotations. When in doubt, just put Since annotations in final or concrete classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133830867 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -199,15 +199,57 @@ class LogisticRegressionSuite } } - test("empty probabilityCol") { -val lr = new LogisticRegression().setProbabilityCol("") -val model = lr.fit(smallBinaryDataset) -assert(model.hasSummary) -// Validate that we re-insert a probability column for evaluation -val fieldNames = model.summary.predictions.schema.fieldNames -assert(smallBinaryDataset.schema.fieldNames.toSet.subsetOf( - fieldNames.toSet)) -assert(fieldNames.exists(s => s.startsWith("probability_"))) + test("empty probabilityCol or predictionCol") { +val lr = new LogisticRegression().setMaxIter(1) +val datasetFieldNames = smallBinaryDataset.schema.fieldNames.toSet +def checkSummarySchema(model: LogisticRegressionModel, columns: Seq[String]): Unit = { + val fieldNames = model.summary.predictions.schema.fieldNames + assert(model.hasSummary) + assert(datasetFieldNames.subsetOf(fieldNames.toSet)) + columns.foreach { c => assert(fieldNames.exists(_.startsWith(c))) } +} +// check that the summary model adds the appropriate columns +Seq(("binomial", smallBinaryDataset), ("multinomial", smallMultinomialDataset)).foreach { + case (family, dataset) => +lr.setFamily(family) +lr.setProbabilityCol("").setPredictionCol("prediction") +val modelNoProb = lr.fit(smallBinaryDataset) +checkSummarySchema(modelNoProb, Seq("probability_")) + +lr.setProbabilityCol("probability").setPredictionCol("") +val modelNoPred = lr.fit(smallBinaryDataset) +checkSummarySchema(modelNoPred, Seq("prediction_")) + +lr.setProbabilityCol("").setPredictionCol("") +val modelNoPredNoProb = lr.fit(smallBinaryDataset) +checkSummarySchema(modelNoPredNoProb, Seq("prediction_", "probability_")) +} + } + + test("check summary types for binary and multiclass") { +val lr = new LogisticRegression() + .setFamily("binomial") + +val blorModel = lr.fit(smallBinaryDataset) + assert(blorModel.summary.isInstanceOf[BinaryLogisticRegressionTrainingSummaryImpl]) --- End diff -- nit: You can test for the public trait BinaryLogisticRegressionTrainingSummary instead of the private implementation class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133824486 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1324,90 +1350,136 @@ private[ml] class MultiClassSummarizer extends Serializable { } /** - * Abstraction for multinomial Logistic Regression Training results. - * Currently, the training summary ignores the training weights except - * for the objective trace. - */ -sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary { - - /** objective function (scaled loss + regularization) at each iteration. */ - def objectiveHistory: Array[Double] - - /** Number of training iterations until termination */ - def totalIterations: Int = objectiveHistory.length - -} - -/** - * Abstraction for Logistic Regression Results for a given model. + * Abstraction for logistic regression results for a given model. */ sealed trait LogisticRegressionSummary extends Serializable { /** * Dataframe output by the model's `transform` method. */ + @Since("2.3.0") def predictions: DataFrame /** Field in "predictions" which gives the probability of each class as a vector. */ + @Since("2.3.0") def probabilityCol: String + /** Field in "predictions" which gives the prediction of each class. */ + @Since("2.3.0") + def predictionCol: String + /** Field in "predictions" which gives the true label of each instance (if available). */ + @Since("2.3.0") def labelCol: String /** Field in "predictions" which gives the features of each instance as a vector. */ + @Since("2.3.0") def featuresCol: String + @transient private val multiclassMetrics = { --- End diff -- MulticlassMetrics provides a ```labels``` field which returns the list of labels. In most cases, this will be values ```{0.0, 1.0, ..., numClasses-1}```. However, if the training set is missing a label, then all of the arrays over labels (e.g., from ```truePositiveRateByLabel```) will be of length numClasses-1 instead of the expected numClasses. In the future, it'd be nice to fix this by having them always be of length numClasses. For now, how about we provide the labels field with this kind of explanation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133829306 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1484,3 +1556,96 @@ class BinaryLogisticRegressionSummary private[classification] ( binaryMetrics.recallByThreshold().toDF("threshold", "recall") } } + +sealed trait BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary + with LogisticRegressionTrainingSummary + +/** + * :: Experimental :: + * Multiclass logistic regression training results. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction for a data instance as a + * double. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + * @param objectiveHistory objective function (scaled loss + regularization) at each iteration. + */ +private class LogisticRegressionTrainingSummaryImpl( +override val predictions: DataFrame, --- End diff -- Same for BinaryLogisticRegressionTrainingSummaryImpl --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133830662 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -199,15 +199,57 @@ class LogisticRegressionSuite } } - test("empty probabilityCol") { -val lr = new LogisticRegression().setProbabilityCol("") -val model = lr.fit(smallBinaryDataset) -assert(model.hasSummary) -// Validate that we re-insert a probability column for evaluation -val fieldNames = model.summary.predictions.schema.fieldNames -assert(smallBinaryDataset.schema.fieldNames.toSet.subsetOf( - fieldNames.toSet)) -assert(fieldNames.exists(s => s.startsWith("probability_"))) + test("empty probabilityCol or predictionCol") { +val lr = new LogisticRegression().setMaxIter(1) +val datasetFieldNames = smallBinaryDataset.schema.fieldNames.toSet +def checkSummarySchema(model: LogisticRegressionModel, columns: Seq[String]): Unit = { + val fieldNames = model.summary.predictions.schema.fieldNames + assert(model.hasSummary) + assert(datasetFieldNames.subsetOf(fieldNames.toSet)) + columns.foreach { c => assert(fieldNames.exists(_.startsWith(c))) } +} +// check that the summary model adds the appropriate columns +Seq(("binomial", smallBinaryDataset), ("multinomial", smallMultinomialDataset)).foreach { + case (family, dataset) => +lr.setFamily(family) +lr.setProbabilityCol("").setPredictionCol("prediction") +val modelNoProb = lr.fit(smallBinaryDataset) +checkSummarySchema(modelNoProb, Seq("probability_")) + +lr.setProbabilityCol("probability").setPredictionCol("") +val modelNoPred = lr.fit(smallBinaryDataset) +checkSummarySchema(modelNoPred, Seq("prediction_")) + +lr.setProbabilityCol("").setPredictionCol("") +val modelNoPredNoProb = lr.fit(smallBinaryDataset) +checkSummarySchema(modelNoPredNoProb, Seq("prediction_", "probability_")) +} + } + + test("check summary types for binary and multiclass") { +val lr = new LogisticRegression() --- End diff -- Set maxIter = 1 for speed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133771738 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1018,19 +1025,33 @@ class LogisticRegressionModel private[spark] ( throw new SparkException("No training summary available for this LogisticRegressionModel") } + @Since("2.2.0") --- End diff -- update to 2.3 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133828620 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1484,3 +1556,96 @@ class BinaryLogisticRegressionSummary private[classification] ( binaryMetrics.recallByThreshold().toDF("threshold", "recall") } } + +sealed trait BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary + with LogisticRegressionTrainingSummary + +/** + * :: Experimental :: --- End diff -- No need for Experimental tags on private classes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133772374 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1018,19 +1025,33 @@ class LogisticRegressionModel private[spark] ( throw new SparkException("No training summary available for this LogisticRegressionModel") } + @Since("2.2.0") --- End diff -- Also, add Scala doc indicating that this will throw an exception if it is a multiclass model. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133830499 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -199,15 +199,57 @@ class LogisticRegressionSuite } } - test("empty probabilityCol") { -val lr = new LogisticRegression().setProbabilityCol("") -val model = lr.fit(smallBinaryDataset) -assert(model.hasSummary) -// Validate that we re-insert a probability column for evaluation -val fieldNames = model.summary.predictions.schema.fieldNames -assert(smallBinaryDataset.schema.fieldNames.toSet.subsetOf( - fieldNames.toSet)) -assert(fieldNames.exists(s => s.startsWith("probability_"))) + test("empty probabilityCol or predictionCol") { +val lr = new LogisticRegression().setMaxIter(1) +val datasetFieldNames = smallBinaryDataset.schema.fieldNames.toSet +def checkSummarySchema(model: LogisticRegressionModel, columns: Seq[String]): Unit = { + val fieldNames = model.summary.predictions.schema.fieldNames + assert(model.hasSummary) + assert(datasetFieldNames.subsetOf(fieldNames.toSet)) + columns.foreach { c => assert(fieldNames.exists(_.startsWith(c))) } +} +// check that the summary model adds the appropriate columns +Seq(("binomial", smallBinaryDataset), ("multinomial", smallMultinomialDataset)).foreach { + case (family, dataset) => --- End diff -- You are not using "dataset" in this. However, this logic should be the same for both binomial and multinomial families, so I'm OK if you just test the binomial case here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133636149 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -882,21 +882,28 @@ class LogisticRegression @Since("1.2.0") ( val model = copyValues(new LogisticRegressionModel(uid, coefficientMatrix, interceptVector, numClasses, isMultinomial)) -// TODO: implement summary model for multinomial case -val m = if (!isMultinomial) { - val (summaryModel, probabilityColName) = model.findSummaryModelAndProbabilityCol() - val logRegSummary = new BinaryLogisticRegressionTrainingSummary( + +val (summaryModel, probabilityColName, predictionColName) = model.findSummaryModel() +val logRegSummary = if (!isMultinomial || (isMultinomial && numClasses <= 2)) { --- End diff -- Oh sorry! I didn't think about this case where a model is specified as multinomial but has 2 classes. What you had before is better: ```numClasses <=2``` (without using ```isMultinomial```). Feel free to question my suggestions :P --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133636062 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -882,21 +882,28 @@ class LogisticRegression @Since("1.2.0") ( val model = copyValues(new LogisticRegressionModel(uid, coefficientMatrix, interceptVector, numClasses, isMultinomial)) -// TODO: implement summary model for multinomial case -val m = if (!isMultinomial) { - val (summaryModel, probabilityColName) = model.findSummaryModelAndProbabilityCol() - val logRegSummary = new BinaryLogisticRegressionTrainingSummary( + +val (summaryModel, probabilityColName, predictionColName) = model.findSummaryModel() +val logRegSummary = if (numClasses <= 2) { --- End diff -- @jkbradley The logic here isn't `if (!isMultinomial) ... else ...` it is `if (!isMultinomial || (isMultinomial && numClasses <= 2)) ... else ...` see testcase `test("check summary types for binary and multiclass")` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133589180 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -882,21 +882,28 @@ class LogisticRegression @Since("1.2.0") ( val model = copyValues(new LogisticRegressionModel(uid, coefficientMatrix, interceptVector, numClasses, isMultinomial)) -// TODO: implement summary model for multinomial case -val m = if (!isMultinomial) { - val (summaryModel, probabilityColName) = model.findSummaryModelAndProbabilityCol() - val logRegSummary = new BinaryLogisticRegressionTrainingSummary( + +val (summaryModel, probabilityColName, predictionColName) = model.findSummaryModel() +val logRegSummary = if (numClasses <= 2) { --- End diff -- Use isMultinomial since it is easier to read. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133598714 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1324,90 +1350,130 @@ private[ml] class MultiClassSummarizer extends Serializable { } /** - * Abstraction for multinomial Logistic Regression Training results. - * Currently, the training summary ignores the training weights except - * for the objective trace. - */ -sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary { - - /** objective function (scaled loss + regularization) at each iteration. */ - def objectiveHistory: Array[Double] - - /** Number of training iterations until termination */ - def totalIterations: Int = objectiveHistory.length - -} - -/** - * Abstraction for Logistic Regression Results for a given model. + * Abstraction for logistic regression results for a given model. */ sealed trait LogisticRegressionSummary extends Serializable { /** * Dataframe output by the model's `transform` method. */ + @Since("2.3.0") def predictions: DataFrame /** Field in "predictions" which gives the probability of each class as a vector. */ + @Since("2.3.0") def probabilityCol: String + /** Field in "predictions" which gives the prediction of each class. */ + @Since("2.3.0") + def predictionCol: String + /** Field in "predictions" which gives the true label of each instance (if available). */ + @Since("2.3.0") def labelCol: String /** Field in "predictions" which gives the features of each instance as a vector. */ + @Since("2.3.0") def featuresCol: String + @transient private val multiclassMetrics = { +new MulticlassMetrics( + predictions.select( +col(predictionCol), +col(labelCol).cast(DoubleType)) +.rdd.map { case Row(prediction: Double, label: Double) => (prediction, label) }) + } + + /** Returns true positive rate for each label. */ + @Since("2.3.0") + def truePositiveRateByLabel: Array[Double] = recallByLabel + + /** Returns false positive rate for each label. */ + @Since("2.3.0") + def falsePositiveRateByLabel: Array[Double] = { +multiclassMetrics.labels.map(label => multiclassMetrics.falsePositiveRate(label)) + } + + /** Returns precision for each label. */ + @Since("2.3.0") + def precisionByLabel: Array[Double] = { +multiclassMetrics.labels.map(label => multiclassMetrics.precision(label)) + } + + /** Returns recall for each label. */ + @Since("2.3.0") + def recallByLabel: Array[Double] = { +multiclassMetrics.labels.map(label => multiclassMetrics.recall(label)) + } + + /** + * Returns f-measure for each label. + */ + @Since("2.3.0") + def fMeasureByLabel(beta: Double): Array[Double] = { +multiclassMetrics.labels.map(label => multiclassMetrics.fMeasure(label, beta)) + } + + /** Returns f1-measure for each label. */ + @Since("2.3.0") + def fMeasureByLabel: Array[Double] = fMeasureByLabel(1.0) + + /** Returns accuracy. */ + @Since("2.3.0") + def accuracy: Double = multiclassMetrics.accuracy + + /** Returns weighted true positive rate. */ + @Since("2.3.0") + def weightedTruePositiveRate: Double = weightedRecall + + /** Returns weighted false positive rate. */ + @Since("2.3.0") + def weightedFalsePositiveRate: Double = multiclassMetrics.weightedFalsePositiveRate + + /** Returns weighted averaged recall. */ + @Since("2.3.0") + def weightedRecall: Double = multiclassMetrics.weightedRecall --- End diff -- For all of these, can you please make sure to copy all of the info from the MulticlassMetrics Scala doc? There is some info missing, and users cannot be expected to find it on their own. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r133598792 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1324,90 +1350,130 @@ private[ml] class MultiClassSummarizer extends Serializable { } /** - * Abstraction for multinomial Logistic Regression Training results. - * Currently, the training summary ignores the training weights except - * for the objective trace. - */ -sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary { - - /** objective function (scaled loss + regularization) at each iteration. */ - def objectiveHistory: Array[Double] - - /** Number of training iterations until termination */ - def totalIterations: Int = objectiveHistory.length - -} - -/** - * Abstraction for Logistic Regression Results for a given model. + * Abstraction for logistic regression results for a given model. */ sealed trait LogisticRegressionSummary extends Serializable { /** * Dataframe output by the model's `transform` method. */ + @Since("2.3.0") def predictions: DataFrame /** Field in "predictions" which gives the probability of each class as a vector. */ + @Since("2.3.0") def probabilityCol: String + /** Field in "predictions" which gives the prediction of each class. */ + @Since("2.3.0") + def predictionCol: String + /** Field in "predictions" which gives the true label of each instance (if available). */ + @Since("2.3.0") def labelCol: String /** Field in "predictions" which gives the features of each instance as a vector. */ + @Since("2.3.0") def featuresCol: String + @transient private val multiclassMetrics = { +new MulticlassMetrics( + predictions.select( +col(predictionCol), +col(labelCol).cast(DoubleType)) +.rdd.map { case Row(prediction: Double, label: Double) => (prediction, label) }) + } + + /** Returns true positive rate for each label. */ + @Since("2.3.0") + def truePositiveRateByLabel: Array[Double] = recallByLabel + + /** Returns false positive rate for each label. */ + @Since("2.3.0") + def falsePositiveRateByLabel: Array[Double] = { +multiclassMetrics.labels.map(label => multiclassMetrics.falsePositiveRate(label)) + } + + /** Returns precision for each label. */ + @Since("2.3.0") + def precisionByLabel: Array[Double] = { +multiclassMetrics.labels.map(label => multiclassMetrics.precision(label)) + } + + /** Returns recall for each label. */ + @Since("2.3.0") + def recallByLabel: Array[Double] = { +multiclassMetrics.labels.map(label => multiclassMetrics.recall(label)) + } + + /** + * Returns f-measure for each label. + */ + @Since("2.3.0") + def fMeasureByLabel(beta: Double): Array[Double] = { +multiclassMetrics.labels.map(label => multiclassMetrics.fMeasure(label, beta)) + } + + /** Returns f1-measure for each label. */ + @Since("2.3.0") + def fMeasureByLabel: Array[Double] = fMeasureByLabel(1.0) + + /** Returns accuracy. */ + @Since("2.3.0") + def accuracy: Double = multiclassMetrics.accuracy + + /** Returns weighted true positive rate. */ + @Since("2.3.0") + def weightedTruePositiveRate: Double = weightedRecall + + /** Returns weighted false positive rate. */ + @Since("2.3.0") + def weightedFalsePositiveRate: Double = multiclassMetrics.weightedFalsePositiveRate + + /** Returns weighted averaged recall. */ + @Since("2.3.0") + def weightedRecall: Double = multiclassMetrics.weightedRecall + + /** Returns weighted averaged precision. */ + @Since("2.3.0") + def weightedPrecision: Double = multiclassMetrics.weightedPrecision + + /** + * Returns weighted averaged f-measure. + */ + @Since("2.3.0") + def weightedFMeasure(beta: Double): Double = multiclassMetrics.weightedFMeasure(beta) + + /** Returns weighted averaged f1-measure. */ + @Since("2.3.0") + def weightedFMeasure: Double = multiclassMetrics.weightedFMeasure(1.0) } /** - * :: Experimental :: - * Logistic regression training results. - * - * @param predictions dataframe output by the model's `transform` method. - * @param probabilityCol field in "predictions" which gives the probability of - * each class as a vector. - * @param labelCol field in "predictions" which gives the true label of each instance. - * @param featuresCol field in "predictions" which gives the features of each instance
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r114818175 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -982,19 +989,33 @@ class LogisticRegressionModel private[spark] ( throw new SparkException("No training summary available for this LogisticRegressionModel") } + @Since("2.2.0") + def binarySummary: BinaryLogisticRegressionTrainingSummary = summary match { --- End diff -- For convenience. Otherwise, users need to call `model.summary.asInstanceOf[BinaryLogisticRegressionTrainingSummary]` in order to access the extra, binary-specific methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r114816222 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1449,6 +1513,107 @@ class BinaryLogisticRegressionSummary private[classification] ( } } +sealed trait BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary + with LogisticRegressionTrainingSummary + +/** + * :: Experimental :: + * Multiclass logistic regression training results. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction for a data instance as a + * double. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + * @param objectiveHistory objective function (scaled loss + regularization) at each iteration. + */ +@Experimental +@Since("2.2.0") --- End diff -- Remove ```Experimental``` and ```Since``` tag for private class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r114816720 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -982,19 +989,33 @@ class LogisticRegressionModel private[spark] ( throw new SparkException("No training summary available for this LogisticRegressionModel") } + @Since("2.2.0") + def binarySummary: BinaryLogisticRegressionTrainingSummary = summary match { --- End diff -- Could I ask why we need a separate ```binarySummary```? If we get a binary classification model, the ```trainingSummary ``` should be the corresponding binary summary. Is that correct? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r114816288 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1449,6 +1513,107 @@ class BinaryLogisticRegressionSummary private[classification] ( } } +sealed trait BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary + with LogisticRegressionTrainingSummary + +/** + * :: Experimental :: + * Multiclass logistic regression training results. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction for a data instance as a + * double. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + * @param objectiveHistory objective function (scaled loss + regularization) at each iteration. + */ +@Experimental +@Since("2.2.0") +private class LogisticRegressionTrainingSummaryImpl( +override val predictions: DataFrame, +override val probabilityCol: String, +override val predictionCol: String, +override val labelCol: String, +override val featuresCol: String, +@Since("1.5.0") val objectiveHistory: Array[Double]) + extends LogisticRegressionSummaryImpl( +predictions, probabilityCol, predictionCol, labelCol, featuresCol) + with LogisticRegressionTrainingSummary + +/** + * :: Experimental :: + * Multiclass logistic regression results for a given model. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction for a data instance as a + * double. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + */ +@Experimental +@Since("2.2.0") +private class LogisticRegressionSummaryImpl( --- End diff -- Ditto. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r114043519 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1231,6 +1295,109 @@ class BinaryLogisticRegressionSummary private[classification] ( } } +sealed trait BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary + with LogisticRegressionTrainingSummary + +/** + * :: Experimental :: + * Multiclass logistic regression training results. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction for a data instance as a + * double. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + * @param objectiveHistory objective function (scaled loss + regularization) at each iteration. + */ +@Experimental +@Since("2.2.0") +private class LogisticRegressionTrainingSummaryImpl( +override val predictions: DataFrame, +override val probabilityCol: String, +override val predictionCol: String, +override val labelCol: String, +override val featuresCol: String, +@Since("1.5.0") val objectiveHistory: Array[Double]) + extends LogisticRegressionSummaryImpl( +predictions, probabilityCol, predictionCol, labelCol, featuresCol) + with LogisticRegressionTrainingSummary { + +} + +/** + * :: Experimental :: + * Multiclass Logistic regression results for a given model. --- End diff -- yeah, I already rebase it. (If I do not squash the commit to fix conflicts , the rebase always fail, maybe there are some better way?) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r113950702 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1231,6 +1295,109 @@ class BinaryLogisticRegressionSummary private[classification] ( } } +sealed trait BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary + with LogisticRegressionTrainingSummary + +/** + * :: Experimental :: + * Multiclass logistic regression training results. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction for a data instance as a + * double. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + * @param objectiveHistory objective function (scaled loss + regularization) at each iteration. + */ +@Experimental +@Since("2.2.0") +private class LogisticRegressionTrainingSummaryImpl( +override val predictions: DataFrame, +override val probabilityCol: String, +override val predictionCol: String, +override val labelCol: String, +override val featuresCol: String, +@Since("1.5.0") val objectiveHistory: Array[Double]) + extends LogisticRegressionSummaryImpl( +predictions, probabilityCol, predictionCol, labelCol, featuresCol) + with LogisticRegressionTrainingSummary { + +} + +/** + * :: Experimental :: + * Multiclass Logistic regression results for a given model. --- End diff -- have you tried just rebasing? `git pull --rebase upstream master`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r113867765 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1231,6 +1295,109 @@ class BinaryLogisticRegressionSummary private[classification] ( } } +sealed trait BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary + with LogisticRegressionTrainingSummary + +/** + * :: Experimental :: + * Multiclass logistic regression training results. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction for a data instance as a + * double. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + * @param objectiveHistory objective function (scaled loss + regularization) at each iteration. + */ +@Experimental +@Since("2.2.0") +private class LogisticRegressionTrainingSummaryImpl( +override val predictions: DataFrame, +override val probabilityCol: String, +override val predictionCol: String, +override val labelCol: String, +override val featuresCol: String, +@Since("1.5.0") val objectiveHistory: Array[Double]) + extends LogisticRegressionSummaryImpl( +predictions, probabilityCol, predictionCol, labelCol, featuresCol) + with LogisticRegressionTrainingSummary { + +} + +/** + * :: Experimental :: + * Multiclass Logistic regression results for a given model. --- End diff -- @sethah I have a small problem, if I do not squash commits, it seems cannot solve the conflicts... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r113857158 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1231,6 +1295,109 @@ class BinaryLogisticRegressionSummary private[classification] ( } } +sealed trait BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary + with LogisticRegressionTrainingSummary + +/** + * :: Experimental :: + * Multiclass logistic regression training results. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction for a data instance as a + * double. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + * @param objectiveHistory objective function (scaled loss + regularization) at each iteration. + */ +@Experimental +@Since("2.2.0") +private class LogisticRegressionTrainingSummaryImpl( +override val predictions: DataFrame, +override val probabilityCol: String, +override val predictionCol: String, +override val labelCol: String, +override val featuresCol: String, +@Since("1.5.0") val objectiveHistory: Array[Double]) + extends LogisticRegressionSummaryImpl( +predictions, probabilityCol, predictionCol, labelCol, featuresCol) + with LogisticRegressionTrainingSummary { + +} --- End diff -- remove braces --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r113857182 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1231,6 +1295,109 @@ class BinaryLogisticRegressionSummary private[classification] ( } } +sealed trait BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary + with LogisticRegressionTrainingSummary + +/** + * :: Experimental :: + * Multiclass logistic regression training results. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction for a data instance as a + * double. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + * @param objectiveHistory objective function (scaled loss + regularization) at each iteration. + */ +@Experimental +@Since("2.2.0") +private class LogisticRegressionTrainingSummaryImpl( +override val predictions: DataFrame, +override val probabilityCol: String, +override val predictionCol: String, +override val labelCol: String, +override val featuresCol: String, +@Since("1.5.0") val objectiveHistory: Array[Double]) + extends LogisticRegressionSummaryImpl( +predictions, probabilityCol, predictionCol, labelCol, featuresCol) + with LogisticRegressionTrainingSummary { + +} + +/** + * :: Experimental :: + * Multiclass Logistic regression results for a given model. --- End diff -- "Multiclass Logistic" -> "Multiclass logistic" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r113856908 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1070,90 +1096,128 @@ private[classification] class MultiClassSummarizer extends Serializable { } /** - * Abstraction for multinomial Logistic Regression Training results. - * Currently, the training summary ignores the training weights except - * for the objective trace. - */ -sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary { - - /** objective function (scaled loss + regularization) at each iteration. */ - def objectiveHistory: Array[Double] - - /** Number of training iterations until termination */ - def totalIterations: Int = objectiveHistory.length - -} - -/** - * Abstraction for Logistic Regression Results for a given model. + * Abstraction for logistic regression results for a given model. */ sealed trait LogisticRegressionSummary extends Serializable { /** * Dataframe output by the model's `transform` method. */ + @Since("2.2.0") --- End diff -- We have avoided putting since tags on methods inside traits. But since this is sealed and I don't think we intend to add to it, it should be ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r107961224 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -644,21 +644,29 @@ class LogisticRegression @Since("1.2.0") ( val model = copyValues(new LogisticRegressionModel(uid, coefficientMatrix, interceptVector, numClasses, isMultinomial)) -// TODO: implement summary model for multinomial case -val m = if (!isMultinomial) { - val (summaryModel, probabilityColName) = model.findSummaryModelAndProbabilityCol() - val logRegSummary = new BinaryLogisticRegressionTrainingSummary( + +val (summaryModel, probabilityColName, predictionColName) + = model.findSummaryModel() +val logRegSummary = if (!isMultinomial) { --- End diff -- It doesn't appear you added any tests for this? Also, it is better for reviewers if you do not squash your commits since we can then see what has changed since the last pass was made. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r107326057 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -1786,51 +1793,98 @@ class LogisticRegressionSuite } test("evaluate on test set") { -// TODO: add for multiclass when model summary becomes available // Evaluate on test set should be same as that of the transformed training data. val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(1.0) .setThreshold(0.6) -val model = lr.fit(smallBinaryDataset) -val summary = model.summary.asInstanceOf[BinaryLogisticRegressionSummary] - -val sameSummary = - model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] -assert(summary.areaUnderROC === sameSummary.areaUnderROC) -assert(summary.roc.collect() === sameSummary.roc.collect()) -assert(summary.pr.collect === sameSummary.pr.collect()) + .setFamily("binomial") +val blorModel = lr.fit(smallBinaryDataset) +val blorSummary = blorModel.binarySummary + +val sameBlorSummary = + blorModel.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] +assert(blorSummary.areaUnderROC === sameBlorSummary.areaUnderROC) +assert(blorSummary.roc.collect() === sameBlorSummary.roc.collect()) +assert(blorSummary.pr.collect === sameBlorSummary.pr.collect()) assert( - summary.fMeasureByThreshold.collect() === sameSummary.fMeasureByThreshold.collect()) -assert(summary.recallByThreshold.collect() === sameSummary.recallByThreshold.collect()) + blorSummary.fMeasureByThreshold.collect() === sameBlorSummary.fMeasureByThreshold.collect()) assert( - summary.precisionByThreshold.collect() === sameSummary.precisionByThreshold.collect()) + blorSummary.recallByThreshold.collect() === sameBlorSummary.recallByThreshold.collect()) +assert( + blorSummary.precisionByThreshold.collect() +=== sameBlorSummary.precisionByThreshold.collect()) + +lr.setFamily("multinomial") +val mlorModel = lr.fit(smallMultinomialDataset) +val mlorSummary = mlorModel.summary + +val mlorSameSummary = mlorModel.evaluate(smallMultinomialDataset) + +assert(mlorSummary.truePositiveRateByLabel === mlorSameSummary.truePositiveRateByLabel) +assert(mlorSummary.falsePositiveRateByLabel === mlorSameSummary.falsePositiveRateByLabel) +assert(mlorSummary.precisionByLabel === mlorSameSummary.precisionByLabel) +assert(mlorSummary.recallByLabel === mlorSameSummary.recallByLabel) +assert(mlorSummary.fMeasureByLabel === mlorSameSummary.fMeasureByLabel) +assert(mlorSummary.accuracy === mlorSameSummary.accuracy) +assert(mlorSummary.weightedTruePositiveRate === mlorSameSummary.weightedTruePositiveRate) +assert(mlorSummary.weightedFalsePositiveRate === mlorSameSummary.weightedFalsePositiveRate) +assert(mlorSummary.weightedPrecision === mlorSameSummary.weightedPrecision) +assert(mlorSummary.weightedRecall === mlorSameSummary.weightedRecall) +assert(mlorSummary.weightedFMeasure === mlorSameSummary.weightedFMeasure) } test("evaluate with labels that are not doubles") { // Evaluate a test set with Label that is a numeric type other than Double -val lr = new LogisticRegression() +val blor = new LogisticRegression() .setMaxIter(1) .setRegParam(1.0) -val model = lr.fit(smallBinaryDataset) -val summary = model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] +val blorModel = blor.fit(smallBinaryDataset) +val blorSummary = blorModel.evaluate(smallBinaryDataset) + .asInstanceOf[BinaryLogisticRegressionSummary] + +val blorLongLabelData = smallBinaryDataset.select(col(blorModel.getLabelCol).cast(LongType), + col(blorModel.getFeaturesCol)) +val blorLongSummary = blorModel.evaluate(blorLongLabelData) + .asInstanceOf[BinaryLogisticRegressionSummary] --- End diff -- OK. I agree to do it in separate PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r107325971 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -194,15 +207,9 @@ class LogisticRegressionSuite // thresholds and threshold must be consistent: values withClue("fit with ParamMap should throw error if threshold, thresholds do not match.") { intercept[IllegalArgumentException] { -lr2.fit(smallBinaryDataset, - lr2.thresholds -> Array(0.3, 0.7), lr2.threshold -> (expectedThreshold / 2.0)) - } -} -withClue("fit with ParamMap should throw error if threshold, thresholds do not match.") { - intercept[IllegalArgumentException] { -val lr2model = lr2.fit(smallBinaryDataset, - lr2.thresholds -> Array(0.3, 0.7), lr2.threshold -> (expectedThreshold / 2.0)) -lr2model.getThreshold +val lr3 = lr2.copy( --- End diff -- I will revert this. (previously has a bug which cause this test case fail but now fixed) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r106725381 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -644,21 +644,29 @@ class LogisticRegression @Since("1.2.0") ( val model = copyValues(new LogisticRegressionModel(uid, coefficientMatrix, interceptVector, numClasses, isMultinomial)) -// TODO: implement summary model for multinomial case -val m = if (!isMultinomial) { - val (summaryModel, probabilityColName) = model.findSummaryModelAndProbabilityCol() - val logRegSummary = new BinaryLogisticRegressionTrainingSummary( + +val (summaryModel, probabilityColName, predictionColName) + = model.findSummaryModel() +val logRegSummary = if (!isMultinomial) { --- End diff -- We should change this to `if (numClasses <= 2)` since you can train a multinomial model on 2 classes, and we should still get a binary summary in that case. And of course, please add tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r106492335 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1086,83 +1115,124 @@ private[classification] class MultiClassSummarizer extends Serializable { } /** - * Abstraction for multinomial Logistic Regression Training results. - * Currently, the training summary ignores the training weights except - * for the objective trace. - */ -sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary { - - /** objective function (scaled loss + regularization) at each iteration. */ - def objectiveHistory: Array[Double] - - /** Number of training iterations until termination */ - def totalIterations: Int = objectiveHistory.length - -} - -/** * Abstraction for Logistic Regression Results for a given model. */ sealed trait LogisticRegressionSummary extends Serializable { /** * Dataframe output by the model's `transform` method. */ + @Since("2.2.0") def predictions: DataFrame /** Field in "predictions" which gives the probability of each class as a vector. */ + @Since("2.2.0") def probabilityCol: String + /** Field in "predictions" which gives the prediction of each class as a vector. */ --- End diff -- "as a Double". Please check this, there are several cases where the prediction is said to be a vector, but predictions are actually double values --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r106726649 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -813,9 +835,16 @@ class LogisticRegressionModel private[spark] ( @Since("2.0.0") def evaluate(dataset: Dataset[_]): LogisticRegressionSummary = { // Handle possible missing or invalid prediction columns -val (summaryModel, probabilityColName) = findSummaryModelAndProbabilityCol() -new BinaryLogisticRegressionSummary(summaryModel.transform(dataset), - probabilityColName, $(labelCol), $(featuresCol)) +val (summaryModel, probabilityColName, predictionColName) --- End diff -- put this on a single line. also, it happens above so do it there as well --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r106727119 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -1786,51 +1793,98 @@ class LogisticRegressionSuite } test("evaluate on test set") { -// TODO: add for multiclass when model summary becomes available // Evaluate on test set should be same as that of the transformed training data. val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(1.0) .setThreshold(0.6) -val model = lr.fit(smallBinaryDataset) -val summary = model.summary.asInstanceOf[BinaryLogisticRegressionSummary] - -val sameSummary = - model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] -assert(summary.areaUnderROC === sameSummary.areaUnderROC) -assert(summary.roc.collect() === sameSummary.roc.collect()) -assert(summary.pr.collect === sameSummary.pr.collect()) + .setFamily("binomial") +val blorModel = lr.fit(smallBinaryDataset) +val blorSummary = blorModel.binarySummary + +val sameBlorSummary = + blorModel.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] +assert(blorSummary.areaUnderROC === sameBlorSummary.areaUnderROC) +assert(blorSummary.roc.collect() === sameBlorSummary.roc.collect()) +assert(blorSummary.pr.collect === sameBlorSummary.pr.collect()) assert( - summary.fMeasureByThreshold.collect() === sameSummary.fMeasureByThreshold.collect()) -assert(summary.recallByThreshold.collect() === sameSummary.recallByThreshold.collect()) + blorSummary.fMeasureByThreshold.collect() === sameBlorSummary.fMeasureByThreshold.collect()) assert( - summary.precisionByThreshold.collect() === sameSummary.precisionByThreshold.collect()) + blorSummary.recallByThreshold.collect() === sameBlorSummary.recallByThreshold.collect()) +assert( + blorSummary.precisionByThreshold.collect() +=== sameBlorSummary.precisionByThreshold.collect()) + +lr.setFamily("multinomial") +val mlorModel = lr.fit(smallMultinomialDataset) +val mlorSummary = mlorModel.summary + +val mlorSameSummary = mlorModel.evaluate(smallMultinomialDataset) + +assert(mlorSummary.truePositiveRateByLabel === mlorSameSummary.truePositiveRateByLabel) +assert(mlorSummary.falsePositiveRateByLabel === mlorSameSummary.falsePositiveRateByLabel) +assert(mlorSummary.precisionByLabel === mlorSameSummary.precisionByLabel) +assert(mlorSummary.recallByLabel === mlorSameSummary.recallByLabel) +assert(mlorSummary.fMeasureByLabel === mlorSameSummary.fMeasureByLabel) +assert(mlorSummary.accuracy === mlorSameSummary.accuracy) +assert(mlorSummary.weightedTruePositiveRate === mlorSameSummary.weightedTruePositiveRate) +assert(mlorSummary.weightedFalsePositiveRate === mlorSameSummary.weightedFalsePositiveRate) +assert(mlorSummary.weightedPrecision === mlorSameSummary.weightedPrecision) +assert(mlorSummary.weightedRecall === mlorSameSummary.weightedRecall) +assert(mlorSummary.weightedFMeasure === mlorSameSummary.weightedFMeasure) } test("evaluate with labels that are not doubles") { // Evaluate a test set with Label that is a numeric type other than Double -val lr = new LogisticRegression() +val blor = new LogisticRegression() .setMaxIter(1) .setRegParam(1.0) -val model = lr.fit(smallBinaryDataset) -val summary = model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] +val blorModel = blor.fit(smallBinaryDataset) +val blorSummary = blorModel.evaluate(smallBinaryDataset) + .asInstanceOf[BinaryLogisticRegressionSummary] + +val blorLongLabelData = smallBinaryDataset.select(col(blorModel.getLabelCol).cast(LongType), + col(blorModel.getFeaturesCol)) +val blorLongSummary = blorModel.evaluate(blorLongLabelData) + .asInstanceOf[BinaryLogisticRegressionSummary] --- End diff -- This is quite annoying. I wonder if we should add a `asBinary` method to `LogisticRegressionSummary` that casts it or throws an error to make it easier. Probably to do in a separate PR anyway. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r106522957 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -194,15 +207,9 @@ class LogisticRegressionSuite // thresholds and threshold must be consistent: values withClue("fit with ParamMap should throw error if threshold, thresholds do not match.") { intercept[IllegalArgumentException] { -lr2.fit(smallBinaryDataset, - lr2.thresholds -> Array(0.3, 0.7), lr2.threshold -> (expectedThreshold / 2.0)) - } -} -withClue("fit with ParamMap should throw error if threshold, thresholds do not match.") { - intercept[IllegalArgumentException] { -val lr2model = lr2.fit(smallBinaryDataset, - lr2.thresholds -> Array(0.3, 0.7), lr2.threshold -> (expectedThreshold / 2.0)) -lr2model.getThreshold +val lr3 = lr2.copy( --- End diff -- why is this changed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r106726508 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1394,13 +1575,13 @@ class BinaryLogisticRegressionSummary private[classification] ( *$$ * * + * --- End diff -- revert this one and the next one --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r106727965 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -780,19 +788,33 @@ class LogisticRegressionModel private[spark] ( throw new SparkException("No training summary available for this LogisticRegressionModel") } + @Since("2.2.0") + def binarySummary: BinaryLogisticRegressionTrainingSummary = summary match { +case b: BinaryLogisticRegressionTrainingSummary => b +case _ => + throw new RuntimeException("Cannot create a binarySummary for a non-binary model" + +s"(numClasses=${numClasses}), use summary instead.") + } + /** - * If the probability column is set returns the current model and probability column, - * otherwise generates a new column and sets it as the probability column on a new copy - * of the current model. + * If the probability column and prediction column is set returns the current model, otherwise --- End diff -- "If the probability and prediction columns are set, this method returns the current model, otherwise it generates new columns for them and sets them as columns on a new copy of the current model." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r106533356 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1242,11 +1312,122 @@ class BinaryLogisticRegressionSummary private[classification] ( * This will change in later Spark versions. */ @Since("1.5.0") - @transient lazy val recallByThreshold: DataFrame = { + def recallByThreshold: DataFrame = { binaryMetrics.recallByThreshold().toDF("threshold", "recall") } } +sealed trait BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary + with LogisticRegressionTrainingSummary { + +} + +/** + * :: Experimental :: + * Multinomial logistic regression training results. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction for a data instance as a + * double. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + * @param objectiveHistory objective function (scaled loss + regularization) at each iteration. + */ +@Experimental +@Since("1.5.0") +class LogisticRegressionTrainingSummaryImpl private[classification]( --- End diff -- These impl classes can be private I think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r106726296 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -780,19 +788,33 @@ class LogisticRegressionModel private[spark] ( throw new SparkException("No training summary available for this LogisticRegressionModel") } + @Since("2.2.0") + def binarySummary: BinaryLogisticRegressionTrainingSummary = summary match { --- End diff -- The behavior we are implementing is non-trivial. We need to add tests to ensure that everything happens as expected. This is an example: scala test("binary and multiclass summary") { val lr = new LogisticRegression() .setFamily("binomial") val blorModel = lr.fit(smallBinaryDataset) assert(blorModel.summary.isInstanceOf[BinaryLogisticRegressionTrainingSummaryImpl]) assert(blorModel.binarySummary.isInstanceOf[BinaryLogisticRegressionTrainingSummaryImpl]) val mlorModel = lr.setFamily("multinomial").fit(smallMultinomialDataset) assert(mlorModel.summary.isInstanceOf[LogisticRegressionTrainingSummaryImpl]) withClue("cannot get binary summary for multiclass model") { intercept[RuntimeException] { mlorModel.binarySummary } } val blorSummary = blorModel.evaluate(smallBinaryDataset) val mlorSummary = mlorModel.evaluate(smallMultinomialDataset) assert(blorSummary.isInstanceOf[BinaryLogisticRegressionSummaryImpl]) assert(mlorSummary.isInstanceOf[LogisticRegressionSummaryImpl]) } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r106727670 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1086,83 +1115,124 @@ private[classification] class MultiClassSummarizer extends Serializable { } /** - * Abstraction for multinomial Logistic Regression Training results. - * Currently, the training summary ignores the training weights except - * for the objective trace. - */ -sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary { - - /** objective function (scaled loss + regularization) at each iteration. */ - def objectiveHistory: Array[Double] - - /** Number of training iterations until termination */ - def totalIterations: Int = objectiveHistory.length - -} - -/** * Abstraction for Logistic Regression Results for a given model. */ sealed trait LogisticRegressionSummary extends Serializable { /** * Dataframe output by the model's `transform` method. */ + @Since("2.2.0") def predictions: DataFrame /** Field in "predictions" which gives the probability of each class as a vector. */ + @Since("2.2.0") def probabilityCol: String + /** Field in "predictions" which gives the prediction of each class as a vector. */ + @Since("2.2.0") + def predictionCol: String + /** Field in "predictions" which gives the true label of each instance (if available). */ + @Since("2.2.0") def labelCol: String /** Field in "predictions" which gives the features of each instance as a vector. */ + @Since("2.2.0") def featuresCol: String + @transient private val multiclassMetrics = { +new MulticlassMetrics( + predictions.select( +col(predictionCol), +col(labelCol).cast(DoubleType)) +.rdd.map { case Row(prediction: Double, label: Double) => (prediction, label) }) + } + + /** Returns true positive rate for each label. */ + @Since("2.2.0") + def truePositiveRateByLabel: Array[Double] = recallByLabel + + /** Returns false positive rate for each label. */ + @Since("2.2.0") + def falsePositiveRateByLabel: Array[Double] = { +multiclassMetrics.labels.map(label => multiclassMetrics.falsePositiveRate(label)) + } + + /** Returns precision for each label. */ + @Since("2.2.0") + def precisionByLabel: Array[Double] = { +multiclassMetrics.labels.map(label => multiclassMetrics.precision(label)) + } + + /** Returns recall for each label. */ + @Since("2.2.0") + def recallByLabel: Array[Double] = { +multiclassMetrics.labels.map(label => multiclassMetrics.recall(label)) + } + + /** + * Returns f-measure for each label. + */ + @Since("2.2.0") + def fMeasureByLabel(beta: Double): Array[Double] = { +multiclassMetrics.labels.map(label => multiclassMetrics.fMeasure(label, beta)) + } + + /** Returns f1-measure for each label. */ + @Since("2.2.0") + def fMeasureByLabel: Array[Double] = fMeasureByLabel(1.0) + + /** Returns accuracy. */ + @Since("2.2.0") + def accuracy: Double = multiclassMetrics.accuracy + + /** Returns weighted true positive rate. */ + @Since("2.2.0") + def weightedTruePositiveRate: Double = weightedRecall + + /** Returns weighted false positive rate. */ + @Since("2.2.0") + def weightedFalsePositiveRate: Double = multiclassMetrics.weightedFalsePositiveRate + + /** Returns weighted averaged recall. */ + @Since("2.2.0") + def weightedRecall: Double = multiclassMetrics.weightedRecall + + /** Returns weighted averaged precision. */ + @Since("2.2.0") + def weightedPrecision: Double = multiclassMetrics.weightedPrecision + + /** + * Returns weighted averaged f-measure. + */ + @Since("2.2.0") + def weightedFMeasure(beta: Double): Double = multiclassMetrics.weightedFMeasure(beta) + + /** Returns weighted averaged f1-measure. */ + @Since("2.2.0") + def weightedFMeasure: Double = multiclassMetrics.weightedFMeasure(1.0) } /** - * :: Experimental :: - * Logistic regression training results. - * - * @param predictions dataframe output by the model's `transform` method. - * @param probabilityCol field in "predictions" which gives the probability of - * each class as a vector. - * @param labelCol field in "predictions" which gives the true label of each instance. - * @param featuresCol field in "predictions" which gives the features of each instance as a vector. - * @param objectiveHistory
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r106534308 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1197,7 +1267,7 @@ class BinaryLogisticRegressionSummary private[classification] ( * This will change in later Spark versions. */ @Since("1.5.0") - lazy val areaUnderROC: Double = binaryMetrics.areaUnderROC() + def areaUnderROC: Double = binaryMetrics.areaUnderROC() --- End diff -- Why are these defs now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r106725472 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -813,9 +835,16 @@ class LogisticRegressionModel private[spark] ( @Since("2.0.0") def evaluate(dataset: Dataset[_]): LogisticRegressionSummary = { // Handle possible missing or invalid prediction columns -val (summaryModel, probabilityColName) = findSummaryModelAndProbabilityCol() -new BinaryLogisticRegressionSummary(summaryModel.transform(dataset), - probabilityColName, $(labelCol), $(featuresCol)) +val (summaryModel, probabilityColName, predictionColName) + = findSummaryModel() +if (isMultinomial) { --- End diff -- same as above comment, check number of classes, not family --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r106728259 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -780,19 +788,33 @@ class LogisticRegressionModel private[spark] ( throw new SparkException("No training summary available for this LogisticRegressionModel") } + @Since("2.2.0") + def binarySummary: BinaryLogisticRegressionTrainingSummary = summary match { +case b: BinaryLogisticRegressionTrainingSummary => b +case _ => + throw new RuntimeException("Cannot create a binarySummary for a non-binary model" + --- End diff -- "binarySummary" -> "binary summary" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r106726431 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1242,11 +1312,122 @@ class BinaryLogisticRegressionSummary private[classification] ( * This will change in later Spark versions. */ @Since("1.5.0") - @transient lazy val recallByThreshold: DataFrame = { + def recallByThreshold: DataFrame = { binaryMetrics.recallByThreshold().toDF("threshold", "recall") } } +sealed trait BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary + with LogisticRegressionTrainingSummary { + +} + +/** + * :: Experimental :: + * Multinomial logistic regression training results. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction for a data instance as a + * double. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + * @param objectiveHistory objective function (scaled loss + regularization) at each iteration. + */ +@Experimental +@Since("1.5.0") +class LogisticRegressionTrainingSummaryImpl private[classification]( +predictions: DataFrame, +probabilityCol: String, +predictionCol: String, +labelCol: String, +featuresCol: String, +@Since("1.5.0") val objectiveHistory: Array[Double]) + extends LogisticRegressionSummaryImpl( +predictions, probabilityCol, predictionCol, labelCol, featuresCol) + with LogisticRegressionTrainingSummary { + +} + +/** + * :: Experimental :: + * Multinomial Logistic regression results for a given model. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction for a data instance as a + * double. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + */ +@Experimental +@Since("2.2.0") +class LogisticRegressionSummaryImpl private[classification]( +@Since("2.2.0") @transient override val predictions: DataFrame, +@Since("2.2.0") override val probabilityCol: String, +@Since("2.2.0") override val predictionCol: String, +@Since("2.2.0") override val labelCol: String, +@Since("2.2.0") override val featuresCol: String) + extends LogisticRegressionSummary { + +} + +/** + * :: Experimental :: + * Binary Logistic regression training results. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction for a data instance as a + * double. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + * @param objectiveHistory objective function (scaled loss + regularization) at each iteration. + */ +@Experimental +@Since("2.2.0") +class BinaryLogisticRegressionTrainingSummaryImpl private[classification] ( +predictions: DataFrame, +probabilityCol: String, +predictionCol: String, +labelCol: String, +featuresCol: String, +@Since("1.5.0") override val objectiveHistory: Array[Double]) + extends BinaryLogisticRegressionSummaryImpl( +predictions, probabilityCol, predictionCol, labelCol, featuresCol) + with BinaryLogisticRegressionTrainingSummary { + +} + +/** + * :: Experimental :: + * Binary Logistic regression results for a given model. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction of + * each class as a
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94643188 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -2006,7 +2075,7 @@ class LogisticRegressionSuite assert(prob === Vectors.dense(Array(0.0, 0.0, 0.0, 0.0, 1.0, 0.0))) assert(pred === 4.0) } -// TODO: check num iters is zero when it become available in the model +require(modelWithMetadata.summary.totalIterations == 0) --- End diff -- Also, add `assert(model.summary.totalIterations === 0` after the first of these three tests and add `modelZeroLabel.summary.totalIterations > 0` after the second of these tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94640294 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -1756,55 +1765,105 @@ class LogisticRegressionSuite } test("evaluate on test set") { -// TODO: add for multiclass when model summary becomes available // Evaluate on test set should be same as that of the transformed training data. -val lr = new LogisticRegression() +val blor = new LogisticRegression() .setMaxIter(10) .setRegParam(1.0) .setThreshold(0.6) -val model = lr.fit(smallBinaryDataset) -val summary = model.summary.asInstanceOf[BinaryLogisticRegressionSummary] - -val sameSummary = - model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] -assert(summary.areaUnderROC === sameSummary.areaUnderROC) -assert(summary.roc.collect() === sameSummary.roc.collect()) -assert(summary.pr.collect === sameSummary.pr.collect()) +val blorModel = blor.fit(smallBinaryDataset) +val blorSummary = blorModel.binarySummary + +val blorSameSummary = + blorModel.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] +assert(blorSummary.areaUnderROC === blorSameSummary.areaUnderROC) +assert(blorSummary.roc.collect() === blorSameSummary.roc.collect()) +assert(blorSummary.pr.collect === blorSameSummary.pr.collect()) assert( - summary.fMeasureByThreshold.collect() === sameSummary.fMeasureByThreshold.collect()) -assert(summary.recallByThreshold.collect() === sameSummary.recallByThreshold.collect()) + blorSummary.fMeasureByThreshold.collect() === blorSameSummary.fMeasureByThreshold.collect()) +assert(blorSummary.recallByThreshold.collect() + === blorSameSummary.recallByThreshold.collect()) assert( - summary.precisionByThreshold.collect() === sameSummary.precisionByThreshold.collect()) + blorSummary.precisionByThreshold.collect() +=== blorSameSummary.precisionByThreshold.collect()) + +val mlor = new LogisticRegression() + .setMaxIter(10) + .setRegParam(1.0) + .setFamily("multinomial") +val mlorModel = mlor.fit(smallMultinomialDataset) +val mlorSummary = mlorModel.multinomialSummary + +val mlorSameSummary = mlorModel.evaluate(smallMultinomialDataset) +.asInstanceOf[MultinomialLogisticRegressionSummary] + +assert(mlorSummary.labels === mlorSameSummary.labels) +assert(mlorSummary.falsePositiveRateByLabel === mlorSameSummary.falsePositiveRateByLabel) +assert(mlorSummary.precisionByLabel === mlorSameSummary.precisionByLabel) +assert(mlorSummary.recallByLabel === mlorSameSummary.recallByLabel) +assert(mlorSummary.fMeasureByLabel === mlorSameSummary.fMeasureByLabel) +assert(mlorSummary.accuracy === mlorSameSummary.accuracy) +assert(mlorSummary.weightedFalsePositiveRate === mlorSameSummary.weightedFalsePositiveRate) +assert(mlorSummary.weightedPrecision === mlorSameSummary.weightedPrecision) +assert(mlorSummary.weightedRecall === mlorSameSummary.weightedRecall) +assert(mlorSummary.weightedFMeasure === mlorSameSummary.weightedFMeasure) } test("evaluate with labels that are not doubles") { // Evaluate a test set with Label that is a numeric type other than Double -val lr = new LogisticRegression() +val blor = new LogisticRegression() .setMaxIter(1) .setRegParam(1.0) -val model = lr.fit(smallBinaryDataset) -val summary = model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] +val blorModel = blor.fit(smallBinaryDataset) +val blorSummary = blorModel.evaluate(smallBinaryDataset) + .asInstanceOf[BinaryLogisticRegressionSummary] -val longLabelData = smallBinaryDataset.select(col(model.getLabelCol).cast(LongType), - col(model.getFeaturesCol)) -val longSummary = model.evaluate(longLabelData).asInstanceOf[BinaryLogisticRegressionSummary] +val blorLongLabelData = smallBinaryDataset.select(col(blorModel.getLabelCol).cast(LongType), --- End diff -- Well, from a completeness standpoint I agree that it's better to test all the types that it's intended to work for. However, since it's just calling `cast` under the hood, it does seem a bit redundant. I'm ok leaving it as is, but I don't feel strongly about it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94614489 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -652,18 +652,27 @@ class LogisticRegression @Since("1.2.0") ( val model = copyValues(new LogisticRegressionModel(uid, coefficientMatrix, interceptVector, numClasses, isMultinomial)) -// TODO: implement summary model for multinomial case + +val (summaryModel, probabilityColName, predictionColName) + = model.findSummaryModelWithProbabilityColAndPredictionCol() val m = if (!isMultinomial) { --- End diff -- How about scala val logRegSummary = if (!isMultinomial) { new BinaryLogisticRegressionTrainingSummary(...) } else { ... } model.setSummary(logRegSummary) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94620487 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1120,21 +1239,129 @@ sealed trait LogisticRegressionSummary extends Serializable { /** Field in "predictions" which gives the probability of each class as a vector. */ def probabilityCol: String + /** Field in "predictions" which gives the prediction of each class as a vector. */ + def predictionCol: String + /** Field in "predictions" which gives the true label of each instance (if available). */ def labelCol: String /** Field in "predictions" which gives the features of each instance as a vector. */ def featuresCol: String + /** Returns false positive rate for each label. */ + @Since("2.1.0") + def falsePositiveRateByLabel: Array[Double] + + /** Returns precision for each label. */ + @Since("2.1.0") + def precisionByLabel: Array[Double] + + /** Returns recall for each label. */ + @Since("2.1.0") + def recallByLabel: Array[Double] + + /** + * Returns f-measure for each label. + * @param beta the beta parameter. + */ + @Since("2.1.0") + def fMeasureByLabel(beta: Double): Array[Double] + + /** Returns f1-measure for each label. */ + @Since("2.1.0") + def fMeasureByLabel: Array[Double] + + /** Returns accuracy. */ + @Since("2.1.0") + def accuracy: Double + + /** Returns weighted false positive rate. */ + @Since("2.1.0") + def weightedFalsePositiveRate: Double + + /** Returns weighted averaged recall. */ + @Since("2.1.0") + def weightedRecall: Double + + /** Returns weighted averaged precision. */ + @Since("2.1.0") + def weightedPrecision: Double + + /** + * Returns weighted averaged f-measure. + * @param beta the beta parameter. + */ + @Since("2.1.0") + def weightedFMeasure(beta: Double): Double + + /** Returns weighted averaged f1-measure. */ + @Since("2.1.0") + def weightedFMeasure: Double } /** * :: Experimental :: - * Logistic regression training results. + * Multinomial Logistic regression training results. * * @param predictions dataframe output by the model's `transform` method. * @param probabilityCol field in "predictions" which gives the probability of * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction of + * each class as a vector. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + * @param objectiveHistory objective function (scaled loss + regularization) at each iteration. + */ +@Experimental +@Since("1.5.0") +class MultinomialLogisticRegressionTrainingSummary private[classification] ( +predictions: DataFrame, +probabilityCol: String, +predictionCol: String, +labelCol: String, +featuresCol: String, +@Since("1.5.0") val objectiveHistory: Array[Double]) + extends MultinomialLogisticRegressionSummary( +predictions, probabilityCol, predictionCol, labelCol, featuresCol) +with LogisticRegressionTrainingSummary { + +} + +/** + * :: Experimental :: + * Multinomial Logistic regression results for a given model. + * + * @param predictions dataframe output by the model's `transform` method. + * @param probabilityCol field in "predictions" which gives the probability of + * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction of + * each class as a vector. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + */ +@Experimental +@Since("1.5.0") +class MultinomialLogisticRegressionSummary private[classification]( +@Since("2.1.0") @transient override val predictions: DataFrame, +@Since("2.1.0") override val probabilityCol: String, +@Since("2.1.0") override val predictionCol: String, +@Since("2.1.0") override val labelCol: String, +@Since("2.1.0") override val featuresCol: String) + extends MulticlassSummary ( +predictions, predictionCol, labelCol) + with LogisticRegressionSummary { + +} + +/** + * :: Experimental :: + * Binary Logistic regression training results. + * + * @param
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94520863 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1120,21 +1239,129 @@ sealed trait LogisticRegressionSummary extends Serializable { /** Field in "predictions" which gives the probability of each class as a vector. */ def probabilityCol: String + /** Field in "predictions" which gives the prediction of each class as a vector. */ + def predictionCol: String + /** Field in "predictions" which gives the true label of each instance (if available). */ def labelCol: String /** Field in "predictions" which gives the features of each instance as a vector. */ def featuresCol: String + /** Returns false positive rate for each label. */ + @Since("2.1.0") + def falsePositiveRateByLabel: Array[Double] + + /** Returns precision for each label. */ + @Since("2.1.0") + def precisionByLabel: Array[Double] + + /** Returns recall for each label. */ + @Since("2.1.0") + def recallByLabel: Array[Double] + + /** + * Returns f-measure for each label. + * @param beta the beta parameter. + */ + @Since("2.1.0") + def fMeasureByLabel(beta: Double): Array[Double] + + /** Returns f1-measure for each label. */ + @Since("2.1.0") + def fMeasureByLabel: Array[Double] + + /** Returns accuracy. */ + @Since("2.1.0") + def accuracy: Double + + /** Returns weighted false positive rate. */ + @Since("2.1.0") + def weightedFalsePositiveRate: Double + + /** Returns weighted averaged recall. */ + @Since("2.1.0") + def weightedRecall: Double + + /** Returns weighted averaged precision. */ + @Since("2.1.0") + def weightedPrecision: Double + + /** + * Returns weighted averaged f-measure. + * @param beta the beta parameter. + */ + @Since("2.1.0") + def weightedFMeasure(beta: Double): Double + + /** Returns weighted averaged f1-measure. */ + @Since("2.1.0") + def weightedFMeasure: Double } /** * :: Experimental :: - * Logistic regression training results. + * Multinomial Logistic regression training results. * * @param predictions dataframe output by the model's `transform` method. * @param probabilityCol field in "predictions" which gives the probability of * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction of + * each class as a vector. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + * @param objectiveHistory objective function (scaled loss + regularization) at each iteration. + */ +@Experimental +@Since("1.5.0") +class MultinomialLogisticRegressionTrainingSummary private[classification] ( --- End diff -- I prefer `MulticlassLogisticRegressionTrainingSummary`. "Multinomial" refers to the GLM family used to train the model, whereas "multiclass" refers to the type of classification. Since we already use "binary" and not "binomial" I think "multiclass" is best here and elsewhere. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94640068 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -1762,51 +1781,101 @@ class LogisticRegressionSuite } test("evaluate on test set") { -// TODO: add for multiclass when model summary becomes available // Evaluate on test set should be same as that of the transformed training data. -val lr = new LogisticRegression() +val blor = new LogisticRegression() .setMaxIter(10) .setRegParam(1.0) .setThreshold(0.6) -val model = lr.fit(smallBinaryDataset) -val summary = model.summary.asInstanceOf[BinaryLogisticRegressionSummary] - -val sameSummary = - model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] -assert(summary.areaUnderROC === sameSummary.areaUnderROC) -assert(summary.roc.collect() === sameSummary.roc.collect()) -assert(summary.pr.collect === sameSummary.pr.collect()) +val blorModel = blor.fit(smallBinaryDataset) +val blorSummary = blorModel.binarySummary + +val sameBlorSummary = + blorModel.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] +assert(blorSummary.areaUnderROC === sameBlorSummary.areaUnderROC) +assert(blorSummary.roc.collect() === sameBlorSummary.roc.collect()) +assert(blorSummary.pr.collect === sameBlorSummary.pr.collect()) assert( - summary.fMeasureByThreshold.collect() === sameSummary.fMeasureByThreshold.collect()) -assert(summary.recallByThreshold.collect() === sameSummary.recallByThreshold.collect()) + blorSummary.fMeasureByThreshold.collect() === sameBlorSummary.fMeasureByThreshold.collect()) assert( - summary.precisionByThreshold.collect() === sameSummary.precisionByThreshold.collect()) + blorSummary.recallByThreshold.collect() === sameBlorSummary.recallByThreshold.collect()) +assert( + blorSummary.precisionByThreshold.collect() +=== sameBlorSummary.precisionByThreshold.collect()) + +val mlor = new LogisticRegression() + .setMaxIter(10) + .setRegParam(1.0) + .setFamily("multinomial") +val mlorModel = blor.fit(smallMultinomialDataset) +val mlorSummary = mlorModel.summary + +val mlorSameSummary = mlorModel.evaluate(smallMultinomialDataset) + .asInstanceOf[MultinomialLogisticRegressionSummary] + +assert(mlorSummary.falsePositiveRateByLabel === mlorSameSummary.falsePositiveRateByLabel) +assert(mlorSummary.precisionByLabel === mlorSameSummary.precisionByLabel) +assert(mlorSummary.recallByLabel === mlorSameSummary.recallByLabel) +assert(mlorSummary.fMeasureByLabel === mlorSameSummary.fMeasureByLabel) +assert(mlorSummary.accuracy === mlorSameSummary.accuracy) +assert(mlorSummary.weightedFalsePositiveRate === mlorSameSummary.weightedFalsePositiveRate) +assert(mlorSummary.weightedPrecision === mlorSameSummary.weightedPrecision) +assert(mlorSummary.weightedRecall === mlorSameSummary.weightedRecall) +assert(mlorSummary.weightedFMeasure === mlorSameSummary.weightedFMeasure) } test("evaluate with labels that are not doubles") { // Evaluate a test set with Label that is a numeric type other than Double -val lr = new LogisticRegression() +val blor = new LogisticRegression() .setMaxIter(1) .setRegParam(1.0) --- End diff -- add `.setFamily("binomial")` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94638575 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -1762,51 +1781,101 @@ class LogisticRegressionSuite } test("evaluate on test set") { -// TODO: add for multiclass when model summary becomes available // Evaluate on test set should be same as that of the transformed training data. -val lr = new LogisticRegression() +val blor = new LogisticRegression() .setMaxIter(10) .setRegParam(1.0) .setThreshold(0.6) -val model = lr.fit(smallBinaryDataset) -val summary = model.summary.asInstanceOf[BinaryLogisticRegressionSummary] - -val sameSummary = - model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] -assert(summary.areaUnderROC === sameSummary.areaUnderROC) -assert(summary.roc.collect() === sameSummary.roc.collect()) -assert(summary.pr.collect === sameSummary.pr.collect()) +val blorModel = blor.fit(smallBinaryDataset) +val blorSummary = blorModel.binarySummary + +val sameBlorSummary = + blorModel.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] +assert(blorSummary.areaUnderROC === sameBlorSummary.areaUnderROC) +assert(blorSummary.roc.collect() === sameBlorSummary.roc.collect()) +assert(blorSummary.pr.collect === sameBlorSummary.pr.collect()) assert( - summary.fMeasureByThreshold.collect() === sameSummary.fMeasureByThreshold.collect()) -assert(summary.recallByThreshold.collect() === sameSummary.recallByThreshold.collect()) + blorSummary.fMeasureByThreshold.collect() === sameBlorSummary.fMeasureByThreshold.collect()) assert( - summary.precisionByThreshold.collect() === sameSummary.precisionByThreshold.collect()) + blorSummary.recallByThreshold.collect() === sameBlorSummary.recallByThreshold.collect()) +assert( + blorSummary.precisionByThreshold.collect() +=== sameBlorSummary.precisionByThreshold.collect()) + +val mlor = new LogisticRegression() --- End diff -- This is never used since you fit with `blor` below. We can probably make a single estimator called `lr` and then just reuse it for both tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94616573 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1120,21 +1239,129 @@ sealed trait LogisticRegressionSummary extends Serializable { /** Field in "predictions" which gives the probability of each class as a vector. */ def probabilityCol: String + /** Field in "predictions" which gives the prediction of each class as a vector. */ + def predictionCol: String + /** Field in "predictions" which gives the true label of each instance (if available). */ def labelCol: String /** Field in "predictions" which gives the features of each instance as a vector. */ def featuresCol: String + /** Returns false positive rate for each label. */ + @Since("2.1.0") + def falsePositiveRateByLabel: Array[Double] + + /** Returns precision for each label. */ + @Since("2.1.0") + def precisionByLabel: Array[Double] + + /** Returns recall for each label. */ + @Since("2.1.0") + def recallByLabel: Array[Double] + + /** + * Returns f-measure for each label. + * @param beta the beta parameter. + */ + @Since("2.1.0") + def fMeasureByLabel(beta: Double): Array[Double] + + /** Returns f1-measure for each label. */ + @Since("2.1.0") + def fMeasureByLabel: Array[Double] + + /** Returns accuracy. */ + @Since("2.1.0") + def accuracy: Double + + /** Returns weighted false positive rate. */ + @Since("2.1.0") + def weightedFalsePositiveRate: Double + + /** Returns weighted averaged recall. */ + @Since("2.1.0") + def weightedRecall: Double + + /** Returns weighted averaged precision. */ + @Since("2.1.0") + def weightedPrecision: Double + + /** + * Returns weighted averaged f-measure. + * @param beta the beta parameter. + */ + @Since("2.1.0") + def weightedFMeasure(beta: Double): Double + + /** Returns weighted averaged f1-measure. */ + @Since("2.1.0") + def weightedFMeasure: Double } /** * :: Experimental :: - * Logistic regression training results. + * Multinomial Logistic regression training results. --- End diff -- nit: don't capitalize logistic here and elsewhere --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94520673 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -788,19 +797,39 @@ class LogisticRegressionModel private[spark] ( throw new SparkException("No training summary available for this LogisticRegressionModel") } + @Since("2.1.0") + def binarySummary: BinaryLogisticRegressionTrainingSummary = summary match { +case b: BinaryLogisticRegressionTrainingSummary => b +case _ => + throw new RuntimeException("Cannot create a binarySummary for a non-binary model" + +s"(numClasses=${numClasses}), use multinomialSummary instead.") + } + /** * If the probability column is set returns the current model and probability column, --- End diff -- These lines should be deleted I think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94615507 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1095,6 +1131,89 @@ private[classification] class MultiClassSummarizer extends Serializable { } /** + * :: Experimental :: + * Summary of multi-classification algorithms. + * + * @param predictions [[DataFrame]] produced by model.transform(). + * @param predictionCol Name for column of prediction in `predictions`. + * @param labelCol Name for column of label in `predictions`. + */ +@Experimental +@Since("2.1.0") +class MulticlassSummary private[ml] ( +@transient val predictions: DataFrame, +val predictionCol: String, +val labelCol: String) extends Serializable { + + @transient private val multinomialMetrics = { --- End diff -- "multiclassMetrics" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94613945 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -788,19 +797,39 @@ class LogisticRegressionModel private[spark] ( throw new SparkException("No training summary available for this LogisticRegressionModel") } + @Since("2.1.0") + def binarySummary: BinaryLogisticRegressionTrainingSummary = summary match { +case b: BinaryLogisticRegressionTrainingSummary => b +case _ => + throw new RuntimeException("Cannot create a binarySummary for a non-binary model" + +s"(numClasses=${numClasses}), use multinomialSummary instead.") --- End diff -- "multinomialSummary" => "summary" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94616155 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1095,6 +1131,89 @@ private[classification] class MultiClassSummarizer extends Serializable { } /** + * :: Experimental :: + * Summary of multi-classification algorithms. + * + * @param predictions [[DataFrame]] produced by model.transform(). + * @param predictionCol Name for column of prediction in `predictions`. + * @param labelCol Name for column of label in `predictions`. + */ +@Experimental +@Since("2.1.0") +class MulticlassSummary private[ml] ( +@transient val predictions: DataFrame, +val predictionCol: String, +val labelCol: String) extends Serializable { + + @transient private val multinomialMetrics = { +new MulticlassMetrics( + predictions.select( +col(predictionCol), +col(labelCol).cast(DoubleType)) +.rdd.map { case Row(prediction: Double, label: Double) => (prediction, label) }) + } + + /** Returns false positive rate for each label. */ + @Since("2.1.0") + @transient lazy val falsePositiveRateByLabel: Array[Double] = { +multinomialMetrics.labels.map(label => multinomialMetrics.falsePositiveRate(label)) + } + + /** Returns precision for each label. */ + @Since("2.1.0") + @transient lazy val precisionByLabel: Array[Double] = { +multinomialMetrics.labels.map(label => multinomialMetrics.precision(label)) + } + + /** Returns recall for each label. */ + @Since("2.1.0") + @transient lazy val recallByLabel: Array[Double] = { +multinomialMetrics.labels.map(label => multinomialMetrics.recall(label)) + } + + /** + * Returns f-measure for each label. + * @param beta the beta parameter. + */ + @Since("2.1.0") + def fMeasureByLabel(beta: Double): Array[Double] = { +multinomialMetrics.labels.map(label => multinomialMetrics.fMeasure(label, beta)) + } + + /** Returns f1-measure for each label. */ + @Since("2.1.0") + @transient lazy val fMeasureByLabel: Array[Double] = fMeasureByLabel(1.0) + + /** Returns accuracy. */ + @Since("2.1.0") + @transient lazy val accuracy: Double = multinomialMetrics.accuracy + + /** Returns weighted false positive rate. */ + @Since("2.1.0") + @transient lazy val weightedFalsePositiveRate: Double += multinomialMetrics.weightedFalsePositiveRate + + /** Returns weighted averaged recall. */ + @Since("2.1.0") + @transient lazy val weightedRecall: Double = multinomialMetrics.weightedRecall + + /** Returns weighted averaged precision. */ + @Since("2.1.0") + @transient lazy val weightedPrecision: Double = multinomialMetrics.weightedPrecision + + /** + * Returns weighted averaged f-measure. + * @param beta the beta parameter. --- End diff -- Same as above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94615417 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1095,6 +1131,89 @@ private[classification] class MultiClassSummarizer extends Serializable { } /** + * :: Experimental :: + * Summary of multi-classification algorithms. --- End diff -- "Summary of multiclass classification results." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94617024 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1159,17 +1388,23 @@ class BinaryLogisticRegressionTrainingSummary private[classification] ( * @param predictions dataframe output by the model's `transform` method. * @param probabilityCol field in "predictions" which gives the probability of * each class as a vector. + * @param predictionCol field in "predictions" which gives the prediction of + * each class as a vector. --- End diff -- nit: indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94521336 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1095,6 +1131,89 @@ private[classification] class MultiClassSummarizer extends Serializable { } /** + * :: Experimental :: + * Summary of multi-classification algorithms. + * + * @param predictions [[DataFrame]] produced by model.transform(). + * @param predictionCol Name for column of prediction in `predictions`. + * @param labelCol Name for column of label in `predictions`. + */ +@Experimental +@Since("2.1.0") +class MulticlassSummary private[ml] ( +@transient val predictions: DataFrame, +val predictionCol: String, +val labelCol: String) extends Serializable { + + @transient private val multinomialMetrics = { +new MulticlassMetrics( + predictions.select( +col(predictionCol), +col(labelCol).cast(DoubleType)) +.rdd.map { case Row(prediction: Double, label: Double) => (prediction, label) }) + } + + /** Returns false positive rate for each label. */ + @Since("2.1.0") + @transient lazy val falsePositiveRateByLabel: Array[Double] = { +multinomialMetrics.labels.map(label => multinomialMetrics.falsePositiveRate(label)) + } + + /** Returns precision for each label. */ + @Since("2.1.0") + @transient lazy val precisionByLabel: Array[Double] = { +multinomialMetrics.labels.map(label => multinomialMetrics.precision(label)) + } + + /** Returns recall for each label. */ + @Since("2.1.0") + @transient lazy val recallByLabel: Array[Double] = { +multinomialMetrics.labels.map(label => multinomialMetrics.recall(label)) + } + + /** + * Returns f-measure for each label. + * @param beta the beta parameter. --- End diff -- This description is not helpful. Let's either put nothing, or say something like "parameter which controls the balance between precision and recall in the f-measure" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94520622 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -788,19 +797,39 @@ class LogisticRegressionModel private[spark] ( throw new SparkException("No training summary available for this LogisticRegressionModel") } + @Since("2.1.0") + def binarySummary: BinaryLogisticRegressionTrainingSummary = summary match { +case b: BinaryLogisticRegressionTrainingSummary => b +case _ => + throw new RuntimeException("Cannot create a binarySummary for a non-binary model" + +s"(numClasses=${numClasses}), use multinomialSummary instead.") + } + /** * If the probability column is set returns the current model and probability column, * otherwise generates a new column and sets it as the probability column on a new copy * of the current model. + * If the probability column and prediction column is set returns the current model, otherwise + * generate new columns for them and set them as probability or prediction column on a new copy + * of the current model */ - private[classification] def findSummaryModelAndProbabilityCol(): - (LogisticRegressionModel, String) = { -$(probabilityCol) match { - case "" => -val probabilityColName = "probability_" + java.util.UUID.randomUUID.toString -(copy(ParamMap.empty).setProbabilityCol(probabilityColName), probabilityColName) - case p => (this, p) + private[classification] def findSummaryModelWithProbabilityColAndPredictionCol(): --- End diff -- Maybe we can just call it `findSummaryModel` and get rid of the vars: scala private[classification] def findSummaryModel(): (LogisticRegressionModel, String, String) = { val model = if ($(probabilityCol).isEmpty && $(predictionCol).isEmpty) { copy(ParamMap.empty) .setProbabilityCol("probability_" + java.util.UUID.randomUUID.toString) .setPredictionCol("prediction_" + java.util.UUID.randomUUID.toString) } else if ($(probabilityCol).isEmpty) { copy(ParamMap.empty).setProbabilityCol("probability_" + java.util.UUID.randomUUID.toString) } else if ($(predictionCol).isEmpty) { copy(ParamMap.empty).setPredictionCol("prediction_" + java.util.UUID.randomUUID.toString) } else { this } (model, model.getProbabilityCol, model.getPredictionCol) } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94521006 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1095,6 +1131,89 @@ private[classification] class MultiClassSummarizer extends Serializable { } /** + * :: Experimental :: + * Summary of multi-classification algorithms. + * + * @param predictions [[DataFrame]] produced by model.transform(). + * @param predictionCol Name for column of prediction in `predictions`. + * @param labelCol Name for column of label in `predictions`. + */ +@Experimental +@Since("2.1.0") +class MulticlassSummary private[ml] ( --- End diff -- I prefer `MulticlassClassificationSummary` especially since we have a `MulticlassSummarizer` in the same file :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94618477 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -2006,7 +2075,7 @@ class LogisticRegressionSuite assert(prob === Vectors.dense(Array(0.0, 0.0, 0.0, 0.0, 1.0, 0.0))) assert(pred === 4.0) } -// TODO: check num iters is zero when it become available in the model +require(modelWithMetadata.summary.totalIterations == 0) --- End diff -- assert with `===` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94638095 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -149,15 +149,34 @@ class LogisticRegressionSuite assert(copiedModel.hasSummary) } - test("empty probabilityCol") { -val lr = new LogisticRegression().setProbabilityCol("") -val model = lr.fit(smallBinaryDataset) + test("empty probabilityCol or predictionCol") { +var lr = new LogisticRegression().setProbabilityCol("") --- End diff -- I think we can get rid of the vars and shorten this up. Also, we should probably test both binary and multiclass. How does this look?: scala val lr = new LogisticRegression().setMaxIter(1) val datasetFieldNames = smallBinaryDataset.schema.fieldNames.toSet def checkSummarySchema(model: LogisticRegressionModel, columns: Seq[String]): Unit = { val fieldNames = model.summary.predictions.schema.fieldNames assert(model.hasSummary) assert(datasetFieldNames.subsetOf(fieldNames.toSet)) columns.foreach { c => assert(fieldNames.exists(_.startsWith(c))) } } Seq("binomial", "multinomial").foreach { family => lr.setProbabilityCol("") val modelNoProb = lr.fit(smallBinaryDataset) checkSummarySchema(modelNoProb, Seq("probability_")) lr.setProbabilityCol("probability").setPredictionCol("") val modelNoPred = lr.fit(smallBinaryDataset) checkSummarySchema(modelNoPred, Seq("prediction_")) lr.setProbabilityCol("").setPredictionCol("") val modelNoPredNoProb = lr.fit(smallBinaryDataset) checkSummarySchema(modelNoPredNoProb, Seq("prediction_", "probability_")) } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94621426 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -1762,51 +1781,101 @@ class LogisticRegressionSuite } test("evaluate on test set") { -// TODO: add for multiclass when model summary becomes available // Evaluate on test set should be same as that of the transformed training data. -val lr = new LogisticRegression() +val blor = new LogisticRegression() .setMaxIter(10) .setRegParam(1.0) .setThreshold(0.6) -val model = lr.fit(smallBinaryDataset) -val summary = model.summary.asInstanceOf[BinaryLogisticRegressionSummary] - -val sameSummary = - model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] -assert(summary.areaUnderROC === sameSummary.areaUnderROC) -assert(summary.roc.collect() === sameSummary.roc.collect()) -assert(summary.pr.collect === sameSummary.pr.collect()) +val blorModel = blor.fit(smallBinaryDataset) +val blorSummary = blorModel.binarySummary + +val sameBlorSummary = + blorModel.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] +assert(blorSummary.areaUnderROC === sameBlorSummary.areaUnderROC) +assert(blorSummary.roc.collect() === sameBlorSummary.roc.collect()) +assert(blorSummary.pr.collect === sameBlorSummary.pr.collect()) assert( - summary.fMeasureByThreshold.collect() === sameSummary.fMeasureByThreshold.collect()) -assert(summary.recallByThreshold.collect() === sameSummary.recallByThreshold.collect()) + blorSummary.fMeasureByThreshold.collect() === sameBlorSummary.fMeasureByThreshold.collect()) assert( - summary.precisionByThreshold.collect() === sameSummary.precisionByThreshold.collect()) + blorSummary.recallByThreshold.collect() === sameBlorSummary.recallByThreshold.collect()) +assert( + blorSummary.precisionByThreshold.collect() +=== sameBlorSummary.precisionByThreshold.collect()) + +val mlor = new LogisticRegression() + .setMaxIter(10) + .setRegParam(1.0) + .setFamily("multinomial") +val mlorModel = blor.fit(smallMultinomialDataset) +val mlorSummary = mlorModel.summary + +val mlorSameSummary = mlorModel.evaluate(smallMultinomialDataset) + .asInstanceOf[MultinomialLogisticRegressionSummary] + +assert(mlorSummary.falsePositiveRateByLabel === mlorSameSummary.falsePositiveRateByLabel) +assert(mlorSummary.precisionByLabel === mlorSameSummary.precisionByLabel) +assert(mlorSummary.recallByLabel === mlorSameSummary.recallByLabel) +assert(mlorSummary.fMeasureByLabel === mlorSameSummary.fMeasureByLabel) +assert(mlorSummary.accuracy === mlorSameSummary.accuracy) +assert(mlorSummary.weightedFalsePositiveRate === mlorSameSummary.weightedFalsePositiveRate) +assert(mlorSummary.weightedPrecision === mlorSameSummary.weightedPrecision) +assert(mlorSummary.weightedRecall === mlorSameSummary.weightedRecall) +assert(mlorSummary.weightedFMeasure === mlorSameSummary.weightedFMeasure) } test("evaluate with labels that are not doubles") { // Evaluate a test set with Label that is a numeric type other than Double -val lr = new LogisticRegression() +val blor = new LogisticRegression() .setMaxIter(1) .setRegParam(1.0) -val model = lr.fit(smallBinaryDataset) -val summary = model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] +val blorModel = blor.fit(smallBinaryDataset) +val blorSummary = blorModel.evaluate(smallBinaryDataset) + .asInstanceOf[BinaryLogisticRegressionSummary] -val longLabelData = smallBinaryDataset.select(col(model.getLabelCol).cast(LongType), - col(model.getFeaturesCol)) -val longSummary = model.evaluate(longLabelData).asInstanceOf[BinaryLogisticRegressionSummary] +val blorLongLabelData = smallBinaryDataset.select(col(blorModel.getLabelCol).cast(LongType), + col(blorModel.getFeaturesCol)) +val blorLongSummary = blorModel.evaluate(blorLongLabelData) + .asInstanceOf[BinaryLogisticRegressionSummary] -assert(summary.areaUnderROC === longSummary.areaUnderROC) +assert(blorSummary.areaUnderROC === blorLongSummary.areaUnderROC) + +val mlor = new LogisticRegression() + .setMaxIter(1) + .setRegParam(1.0) + .setFamily("multinomial") +val mlorModel = mlor.fit(smallMultinomialDataset) +val mlorSummary =
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r94615848 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1095,6 +1131,89 @@ private[classification] class MultiClassSummarizer extends Serializable { } /** + * :: Experimental :: + * Summary of multi-classification algorithms. + * + * @param predictions [[DataFrame]] produced by model.transform(). + * @param predictionCol Name for column of prediction in `predictions`. + * @param labelCol Name for column of label in `predictions`. + */ +@Experimental +@Since("2.1.0") +class MulticlassSummary private[ml] ( +@transient val predictions: DataFrame, +val predictionCol: String, +val labelCol: String) extends Serializable { + + @transient private val multinomialMetrics = { +new MulticlassMetrics( + predictions.select( +col(predictionCol), +col(labelCol).cast(DoubleType)) +.rdd.map { case Row(prediction: Double, label: Double) => (prediction, label) }) + } + + /** Returns false positive rate for each label. */ + @Since("2.1.0") + @transient lazy val falsePositiveRateByLabel: Array[Double] = { +multinomialMetrics.labels.map(label => multinomialMetrics.falsePositiveRate(label)) + } + + /** Returns precision for each label. */ + @Since("2.1.0") --- End diff -- Update all since tags --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r85220020 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -777,6 +787,20 @@ class LogisticRegressionModel private[spark] ( throw new SparkException("No training summary available for this LogisticRegressionModel") } + @Since("2.1.0") + def binarySummary: BinaryLogisticRegressionTrainingSummary = summary match { +case b: BinaryLogisticRegressionTrainingSummary => b +case _ => throw new SparkException("Summary type dismatch, " + + "BinaryLogisticRegressionTrainingSummary expected.") + } + + @Since("2.1.0") + def multinomialSummary: MultinomialLogisticRegressionTrainingSummary = summary match { +case m: MultinomialLogisticRegressionTrainingSummary => m +case _ => throw new SparkException("Summary type dismatch, " + --- End diff -- We actually could return a valid multinomial summary, even for binary problems. I'd recommend we do that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r85220068 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1141,6 +1174,21 @@ class BinaryLogisticRegressionTrainingSummary private[classification] ( } +@Experimental --- End diff -- Add scala doc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r85218915 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -777,6 +787,20 @@ class LogisticRegressionModel private[spark] ( throw new SparkException("No training summary available for this LogisticRegressionModel") } + @Since("2.1.0") + def binarySummary: BinaryLogisticRegressionTrainingSummary = summary match { +case b: BinaryLogisticRegressionTrainingSummary => b +case _ => throw new SparkException("Summary type dismatch, " + --- End diff -- Don't use SparkException here; that's primarily for failures within Spark jobs AFAIK. I guess I'd use RuntimeException. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r85220866 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1390,7 +1505,8 @@ class BinaryLogisticRegressionSummary private[classification] ( *$$ * * - * @param bcCoefficients The broadcast coefficients corresponding to the features. + * + * @param bcCoefficients The broadcast coefficients corresponding to the features. --- End diff -- fix indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r85220059 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -919,6 +948,10 @@ class LogisticRegressionModel private[spark] ( } } + private[classification] val probability2predictionUdf = udf { --- End diff -- This UDF might pull this entire class into the closure and thus into the summaries. I believe we can eliminate this UDF; I'll comment more below. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r85218235 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -721,7 +730,8 @@ class LogisticRegressionModel private[spark] ( /** * The model intercept for "binomial" logistic regression. If this model was fit with the * "multinomial" family then an exception is thrown. - * @return Double +* --- End diff -- same here: remove these 2 lines since they don't say anything useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r85218200 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -698,7 +706,8 @@ class LogisticRegressionModel private[spark] ( /** * A vector of model coefficients for "binomial" logistic regression. If this model was trained * using the "multinomial" family then an exception is thrown. - * @return Vector +* --- End diff -- I'd actually just remove these 2 lines since they don't say anything useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r85219325 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -777,6 +787,20 @@ class LogisticRegressionModel private[spark] ( throw new SparkException("No training summary available for this LogisticRegressionModel") } + @Since("2.1.0") + def binarySummary: BinaryLogisticRegressionTrainingSummary = summary match { +case b: BinaryLogisticRegressionTrainingSummary => b +case _ => throw new SparkException("Summary type dismatch, " + --- End diff -- Also, the current error doesn't really help the user. How about "Cannot create a binarySummary for a non-binary model (with numClasses = $numClasses). Use multinomialSummary instead." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r85218415 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -777,6 +787,20 @@ class LogisticRegressionModel private[spark] ( throw new SparkException("No training summary available for this LogisticRegressionModel") } + @Since("2.1.0") --- End diff -- Add Scala doc, especially saying that this will throw an exception when numClasses > 2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r85220713 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1243,6 +1291,73 @@ class BinaryLogisticRegressionSummary private[classification] ( } } +@Experimental +@Since("2.1.0") +class MultinomialLogisticRegressionSummary private[classification] ( +@Since("1.5.0") @transient override val predictions: DataFrame, +@Since("2.1.0") override val probabilityCol: String, +@Since("1.5.0") override val labelCol: String, +@Since("1.6.0") override val featuresCol: String, +private val probability2PredictionUdf: UserDefinedFunction) extends LogisticRegressionSummary { + + private val sparkSession = predictions.sparkSession + import sparkSession.implicits._ + + @transient private val multinomialMetrics = { +new MulticlassMetrics( + predictions.select( +probability2PredictionUdf(col(probabilityCol)), +col(labelCol).cast(DoubleType)) +.rdd.map { case Row(prediction: Double, label: Double) => (prediction, label) }) + } + + @Since("2.1.0") + @transient lazy val labels: Array[Double] = multinomialMetrics.labels --- End diff -- Add scala doc for all of these --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r85218947 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -777,6 +787,20 @@ class LogisticRegressionModel private[spark] ( throw new SparkException("No training summary available for this LogisticRegressionModel") } + @Since("2.1.0") + def binarySummary: BinaryLogisticRegressionTrainingSummary = summary match { +case b: BinaryLogisticRegressionTrainingSummary => b +case _ => throw new SparkException("Summary type dismatch, " + + "BinaryLogisticRegressionTrainingSummary expected.") + } + + @Since("2.1.0") --- End diff -- Add doc here too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r85220564 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1243,6 +1291,73 @@ class BinaryLogisticRegressionSummary private[classification] ( } } +@Experimental +@Since("2.1.0") +class MultinomialLogisticRegressionSummary private[classification] ( +@Since("1.5.0") @transient override val predictions: DataFrame, +@Since("2.1.0") override val probabilityCol: String, +@Since("1.5.0") override val labelCol: String, +@Since("1.6.0") override val featuresCol: String, +private val probability2PredictionUdf: UserDefinedFunction) extends LogisticRegressionSummary { + + private val sparkSession = predictions.sparkSession --- End diff -- Do you need these 2 lines? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r85220449 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1243,6 +1291,73 @@ class BinaryLogisticRegressionSummary private[classification] ( } } +@Experimental +@Since("2.1.0") +class MultinomialLogisticRegressionSummary private[classification] ( +@Since("1.5.0") @transient override val predictions: DataFrame, --- End diff -- These since tags should all be 2.1.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r85220375 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1243,6 +1291,73 @@ class BinaryLogisticRegressionSummary private[classification] ( } } +@Experimental +@Since("2.1.0") +class MultinomialLogisticRegressionSummary private[classification] ( +@Since("1.5.0") @transient override val predictions: DataFrame, +@Since("2.1.0") override val probabilityCol: String, +@Since("1.5.0") override val labelCol: String, +@Since("1.6.0") override val featuresCol: String, +private val probability2PredictionUdf: UserDefinedFunction) extends LogisticRegressionSummary { + + private val sparkSession = predictions.sparkSession + import sparkSession.implicits._ + + @transient private val multinomialMetrics = { +new MulticlassMetrics( + predictions.select( +probability2PredictionUdf(col(probabilityCol)), --- End diff -- This is the only place where you need the probability2predictionUDF. Rather than passing a UDF or the model, I'd like us to modify findSummaryModelAndProbabilityCol to also set predictionCol if needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r84468260 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -1756,55 +1765,105 @@ class LogisticRegressionSuite } test("evaluate on test set") { -// TODO: add for multiclass when model summary becomes available // Evaluate on test set should be same as that of the transformed training data. -val lr = new LogisticRegression() +val blor = new LogisticRegression() .setMaxIter(10) .setRegParam(1.0) .setThreshold(0.6) -val model = lr.fit(smallBinaryDataset) -val summary = model.summary.asInstanceOf[BinaryLogisticRegressionSummary] - -val sameSummary = - model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] -assert(summary.areaUnderROC === sameSummary.areaUnderROC) -assert(summary.roc.collect() === sameSummary.roc.collect()) -assert(summary.pr.collect === sameSummary.pr.collect()) +val blorModel = blor.fit(smallBinaryDataset) +val blorSummary = blorModel.binarySummary + +val blorSameSummary = + blorModel.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] +assert(blorSummary.areaUnderROC === blorSameSummary.areaUnderROC) +assert(blorSummary.roc.collect() === blorSameSummary.roc.collect()) +assert(blorSummary.pr.collect === blorSameSummary.pr.collect()) assert( - summary.fMeasureByThreshold.collect() === sameSummary.fMeasureByThreshold.collect()) -assert(summary.recallByThreshold.collect() === sameSummary.recallByThreshold.collect()) + blorSummary.fMeasureByThreshold.collect() === blorSameSummary.fMeasureByThreshold.collect()) +assert(blorSummary.recallByThreshold.collect() + === blorSameSummary.recallByThreshold.collect()) assert( - summary.precisionByThreshold.collect() === sameSummary.precisionByThreshold.collect()) + blorSummary.precisionByThreshold.collect() +=== blorSameSummary.precisionByThreshold.collect()) + +val mlor = new LogisticRegression() + .setMaxIter(10) + .setRegParam(1.0) + .setFamily("multinomial") +val mlorModel = mlor.fit(smallMultinomialDataset) +val mlorSummary = mlorModel.multinomialSummary + +val mlorSameSummary = mlorModel.evaluate(smallMultinomialDataset) +.asInstanceOf[MultinomialLogisticRegressionSummary] + +assert(mlorSummary.labels === mlorSameSummary.labels) +assert(mlorSummary.falsePositiveRateByLabel === mlorSameSummary.falsePositiveRateByLabel) +assert(mlorSummary.precisionByLabel === mlorSameSummary.precisionByLabel) +assert(mlorSummary.recallByLabel === mlorSameSummary.recallByLabel) +assert(mlorSummary.fMeasureByLabel === mlorSameSummary.fMeasureByLabel) +assert(mlorSummary.accuracy === mlorSameSummary.accuracy) +assert(mlorSummary.weightedFalsePositiveRate === mlorSameSummary.weightedFalsePositiveRate) +assert(mlorSummary.weightedPrecision === mlorSameSummary.weightedPrecision) +assert(mlorSummary.weightedRecall === mlorSameSummary.weightedRecall) +assert(mlorSummary.weightedFMeasure === mlorSameSummary.weightedFMeasure) } test("evaluate with labels that are not doubles") { // Evaluate a test set with Label that is a numeric type other than Double -val lr = new LogisticRegression() +val blor = new LogisticRegression() .setMaxIter(1) .setRegParam(1.0) -val model = lr.fit(smallBinaryDataset) -val summary = model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary] +val blorModel = blor.fit(smallBinaryDataset) +val blorSummary = blorModel.evaluate(smallBinaryDataset) + .asInstanceOf[BinaryLogisticRegressionSummary] -val longLabelData = smallBinaryDataset.select(col(model.getLabelCol).cast(LongType), - col(model.getFeaturesCol)) -val longSummary = model.evaluate(longLabelData).asInstanceOf[BinaryLogisticRegressionSummary] +val blorLongLabelData = smallBinaryDataset.select(col(blorModel.getLabelCol).cast(LongType), --- End diff -- Non-double numeric datatypes other than LongType maybe also needed to test. Any thoughts? @sethah --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/15435 [SPARK-17139][ML] Add model summary for MultinomialLogisticRegression ## What changes were proposed in this pull request? Add model summary for MultinomialLogisticRegression. currently, using `MulticlassMetrics` to do the summary. because multi-class case has some differences with binary case, I just expose the `MulticlassMetrics` interface about summary for now. And I add `findSummaryModelAndPredictionCol` method because the `MulticlassMetrics` need the `predictionCol` value. Although the `predictionCol` can be calculated from `probabilityCol`, put some code here to do this thing will cause repeated code, so I do not use that way. ## How was this patch tested? Existing tests. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark mlor_summary Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15435.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15435 commit e93740ead908251cebee22cc00059571a6e22a3f Author: WeichenXuDate: 2016-10-11T00:25:45Z update --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org