[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15435


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-22 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r134628661
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1324,90 +1350,136 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.3.0")
--- End diff --

I see; I think this should work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-22 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r134627545
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1324,90 +1354,147 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("1.5.0")
   def predictions: DataFrame
 
   /** Field in "predictions" which gives the probability of each class as 
a vector. */
+  @Since("1.5.0")
   def probabilityCol: String
 
+  /** Field in "predictions" which gives the prediction of each class. */
+  @Since("2.3.0")
+  def predictionCol: String
+
   /** Field in "predictions" which gives the true label of each instance 
(if available). */
+  @Since("1.5.0")
   def labelCol: String
 
   /** Field in "predictions" which gives the features of each instance as 
a vector. */
+  @Since("1.6.0")
   def featuresCol: String
 
+  @transient private val multiclassMetrics = {
+new MulticlassMetrics(
+  predictions.select(
+col(predictionCol),
+col(labelCol).cast(DoubleType))
+.rdd.map { case Row(prediction: Double, label: Double) => 
(prediction, label) })
+  }
+
+  /**
+   * Returns the sequence of labels in ascending order
--- End diff --

Clarify: "Returns the sequence of labels in ascending order.  This order 
matches the order used in metrics which are specified as arrays over labels, 
e.g., truePositiveRateByLabel."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-18 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133895488
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1574,18 +1588,17 @@ sealed trait 
BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegre
  * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
  */
 private class LogisticRegressionTrainingSummaryImpl(
-override val predictions: DataFrame,
-override val probabilityCol: String,
-override val predictionCol: String,
-override val labelCol: String,
-override val featuresCol: String,
-val objectiveHistory: Array[Double])
+predictions: DataFrame,
+probabilityCol: String,
+predictionCol: String,
+labelCol: String,
+featuresCol: String,
+override val objectiveHistory: Array[Double])
--- End diff --

This `val` cannot be removed because in base class it is a `def` so 
override here is required.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-18 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133895254
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1357,23 +1361,23 @@ sealed trait LogisticRegressionSummary extends 
Serializable {
   /**
* Dataframe output by the model's `transform` method.
*/
-  @Since("2.3.0")
+  @Since("1.5.0")
   def predictions: DataFrame
 
   /** Field in "predictions" which gives the probability of each class as 
a vector. */
-  @Since("2.3.0")
+  @Since("1.5.0")
   def probabilityCol: String
 
   /** Field in "predictions" which gives the prediction of each class. */
   @Since("2.3.0")
   def predictionCol: String
 
   /** Field in "predictions" which gives the true label of each instance 
(if available). */
-  @Since("2.3.0")
+  @Since("1.5.0")
   def labelCol: String
 
   /** Field in "predictions" which gives the features of each instance as 
a vector. */
-  @Since("2.3.0")
+  @Since("1.6.0")
--- End diff --

current master code featuresCol is marked "1.6.0".
I do not remove the `@since` into concrete impl class because they are all 
private.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133883361
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1324,90 +1350,136 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.3.0")
--- End diff --

@jkbradley 
I think about this, currently the concrete impl class are marked as 
private, so it seems add @Since in `trait` is better (this is discussed with 
@sethah before) So what do you think about  this:
```
sealed trait LogisticRegressionSummary extends Serializable {

  /**
   * Dataframe output by the model's `transform` method.
   */
  @Since("1.5.0")
  def predictions: DataFrame

  /** Field in "predictions" which gives the probability of each class as a 
vector. */
  @Since("1.5.0")
  def probabilityCol: String

  /** Field in "predictions" which gives the prediction of each class. */
  @Since("2.3.0")
  def predictionCol: String

  /** Field in "predictions" which gives the true label of each instance 
(if available). */
  @Since("1.5.0")
  def labelCol: String

  /** Field in "predictions" which gives the features of each instance as a 
vector. */
  @Since("1.6.0")
  def featuresCol: String
  ...
}
```
I also check current master code and `featuresCol` is marked "1.6.0", 
others marked "1.5.0", and "predictionCol" is new added so mark it "2.3.0"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133844018
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1324,90 +1350,136 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.3.0")
--- End diff --

I confirmed: These annotations are inherited, so users may be confused 
about why some Since annotations have changed.  E.g., 
BinaryLogisticRegressionSummary.featuresCol will change from 1.5.0 to 2.3.0.  
I'd just remove the ones in the trait.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133828976
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1484,3 +1556,96 @@ class BinaryLogisticRegressionSummary 
private[classification] (
 binaryMetrics.recallByThreshold().toDF("threshold", "recall")
   }
 }
+
+sealed trait BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary
+  with LogisticRegressionTrainingSummary
+
+/**
+ * :: Experimental ::
+ * Multiclass logistic regression training results.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+private class LogisticRegressionTrainingSummaryImpl(
+override val predictions: DataFrame,
--- End diff --

I think you can leave out the "override val" since the value is passed to 
LogisticRegressionSummaryImpl, which provides it as a val.  Same for the other 
vals provided by LogisticRegressionSummaryImpl.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133771590
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -882,21 +882,28 @@ class LogisticRegression @Since("1.2.0") (
 
 val model = copyValues(new LogisticRegressionModel(uid, 
coefficientMatrix, interceptVector,
   numClasses, isMultinomial))
-// TODO: implement summary model for multinomial case
-val m = if (!isMultinomial) {
-  val (summaryModel, probabilityColName) = 
model.findSummaryModelAndProbabilityCol()
-  val logRegSummary = new BinaryLogisticRegressionTrainingSummary(
+
+val (summaryModel, probabilityColName, predictionColName) = 
model.findSummaryModel()
+val logRegSummary = if (numClasses <=2) {
--- End diff --

nit: space before "2"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133772559
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1324,90 +1350,136 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.3.0")
--- End diff --

These are not new for this trait, so I'd remove the Since annotations.  
When in doubt, just put Since annotations in final or concrete classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133830867
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -199,15 +199,57 @@ class LogisticRegressionSuite
 }
   }
 
-  test("empty probabilityCol") {
-val lr = new LogisticRegression().setProbabilityCol("")
-val model = lr.fit(smallBinaryDataset)
-assert(model.hasSummary)
-// Validate that we re-insert a probability column for evaluation
-val fieldNames = model.summary.predictions.schema.fieldNames
-assert(smallBinaryDataset.schema.fieldNames.toSet.subsetOf(
-  fieldNames.toSet))
-assert(fieldNames.exists(s => s.startsWith("probability_")))
+  test("empty probabilityCol or predictionCol") {
+val lr = new LogisticRegression().setMaxIter(1)
+val datasetFieldNames = smallBinaryDataset.schema.fieldNames.toSet
+def checkSummarySchema(model: LogisticRegressionModel, columns: 
Seq[String]): Unit = {
+  val fieldNames = model.summary.predictions.schema.fieldNames
+  assert(model.hasSummary)
+  assert(datasetFieldNames.subsetOf(fieldNames.toSet))
+  columns.foreach { c => assert(fieldNames.exists(_.startsWith(c))) }
+}
+// check that the summary model adds the appropriate columns
+Seq(("binomial", smallBinaryDataset), ("multinomial", 
smallMultinomialDataset)).foreach {
+  case (family, dataset) =>
+lr.setFamily(family)
+lr.setProbabilityCol("").setPredictionCol("prediction")
+val modelNoProb = lr.fit(smallBinaryDataset)
+checkSummarySchema(modelNoProb, Seq("probability_"))
+
+lr.setProbabilityCol("probability").setPredictionCol("")
+val modelNoPred = lr.fit(smallBinaryDataset)
+checkSummarySchema(modelNoPred, Seq("prediction_"))
+
+lr.setProbabilityCol("").setPredictionCol("")
+val modelNoPredNoProb = lr.fit(smallBinaryDataset)
+checkSummarySchema(modelNoPredNoProb, Seq("prediction_", 
"probability_"))
+}
+  }
+
+  test("check summary types for binary and multiclass") {
+val lr = new LogisticRegression()
+  .setFamily("binomial")
+
+val blorModel = lr.fit(smallBinaryDataset)
+
assert(blorModel.summary.isInstanceOf[BinaryLogisticRegressionTrainingSummaryImpl])
--- End diff --

nit: You can test for the public trait 
BinaryLogisticRegressionTrainingSummary instead of the private implementation 
class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133824486
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1324,90 +1350,136 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.3.0")
   def predictions: DataFrame
 
   /** Field in "predictions" which gives the probability of each class as 
a vector. */
+  @Since("2.3.0")
   def probabilityCol: String
 
+  /** Field in "predictions" which gives the prediction of each class. */
+  @Since("2.3.0")
+  def predictionCol: String
+
   /** Field in "predictions" which gives the true label of each instance 
(if available). */
+  @Since("2.3.0")
   def labelCol: String
 
   /** Field in "predictions" which gives the features of each instance as 
a vector. */
+  @Since("2.3.0")
   def featuresCol: String
 
+  @transient private val multiclassMetrics = {
--- End diff --

MulticlassMetrics provides a ```labels``` field which returns the list of 
labels.  In most cases, this will be values ```{0.0, 1.0, ..., 
numClasses-1}```.  However, if the training set is missing a label, then all of 
the arrays over labels (e.g., from ```truePositiveRateByLabel```) will be of 
length numClasses-1 instead of the expected numClasses.  In the future, it'd be 
nice to fix this by having them always be of length numClasses.  For now, how 
about we provide the labels field with this kind of explanation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133829306
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1484,3 +1556,96 @@ class BinaryLogisticRegressionSummary 
private[classification] (
 binaryMetrics.recallByThreshold().toDF("threshold", "recall")
   }
 }
+
+sealed trait BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary
+  with LogisticRegressionTrainingSummary
+
+/**
+ * :: Experimental ::
+ * Multiclass logistic regression training results.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+private class LogisticRegressionTrainingSummaryImpl(
+override val predictions: DataFrame,
--- End diff --

Same for BinaryLogisticRegressionTrainingSummaryImpl


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133830662
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -199,15 +199,57 @@ class LogisticRegressionSuite
 }
   }
 
-  test("empty probabilityCol") {
-val lr = new LogisticRegression().setProbabilityCol("")
-val model = lr.fit(smallBinaryDataset)
-assert(model.hasSummary)
-// Validate that we re-insert a probability column for evaluation
-val fieldNames = model.summary.predictions.schema.fieldNames
-assert(smallBinaryDataset.schema.fieldNames.toSet.subsetOf(
-  fieldNames.toSet))
-assert(fieldNames.exists(s => s.startsWith("probability_")))
+  test("empty probabilityCol or predictionCol") {
+val lr = new LogisticRegression().setMaxIter(1)
+val datasetFieldNames = smallBinaryDataset.schema.fieldNames.toSet
+def checkSummarySchema(model: LogisticRegressionModel, columns: 
Seq[String]): Unit = {
+  val fieldNames = model.summary.predictions.schema.fieldNames
+  assert(model.hasSummary)
+  assert(datasetFieldNames.subsetOf(fieldNames.toSet))
+  columns.foreach { c => assert(fieldNames.exists(_.startsWith(c))) }
+}
+// check that the summary model adds the appropriate columns
+Seq(("binomial", smallBinaryDataset), ("multinomial", 
smallMultinomialDataset)).foreach {
+  case (family, dataset) =>
+lr.setFamily(family)
+lr.setProbabilityCol("").setPredictionCol("prediction")
+val modelNoProb = lr.fit(smallBinaryDataset)
+checkSummarySchema(modelNoProb, Seq("probability_"))
+
+lr.setProbabilityCol("probability").setPredictionCol("")
+val modelNoPred = lr.fit(smallBinaryDataset)
+checkSummarySchema(modelNoPred, Seq("prediction_"))
+
+lr.setProbabilityCol("").setPredictionCol("")
+val modelNoPredNoProb = lr.fit(smallBinaryDataset)
+checkSummarySchema(modelNoPredNoProb, Seq("prediction_", 
"probability_"))
+}
+  }
+
+  test("check summary types for binary and multiclass") {
+val lr = new LogisticRegression()
--- End diff --

Set maxIter = 1 for speed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133771738
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1018,19 +1025,33 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.2.0")
--- End diff --

update to 2.3


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133828620
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1484,3 +1556,96 @@ class BinaryLogisticRegressionSummary 
private[classification] (
 binaryMetrics.recallByThreshold().toDF("threshold", "recall")
   }
 }
+
+sealed trait BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary
+  with LogisticRegressionTrainingSummary
+
+/**
+ * :: Experimental ::
--- End diff --

No need for Experimental tags on private classes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133772374
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1018,19 +1025,33 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.2.0")
--- End diff --

Also, add Scala doc indicating that this will throw an exception if it is a 
multiclass model.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133830499
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -199,15 +199,57 @@ class LogisticRegressionSuite
 }
   }
 
-  test("empty probabilityCol") {
-val lr = new LogisticRegression().setProbabilityCol("")
-val model = lr.fit(smallBinaryDataset)
-assert(model.hasSummary)
-// Validate that we re-insert a probability column for evaluation
-val fieldNames = model.summary.predictions.schema.fieldNames
-assert(smallBinaryDataset.schema.fieldNames.toSet.subsetOf(
-  fieldNames.toSet))
-assert(fieldNames.exists(s => s.startsWith("probability_")))
+  test("empty probabilityCol or predictionCol") {
+val lr = new LogisticRegression().setMaxIter(1)
+val datasetFieldNames = smallBinaryDataset.schema.fieldNames.toSet
+def checkSummarySchema(model: LogisticRegressionModel, columns: 
Seq[String]): Unit = {
+  val fieldNames = model.summary.predictions.schema.fieldNames
+  assert(model.hasSummary)
+  assert(datasetFieldNames.subsetOf(fieldNames.toSet))
+  columns.foreach { c => assert(fieldNames.exists(_.startsWith(c))) }
+}
+// check that the summary model adds the appropriate columns
+Seq(("binomial", smallBinaryDataset), ("multinomial", 
smallMultinomialDataset)).foreach {
+  case (family, dataset) =>
--- End diff --

You are not using "dataset" in this.  However, this logic should be the 
same for both binomial and multinomial families, so I'm OK if you just test the 
binomial case here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133636149
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -882,21 +882,28 @@ class LogisticRegression @Since("1.2.0") (
 
 val model = copyValues(new LogisticRegressionModel(uid, 
coefficientMatrix, interceptVector,
   numClasses, isMultinomial))
-// TODO: implement summary model for multinomial case
-val m = if (!isMultinomial) {
-  val (summaryModel, probabilityColName) = 
model.findSummaryModelAndProbabilityCol()
-  val logRegSummary = new BinaryLogisticRegressionTrainingSummary(
+
+val (summaryModel, probabilityColName, predictionColName) = 
model.findSummaryModel()
+val logRegSummary = if (!isMultinomial || (isMultinomial && numClasses 
<= 2)) {
--- End diff --

Oh sorry!  I didn't think about this case where a model is specified as 
multinomial but has 2 classes.  What you had before is better: ```numClasses 
<=2``` (without using ```isMultinomial```).  Feel free to question my 
suggestions :P 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133636062
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -882,21 +882,28 @@ class LogisticRegression @Since("1.2.0") (
 
 val model = copyValues(new LogisticRegressionModel(uid, 
coefficientMatrix, interceptVector,
   numClasses, isMultinomial))
-// TODO: implement summary model for multinomial case
-val m = if (!isMultinomial) {
-  val (summaryModel, probabilityColName) = 
model.findSummaryModelAndProbabilityCol()
-  val logRegSummary = new BinaryLogisticRegressionTrainingSummary(
+
+val (summaryModel, probabilityColName, predictionColName) = 
model.findSummaryModel()
+val logRegSummary = if (numClasses <= 2) {
--- End diff --

@jkbradley The logic here isn't `if (!isMultinomial) ... else ...`
it is `if (!isMultinomial || (isMultinomial && numClasses <= 2)) ... else 
...`
see testcase `test("check summary types for binary and multiclass")`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133589180
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -882,21 +882,28 @@ class LogisticRegression @Since("1.2.0") (
 
 val model = copyValues(new LogisticRegressionModel(uid, 
coefficientMatrix, interceptVector,
   numClasses, isMultinomial))
-// TODO: implement summary model for multinomial case
-val m = if (!isMultinomial) {
-  val (summaryModel, probabilityColName) = 
model.findSummaryModelAndProbabilityCol()
-  val logRegSummary = new BinaryLogisticRegressionTrainingSummary(
+
+val (summaryModel, probabilityColName, predictionColName) = 
model.findSummaryModel()
+val logRegSummary = if (numClasses <= 2) {
--- End diff --

Use isMultinomial since it is easier to read.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133598714
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1324,90 +1350,130 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.3.0")
   def predictions: DataFrame
 
   /** Field in "predictions" which gives the probability of each class as 
a vector. */
+  @Since("2.3.0")
   def probabilityCol: String
 
+  /** Field in "predictions" which gives the prediction of each class. */
+  @Since("2.3.0")
+  def predictionCol: String
+
   /** Field in "predictions" which gives the true label of each instance 
(if available). */
+  @Since("2.3.0")
   def labelCol: String
 
   /** Field in "predictions" which gives the features of each instance as 
a vector. */
+  @Since("2.3.0")
   def featuresCol: String
 
+  @transient private val multiclassMetrics = {
+new MulticlassMetrics(
+  predictions.select(
+col(predictionCol),
+col(labelCol).cast(DoubleType))
+.rdd.map { case Row(prediction: Double, label: Double) => 
(prediction, label) })
+  }
+
+  /** Returns true positive rate for each label. */
+  @Since("2.3.0")
+  def truePositiveRateByLabel: Array[Double] = recallByLabel
+
+  /** Returns false positive rate for each label. */
+  @Since("2.3.0")
+  def falsePositiveRateByLabel: Array[Double] = {
+multiclassMetrics.labels.map(label => 
multiclassMetrics.falsePositiveRate(label))
+  }
+
+  /** Returns precision for each label. */
+  @Since("2.3.0")
+  def precisionByLabel: Array[Double] = {
+multiclassMetrics.labels.map(label => 
multiclassMetrics.precision(label))
+  }
+
+  /** Returns recall for each label. */
+  @Since("2.3.0")
+  def recallByLabel: Array[Double] = {
+multiclassMetrics.labels.map(label => multiclassMetrics.recall(label))
+  }
+
+  /**
+   * Returns f-measure for each label.
+   */
+  @Since("2.3.0")
+  def fMeasureByLabel(beta: Double): Array[Double] = {
+multiclassMetrics.labels.map(label => 
multiclassMetrics.fMeasure(label, beta))
+  }
+
+  /** Returns f1-measure for each label. */
+  @Since("2.3.0")
+  def fMeasureByLabel: Array[Double] = fMeasureByLabel(1.0)
+
+  /** Returns accuracy. */
+  @Since("2.3.0")
+  def accuracy: Double = multiclassMetrics.accuracy
+
+  /** Returns weighted true positive rate. */
+  @Since("2.3.0")
+  def weightedTruePositiveRate: Double = weightedRecall
+
+  /** Returns weighted false positive rate. */
+  @Since("2.3.0")
+  def weightedFalsePositiveRate: Double = 
multiclassMetrics.weightedFalsePositiveRate
+
+  /** Returns weighted averaged recall. */
+  @Since("2.3.0")
+  def weightedRecall: Double = multiclassMetrics.weightedRecall
--- End diff --

For all of these, can you please make sure to copy all of the info from the 
MulticlassMetrics Scala doc?  There is some info missing, and users cannot be 
expected to find it on their own.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133598792
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1324,90 +1350,130 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.3.0")
   def predictions: DataFrame
 
   /** Field in "predictions" which gives the probability of each class as 
a vector. */
+  @Since("2.3.0")
   def probabilityCol: String
 
+  /** Field in "predictions" which gives the prediction of each class. */
+  @Since("2.3.0")
+  def predictionCol: String
+
   /** Field in "predictions" which gives the true label of each instance 
(if available). */
+  @Since("2.3.0")
   def labelCol: String
 
   /** Field in "predictions" which gives the features of each instance as 
a vector. */
+  @Since("2.3.0")
   def featuresCol: String
 
+  @transient private val multiclassMetrics = {
+new MulticlassMetrics(
+  predictions.select(
+col(predictionCol),
+col(labelCol).cast(DoubleType))
+.rdd.map { case Row(prediction: Double, label: Double) => 
(prediction, label) })
+  }
+
+  /** Returns true positive rate for each label. */
+  @Since("2.3.0")
+  def truePositiveRateByLabel: Array[Double] = recallByLabel
+
+  /** Returns false positive rate for each label. */
+  @Since("2.3.0")
+  def falsePositiveRateByLabel: Array[Double] = {
+multiclassMetrics.labels.map(label => 
multiclassMetrics.falsePositiveRate(label))
+  }
+
+  /** Returns precision for each label. */
+  @Since("2.3.0")
+  def precisionByLabel: Array[Double] = {
+multiclassMetrics.labels.map(label => 
multiclassMetrics.precision(label))
+  }
+
+  /** Returns recall for each label. */
+  @Since("2.3.0")
+  def recallByLabel: Array[Double] = {
+multiclassMetrics.labels.map(label => multiclassMetrics.recall(label))
+  }
+
+  /**
+   * Returns f-measure for each label.
+   */
+  @Since("2.3.0")
+  def fMeasureByLabel(beta: Double): Array[Double] = {
+multiclassMetrics.labels.map(label => 
multiclassMetrics.fMeasure(label, beta))
+  }
+
+  /** Returns f1-measure for each label. */
+  @Since("2.3.0")
+  def fMeasureByLabel: Array[Double] = fMeasureByLabel(1.0)
+
+  /** Returns accuracy. */
+  @Since("2.3.0")
+  def accuracy: Double = multiclassMetrics.accuracy
+
+  /** Returns weighted true positive rate. */
+  @Since("2.3.0")
+  def weightedTruePositiveRate: Double = weightedRecall
+
+  /** Returns weighted false positive rate. */
+  @Since("2.3.0")
+  def weightedFalsePositiveRate: Double = 
multiclassMetrics.weightedFalsePositiveRate
+
+  /** Returns weighted averaged recall. */
+  @Since("2.3.0")
+  def weightedRecall: Double = multiclassMetrics.weightedRecall
+
+  /** Returns weighted averaged precision. */
+  @Since("2.3.0")
+  def weightedPrecision: Double = multiclassMetrics.weightedPrecision
+
+  /**
+   * Returns weighted averaged f-measure.
+   */
+  @Since("2.3.0")
+  def weightedFMeasure(beta: Double): Double = 
multiclassMetrics.weightedFMeasure(beta)
+
+  /** Returns weighted averaged f1-measure. */
+  @Since("2.3.0")
+  def weightedFMeasure: Double = multiclassMetrics.weightedFMeasure(1.0)
 }
 
 /**
- * :: Experimental ::
- * Logistic regression training results.
- *
- * @param predictions dataframe output by the model's `transform` method.
- * @param probabilityCol field in "predictions" which gives the 
probability of
- *   each class as a vector.
- * @param labelCol field in "predictions" which gives the true label of 
each instance.
- * @param featuresCol field in "predictions" which gives the features of 
each instance 

[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-05-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r114818175
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -982,19 +989,33 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.2.0")
+  def binarySummary: BinaryLogisticRegressionTrainingSummary = summary 
match {
--- End diff --

For convenience. Otherwise, users need to call 
`model.summary.asInstanceOf[BinaryLogisticRegressionTrainingSummary]` in order 
to access the extra, binary-specific methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-05-04 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r114816222
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1449,6 +1513,107 @@ class BinaryLogisticRegressionSummary 
private[classification] (
   }
 }
 
+sealed trait BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary
+  with LogisticRegressionTrainingSummary
+
+/**
+ * :: Experimental ::
+ * Multiclass logistic regression training results.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+@Experimental
+@Since("2.2.0")
--- End diff --

Remove ```Experimental``` and ```Since``` tag for private class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-05-04 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r114816720
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -982,19 +989,33 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.2.0")
+  def binarySummary: BinaryLogisticRegressionTrainingSummary = summary 
match {
--- End diff --

Could I ask why we need a separate ```binarySummary```? If we get a binary 
classification model, the ```trainingSummary ``` should be the corresponding 
binary summary. Is that correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-05-04 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r114816288
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1449,6 +1513,107 @@ class BinaryLogisticRegressionSummary 
private[classification] (
   }
 }
 
+sealed trait BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary
+  with LogisticRegressionTrainingSummary
+
+/**
+ * :: Experimental ::
+ * Multiclass logistic regression training results.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+@Experimental
+@Since("2.2.0")
+private class LogisticRegressionTrainingSummaryImpl(
+override val predictions: DataFrame,
+override val probabilityCol: String,
+override val predictionCol: String,
+override val labelCol: String,
+override val featuresCol: String,
+@Since("1.5.0") val objectiveHistory: Array[Double])
+  extends LogisticRegressionSummaryImpl(
+predictions, probabilityCol, predictionCol, labelCol, featuresCol)
+  with LogisticRegressionTrainingSummary
+
+/**
+ * :: Experimental ::
+ * Multiclass logistic regression results for a given model.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ */
+@Experimental
+@Since("2.2.0")
+private class LogisticRegressionSummaryImpl(
--- End diff --

Ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-04-28 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r114043519
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1231,6 +1295,109 @@ class BinaryLogisticRegressionSummary 
private[classification] (
   }
 }
 
+sealed trait BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary
+  with LogisticRegressionTrainingSummary
+
+/**
+ * :: Experimental ::
+ * Multiclass logistic regression training results.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+@Experimental
+@Since("2.2.0")
+private class LogisticRegressionTrainingSummaryImpl(
+override val predictions: DataFrame,
+override val probabilityCol: String,
+override val predictionCol: String,
+override val labelCol: String,
+override val featuresCol: String,
+@Since("1.5.0") val objectiveHistory: Array[Double])
+  extends LogisticRegressionSummaryImpl(
+predictions, probabilityCol, predictionCol, labelCol, featuresCol)
+  with LogisticRegressionTrainingSummary {
+
+}
+
+/**
+ * :: Experimental ::
+ * Multiclass Logistic regression results for a given model.
--- End diff --

yeah, I already rebase it. (If I do not squash the commit to fix conflicts 
, the rebase always fail, maybe there are some better way?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-04-28 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r113950702
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1231,6 +1295,109 @@ class BinaryLogisticRegressionSummary 
private[classification] (
   }
 }
 
+sealed trait BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary
+  with LogisticRegressionTrainingSummary
+
+/**
+ * :: Experimental ::
+ * Multiclass logistic regression training results.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+@Experimental
+@Since("2.2.0")
+private class LogisticRegressionTrainingSummaryImpl(
+override val predictions: DataFrame,
+override val probabilityCol: String,
+override val predictionCol: String,
+override val labelCol: String,
+override val featuresCol: String,
+@Since("1.5.0") val objectiveHistory: Array[Double])
+  extends LogisticRegressionSummaryImpl(
+predictions, probabilityCol, predictionCol, labelCol, featuresCol)
+  with LogisticRegressionTrainingSummary {
+
+}
+
+/**
+ * :: Experimental ::
+ * Multiclass Logistic regression results for a given model.
--- End diff --

have you tried just rebasing? `git pull --rebase upstream master`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-04-28 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r113867765
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1231,6 +1295,109 @@ class BinaryLogisticRegressionSummary 
private[classification] (
   }
 }
 
+sealed trait BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary
+  with LogisticRegressionTrainingSummary
+
+/**
+ * :: Experimental ::
+ * Multiclass logistic regression training results.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+@Experimental
+@Since("2.2.0")
+private class LogisticRegressionTrainingSummaryImpl(
+override val predictions: DataFrame,
+override val probabilityCol: String,
+override val predictionCol: String,
+override val labelCol: String,
+override val featuresCol: String,
+@Since("1.5.0") val objectiveHistory: Array[Double])
+  extends LogisticRegressionSummaryImpl(
+predictions, probabilityCol, predictionCol, labelCol, featuresCol)
+  with LogisticRegressionTrainingSummary {
+
+}
+
+/**
+ * :: Experimental ::
+ * Multiclass Logistic regression results for a given model.
--- End diff --

@sethah I have a small problem, if I do not squash commits, it seems cannot 
solve the conflicts...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-04-28 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r113857158
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1231,6 +1295,109 @@ class BinaryLogisticRegressionSummary 
private[classification] (
   }
 }
 
+sealed trait BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary
+  with LogisticRegressionTrainingSummary
+
+/**
+ * :: Experimental ::
+ * Multiclass logistic regression training results.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+@Experimental
+@Since("2.2.0")
+private class LogisticRegressionTrainingSummaryImpl(
+override val predictions: DataFrame,
+override val probabilityCol: String,
+override val predictionCol: String,
+override val labelCol: String,
+override val featuresCol: String,
+@Since("1.5.0") val objectiveHistory: Array[Double])
+  extends LogisticRegressionSummaryImpl(
+predictions, probabilityCol, predictionCol, labelCol, featuresCol)
+  with LogisticRegressionTrainingSummary {
+
+}
--- End diff --

remove braces


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-04-28 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r113857182
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1231,6 +1295,109 @@ class BinaryLogisticRegressionSummary 
private[classification] (
   }
 }
 
+sealed trait BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary
+  with LogisticRegressionTrainingSummary
+
+/**
+ * :: Experimental ::
+ * Multiclass logistic regression training results.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+@Experimental
+@Since("2.2.0")
+private class LogisticRegressionTrainingSummaryImpl(
+override val predictions: DataFrame,
+override val probabilityCol: String,
+override val predictionCol: String,
+override val labelCol: String,
+override val featuresCol: String,
+@Since("1.5.0") val objectiveHistory: Array[Double])
+  extends LogisticRegressionSummaryImpl(
+predictions, probabilityCol, predictionCol, labelCol, featuresCol)
+  with LogisticRegressionTrainingSummary {
+
+}
+
+/**
+ * :: Experimental ::
+ * Multiclass Logistic regression results for a given model.
--- End diff --

"Multiclass Logistic" -> "Multiclass logistic"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-04-28 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r113856908
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1070,90 +1096,128 @@ private[classification] class MultiClassSummarizer 
extends Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.2.0")
--- End diff --

We have avoided putting since tags on methods inside traits. But since this 
is sealed and I don't think we intend to add to it, it should be ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-24 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r107961224
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -644,21 +644,29 @@ class LogisticRegression @Since("1.2.0") (
 
 val model = copyValues(new LogisticRegressionModel(uid, 
coefficientMatrix, interceptVector,
   numClasses, isMultinomial))
-// TODO: implement summary model for multinomial case
-val m = if (!isMultinomial) {
-  val (summaryModel, probabilityColName) = 
model.findSummaryModelAndProbabilityCol()
-  val logRegSummary = new BinaryLogisticRegressionTrainingSummary(
+
+val (summaryModel, probabilityColName, predictionColName)
+  = model.findSummaryModel()
+val logRegSummary = if (!isMultinomial) {
--- End diff --

It doesn't appear you added any tests for this? Also, it is better for 
reviewers if you do not squash your commits since we can then see what has 
changed since the last pass was made.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-21 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r107326057
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -1786,51 +1793,98 @@ class LogisticRegressionSuite
   }
 
   test("evaluate on test set") {
-// TODO: add for multiclass when model summary becomes available
 // Evaluate on test set should be same as that of the transformed 
training data.
 val lr = new LogisticRegression()
   .setMaxIter(10)
   .setRegParam(1.0)
   .setThreshold(0.6)
-val model = lr.fit(smallBinaryDataset)
-val summary = 
model.summary.asInstanceOf[BinaryLogisticRegressionSummary]
-
-val sameSummary =
-  
model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
-assert(summary.areaUnderROC === sameSummary.areaUnderROC)
-assert(summary.roc.collect() === sameSummary.roc.collect())
-assert(summary.pr.collect === sameSummary.pr.collect())
+  .setFamily("binomial")
+val blorModel = lr.fit(smallBinaryDataset)
+val blorSummary = blorModel.binarySummary
+
+val sameBlorSummary =
+  
blorModel.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
+assert(blorSummary.areaUnderROC === sameBlorSummary.areaUnderROC)
+assert(blorSummary.roc.collect() === sameBlorSummary.roc.collect())
+assert(blorSummary.pr.collect === sameBlorSummary.pr.collect())
 assert(
-  summary.fMeasureByThreshold.collect() === 
sameSummary.fMeasureByThreshold.collect())
-assert(summary.recallByThreshold.collect() === 
sameSummary.recallByThreshold.collect())
+  blorSummary.fMeasureByThreshold.collect() === 
sameBlorSummary.fMeasureByThreshold.collect())
 assert(
-  summary.precisionByThreshold.collect() === 
sameSummary.precisionByThreshold.collect())
+  blorSummary.recallByThreshold.collect() === 
sameBlorSummary.recallByThreshold.collect())
+assert(
+  blorSummary.precisionByThreshold.collect()
+=== sameBlorSummary.precisionByThreshold.collect())
+
+lr.setFamily("multinomial")
+val mlorModel = lr.fit(smallMultinomialDataset)
+val mlorSummary = mlorModel.summary
+
+val mlorSameSummary = mlorModel.evaluate(smallMultinomialDataset)
+
+assert(mlorSummary.truePositiveRateByLabel === 
mlorSameSummary.truePositiveRateByLabel)
+assert(mlorSummary.falsePositiveRateByLabel === 
mlorSameSummary.falsePositiveRateByLabel)
+assert(mlorSummary.precisionByLabel === 
mlorSameSummary.precisionByLabel)
+assert(mlorSummary.recallByLabel === mlorSameSummary.recallByLabel)
+assert(mlorSummary.fMeasureByLabel === mlorSameSummary.fMeasureByLabel)
+assert(mlorSummary.accuracy === mlorSameSummary.accuracy)
+assert(mlorSummary.weightedTruePositiveRate === 
mlorSameSummary.weightedTruePositiveRate)
+assert(mlorSummary.weightedFalsePositiveRate === 
mlorSameSummary.weightedFalsePositiveRate)
+assert(mlorSummary.weightedPrecision === 
mlorSameSummary.weightedPrecision)
+assert(mlorSummary.weightedRecall === mlorSameSummary.weightedRecall)
+assert(mlorSummary.weightedFMeasure === 
mlorSameSummary.weightedFMeasure)
   }
 
   test("evaluate with labels that are not doubles") {
 // Evaluate a test set with Label that is a numeric type other than 
Double
-val lr = new LogisticRegression()
+val blor = new LogisticRegression()
   .setMaxIter(1)
   .setRegParam(1.0)
-val model = lr.fit(smallBinaryDataset)
-val summary = 
model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
+val blorModel = blor.fit(smallBinaryDataset)
+val blorSummary = blorModel.evaluate(smallBinaryDataset)
+  .asInstanceOf[BinaryLogisticRegressionSummary]
+
+val blorLongLabelData = 
smallBinaryDataset.select(col(blorModel.getLabelCol).cast(LongType),
+  col(blorModel.getFeaturesCol))
+val blorLongSummary = blorModel.evaluate(blorLongLabelData)
+  .asInstanceOf[BinaryLogisticRegressionSummary]
--- End diff --

OK. I agree to do it in separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-21 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r107325971
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -194,15 +207,9 @@ class LogisticRegressionSuite
 // thresholds and threshold must be consistent: values
 withClue("fit with ParamMap should throw error if threshold, 
thresholds do not match.") {
   intercept[IllegalArgumentException] {
-lr2.fit(smallBinaryDataset,
-  lr2.thresholds -> Array(0.3, 0.7), lr2.threshold -> 
(expectedThreshold / 2.0))
-  }
-}
-withClue("fit with ParamMap should throw error if threshold, 
thresholds do not match.") {
-  intercept[IllegalArgumentException] {
-val lr2model = lr2.fit(smallBinaryDataset,
-  lr2.thresholds -> Array(0.3, 0.7), lr2.threshold -> 
(expectedThreshold / 2.0))
-lr2model.getThreshold
+val lr3 = lr2.copy(
--- End diff --

I will revert this. (previously has a bug which cause this test case fail 
but now fixed)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r106725381
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -644,21 +644,29 @@ class LogisticRegression @Since("1.2.0") (
 
 val model = copyValues(new LogisticRegressionModel(uid, 
coefficientMatrix, interceptVector,
   numClasses, isMultinomial))
-// TODO: implement summary model for multinomial case
-val m = if (!isMultinomial) {
-  val (summaryModel, probabilityColName) = 
model.findSummaryModelAndProbabilityCol()
-  val logRegSummary = new BinaryLogisticRegressionTrainingSummary(
+
+val (summaryModel, probabilityColName, predictionColName)
+  = model.findSummaryModel()
+val logRegSummary = if (!isMultinomial) {
--- End diff --

We should change this to `if (numClasses <= 2)` since you can train a 
multinomial model on 2 classes, and we should still get a binary summary in 
that case. And of course, please add tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r106492335
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1086,83 +1115,124 @@ private[classification] class MultiClassSummarizer 
extends Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
  * Abstraction for Logistic Regression Results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.2.0")
   def predictions: DataFrame
 
   /** Field in "predictions" which gives the probability of each class as 
a vector. */
+  @Since("2.2.0")
   def probabilityCol: String
 
+  /** Field in "predictions" which gives the prediction of each class as a 
vector. */
--- End diff --

"as a Double". Please check this, there are several cases where the 
prediction is said to be a vector, but predictions are actually double values


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r106726649
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -813,9 +835,16 @@ class LogisticRegressionModel private[spark] (
   @Since("2.0.0")
   def evaluate(dataset: Dataset[_]): LogisticRegressionSummary = {
 // Handle possible missing or invalid prediction columns
-val (summaryModel, probabilityColName) = 
findSummaryModelAndProbabilityCol()
-new BinaryLogisticRegressionSummary(summaryModel.transform(dataset),
-  probabilityColName, $(labelCol), $(featuresCol))
+val (summaryModel, probabilityColName, predictionColName)
--- End diff --

put this on a single line. also, it happens above so do it there as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r106727119
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -1786,51 +1793,98 @@ class LogisticRegressionSuite
   }
 
   test("evaluate on test set") {
-// TODO: add for multiclass when model summary becomes available
 // Evaluate on test set should be same as that of the transformed 
training data.
 val lr = new LogisticRegression()
   .setMaxIter(10)
   .setRegParam(1.0)
   .setThreshold(0.6)
-val model = lr.fit(smallBinaryDataset)
-val summary = 
model.summary.asInstanceOf[BinaryLogisticRegressionSummary]
-
-val sameSummary =
-  
model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
-assert(summary.areaUnderROC === sameSummary.areaUnderROC)
-assert(summary.roc.collect() === sameSummary.roc.collect())
-assert(summary.pr.collect === sameSummary.pr.collect())
+  .setFamily("binomial")
+val blorModel = lr.fit(smallBinaryDataset)
+val blorSummary = blorModel.binarySummary
+
+val sameBlorSummary =
+  
blorModel.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
+assert(blorSummary.areaUnderROC === sameBlorSummary.areaUnderROC)
+assert(blorSummary.roc.collect() === sameBlorSummary.roc.collect())
+assert(blorSummary.pr.collect === sameBlorSummary.pr.collect())
 assert(
-  summary.fMeasureByThreshold.collect() === 
sameSummary.fMeasureByThreshold.collect())
-assert(summary.recallByThreshold.collect() === 
sameSummary.recallByThreshold.collect())
+  blorSummary.fMeasureByThreshold.collect() === 
sameBlorSummary.fMeasureByThreshold.collect())
 assert(
-  summary.precisionByThreshold.collect() === 
sameSummary.precisionByThreshold.collect())
+  blorSummary.recallByThreshold.collect() === 
sameBlorSummary.recallByThreshold.collect())
+assert(
+  blorSummary.precisionByThreshold.collect()
+=== sameBlorSummary.precisionByThreshold.collect())
+
+lr.setFamily("multinomial")
+val mlorModel = lr.fit(smallMultinomialDataset)
+val mlorSummary = mlorModel.summary
+
+val mlorSameSummary = mlorModel.evaluate(smallMultinomialDataset)
+
+assert(mlorSummary.truePositiveRateByLabel === 
mlorSameSummary.truePositiveRateByLabel)
+assert(mlorSummary.falsePositiveRateByLabel === 
mlorSameSummary.falsePositiveRateByLabel)
+assert(mlorSummary.precisionByLabel === 
mlorSameSummary.precisionByLabel)
+assert(mlorSummary.recallByLabel === mlorSameSummary.recallByLabel)
+assert(mlorSummary.fMeasureByLabel === mlorSameSummary.fMeasureByLabel)
+assert(mlorSummary.accuracy === mlorSameSummary.accuracy)
+assert(mlorSummary.weightedTruePositiveRate === 
mlorSameSummary.weightedTruePositiveRate)
+assert(mlorSummary.weightedFalsePositiveRate === 
mlorSameSummary.weightedFalsePositiveRate)
+assert(mlorSummary.weightedPrecision === 
mlorSameSummary.weightedPrecision)
+assert(mlorSummary.weightedRecall === mlorSameSummary.weightedRecall)
+assert(mlorSummary.weightedFMeasure === 
mlorSameSummary.weightedFMeasure)
   }
 
   test("evaluate with labels that are not doubles") {
 // Evaluate a test set with Label that is a numeric type other than 
Double
-val lr = new LogisticRegression()
+val blor = new LogisticRegression()
   .setMaxIter(1)
   .setRegParam(1.0)
-val model = lr.fit(smallBinaryDataset)
-val summary = 
model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
+val blorModel = blor.fit(smallBinaryDataset)
+val blorSummary = blorModel.evaluate(smallBinaryDataset)
+  .asInstanceOf[BinaryLogisticRegressionSummary]
+
+val blorLongLabelData = 
smallBinaryDataset.select(col(blorModel.getLabelCol).cast(LongType),
+  col(blorModel.getFeaturesCol))
+val blorLongSummary = blorModel.evaluate(blorLongLabelData)
+  .asInstanceOf[BinaryLogisticRegressionSummary]
--- End diff --

This is quite annoying. I wonder if we should add a `asBinary` method to 
`LogisticRegressionSummary` that casts it or throws an error to make it easier. 
Probably to do in a separate PR anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To 

[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r106522957
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -194,15 +207,9 @@ class LogisticRegressionSuite
 // thresholds and threshold must be consistent: values
 withClue("fit with ParamMap should throw error if threshold, 
thresholds do not match.") {
   intercept[IllegalArgumentException] {
-lr2.fit(smallBinaryDataset,
-  lr2.thresholds -> Array(0.3, 0.7), lr2.threshold -> 
(expectedThreshold / 2.0))
-  }
-}
-withClue("fit with ParamMap should throw error if threshold, 
thresholds do not match.") {
-  intercept[IllegalArgumentException] {
-val lr2model = lr2.fit(smallBinaryDataset,
-  lr2.thresholds -> Array(0.3, 0.7), lr2.threshold -> 
(expectedThreshold / 2.0))
-lr2model.getThreshold
+val lr3 = lr2.copy(
--- End diff --

why is this changed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r106726508
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1394,13 +1575,13 @@ class BinaryLogisticRegressionSummary 
private[classification] (
  *$$
  * 
  *
+ *
--- End diff --

revert this one and the next one


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r106727965
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -780,19 +788,33 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.2.0")
+  def binarySummary: BinaryLogisticRegressionTrainingSummary = summary 
match {
+case b: BinaryLogisticRegressionTrainingSummary => b
+case _ =>
+  throw new RuntimeException("Cannot create a binarySummary for a 
non-binary model" +
+s"(numClasses=${numClasses}), use summary instead.")
+  }
+
   /**
-   * If the probability column is set returns the current model and 
probability column,
-   * otherwise generates a new column and sets it as the probability 
column on a new copy
-   * of the current model.
+   * If the probability column and prediction column is set returns the 
current model, otherwise
--- End diff --

"If the probability and prediction columns are set, this method returns the 
current model, otherwise it generates new columns for them and sets them as 
columns on a new copy of the current model."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r106533356
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1242,11 +1312,122 @@ class BinaryLogisticRegressionSummary 
private[classification] (
* This will change in later Spark versions.
*/
   @Since("1.5.0")
-  @transient lazy val recallByThreshold: DataFrame = {
+  def recallByThreshold: DataFrame = {
 binaryMetrics.recallByThreshold().toDF("threshold", "recall")
   }
 }
 
+sealed trait BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary
+  with LogisticRegressionTrainingSummary {
+
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial logistic regression training results.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+@Experimental
+@Since("1.5.0")
+class LogisticRegressionTrainingSummaryImpl private[classification](
--- End diff --

These impl classes can be private I think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r106726296
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -780,19 +788,33 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.2.0")
+  def binarySummary: BinaryLogisticRegressionTrainingSummary = summary 
match {
--- End diff --

The behavior we are implementing is non-trivial. We need to add tests to 
ensure that everything happens as expected. This is an example:

scala
test("binary and multiclass summary") {
val lr = new LogisticRegression()
  .setFamily("binomial")

val blorModel = lr.fit(smallBinaryDataset)

assert(blorModel.summary.isInstanceOf[BinaryLogisticRegressionTrainingSummaryImpl])

assert(blorModel.binarySummary.isInstanceOf[BinaryLogisticRegressionTrainingSummaryImpl])

val mlorModel = lr.setFamily("multinomial").fit(smallMultinomialDataset)

assert(mlorModel.summary.isInstanceOf[LogisticRegressionTrainingSummaryImpl])
withClue("cannot get binary summary for multiclass model") {
  intercept[RuntimeException] {
mlorModel.binarySummary
  }
}

val blorSummary = blorModel.evaluate(smallBinaryDataset)
val mlorSummary = mlorModel.evaluate(smallMultinomialDataset)
assert(blorSummary.isInstanceOf[BinaryLogisticRegressionSummaryImpl])
assert(mlorSummary.isInstanceOf[LogisticRegressionSummaryImpl])
  }



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r106727670
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1086,83 +1115,124 @@ private[classification] class MultiClassSummarizer 
extends Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
  * Abstraction for Logistic Regression Results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.2.0")
   def predictions: DataFrame
 
   /** Field in "predictions" which gives the probability of each class as 
a vector. */
+  @Since("2.2.0")
   def probabilityCol: String
 
+  /** Field in "predictions" which gives the prediction of each class as a 
vector. */
+  @Since("2.2.0")
+  def predictionCol: String
+
   /** Field in "predictions" which gives the true label of each instance 
(if available). */
+  @Since("2.2.0")
   def labelCol: String
 
   /** Field in "predictions" which gives the features of each instance as 
a vector. */
+  @Since("2.2.0")
   def featuresCol: String
 
+  @transient private val multiclassMetrics = {
+new MulticlassMetrics(
+  predictions.select(
+col(predictionCol),
+col(labelCol).cast(DoubleType))
+.rdd.map { case Row(prediction: Double, label: Double) => 
(prediction, label) })
+  }
+
+  /** Returns true positive rate for each label. */
+  @Since("2.2.0")
+  def truePositiveRateByLabel: Array[Double] = recallByLabel
+
+  /** Returns false positive rate for each label. */
+  @Since("2.2.0")
+  def falsePositiveRateByLabel: Array[Double] = {
+multiclassMetrics.labels.map(label => 
multiclassMetrics.falsePositiveRate(label))
+  }
+
+  /** Returns precision for each label. */
+  @Since("2.2.0")
+  def precisionByLabel: Array[Double] = {
+multiclassMetrics.labels.map(label => 
multiclassMetrics.precision(label))
+  }
+
+  /** Returns recall for each label. */
+  @Since("2.2.0")
+  def recallByLabel: Array[Double] = {
+multiclassMetrics.labels.map(label => multiclassMetrics.recall(label))
+  }
+
+  /**
+   * Returns f-measure for each label.
+   */
+  @Since("2.2.0")
+  def fMeasureByLabel(beta: Double): Array[Double] = {
+multiclassMetrics.labels.map(label => 
multiclassMetrics.fMeasure(label, beta))
+  }
+
+  /** Returns f1-measure for each label. */
+  @Since("2.2.0")
+  def fMeasureByLabel: Array[Double] = fMeasureByLabel(1.0)
+
+  /** Returns accuracy. */
+  @Since("2.2.0")
+  def accuracy: Double = multiclassMetrics.accuracy
+
+  /** Returns weighted true positive rate. */
+  @Since("2.2.0")
+  def weightedTruePositiveRate: Double = weightedRecall
+
+  /** Returns weighted false positive rate. */
+  @Since("2.2.0")
+  def weightedFalsePositiveRate: Double = 
multiclassMetrics.weightedFalsePositiveRate
+
+  /** Returns weighted averaged recall. */
+  @Since("2.2.0")
+  def weightedRecall: Double = multiclassMetrics.weightedRecall
+
+  /** Returns weighted averaged precision. */
+  @Since("2.2.0")
+  def weightedPrecision: Double = multiclassMetrics.weightedPrecision
+
+  /**
+   * Returns weighted averaged f-measure.
+   */
+  @Since("2.2.0")
+  def weightedFMeasure(beta: Double): Double = 
multiclassMetrics.weightedFMeasure(beta)
+
+  /** Returns weighted averaged f1-measure. */
+  @Since("2.2.0")
+  def weightedFMeasure: Double = multiclassMetrics.weightedFMeasure(1.0)
 }
 
 /**
- * :: Experimental ::
- * Logistic regression training results.
- *
- * @param predictions dataframe output by the model's `transform` method.
- * @param probabilityCol field in "predictions" which gives the 
probability of
- *   each class as a vector.
- * @param labelCol field in "predictions" which gives the true label of 
each instance.
- * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
- * @param objectiveHistory 

[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r106534308
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1197,7 +1267,7 @@ class BinaryLogisticRegressionSummary 
private[classification] (
* This will change in later Spark versions.
*/
   @Since("1.5.0")
-  lazy val areaUnderROC: Double = binaryMetrics.areaUnderROC()
+  def areaUnderROC: Double = binaryMetrics.areaUnderROC()
--- End diff --

Why are these defs now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r106725472
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -813,9 +835,16 @@ class LogisticRegressionModel private[spark] (
   @Since("2.0.0")
   def evaluate(dataset: Dataset[_]): LogisticRegressionSummary = {
 // Handle possible missing or invalid prediction columns
-val (summaryModel, probabilityColName) = 
findSummaryModelAndProbabilityCol()
-new BinaryLogisticRegressionSummary(summaryModel.transform(dataset),
-  probabilityColName, $(labelCol), $(featuresCol))
+val (summaryModel, probabilityColName, predictionColName)
+  = findSummaryModel()
+if (isMultinomial) {
--- End diff --

same as above comment, check number of classes, not family


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r106728259
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -780,19 +788,33 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.2.0")
+  def binarySummary: BinaryLogisticRegressionTrainingSummary = summary 
match {
+case b: BinaryLogisticRegressionTrainingSummary => b
+case _ =>
+  throw new RuntimeException("Cannot create a binarySummary for a 
non-binary model" +
--- End diff --

"binarySummary" -> "binary summary"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r106726431
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1242,11 +1312,122 @@ class BinaryLogisticRegressionSummary 
private[classification] (
* This will change in later Spark versions.
*/
   @Since("1.5.0")
-  @transient lazy val recallByThreshold: DataFrame = {
+  def recallByThreshold: DataFrame = {
 binaryMetrics.recallByThreshold().toDF("threshold", "recall")
   }
 }
 
+sealed trait BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary
+  with LogisticRegressionTrainingSummary {
+
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial logistic regression training results.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+@Experimental
+@Since("1.5.0")
+class LogisticRegressionTrainingSummaryImpl private[classification](
+predictions: DataFrame,
+probabilityCol: String,
+predictionCol: String,
+labelCol: String,
+featuresCol: String,
+@Since("1.5.0") val objectiveHistory: Array[Double])
+  extends LogisticRegressionSummaryImpl(
+predictions, probabilityCol, predictionCol, labelCol, featuresCol)
+  with LogisticRegressionTrainingSummary {
+
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic regression results for a given model.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ */
+@Experimental
+@Since("2.2.0")
+class LogisticRegressionSummaryImpl private[classification](
+@Since("2.2.0") @transient override val predictions: DataFrame,
+@Since("2.2.0") override val probabilityCol: String,
+@Since("2.2.0") override val predictionCol: String,
+@Since("2.2.0") override val labelCol: String,
+@Since("2.2.0") override val featuresCol: String)
+  extends LogisticRegressionSummary {
+
+}
+
+/**
+ * :: Experimental ::
+ * Binary Logistic regression training results.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+@Experimental
+@Since("2.2.0")
+class BinaryLogisticRegressionTrainingSummaryImpl private[classification] (
+predictions: DataFrame,
+probabilityCol: String,
+predictionCol: String,
+labelCol: String,
+featuresCol: String,
+@Since("1.5.0") override val objectiveHistory: Array[Double])
+  extends BinaryLogisticRegressionSummaryImpl(
+predictions, probabilityCol, predictionCol, labelCol, featuresCol)
+  with BinaryLogisticRegressionTrainingSummary {
+
+}
+
+/**
+ * :: Experimental ::
+ * Binary Logistic regression results for a given model.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
of
+ *  each class as a 

[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94643188
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -2006,7 +2075,7 @@ class LogisticRegressionSuite
 assert(prob === Vectors.dense(Array(0.0, 0.0, 0.0, 0.0, 1.0, 0.0)))
 assert(pred === 4.0)
 }
-// TODO: check num iters is zero when it become available in the model
+require(modelWithMetadata.summary.totalIterations == 0)
--- End diff --

Also, add `assert(model.summary.totalIterations === 0` after the first of 
these three tests and add `modelZeroLabel.summary.totalIterations > 0` after 
the second of these tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94640294
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -1756,55 +1765,105 @@ class LogisticRegressionSuite
   }
 
   test("evaluate on test set") {
-// TODO: add for multiclass when model summary becomes available
 // Evaluate on test set should be same as that of the transformed 
training data.
-val lr = new LogisticRegression()
+val blor = new LogisticRegression()
   .setMaxIter(10)
   .setRegParam(1.0)
   .setThreshold(0.6)
-val model = lr.fit(smallBinaryDataset)
-val summary = 
model.summary.asInstanceOf[BinaryLogisticRegressionSummary]
-
-val sameSummary =
-  
model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
-assert(summary.areaUnderROC === sameSummary.areaUnderROC)
-assert(summary.roc.collect() === sameSummary.roc.collect())
-assert(summary.pr.collect === sameSummary.pr.collect())
+val blorModel = blor.fit(smallBinaryDataset)
+val blorSummary = blorModel.binarySummary
+
+val blorSameSummary =
+  
blorModel.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
+assert(blorSummary.areaUnderROC === blorSameSummary.areaUnderROC)
+assert(blorSummary.roc.collect() === blorSameSummary.roc.collect())
+assert(blorSummary.pr.collect === blorSameSummary.pr.collect())
 assert(
-  summary.fMeasureByThreshold.collect() === 
sameSummary.fMeasureByThreshold.collect())
-assert(summary.recallByThreshold.collect() === 
sameSummary.recallByThreshold.collect())
+  blorSummary.fMeasureByThreshold.collect() === 
blorSameSummary.fMeasureByThreshold.collect())
+assert(blorSummary.recallByThreshold.collect()
+  === blorSameSummary.recallByThreshold.collect())
 assert(
-  summary.precisionByThreshold.collect() === 
sameSummary.precisionByThreshold.collect())
+  blorSummary.precisionByThreshold.collect()
+=== blorSameSummary.precisionByThreshold.collect())
+
+val mlor = new LogisticRegression()
+  .setMaxIter(10)
+  .setRegParam(1.0)
+  .setFamily("multinomial")
+val mlorModel = mlor.fit(smallMultinomialDataset)
+val mlorSummary = mlorModel.multinomialSummary
+
+val mlorSameSummary = mlorModel.evaluate(smallMultinomialDataset)
+.asInstanceOf[MultinomialLogisticRegressionSummary]
+
+assert(mlorSummary.labels === mlorSameSummary.labels)
+assert(mlorSummary.falsePositiveRateByLabel === 
mlorSameSummary.falsePositiveRateByLabel)
+assert(mlorSummary.precisionByLabel === 
mlorSameSummary.precisionByLabel)
+assert(mlorSummary.recallByLabel === mlorSameSummary.recallByLabel)
+assert(mlorSummary.fMeasureByLabel === mlorSameSummary.fMeasureByLabel)
+assert(mlorSummary.accuracy === mlorSameSummary.accuracy)
+assert(mlorSummary.weightedFalsePositiveRate === 
mlorSameSummary.weightedFalsePositiveRate)
+assert(mlorSummary.weightedPrecision === 
mlorSameSummary.weightedPrecision)
+assert(mlorSummary.weightedRecall === mlorSameSummary.weightedRecall)
+assert(mlorSummary.weightedFMeasure === 
mlorSameSummary.weightedFMeasure)
   }
 
   test("evaluate with labels that are not doubles") {
 // Evaluate a test set with Label that is a numeric type other than 
Double
-val lr = new LogisticRegression()
+val blor = new LogisticRegression()
   .setMaxIter(1)
   .setRegParam(1.0)
-val model = lr.fit(smallBinaryDataset)
-val summary = 
model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
+val blorModel = blor.fit(smallBinaryDataset)
+val blorSummary = blorModel.evaluate(smallBinaryDataset)
+  .asInstanceOf[BinaryLogisticRegressionSummary]
 
-val longLabelData = 
smallBinaryDataset.select(col(model.getLabelCol).cast(LongType),
-  col(model.getFeaturesCol))
-val longSummary = 
model.evaluate(longLabelData).asInstanceOf[BinaryLogisticRegressionSummary]
+val blorLongLabelData = 
smallBinaryDataset.select(col(blorModel.getLabelCol).cast(LongType),
--- End diff --

Well, from a completeness standpoint I agree that it's better to test all 
the types that it's intended to work for. However, since it's just calling 
`cast` under the hood, it does seem a bit redundant. I'm ok leaving it as is, 
but I don't feel strongly about it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled 

[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94614489
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -652,18 +652,27 @@ class LogisticRegression @Since("1.2.0") (
 
 val model = copyValues(new LogisticRegressionModel(uid, 
coefficientMatrix, interceptVector,
   numClasses, isMultinomial))
-// TODO: implement summary model for multinomial case
+
+val (summaryModel, probabilityColName, predictionColName)
+  = model.findSummaryModelWithProbabilityColAndPredictionCol()
 val m = if (!isMultinomial) {
--- End diff --

How about

scala
val logRegSummary = if (!isMultinomial) { new 
BinaryLogisticRegressionTrainingSummary(...) } else { ... }
model.setSummary(logRegSummary)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94620487
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1120,21 +1239,129 @@ sealed trait LogisticRegressionSummary extends 
Serializable {
   /** Field in "predictions" which gives the probability of each class as 
a vector. */
   def probabilityCol: String
 
+  /** Field in "predictions" which gives the prediction of each class as a 
vector. */
+  def predictionCol: String
+
   /** Field in "predictions" which gives the true label of each instance 
(if available). */
   def labelCol: String
 
   /** Field in "predictions" which gives the features of each instance as 
a vector. */
   def featuresCol: String
 
+  /** Returns false positive rate for each label. */
+  @Since("2.1.0")
+  def falsePositiveRateByLabel: Array[Double]
+
+  /** Returns precision for each label. */
+  @Since("2.1.0")
+  def precisionByLabel: Array[Double]
+
+  /** Returns recall for each label. */
+  @Since("2.1.0")
+  def recallByLabel: Array[Double]
+
+  /**
+   * Returns f-measure for each label.
+   * @param beta the beta parameter.
+   */
+  @Since("2.1.0")
+  def fMeasureByLabel(beta: Double): Array[Double]
+
+  /** Returns f1-measure for each label. */
+  @Since("2.1.0")
+  def fMeasureByLabel: Array[Double]
+
+  /** Returns accuracy. */
+  @Since("2.1.0")
+  def accuracy: Double
+
+  /** Returns weighted false positive rate. */
+  @Since("2.1.0")
+  def weightedFalsePositiveRate: Double
+
+  /** Returns weighted averaged recall. */
+  @Since("2.1.0")
+  def weightedRecall: Double
+
+  /** Returns weighted averaged precision. */
+  @Since("2.1.0")
+  def weightedPrecision: Double
+
+  /**
+   * Returns weighted averaged f-measure.
+   * @param beta the beta parameter.
+   */
+  @Since("2.1.0")
+  def weightedFMeasure(beta: Double): Double
+
+  /** Returns weighted averaged f1-measure. */
+  @Since("2.1.0")
+  def weightedFMeasure: Double
 }
 
 /**
  * :: Experimental ::
- * Logistic regression training results.
+ * Multinomial Logistic regression training results.
  *
  * @param predictions dataframe output by the model's `transform` method.
  * @param probabilityCol field in "predictions" which gives the 
probability of
  *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
of
+ *   each class as a vector.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+@Experimental
+@Since("1.5.0")
+class MultinomialLogisticRegressionTrainingSummary private[classification] 
(
+predictions: DataFrame,
+probabilityCol: String,
+predictionCol: String,
+labelCol: String,
+featuresCol: String,
+@Since("1.5.0") val objectiveHistory: Array[Double])
+  extends MultinomialLogisticRegressionSummary(
+predictions, probabilityCol, predictionCol, labelCol, featuresCol)
+with LogisticRegressionTrainingSummary {
+
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic regression results for a given model.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
of
+ *   each class as a vector.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ */
+@Experimental
+@Since("1.5.0")
+class MultinomialLogisticRegressionSummary private[classification](
+@Since("2.1.0") @transient override val predictions: DataFrame,
+@Since("2.1.0") override val probabilityCol: String,
+@Since("2.1.0") override val predictionCol: String,
+@Since("2.1.0") override val labelCol: String,
+@Since("2.1.0") override val featuresCol: String)
+  extends MulticlassSummary (
+predictions, predictionCol, labelCol)
+  with LogisticRegressionSummary {
+
+}
+
+/**
+ * :: Experimental ::
+ * Binary Logistic regression training results.
+ *
+ * @param 

[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94520863
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1120,21 +1239,129 @@ sealed trait LogisticRegressionSummary extends 
Serializable {
   /** Field in "predictions" which gives the probability of each class as 
a vector. */
   def probabilityCol: String
 
+  /** Field in "predictions" which gives the prediction of each class as a 
vector. */
+  def predictionCol: String
+
   /** Field in "predictions" which gives the true label of each instance 
(if available). */
   def labelCol: String
 
   /** Field in "predictions" which gives the features of each instance as 
a vector. */
   def featuresCol: String
 
+  /** Returns false positive rate for each label. */
+  @Since("2.1.0")
+  def falsePositiveRateByLabel: Array[Double]
+
+  /** Returns precision for each label. */
+  @Since("2.1.0")
+  def precisionByLabel: Array[Double]
+
+  /** Returns recall for each label. */
+  @Since("2.1.0")
+  def recallByLabel: Array[Double]
+
+  /**
+   * Returns f-measure for each label.
+   * @param beta the beta parameter.
+   */
+  @Since("2.1.0")
+  def fMeasureByLabel(beta: Double): Array[Double]
+
+  /** Returns f1-measure for each label. */
+  @Since("2.1.0")
+  def fMeasureByLabel: Array[Double]
+
+  /** Returns accuracy. */
+  @Since("2.1.0")
+  def accuracy: Double
+
+  /** Returns weighted false positive rate. */
+  @Since("2.1.0")
+  def weightedFalsePositiveRate: Double
+
+  /** Returns weighted averaged recall. */
+  @Since("2.1.0")
+  def weightedRecall: Double
+
+  /** Returns weighted averaged precision. */
+  @Since("2.1.0")
+  def weightedPrecision: Double
+
+  /**
+   * Returns weighted averaged f-measure.
+   * @param beta the beta parameter.
+   */
+  @Since("2.1.0")
+  def weightedFMeasure(beta: Double): Double
+
+  /** Returns weighted averaged f1-measure. */
+  @Since("2.1.0")
+  def weightedFMeasure: Double
 }
 
 /**
  * :: Experimental ::
- * Logistic regression training results.
+ * Multinomial Logistic regression training results.
  *
  * @param predictions dataframe output by the model's `transform` method.
  * @param probabilityCol field in "predictions" which gives the 
probability of
  *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
of
+ *   each class as a vector.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+@Experimental
+@Since("1.5.0")
+class MultinomialLogisticRegressionTrainingSummary private[classification] 
(
--- End diff --

I prefer `MulticlassLogisticRegressionTrainingSummary`. "Multinomial" 
refers to the GLM family used to train the model, whereas "multiclass" refers 
to the type of classification. Since we already use "binary" and not "binomial" 
I think "multiclass" is best here and elsewhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94640068
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -1762,51 +1781,101 @@ class LogisticRegressionSuite
   }
 
   test("evaluate on test set") {
-// TODO: add for multiclass when model summary becomes available
 // Evaluate on test set should be same as that of the transformed 
training data.
-val lr = new LogisticRegression()
+val blor = new LogisticRegression()
   .setMaxIter(10)
   .setRegParam(1.0)
   .setThreshold(0.6)
-val model = lr.fit(smallBinaryDataset)
-val summary = 
model.summary.asInstanceOf[BinaryLogisticRegressionSummary]
-
-val sameSummary =
-  
model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
-assert(summary.areaUnderROC === sameSummary.areaUnderROC)
-assert(summary.roc.collect() === sameSummary.roc.collect())
-assert(summary.pr.collect === sameSummary.pr.collect())
+val blorModel = blor.fit(smallBinaryDataset)
+val blorSummary = blorModel.binarySummary
+
+val sameBlorSummary =
+  
blorModel.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
+assert(blorSummary.areaUnderROC === sameBlorSummary.areaUnderROC)
+assert(blorSummary.roc.collect() === sameBlorSummary.roc.collect())
+assert(blorSummary.pr.collect === sameBlorSummary.pr.collect())
 assert(
-  summary.fMeasureByThreshold.collect() === 
sameSummary.fMeasureByThreshold.collect())
-assert(summary.recallByThreshold.collect() === 
sameSummary.recallByThreshold.collect())
+  blorSummary.fMeasureByThreshold.collect() === 
sameBlorSummary.fMeasureByThreshold.collect())
 assert(
-  summary.precisionByThreshold.collect() === 
sameSummary.precisionByThreshold.collect())
+  blorSummary.recallByThreshold.collect() === 
sameBlorSummary.recallByThreshold.collect())
+assert(
+  blorSummary.precisionByThreshold.collect()
+=== sameBlorSummary.precisionByThreshold.collect())
+
+val mlor = new LogisticRegression()
+  .setMaxIter(10)
+  .setRegParam(1.0)
+  .setFamily("multinomial")
+val mlorModel = blor.fit(smallMultinomialDataset)
+val mlorSummary = mlorModel.summary
+
+val mlorSameSummary = mlorModel.evaluate(smallMultinomialDataset)
+  .asInstanceOf[MultinomialLogisticRegressionSummary]
+
+assert(mlorSummary.falsePositiveRateByLabel === 
mlorSameSummary.falsePositiveRateByLabel)
+assert(mlorSummary.precisionByLabel === 
mlorSameSummary.precisionByLabel)
+assert(mlorSummary.recallByLabel === mlorSameSummary.recallByLabel)
+assert(mlorSummary.fMeasureByLabel === mlorSameSummary.fMeasureByLabel)
+assert(mlorSummary.accuracy === mlorSameSummary.accuracy)
+assert(mlorSummary.weightedFalsePositiveRate === 
mlorSameSummary.weightedFalsePositiveRate)
+assert(mlorSummary.weightedPrecision === 
mlorSameSummary.weightedPrecision)
+assert(mlorSummary.weightedRecall === mlorSameSummary.weightedRecall)
+assert(mlorSummary.weightedFMeasure === 
mlorSameSummary.weightedFMeasure)
   }
 
   test("evaluate with labels that are not doubles") {
 // Evaluate a test set with Label that is a numeric type other than 
Double
-val lr = new LogisticRegression()
+val blor = new LogisticRegression()
   .setMaxIter(1)
   .setRegParam(1.0)
--- End diff --

add `.setFamily("binomial")`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94638575
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -1762,51 +1781,101 @@ class LogisticRegressionSuite
   }
 
   test("evaluate on test set") {
-// TODO: add for multiclass when model summary becomes available
 // Evaluate on test set should be same as that of the transformed 
training data.
-val lr = new LogisticRegression()
+val blor = new LogisticRegression()
   .setMaxIter(10)
   .setRegParam(1.0)
   .setThreshold(0.6)
-val model = lr.fit(smallBinaryDataset)
-val summary = 
model.summary.asInstanceOf[BinaryLogisticRegressionSummary]
-
-val sameSummary =
-  
model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
-assert(summary.areaUnderROC === sameSummary.areaUnderROC)
-assert(summary.roc.collect() === sameSummary.roc.collect())
-assert(summary.pr.collect === sameSummary.pr.collect())
+val blorModel = blor.fit(smallBinaryDataset)
+val blorSummary = blorModel.binarySummary
+
+val sameBlorSummary =
+  
blorModel.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
+assert(blorSummary.areaUnderROC === sameBlorSummary.areaUnderROC)
+assert(blorSummary.roc.collect() === sameBlorSummary.roc.collect())
+assert(blorSummary.pr.collect === sameBlorSummary.pr.collect())
 assert(
-  summary.fMeasureByThreshold.collect() === 
sameSummary.fMeasureByThreshold.collect())
-assert(summary.recallByThreshold.collect() === 
sameSummary.recallByThreshold.collect())
+  blorSummary.fMeasureByThreshold.collect() === 
sameBlorSummary.fMeasureByThreshold.collect())
 assert(
-  summary.precisionByThreshold.collect() === 
sameSummary.precisionByThreshold.collect())
+  blorSummary.recallByThreshold.collect() === 
sameBlorSummary.recallByThreshold.collect())
+assert(
+  blorSummary.precisionByThreshold.collect()
+=== sameBlorSummary.precisionByThreshold.collect())
+
+val mlor = new LogisticRegression()
--- End diff --

This is never used since you fit with `blor` below. We can probably make a 
single estimator called `lr` and then just reuse it for both tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94616573
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1120,21 +1239,129 @@ sealed trait LogisticRegressionSummary extends 
Serializable {
   /** Field in "predictions" which gives the probability of each class as 
a vector. */
   def probabilityCol: String
 
+  /** Field in "predictions" which gives the prediction of each class as a 
vector. */
+  def predictionCol: String
+
   /** Field in "predictions" which gives the true label of each instance 
(if available). */
   def labelCol: String
 
   /** Field in "predictions" which gives the features of each instance as 
a vector. */
   def featuresCol: String
 
+  /** Returns false positive rate for each label. */
+  @Since("2.1.0")
+  def falsePositiveRateByLabel: Array[Double]
+
+  /** Returns precision for each label. */
+  @Since("2.1.0")
+  def precisionByLabel: Array[Double]
+
+  /** Returns recall for each label. */
+  @Since("2.1.0")
+  def recallByLabel: Array[Double]
+
+  /**
+   * Returns f-measure for each label.
+   * @param beta the beta parameter.
+   */
+  @Since("2.1.0")
+  def fMeasureByLabel(beta: Double): Array[Double]
+
+  /** Returns f1-measure for each label. */
+  @Since("2.1.0")
+  def fMeasureByLabel: Array[Double]
+
+  /** Returns accuracy. */
+  @Since("2.1.0")
+  def accuracy: Double
+
+  /** Returns weighted false positive rate. */
+  @Since("2.1.0")
+  def weightedFalsePositiveRate: Double
+
+  /** Returns weighted averaged recall. */
+  @Since("2.1.0")
+  def weightedRecall: Double
+
+  /** Returns weighted averaged precision. */
+  @Since("2.1.0")
+  def weightedPrecision: Double
+
+  /**
+   * Returns weighted averaged f-measure.
+   * @param beta the beta parameter.
+   */
+  @Since("2.1.0")
+  def weightedFMeasure(beta: Double): Double
+
+  /** Returns weighted averaged f1-measure. */
+  @Since("2.1.0")
+  def weightedFMeasure: Double
 }
 
 /**
  * :: Experimental ::
- * Logistic regression training results.
+ * Multinomial Logistic regression training results.
--- End diff --

nit: don't capitalize logistic here and elsewhere


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94520673
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -788,19 +797,39 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.1.0")
+  def binarySummary: BinaryLogisticRegressionTrainingSummary = summary 
match {
+case b: BinaryLogisticRegressionTrainingSummary => b
+case _ =>
+  throw new RuntimeException("Cannot create a binarySummary for a 
non-binary model" +
+s"(numClasses=${numClasses}), use multinomialSummary instead.")
+  }
+
   /**
* If the probability column is set returns the current model and 
probability column,
--- End diff --

These lines should be deleted I think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94615507
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1095,6 +1131,89 @@ private[classification] class MultiClassSummarizer 
extends Serializable {
 }
 
 /**
+ * :: Experimental ::
+ * Summary of multi-classification algorithms.
+ *
+ * @param predictions  [[DataFrame]] produced by model.transform().
+ * @param predictionCol  Name for column of prediction in `predictions`.
+ * @param labelCol  Name for column of label in `predictions`.
+ */
+@Experimental
+@Since("2.1.0")
+class MulticlassSummary private[ml] (
+@transient val predictions: DataFrame,
+val predictionCol: String,
+val labelCol: String) extends Serializable {
+
+  @transient private val multinomialMetrics = {
--- End diff --

"multiclassMetrics"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94613945
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -788,19 +797,39 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.1.0")
+  def binarySummary: BinaryLogisticRegressionTrainingSummary = summary 
match {
+case b: BinaryLogisticRegressionTrainingSummary => b
+case _ =>
+  throw new RuntimeException("Cannot create a binarySummary for a 
non-binary model" +
+s"(numClasses=${numClasses}), use multinomialSummary instead.")
--- End diff --

"multinomialSummary" => "summary"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94616155
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1095,6 +1131,89 @@ private[classification] class MultiClassSummarizer 
extends Serializable {
 }
 
 /**
+ * :: Experimental ::
+ * Summary of multi-classification algorithms.
+ *
+ * @param predictions  [[DataFrame]] produced by model.transform().
+ * @param predictionCol  Name for column of prediction in `predictions`.
+ * @param labelCol  Name for column of label in `predictions`.
+ */
+@Experimental
+@Since("2.1.0")
+class MulticlassSummary private[ml] (
+@transient val predictions: DataFrame,
+val predictionCol: String,
+val labelCol: String) extends Serializable {
+
+  @transient private val multinomialMetrics = {
+new MulticlassMetrics(
+  predictions.select(
+col(predictionCol),
+col(labelCol).cast(DoubleType))
+.rdd.map { case Row(prediction: Double, label: Double) => 
(prediction, label) })
+  }
+
+  /** Returns false positive rate for each label. */
+  @Since("2.1.0")
+  @transient lazy val falsePositiveRateByLabel: Array[Double] = {
+multinomialMetrics.labels.map(label => 
multinomialMetrics.falsePositiveRate(label))
+  }
+
+  /** Returns precision for each label. */
+  @Since("2.1.0")
+  @transient lazy val precisionByLabel: Array[Double] = {
+multinomialMetrics.labels.map(label => 
multinomialMetrics.precision(label))
+  }
+
+  /** Returns recall for each label. */
+  @Since("2.1.0")
+  @transient lazy val recallByLabel: Array[Double] = {
+multinomialMetrics.labels.map(label => 
multinomialMetrics.recall(label))
+  }
+
+  /**
+   * Returns f-measure for each label.
+   * @param beta the beta parameter.
+   */
+  @Since("2.1.0")
+  def fMeasureByLabel(beta: Double): Array[Double] = {
+multinomialMetrics.labels.map(label => 
multinomialMetrics.fMeasure(label, beta))
+  }
+
+  /** Returns f1-measure for each label. */
+  @Since("2.1.0")
+  @transient lazy val fMeasureByLabel: Array[Double] = fMeasureByLabel(1.0)
+
+  /** Returns accuracy. */
+  @Since("2.1.0")
+  @transient lazy val accuracy: Double = multinomialMetrics.accuracy
+
+  /** Returns weighted false positive rate. */
+  @Since("2.1.0")
+  @transient lazy val weightedFalsePositiveRate: Double
+= multinomialMetrics.weightedFalsePositiveRate
+
+  /** Returns weighted averaged recall. */
+  @Since("2.1.0")
+  @transient lazy val weightedRecall: Double = 
multinomialMetrics.weightedRecall
+
+  /** Returns weighted averaged precision. */
+  @Since("2.1.0")
+  @transient lazy val weightedPrecision: Double = 
multinomialMetrics.weightedPrecision
+
+  /**
+   * Returns weighted averaged f-measure.
+   * @param beta the beta parameter.
--- End diff --

Same as above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94615417
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1095,6 +1131,89 @@ private[classification] class MultiClassSummarizer 
extends Serializable {
 }
 
 /**
+ * :: Experimental ::
+ * Summary of multi-classification algorithms.
--- End diff --

"Summary of multiclass classification results."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94617024
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1159,17 +1388,23 @@ class BinaryLogisticRegressionTrainingSummary 
private[classification] (
  * @param predictions dataframe output by the model's `transform` method.
  * @param probabilityCol field in "predictions" which gives the 
probability of
  *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
of
+ *   each class as a vector.
--- End diff --

nit: indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94521336
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1095,6 +1131,89 @@ private[classification] class MultiClassSummarizer 
extends Serializable {
 }
 
 /**
+ * :: Experimental ::
+ * Summary of multi-classification algorithms.
+ *
+ * @param predictions  [[DataFrame]] produced by model.transform().
+ * @param predictionCol  Name for column of prediction in `predictions`.
+ * @param labelCol  Name for column of label in `predictions`.
+ */
+@Experimental
+@Since("2.1.0")
+class MulticlassSummary private[ml] (
+@transient val predictions: DataFrame,
+val predictionCol: String,
+val labelCol: String) extends Serializable {
+
+  @transient private val multinomialMetrics = {
+new MulticlassMetrics(
+  predictions.select(
+col(predictionCol),
+col(labelCol).cast(DoubleType))
+.rdd.map { case Row(prediction: Double, label: Double) => 
(prediction, label) })
+  }
+
+  /** Returns false positive rate for each label. */
+  @Since("2.1.0")
+  @transient lazy val falsePositiveRateByLabel: Array[Double] = {
+multinomialMetrics.labels.map(label => 
multinomialMetrics.falsePositiveRate(label))
+  }
+
+  /** Returns precision for each label. */
+  @Since("2.1.0")
+  @transient lazy val precisionByLabel: Array[Double] = {
+multinomialMetrics.labels.map(label => 
multinomialMetrics.precision(label))
+  }
+
+  /** Returns recall for each label. */
+  @Since("2.1.0")
+  @transient lazy val recallByLabel: Array[Double] = {
+multinomialMetrics.labels.map(label => 
multinomialMetrics.recall(label))
+  }
+
+  /**
+   * Returns f-measure for each label.
+   * @param beta the beta parameter.
--- End diff --

This description is not helpful. Let's either put nothing, or say something 
like "parameter which controls the balance between precision and recall in the 
f-measure"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94520622
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -788,19 +797,39 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.1.0")
+  def binarySummary: BinaryLogisticRegressionTrainingSummary = summary 
match {
+case b: BinaryLogisticRegressionTrainingSummary => b
+case _ =>
+  throw new RuntimeException("Cannot create a binarySummary for a 
non-binary model" +
+s"(numClasses=${numClasses}), use multinomialSummary instead.")
+  }
+
   /**
* If the probability column is set returns the current model and 
probability column,
* otherwise generates a new column and sets it as the probability 
column on a new copy
* of the current model.
+   * If the probability column and prediction column is set returns the 
current model, otherwise
+   * generate new columns for them and set them as probability or 
prediction column on a new copy
+   * of the current model
*/
-  private[classification] def findSummaryModelAndProbabilityCol():
-  (LogisticRegressionModel, String) = {
-$(probabilityCol) match {
-  case "" =>
-val probabilityColName = "probability_" + 
java.util.UUID.randomUUID.toString
-(copy(ParamMap.empty).setProbabilityCol(probabilityColName), 
probabilityColName)
-  case p => (this, p)
+  private[classification] def 
findSummaryModelWithProbabilityColAndPredictionCol():
--- End diff --

Maybe we can just call it `findSummaryModel` and get rid of the vars:

scala
 private[classification] def findSummaryModel(): (LogisticRegressionModel, 
String, String) = {
val model = if ($(probabilityCol).isEmpty && $(predictionCol).isEmpty) {
  copy(ParamMap.empty)
.setProbabilityCol("probability_" + 
java.util.UUID.randomUUID.toString)
.setPredictionCol("prediction_" + 
java.util.UUID.randomUUID.toString)
} else if ($(probabilityCol).isEmpty) {
  copy(ParamMap.empty).setProbabilityCol("probability_" + 
java.util.UUID.randomUUID.toString)
} else if ($(predictionCol).isEmpty) {
  copy(ParamMap.empty).setPredictionCol("prediction_" + 
java.util.UUID.randomUUID.toString)
} else {
  this
}
(model, model.getProbabilityCol, model.getPredictionCol)
  }



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94521006
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1095,6 +1131,89 @@ private[classification] class MultiClassSummarizer 
extends Serializable {
 }
 
 /**
+ * :: Experimental ::
+ * Summary of multi-classification algorithms.
+ *
+ * @param predictions  [[DataFrame]] produced by model.transform().
+ * @param predictionCol  Name for column of prediction in `predictions`.
+ * @param labelCol  Name for column of label in `predictions`.
+ */
+@Experimental
+@Since("2.1.0")
+class MulticlassSummary private[ml] (
--- End diff --

I prefer `MulticlassClassificationSummary` especially since we have a 
`MulticlassSummarizer` in the same file :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94618477
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -2006,7 +2075,7 @@ class LogisticRegressionSuite
 assert(prob === Vectors.dense(Array(0.0, 0.0, 0.0, 0.0, 1.0, 0.0)))
 assert(pred === 4.0)
 }
-// TODO: check num iters is zero when it become available in the model
+require(modelWithMetadata.summary.totalIterations == 0)
--- End diff --

assert with `===`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94638095
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -149,15 +149,34 @@ class LogisticRegressionSuite
 assert(copiedModel.hasSummary)
   }
 
-  test("empty probabilityCol") {
-val lr = new LogisticRegression().setProbabilityCol("")
-val model = lr.fit(smallBinaryDataset)
+  test("empty probabilityCol or predictionCol") {
+var lr = new LogisticRegression().setProbabilityCol("")
--- End diff --

I think we can get rid of the vars and shorten this up. Also, we should 
probably test both binary and multiclass. How does this look?:

scala
val lr = new LogisticRegression().setMaxIter(1)
val datasetFieldNames = smallBinaryDataset.schema.fieldNames.toSet
def checkSummarySchema(model: LogisticRegressionModel, columns: 
Seq[String]): Unit = {
  val fieldNames = model.summary.predictions.schema.fieldNames
  assert(model.hasSummary)
  assert(datasetFieldNames.subsetOf(fieldNames.toSet))
  columns.foreach { c => assert(fieldNames.exists(_.startsWith(c))) }
}
Seq("binomial", "multinomial").foreach { family =>
  lr.setProbabilityCol("")
  val modelNoProb = lr.fit(smallBinaryDataset)
  checkSummarySchema(modelNoProb, Seq("probability_"))

  lr.setProbabilityCol("probability").setPredictionCol("")
  val modelNoPred = lr.fit(smallBinaryDataset)
  checkSummarySchema(modelNoPred, Seq("prediction_"))

  lr.setProbabilityCol("").setPredictionCol("")
  val modelNoPredNoProb = lr.fit(smallBinaryDataset)
  checkSummarySchema(modelNoPredNoProb, Seq("prediction_", 
"probability_"))
}



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94621426
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -1762,51 +1781,101 @@ class LogisticRegressionSuite
   }
 
   test("evaluate on test set") {
-// TODO: add for multiclass when model summary becomes available
 // Evaluate on test set should be same as that of the transformed 
training data.
-val lr = new LogisticRegression()
+val blor = new LogisticRegression()
   .setMaxIter(10)
   .setRegParam(1.0)
   .setThreshold(0.6)
-val model = lr.fit(smallBinaryDataset)
-val summary = 
model.summary.asInstanceOf[BinaryLogisticRegressionSummary]
-
-val sameSummary =
-  
model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
-assert(summary.areaUnderROC === sameSummary.areaUnderROC)
-assert(summary.roc.collect() === sameSummary.roc.collect())
-assert(summary.pr.collect === sameSummary.pr.collect())
+val blorModel = blor.fit(smallBinaryDataset)
+val blorSummary = blorModel.binarySummary
+
+val sameBlorSummary =
+  
blorModel.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
+assert(blorSummary.areaUnderROC === sameBlorSummary.areaUnderROC)
+assert(blorSummary.roc.collect() === sameBlorSummary.roc.collect())
+assert(blorSummary.pr.collect === sameBlorSummary.pr.collect())
 assert(
-  summary.fMeasureByThreshold.collect() === 
sameSummary.fMeasureByThreshold.collect())
-assert(summary.recallByThreshold.collect() === 
sameSummary.recallByThreshold.collect())
+  blorSummary.fMeasureByThreshold.collect() === 
sameBlorSummary.fMeasureByThreshold.collect())
 assert(
-  summary.precisionByThreshold.collect() === 
sameSummary.precisionByThreshold.collect())
+  blorSummary.recallByThreshold.collect() === 
sameBlorSummary.recallByThreshold.collect())
+assert(
+  blorSummary.precisionByThreshold.collect()
+=== sameBlorSummary.precisionByThreshold.collect())
+
+val mlor = new LogisticRegression()
+  .setMaxIter(10)
+  .setRegParam(1.0)
+  .setFamily("multinomial")
+val mlorModel = blor.fit(smallMultinomialDataset)
+val mlorSummary = mlorModel.summary
+
+val mlorSameSummary = mlorModel.evaluate(smallMultinomialDataset)
+  .asInstanceOf[MultinomialLogisticRegressionSummary]
+
+assert(mlorSummary.falsePositiveRateByLabel === 
mlorSameSummary.falsePositiveRateByLabel)
+assert(mlorSummary.precisionByLabel === 
mlorSameSummary.precisionByLabel)
+assert(mlorSummary.recallByLabel === mlorSameSummary.recallByLabel)
+assert(mlorSummary.fMeasureByLabel === mlorSameSummary.fMeasureByLabel)
+assert(mlorSummary.accuracy === mlorSameSummary.accuracy)
+assert(mlorSummary.weightedFalsePositiveRate === 
mlorSameSummary.weightedFalsePositiveRate)
+assert(mlorSummary.weightedPrecision === 
mlorSameSummary.weightedPrecision)
+assert(mlorSummary.weightedRecall === mlorSameSummary.weightedRecall)
+assert(mlorSummary.weightedFMeasure === 
mlorSameSummary.weightedFMeasure)
   }
 
   test("evaluate with labels that are not doubles") {
 // Evaluate a test set with Label that is a numeric type other than 
Double
-val lr = new LogisticRegression()
+val blor = new LogisticRegression()
   .setMaxIter(1)
   .setRegParam(1.0)
-val model = lr.fit(smallBinaryDataset)
-val summary = 
model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
+val blorModel = blor.fit(smallBinaryDataset)
+val blorSummary = blorModel.evaluate(smallBinaryDataset)
+  .asInstanceOf[BinaryLogisticRegressionSummary]
 
-val longLabelData = 
smallBinaryDataset.select(col(model.getLabelCol).cast(LongType),
-  col(model.getFeaturesCol))
-val longSummary = 
model.evaluate(longLabelData).asInstanceOf[BinaryLogisticRegressionSummary]
+val blorLongLabelData = 
smallBinaryDataset.select(col(blorModel.getLabelCol).cast(LongType),
+  col(blorModel.getFeaturesCol))
+val blorLongSummary = blorModel.evaluate(blorLongLabelData)
+  .asInstanceOf[BinaryLogisticRegressionSummary]
 
-assert(summary.areaUnderROC === longSummary.areaUnderROC)
+assert(blorSummary.areaUnderROC === blorLongSummary.areaUnderROC)
+
+val mlor = new LogisticRegression()
+  .setMaxIter(1)
+  .setRegParam(1.0)
+  .setFamily("multinomial")
+val mlorModel = mlor.fit(smallMultinomialDataset)
+val mlorSummary = 

[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-01-04 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r94615848
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1095,6 +1131,89 @@ private[classification] class MultiClassSummarizer 
extends Serializable {
 }
 
 /**
+ * :: Experimental ::
+ * Summary of multi-classification algorithms.
+ *
+ * @param predictions  [[DataFrame]] produced by model.transform().
+ * @param predictionCol  Name for column of prediction in `predictions`.
+ * @param labelCol  Name for column of label in `predictions`.
+ */
+@Experimental
+@Since("2.1.0")
+class MulticlassSummary private[ml] (
+@transient val predictions: DataFrame,
+val predictionCol: String,
+val labelCol: String) extends Serializable {
+
+  @transient private val multinomialMetrics = {
+new MulticlassMetrics(
+  predictions.select(
+col(predictionCol),
+col(labelCol).cast(DoubleType))
+.rdd.map { case Row(prediction: Double, label: Double) => 
(prediction, label) })
+  }
+
+  /** Returns false positive rate for each label. */
+  @Since("2.1.0")
+  @transient lazy val falsePositiveRateByLabel: Array[Double] = {
+multinomialMetrics.labels.map(label => 
multinomialMetrics.falsePositiveRate(label))
+  }
+
+  /** Returns precision for each label. */
+  @Since("2.1.0")
--- End diff --

Update all since tags


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r85220020
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -777,6 +787,20 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.1.0")
+  def binarySummary: BinaryLogisticRegressionTrainingSummary = summary 
match {
+case b: BinaryLogisticRegressionTrainingSummary => b
+case _ => throw new SparkException("Summary type dismatch, " +
+  "BinaryLogisticRegressionTrainingSummary expected.")
+  }
+
+  @Since("2.1.0")
+  def multinomialSummary: MultinomialLogisticRegressionTrainingSummary = 
summary match {
+case m: MultinomialLogisticRegressionTrainingSummary => m
+case _ => throw new SparkException("Summary type dismatch, " +
--- End diff --

We actually could return a valid multinomial summary, even for binary 
problems.  I'd recommend we do that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r85220068
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1141,6 +1174,21 @@ class BinaryLogisticRegressionTrainingSummary 
private[classification] (
 
 }
 
+@Experimental
--- End diff --

Add scala doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r85218915
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -777,6 +787,20 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.1.0")
+  def binarySummary: BinaryLogisticRegressionTrainingSummary = summary 
match {
+case b: BinaryLogisticRegressionTrainingSummary => b
+case _ => throw new SparkException("Summary type dismatch, " +
--- End diff --

Don't use SparkException here; that's primarily for failures within Spark 
jobs AFAIK.  I guess I'd use RuntimeException.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r85220866
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1390,7 +1505,8 @@ class BinaryLogisticRegressionSummary 
private[classification] (
  *$$
  * 
  *
- * @param bcCoefficients The broadcast coefficients corresponding to the 
features.
+  *
+  * @param bcCoefficients The broadcast coefficients corresponding to the 
features.
--- End diff --

fix indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r85220059
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -919,6 +948,10 @@ class LogisticRegressionModel private[spark] (
 }
   }
 
+  private[classification] val probability2predictionUdf = udf {
--- End diff --

This UDF might pull this entire class into the closure and thus into the 
summaries.  I believe we can eliminate this UDF; I'll comment more below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r85218235
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -721,7 +730,8 @@ class LogisticRegressionModel private[spark] (
   /**
* The model intercept for "binomial" logistic regression. If this model 
was fit with the
* "multinomial" family then an exception is thrown.
-   * @return Double
+*
--- End diff --

same here: remove these 2 lines since they don't say anything useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r85218200
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -698,7 +706,8 @@ class LogisticRegressionModel private[spark] (
   /**
* A vector of model coefficients for "binomial" logistic regression. If 
this model was trained
* using the "multinomial" family then an exception is thrown.
-   * @return Vector
+*
--- End diff --

I'd actually just remove these 2 lines since they don't say anything useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r85219325
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -777,6 +787,20 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.1.0")
+  def binarySummary: BinaryLogisticRegressionTrainingSummary = summary 
match {
+case b: BinaryLogisticRegressionTrainingSummary => b
+case _ => throw new SparkException("Summary type dismatch, " +
--- End diff --

Also, the current error doesn't really help the user.  How about "Cannot 
create a binarySummary for a non-binary model (with numClasses = $numClasses).  
Use multinomialSummary instead."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r85218415
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -777,6 +787,20 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.1.0")
--- End diff --

Add Scala doc, especially saying that this will throw an exception when 
numClasses > 2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r85220713
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1243,6 +1291,73 @@ class BinaryLogisticRegressionSummary 
private[classification] (
   }
 }
 
+@Experimental
+@Since("2.1.0")
+class MultinomialLogisticRegressionSummary private[classification] (
+@Since("1.5.0") @transient override val predictions: DataFrame,
+@Since("2.1.0") override val probabilityCol: String,
+@Since("1.5.0") override val labelCol: String,
+@Since("1.6.0") override val featuresCol: String,
+private val probability2PredictionUdf: UserDefinedFunction) extends 
LogisticRegressionSummary {
+
+  private val sparkSession = predictions.sparkSession
+  import sparkSession.implicits._
+
+  @transient private val multinomialMetrics = {
+new MulticlassMetrics(
+  predictions.select(
+probability2PredictionUdf(col(probabilityCol)),
+col(labelCol).cast(DoubleType))
+.rdd.map { case Row(prediction: Double, label: Double) => 
(prediction, label) })
+  }
+
+  @Since("2.1.0")
+  @transient lazy val labels: Array[Double] = multinomialMetrics.labels
--- End diff --

Add scala doc for all of these


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r85218947
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -777,6 +787,20 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.1.0")
+  def binarySummary: BinaryLogisticRegressionTrainingSummary = summary 
match {
+case b: BinaryLogisticRegressionTrainingSummary => b
+case _ => throw new SparkException("Summary type dismatch, " +
+  "BinaryLogisticRegressionTrainingSummary expected.")
+  }
+
+  @Since("2.1.0")
--- End diff --

Add doc here too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r85220564
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1243,6 +1291,73 @@ class BinaryLogisticRegressionSummary 
private[classification] (
   }
 }
 
+@Experimental
+@Since("2.1.0")
+class MultinomialLogisticRegressionSummary private[classification] (
+@Since("1.5.0") @transient override val predictions: DataFrame,
+@Since("2.1.0") override val probabilityCol: String,
+@Since("1.5.0") override val labelCol: String,
+@Since("1.6.0") override val featuresCol: String,
+private val probability2PredictionUdf: UserDefinedFunction) extends 
LogisticRegressionSummary {
+
+  private val sparkSession = predictions.sparkSession
--- End diff --

Do you need these 2 lines?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r85220449
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1243,6 +1291,73 @@ class BinaryLogisticRegressionSummary 
private[classification] (
   }
 }
 
+@Experimental
+@Since("2.1.0")
+class MultinomialLogisticRegressionSummary private[classification] (
+@Since("1.5.0") @transient override val predictions: DataFrame,
--- End diff --

These since tags should all be 2.1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r85220375
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1243,6 +1291,73 @@ class BinaryLogisticRegressionSummary 
private[classification] (
   }
 }
 
+@Experimental
+@Since("2.1.0")
+class MultinomialLogisticRegressionSummary private[classification] (
+@Since("1.5.0") @transient override val predictions: DataFrame,
+@Since("2.1.0") override val probabilityCol: String,
+@Since("1.5.0") override val labelCol: String,
+@Since("1.6.0") override val featuresCol: String,
+private val probability2PredictionUdf: UserDefinedFunction) extends 
LogisticRegressionSummary {
+
+  private val sparkSession = predictions.sparkSession
+  import sparkSession.implicits._
+
+  @transient private val multinomialMetrics = {
+new MulticlassMetrics(
+  predictions.select(
+probability2PredictionUdf(col(probabilityCol)),
--- End diff --

This is the only place where you need the probability2predictionUDF.  
Rather than passing a UDF or the model, I'd like us to modify 
findSummaryModelAndProbabilityCol to also set predictionCol if needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-21 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r84468260
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -1756,55 +1765,105 @@ class LogisticRegressionSuite
   }
 
   test("evaluate on test set") {
-// TODO: add for multiclass when model summary becomes available
 // Evaluate on test set should be same as that of the transformed 
training data.
-val lr = new LogisticRegression()
+val blor = new LogisticRegression()
   .setMaxIter(10)
   .setRegParam(1.0)
   .setThreshold(0.6)
-val model = lr.fit(smallBinaryDataset)
-val summary = 
model.summary.asInstanceOf[BinaryLogisticRegressionSummary]
-
-val sameSummary =
-  
model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
-assert(summary.areaUnderROC === sameSummary.areaUnderROC)
-assert(summary.roc.collect() === sameSummary.roc.collect())
-assert(summary.pr.collect === sameSummary.pr.collect())
+val blorModel = blor.fit(smallBinaryDataset)
+val blorSummary = blorModel.binarySummary
+
+val blorSameSummary =
+  
blorModel.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
+assert(blorSummary.areaUnderROC === blorSameSummary.areaUnderROC)
+assert(blorSummary.roc.collect() === blorSameSummary.roc.collect())
+assert(blorSummary.pr.collect === blorSameSummary.pr.collect())
 assert(
-  summary.fMeasureByThreshold.collect() === 
sameSummary.fMeasureByThreshold.collect())
-assert(summary.recallByThreshold.collect() === 
sameSummary.recallByThreshold.collect())
+  blorSummary.fMeasureByThreshold.collect() === 
blorSameSummary.fMeasureByThreshold.collect())
+assert(blorSummary.recallByThreshold.collect()
+  === blorSameSummary.recallByThreshold.collect())
 assert(
-  summary.precisionByThreshold.collect() === 
sameSummary.precisionByThreshold.collect())
+  blorSummary.precisionByThreshold.collect()
+=== blorSameSummary.precisionByThreshold.collect())
+
+val mlor = new LogisticRegression()
+  .setMaxIter(10)
+  .setRegParam(1.0)
+  .setFamily("multinomial")
+val mlorModel = mlor.fit(smallMultinomialDataset)
+val mlorSummary = mlorModel.multinomialSummary
+
+val mlorSameSummary = mlorModel.evaluate(smallMultinomialDataset)
+.asInstanceOf[MultinomialLogisticRegressionSummary]
+
+assert(mlorSummary.labels === mlorSameSummary.labels)
+assert(mlorSummary.falsePositiveRateByLabel === 
mlorSameSummary.falsePositiveRateByLabel)
+assert(mlorSummary.precisionByLabel === 
mlorSameSummary.precisionByLabel)
+assert(mlorSummary.recallByLabel === mlorSameSummary.recallByLabel)
+assert(mlorSummary.fMeasureByLabel === mlorSameSummary.fMeasureByLabel)
+assert(mlorSummary.accuracy === mlorSameSummary.accuracy)
+assert(mlorSummary.weightedFalsePositiveRate === 
mlorSameSummary.weightedFalsePositiveRate)
+assert(mlorSummary.weightedPrecision === 
mlorSameSummary.weightedPrecision)
+assert(mlorSummary.weightedRecall === mlorSameSummary.weightedRecall)
+assert(mlorSummary.weightedFMeasure === 
mlorSameSummary.weightedFMeasure)
   }
 
   test("evaluate with labels that are not doubles") {
 // Evaluate a test set with Label that is a numeric type other than 
Double
-val lr = new LogisticRegression()
+val blor = new LogisticRegression()
   .setMaxIter(1)
   .setRegParam(1.0)
-val model = lr.fit(smallBinaryDataset)
-val summary = 
model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
+val blorModel = blor.fit(smallBinaryDataset)
+val blorSummary = blorModel.evaluate(smallBinaryDataset)
+  .asInstanceOf[BinaryLogisticRegressionSummary]
 
-val longLabelData = 
smallBinaryDataset.select(col(model.getLabelCol).cast(LongType),
-  col(model.getFeaturesCol))
-val longSummary = 
model.evaluate(longLabelData).asInstanceOf[BinaryLogisticRegressionSummary]
+val blorLongLabelData = 
smallBinaryDataset.select(col(blorModel.getLabelCol).cast(LongType),
--- End diff --

Non-double numeric datatypes other than LongType maybe also needed to test. 
Any thoughts? @sethah 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-11 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/15435

[SPARK-17139][ML] Add model summary for MultinomialLogisticRegression

## What changes were proposed in this pull request?

Add model summary for MultinomialLogisticRegression.
currently, using `MulticlassMetrics` to do the summary.

because multi-class case has some differences with binary case,
I just expose the `MulticlassMetrics` interface about summary for now.

And I add `findSummaryModelAndPredictionCol` method because the 
`MulticlassMetrics` need the `predictionCol` value. Although the 
`predictionCol` can be calculated from `probabilityCol`,  put some code here to 
do this thing will cause repeated code, so I do not use that way.

## How was this patch tested?

Existing tests.


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark mlor_summary

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15435.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15435


commit e93740ead908251cebee22cc00059571a6e22a3f
Author: WeichenXu 
Date:   2016-10-11T00:25:45Z

update




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org