[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...

2017-09-21 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19122
  
@BryanCutler Do you have more comments?  I can check it out now but don't 
want to review at the same time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-20 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18924#discussion_r140030582
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +462,46 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
 val expElogbetaBc = batch.sparkContext.broadcast(expElogbeta)
 val alpha = this.alpha.asBreeze
 val gammaShape = this.gammaShape
+val optimizeDocConcentration = this.optimizeDocConcentration
+val logphatPartOptionBase = () => if (optimizeDocConcentration) 
Some(BDV.zeros[Double](k))
--- End diff --

Add an inline doc note here:
"We calculate logphat in the same pass as other statistics, but we only 
need it if we are optimizing docConcentration."


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-20 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18924#discussion_r139832997
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
   }
 
   /**
-   * Update alpha based on `gammat`, the inferred topic distributions for 
documents in the
-   * current mini-batch. Uses Newton-Rhapson method.
+   * Update alpha based on `logphat`.
--- End diff --

Same with N


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-20 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18924#discussion_r139832967
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
   }
 
   /**
-   * Update alpha based on `gammat`, the inferred topic distributions for 
documents in the
-   * current mini-batch. Uses Newton-Rhapson method.
+   * Update alpha based on `logphat`.
--- End diff --

Document what logphat is


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-09-20 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/18924
  
Taking a look


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-09-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19020#discussion_r139807816
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala 
---
@@ -998,6 +1047,172 @@ class LinearRegressionSuite
   }
 }
   }
+
+  test("linear regression (huber loss) with intercept without 
regularization") {
--- End diff --

Do you know if these integration tests are in the L1 penalty regime or in 
the L2 regime?  It'd be nice to make sure we're testing both.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-09-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19020#discussion_r139765053
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -69,19 +69,57 @@ private[regression] trait LinearRegressionParams 
extends PredictorParams
 "The solver algorithm for optimization. Supported options: " +
   s"${supportedSolvers.mkString(", ")}. (Default auto)",
 ParamValidators.inArray[String](supportedSolvers))
+
+  /**
+   * The loss function to be optimized.
+   * Supported options: "leastSquares" and "huber".
+   * Default: "leastSquares"
+   *
+   * @group param
+   */
+  @Since("2.3.0")
+  final override val loss: Param[String] = new Param[String](this, "loss", 
"The loss function to" +
+s" be optimized. Supported options: ${supportedLosses.mkString(", ")}. 
(Default leastSquares)",
+ParamValidators.inArray[String](supportedLosses))
+
+  /**
+   * The shape parameter to control the amount of robustness. Must be  
1.0.
+   * At larger values of epsilon, the huber criterion becomes more similar 
to least squares
+   * regression; for small values of epsilon, the criterion is more 
similar to L1 regression.
+   * Default is 1.35 to get as much robustness as possible while retaining
+   * 95% statistical efficiency for normally distributed data.
+   * Only valid when "loss" is "huber".
+   */
+  @Since("2.3.0")
+  final val epsilon = new DoubleParam(this, "epsilon", "The shape 
parameter to control the " +
+"amount of robustness. Must be > 1.0.", ParamValidators.gt(1.0))
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getEpsilon: Double = $(epsilon)
+
+  override protected def validateAndTransformSchema(
+  schema: StructType,
+  fitting: Boolean,
+  featuresDataType: DataType): StructType = {
+if ($(loss) == Huber) {
+  require($(solver)!= Normal, "LinearRegression with huber loss 
doesn't support " +
+"normal solver, please change solver to auto or l-bfgs.")
+  require($(elasticNetParam) == 0.0, "LinearRegression with huber loss 
only supports " +
+s"L2 regularization, but got elasticNetParam = 
$getElasticNetParam.")
+
+}
+super.validateAndTransformSchema(schema, fitting, featuresDataType)
+  }
 }
 
 /**
  * Linear regression.
  *
- * The learning objective is to minimize the squared error, with 
regularization.
- * The specific squared error loss function used is:
- *
- * 
- *$$
- *L = 1/2n ||A coefficients - y||^2^
- *$$
- * 
+ * The learning objective is to minimize the specified loss function, with 
regularization.
+ * This supports two loss functions:
+ *  - leastSquares (a.k.a squared loss)
--- End diff --

Let's keep exact specifications of the losses being used.  This is one of 
my big annoyances with many ML libraries: It's hard to tell exactly what loss 
is being used, which makes it hard to compare/validate results across different 
ML libraries.

It'd also be nice to make it clear what we mean by "huber," in particular 
that we estimate the scale parameter from data.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-09-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19020#discussion_r139772375
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -471,6 +574,15 @@ object LinearRegression extends 
DefaultParamsReadable[LinearRegression] {
 
   /** Set of solvers that LinearRegression supports. */
   private[regression] val supportedSolvers = Array(Auto, Normal, LBFGS)
+
+  /** String name for "leastSquares". */
+  private[regression] val LeastSquares = "leastSquares"
--- End diff --

How about calling this "squaredError" since the loss is "squared error," 
not "least squares."


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-09-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19020#discussion_r139765068
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -220,12 +283,12 @@ class LinearRegression @Since("1.3.0") 
(@Since("1.3.0") override val uid: String
 }
 
 val instr = Instrumentation.create(this, dataset)
-instr.logParams(labelCol, featuresCol, weightCol, predictionCol, 
solver, tol,
-  elasticNetParam, fitIntercept, maxIter, regParam, standardization, 
aggregationDepth)
+instr.logParams(labelCol, featuresCol, weightCol, predictionCol, 
solver, tol, elasticNetParam,
--- End diff --

Log epsilon (M) as well


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-09-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19020#discussion_r139764932
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala 
---
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.optim.aggregator
+
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg.Vector
+
+/**
+ * HuberAggregator computes the gradient and loss for a huber loss 
function,
+ * as used in robust regression for samples in sparse or dense vector in 
an online fashion.
+ *
+ * The huber loss function based on:
+ * Art B. Owen (2006), A robust hybrid of lasso and ridge regression.
+ * (http://statweb.stanford.edu/~owen/reports/hhu.pdf)
+ *
+ * Two HuberAggregator can be merged together to have a summary of loss 
and gradient of
+ * the corresponding joint dataset.
+ *
+ * The huber loss function is given by
+ *
+ * 
+ *   $$
+ *   \begin{align}
+ *   \min_{w, \sigma}\frac{1}{2n}{\sum_{i=1}^n\left(\sigma +
+ *   H_m\left(\frac{X_{i}w - y_{i}}{\sigma}\right)\sigma\right) + 
\frac{1}{2}\alpha {||w||_2}^2}
+ *   \end{align}
+ *   $$
+ * 
+ *
+ * where
+ *
+ * 
+ *   $$
+ *   \begin{align}
+ *   H_m(z) = \begin{cases}
+ *z^2, & \text {if } |z|  \epsilon, \\
+ *2\epsilon|z| - \epsilon^2, & \text{otherwise}
+ *\end{cases}
+ *   \end{align}
+ *   $$
+ * 
+ *
+ * It is advised to set the parameter $\epsilon$ to 1.35 to achieve 95% 
statistical efficiency.
+ *
+ * @param fitIntercept Whether to fit an intercept term.
+ * @param epsilon The shape parameter to control the amount of robustness.
+ * @param bcFeaturesStd The broadcast standard deviation values of the 
features.
+ * @param bcParameters including three parts: the regression coefficients 
corresponding
+ * to the features, the intercept (if fitIntercept is 
ture)
+ * and the scale parameter (sigma).
+ */
+private[ml] class HuberAggregator(
+fitIntercept: Boolean,
+epsilon: Double,
+bcFeaturesStd: Broadcast[Array[Double]])(bcParameters: 
Broadcast[Vector])
+  extends DifferentiableLossAggregator[Instance, HuberAggregator] {
+
+  protected override val dim: Int = bcParameters.value.size
+  private val numFeatures: Int = if (fitIntercept) dim - 2 else dim - 1
+
+  @transient private lazy val coefficients: Array[Double] =
+bcParameters.value.toArray.slice(0, numFeatures)
+  private val sigma: Double = bcParameters.value(dim - 1)
+
+  @transient private lazy val featuresStd = bcFeaturesStd.value
+
+  /**
+   * Add a new training instance to this HuberAggregator, and update the 
loss and gradient
+   * of the objective function.
+   *
+   * @param instance The instance of data point to be added.
+   * @return This HuberAggregator object.
+   */
+  def add(instance: Instance): HuberAggregator = {
+instance match { case Instance(label, weight, features) =>
+  require(numFeatures == features.size, s"Dimensions mismatch when 
adding new sample." +
+s" Expecting $numFeatures but got ${features.size}.")
+  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
+
+  if (weight == 0.0) return this
+
+  val margin = {
+var sum = 0.0
+features.foreachActive { (index, value) =>
+  if (featuresStd(index) != 0.0 && value != 0.0) {
+sum += coefficients(index) * (value / featuresStd(index))
+  }
+}
+if (fitIntercept) sum += bcParameters.value(dim - 2)
+sum
+  }
+  val linearLoss = label - margin
+
+  if (math.abs(l

[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-09-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19020#discussion_r139765034
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -69,19 +69,57 @@ private[regression] trait LinearRegressionParams 
extends PredictorParams
 "The solver algorithm for optimization. Supported options: " +
   s"${supportedSolvers.mkString(", ")}. (Default auto)",
 ParamValidators.inArray[String](supportedSolvers))
+
+  /**
+   * The loss function to be optimized.
+   * Supported options: "leastSquares" and "huber".
+   * Default: "leastSquares"
+   *
+   * @group param
+   */
+  @Since("2.3.0")
+  final override val loss: Param[String] = new Param[String](this, "loss", 
"The loss function to" +
+s" be optimized. Supported options: ${supportedLosses.mkString(", ")}. 
(Default leastSquares)",
+ParamValidators.inArray[String](supportedLosses))
+
+  /**
+   * The shape parameter to control the amount of robustness. Must be  
1.0.
+   * At larger values of epsilon, the huber criterion becomes more similar 
to least squares
+   * regression; for small values of epsilon, the criterion is more 
similar to L1 regression.
+   * Default is 1.35 to get as much robustness as possible while retaining
+   * 95% statistical efficiency for normally distributed data.
+   * Only valid when "loss" is "huber".
+   */
+  @Since("2.3.0")
+  final val epsilon = new DoubleParam(this, "epsilon", "The shape 
parameter to control the " +
--- End diff --

Mark as expertParam (same for set/get)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-09-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19020#discussion_r139764922
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala 
---
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.optim.aggregator
+
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg.Vector
+
+/**
+ * HuberAggregator computes the gradient and loss for a huber loss 
function,
+ * as used in robust regression for samples in sparse or dense vector in 
an online fashion.
+ *
+ * The huber loss function based on:
+ * Art B. Owen (2006), A robust hybrid of lasso and ridge regression.
+ * (http://statweb.stanford.edu/~owen/reports/hhu.pdf)
+ *
+ * Two HuberAggregator can be merged together to have a summary of loss 
and gradient of
+ * the corresponding joint dataset.
+ *
+ * The huber loss function is given by
+ *
+ * 
+ *   $$
+ *   \begin{align}
+ *   \min_{w, \sigma}\frac{1}{2n}{\sum_{i=1}^n\left(\sigma +
+ *   H_m\left(\frac{X_{i}w - y_{i}}{\sigma}\right)\sigma\right) + 
\frac{1}{2}\alpha {||w||_2}^2}
+ *   \end{align}
+ *   $$
+ * 
+ *
+ * where
+ *
+ * 
+ *   $$
+ *   \begin{align}
+ *   H_m(z) = \begin{cases}
+ *z^2, & \text {if } |z|  \epsilon, \\
+ *2\epsilon|z| - \epsilon^2, & \text{otherwise}
+ *\end{cases}
+ *   \end{align}
+ *   $$
+ * 
+ *
+ * It is advised to set the parameter $\epsilon$ to 1.35 to achieve 95% 
statistical efficiency.
--- End diff --

This description is misleadingly general since this claim only applies to 
normally distributed data.  How about referencing the part of the paper which 
talks about this so that people can look up what is meant here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19020: [SPARK-3181] [ML] Implement huber loss for LinearRegress...

2017-09-19 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19020
  
I'll check this out now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-18 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19186
  
This has ended up being more complex than we envisioned.  It would be 
valuable to describe the design succinctly so that people can debate it on 
JIRA.  Could you please describe your solution on the JIRA and ping people who 
have been discussing this?  (I also made a request on the JIRA which I'd like 
to see addressed.)  Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-18 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19186
  
@zhengruifeng Can you please update the PR description so it describes the 
actual functionality being added?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19106: [SPARK-21770][ML] ProbabilisticClassificationModel fix c...

2017-09-18 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19106
  
These are fair arguments.  I guess it makes sense to throw an exception; 
that's fine with me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16774: [SPARK-19357][ML] Adding parallel model evaluation in ML...

2017-09-18 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/16774
  
@WeichenXu123 Thanks for finding that bug!  Can you please separate out 
your bugfix?  It's good to get fixes in, rather than attaching them to PRs 
which may require discussion, so that we make sure that bugs don't slip into 
the next release.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-09-13 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19208
  
Synced offline: I hadn't looked carefully and seen the 2 issues had been 
merged.  @WeichenXu123 said he will split the work in 2, adding one parameter 
first.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-09-13 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19208
  
CC @hhbyyh and @MLnick  Does this look reasonable to you?

And @hhbyyh would you want to split off a new JIRA for your original 
solution of adding an option to dump models to disk?  Then we could revive your 
PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-18608][ML] Fix double caching

2017-09-12 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 63098dc31 -> b606dc177


[SPARK-18608][ML] Fix double caching

## What changes were proposed in this pull request?
`df.rdd.getStorageLevel` => `df.storageLevel`

using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in 
"\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this 
issue are fixed.

Previous discussion in other PRs: https://github.com/apache/spark/pull/19107, 
https://github.com/apache/spark/pull/17014

## How was this patch tested?
existing tests

Author: Zheng RuiFeng 

Closes #19197 from zhengruifeng/double_caching.

(cherry picked from commit c5f9b89dda40ffaa4622a7ba2b3d0605dbe815c0)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b606dc17
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b606dc17
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b606dc17

Branch: refs/heads/branch-2.2
Commit: b606dc177e177bdbf99e72638eb8baec12e9fb53
Parents: 63098dc
Author: Zheng RuiFeng 
Authored: Tue Sep 12 11:37:05 2017 -0700
Committer: Joseph K. Bradley 
Committed: Tue Sep 12 11:37:22 2017 -0700

--
 .../org/apache/spark/ml/classification/LogisticRegression.scala  | 2 +-
 .../scala/org/apache/spark/ml/classification/OneVsRest.scala | 4 ++--
 mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala | 2 +-
 .../org/apache/spark/ml/regression/AFTSurvivalRegression.scala   | 2 +-
 .../org/apache/spark/ml/regression/IsotonicRegression.scala  | 2 +-
 .../scala/org/apache/spark/ml/regression/LinearRegression.scala  | 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b606dc17/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index 1de2373..e7f99fc 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -483,7 +483,7 @@ class LogisticRegression @Since("1.2.0") (
   }
 
   override protected[spark] def train(dataset: Dataset[_]): 
LogisticRegressionModel = {
-val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
+val handlePersistence = dataset.storageLevel == StorageLevel.NONE
 train(dataset, handlePersistence)
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/b606dc17/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
index 05b8c3a..f3aff4c 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
@@ -164,7 +164,7 @@ final class OneVsRestModel private[ml] (
 val newDataset = dataset.withColumn(accColName, initUDF())
 
 // persist if underlying dataset is not persistent.
-val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
+val handlePersistence = dataset.storageLevel == StorageLevel.NONE
 if (handlePersistence) {
   newDataset.persist(StorageLevel.MEMORY_AND_DISK)
 }
@@ -347,7 +347,7 @@ final class OneVsRest @Since("1.4.0") (
 }
 
 // persist if underlying dataset is not persistent.
-val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
+val handlePersistence = dataset.storageLevel == StorageLevel.NONE
 if (handlePersistence) {
   multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/b606dc17/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
index e02b532..f2af7fe 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
@@ -304,7 +304,7 @@ class KMeans @Since("1.5.0") (
   override def fit(dataset: Dataset[_]): KMeansModel = {
 transformSchema(dataset.schema, logging = true)
 
-val handlePersistence = dataset.rdd.getStorageLevel == 

spark git commit: [SPARK-18608][ML] Fix double caching

2017-09-12 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master b9b54b1c8 -> c5f9b89dd


[SPARK-18608][ML] Fix double caching

## What changes were proposed in this pull request?
`df.rdd.getStorageLevel` => `df.storageLevel`

using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in 
"\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this 
issue are fixed.

Previous discussion in other PRs: https://github.com/apache/spark/pull/19107, 
https://github.com/apache/spark/pull/17014

## How was this patch tested?
existing tests

Author: Zheng RuiFeng 

Closes #19197 from zhengruifeng/double_caching.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c5f9b89d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c5f9b89d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c5f9b89d

Branch: refs/heads/master
Commit: c5f9b89dda40ffaa4622a7ba2b3d0605dbe815c0
Parents: b9b54b1
Author: Zheng RuiFeng 
Authored: Tue Sep 12 11:37:05 2017 -0700
Committer: Joseph K. Bradley 
Committed: Tue Sep 12 11:37:05 2017 -0700

--
 .../org/apache/spark/ml/classification/LogisticRegression.scala  | 2 +-
 .../scala/org/apache/spark/ml/classification/OneVsRest.scala | 4 ++--
 mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala | 2 +-
 .../org/apache/spark/ml/regression/AFTSurvivalRegression.scala   | 2 +-
 .../org/apache/spark/ml/regression/IsotonicRegression.scala  | 2 +-
 .../scala/org/apache/spark/ml/regression/LinearRegression.scala  | 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c5f9b89d/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index f491a67..cbc8f4a 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -484,7 +484,7 @@ class LogisticRegression @Since("1.2.0") (
   }
 
   override protected[spark] def train(dataset: Dataset[_]): 
LogisticRegressionModel = {
-val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
+val handlePersistence = dataset.storageLevel == StorageLevel.NONE
 train(dataset, handlePersistence)
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/c5f9b89d/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
index 99bb234..942e981 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
@@ -165,7 +165,7 @@ final class OneVsRestModel private[ml] (
 val newDataset = dataset.withColumn(accColName, initUDF())
 
 // persist if underlying dataset is not persistent.
-val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
+val handlePersistence = dataset.storageLevel == StorageLevel.NONE
 if (handlePersistence) {
   newDataset.persist(StorageLevel.MEMORY_AND_DISK)
 }
@@ -358,7 +358,7 @@ final class OneVsRest @Since("1.4.0") (
 }
 
 // persist if underlying dataset is not persistent.
-val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
+val handlePersistence = dataset.storageLevel == StorageLevel.NONE
 if (handlePersistence) {
   multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/c5f9b89d/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
index e02b532..f2af7fe 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
@@ -304,7 +304,7 @@ class KMeans @Since("1.5.0") (
   override def fit(dataset: Dataset[_]): KMeansModel = {
 transformSchema(dataset.schema, logging = true)
 
-val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
+val handlePersistence = dataset.storageLevel == StorageLevel.NONE
 val instances: RDD[OldVector] = 

[GitHub] spark issue #19197: [SPARK-18608][ML] Fix double caching

2017-09-12 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19197
  
having trouble merging...will try again soon


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19197: [SPARK-18608][ML] Fix double caching

2017-09-12 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19197
  
LGTM
Merging with master and branch-2.2
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark

2017-09-12 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 515910e9b -> 720c94fe7


[SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both 
Scala mllib and Pyspark

# What changes were proposed in this pull request?

Added tunable parallelism to the pyspark implementation of one vs. rest 
classification. Added a parallelism parameter to the Scala implementation of 
one vs. rest along with functionality for using the parameter to tune the level 
of parallelism.

I take this PR #18281 over because the original author is busy but we need 
merge this PR soon.
After this been merged, we can close #18281 .

## How was this patch tested?

Test suite added.

Author: Ajay Saini 
Author: WeichenXu 

Closes #19110 from WeichenXu123/spark-21027.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/720c94fe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/720c94fe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/720c94fe

Branch: refs/heads/master
Commit: 720c94fe774431a8a40215757ded3dae9f267c7f
Parents: 515910e
Author: Ajay Saini 
Authored: Tue Sep 12 10:02:27 2017 -0700
Committer: Joseph K. Bradley 
Committed: Tue Sep 12 10:02:27 2017 -0700

--
 docs/ml-guide.md| 18 
 .../spark/ml/classification/OneVsRest.scala | 45 ++--
 .../ml/classification/OneVsRestSuite.scala  | 42 +-
 python/pyspark/ml/classification.py | 25 ++-
 .../pyspark/ml/param/_shared_params_code_gen.py |  4 +-
 python/pyspark/ml/param/shared.py   | 24 +++
 python/pyspark/ml/tests.py  | 16 ++-
 7 files changed, 146 insertions(+), 28 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/720c94fe/docs/ml-guide.md
--
diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index 7aec6a4..f6288e7 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -105,6 +105,24 @@ MLlib is under active development.
 The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
 and the migration guide below will explain all changes between releases.
 
+## From 2.2 to 2.3
+
+### Breaking changes
+
+There are no breaking changes.
+
+### Deprecations and changes of behavior
+
+**Deprecations**
+
+There are no deprecations.
+
+**Changes of behavior**
+
+* [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027):
+ We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. 
serial), in 2.2 and earlier version,
+ the `OneVsRest` parallelism would be parallelism of the default threadpool in 
scala.
+
 ## From 2.1 to 2.2
 
 ### Breaking changes

http://git-wip-us.apache.org/repos/asf/spark/blob/720c94fe/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
index 05b8c3a..99bb234 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
@@ -17,10 +17,10 @@
 
 package org.apache.spark.ml.classification
 
-import java.util.{List => JList}
 import java.util.UUID
 
-import scala.collection.JavaConverters._
+import scala.concurrent.Future
+import scala.concurrent.duration.Duration
 import scala.language.existentials
 
 import org.apache.hadoop.fs.Path
@@ -34,12 +34,13 @@ import org.apache.spark.ml._
 import org.apache.spark.ml.attribute._
 import org.apache.spark.ml.linalg.Vector
 import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params}
-import org.apache.spark.ml.param.shared.HasWeightCol
+import org.apache.spark.ml.param.shared.{HasParallelism, HasWeightCol}
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset, Row}
 import org.apache.spark.sql.functions._
 import org.apache.spark.sql.types._
 import org.apache.spark.storage.StorageLevel
+import org.apache.spark.util.ThreadUtils
 
 private[ml] trait ClassifierTypeTrait {
   // scalastyle:off structural.type
@@ -273,7 +274,7 @@ object OneVsRestModel extends MLReadable[OneVsRestModel] {
 @Since("1.4.0")
 final class OneVsRest @Since("1.4.0") (
 @Since("1.4.0") override val uid: String)
-  extends Estimator[OneVsRestModel] with OneVsRestParams with MLWritable {
+  extends Estimator[OneVsRestModel] with OneVsRestParams with HasParallelism 
with MLWritable {
 
   @Since("1.4.0")
   def this() = this(Identifiable.randomUID("oneVsRest"))
@@ -297,6 +298,16 @@ final 

[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...

2017-09-12 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19110
  
LGTM
Merging with master
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19107: [SPARK-21799][ML] Fix `KMeans` performance regression ca...

2017-09-11 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19107
  
@WeichenXu123 I just commented on 
https://issues.apache.org/jira/browse/SPARK-18608 to clarify our efforts here.  
Can you please either retarget this for SPARK-18608 and update it, or ask 
@zhengruifeng to submit his original PR as the fix?  Please coordinate, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...

2017-09-11 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19110
  
Other than that 1 item, this looks ready


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19110: [SPARK-21027][ML][PYTHON] Added tunable paralleli...

2017-09-11 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19110#discussion_r138180599
  
--- Diff: python/pyspark/ml/param/shared.py ---
@@ -608,6 +608,30 @@ def getAggregationDepth(self):
 return self.getOrDefault(self.aggregationDepth)
 
 
+class HasParallelism(Params):
+"""
+Mixin for param parallelism: number of threads to use when fitting 
models in parallel.
+"""
+
+parallelism = Param(Params._dummy(), "parallelism", "the number of 
threads to use when running parallel algorithms.", 
typeConverter=TypeConverters.toInt)
--- End diff --

nit: Is this out of date?  It's missing the "(>= 1)" from the code gen file.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16774: [SPARK-19357][ML] Adding parallel model evaluation in ML...

2017-09-05 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/16774
  
I think https://github.com/apache/spark/pull/19110 is ready to merge now.  
@BryanCutler @WeichenXu123 do you have a preference for which gets merged first?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...

2017-09-05 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19110
  
LGTM
Pinging on https://github.com/apache/spark/pull/16774 first to coordinate 
merging


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18281: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...

2017-09-05 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/18281
  
Thanks @ajaysaini725 for your work (and permission given offline to take 
this over)!  We can close this issue now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure consistent output columns

2017-09-01 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master aba9492d2 -> 900f14f6f


[SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure 
consistent output columns

## What changes were proposed in this pull request?

Add test for prediction using the model with all combinations of output columns 
turned on/off.
Make sure the output column values match, presumably by comparing vs. the case 
with all 3 output columns turned on.

## How was this patch tested?

Test updated.

Author: WeichenXu 
Author: WeichenXu 

Closes #19065 from WeichenXu123/generic_test_for_prob_classifier.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/900f14f6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/900f14f6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/900f14f6

Branch: refs/heads/master
Commit: 900f14f6fad50369aa849922447f60d7cf06cf2f
Parents: aba9492
Author: WeichenXu 
Authored: Fri Sep 1 17:32:33 2017 -0700
Committer: Joseph K. Bradley 
Committed: Fri Sep 1 17:32:33 2017 -0700

--
 .../DecisionTreeClassifierSuite.scala   |  3 +
 .../ml/classification/GBTClassifierSuite.scala  |  3 +
 .../LogisticRegressionSuite.scala   |  6 ++
 .../MultilayerPerceptronClassifierSuite.scala   |  2 +
 .../ml/classification/NaiveBayesSuite.scala |  6 ++
 .../ProbabilisticClassifierSuite.scala  | 60 
 .../RandomForestClassifierSuite.scala   |  2 +
 7 files changed, 82 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/900f14f6/mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
index 918ab27..98c879e 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
@@ -262,6 +262,9 @@ class DecisionTreeClassifierSuite
   assert(Vectors.dense(rawPred.toArray.map(_ / sum)) === probPred,
 "probability prediction mismatch")
 }
+
+ProbabilisticClassifierSuite.testPredictMethods[
+  Vector, DecisionTreeClassificationModel](newTree, newData)
   }
 
   test("training with 1-category categorical feature") {

http://git-wip-us.apache.org/repos/asf/spark/blob/900f14f6/mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
index 1f79e0d..8000143 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
@@ -219,6 +219,9 @@ class GBTClassifierSuite extends SparkFunSuite with 
MLlibTestSparkContext
 
resultsUsingPredict.zip(results.select(predictionCol).as[Double].collect()).foreach
 {
   case (pred1, pred2) => assert(pred1 === pred2)
 }
+
+ProbabilisticClassifierSuite.testPredictMethods[
+  Vector, GBTClassificationModel](gbtModel, validationDataset)
   }
 
   test("GBT parameter stepSize should be in interval (0, 1]") {

http://git-wip-us.apache.org/repos/asf/spark/blob/900f14f6/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
index 6bf1253..d43c7cd 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
@@ -502,6 +502,9 @@ class LogisticRegressionSuite
 
resultsUsingPredict.zip(results.select("prediction").as[Double].collect()).foreach
 {
   case (pred1, pred2) => assert(pred1 === pred2)
 }
+
+ProbabilisticClassifierSuite.testPredictMethods[
+  Vector, LogisticRegressionModel](model, smallMultinomialDataset)
   }
 
   test("binary logistic regression: Predictor, Classifier methods") {
@@ -556,6 +559,9 @@ class LogisticRegressionSuite
 

[GitHub] spark issue #19065: [SPARK-21729][ML][TEST] Generic test for ProbabilisticCl...

2017-09-01 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19065
  
Nice!  LGTM
Thanks @WeichenXu123 and @smurching !
Merging with master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19065: [SPARK-21729][ML][TEST] Generic test for ProbabilisticCl...

2017-09-01 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19065
  
Taking a look now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19072: [SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBi...

2017-09-01 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19072
  
I merged this with master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-09-01 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/17014
  
Thanks all for discussing this!  I'm just catching up now.

I'm OK with adding handlePersistence as a new Param, but please do so in a 
separate PR and JIRA issue.  I'd like to backport the KMeans fix since it's a 
bug causing a performance regression, but we should not backport the new Param 
API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-21862][ML] Add overflow check in PCA

2017-08-31 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 96028e36b -> f5e10a34e


[SPARK-21862][ML] Add overflow check in PCA

## What changes were proposed in this pull request?

add overflow check in PCA, otherwise it is possible to throw 
`NegativeArraySizeException` when `k` and `numFeatures` are too large.
The overflow checking formula is here:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala#L87

## How was this patch tested?

N/A

Author: WeichenXu 

Closes #19078 from WeichenXu123/SVD_overflow_check.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5e10a34
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5e10a34
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5e10a34

Branch: refs/heads/master
Commit: f5e10a34e644edf3cbce9a7714d31bc433f3ccbd
Parents: 96028e3
Author: WeichenXu 
Authored: Thu Aug 31 16:25:10 2017 -0700
Committer: Joseph K. Bradley 
Committed: Thu Aug 31 16:25:10 2017 -0700

--
 .../org/apache/spark/mllib/feature/PCA.scala | 19 +++
 .../apache/spark/mllib/feature/PCASuite.scala|  6 ++
 2 files changed, 25 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f5e10a34/mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala
index aaecfa8..a01503f 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala
@@ -44,6 +44,11 @@ class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) {
 require(k <= numFeatures,
   s"source vector size $numFeatures must be no less than k=$k")
 
+require(PCAUtil.memoryCost(k, numFeatures) < Int.MaxValue,
+  "The param k and numFeatures is too large for SVD computation. " +
+  "Try reducing the parameter k for PCA, or reduce the input feature " +
+  "vector dimension to make this tractable.")
+
 val mat = new RowMatrix(sources)
 val (pc, explainedVariance) = 
mat.computePrincipalComponentsAndExplainedVariance(k)
 val densePC = pc match {
@@ -110,3 +115,17 @@ class PCAModel private[spark] (
 }
   }
 }
+
+private[feature] object PCAUtil {
+
+  // This memory cost formula is from breeze code:
+  // https://github.com/scalanlp/breeze/blob/
+  // 
6e541be066d547a097f5089165cd7c38c3ca276d/math/src/main/scala/breeze/linalg/
+  // functions/svd.scala#L87
+  def memoryCost(k: Int, numFeatures: Int): Long = {
+3L * math.min(k, numFeatures) * math.min(k, numFeatures)
++ math.max(math.max(k, numFeatures), 4L * math.min(k, numFeatures)
+* math.min(k, numFeatures) + 4L * math.min(k, numFeatures))
+  }
+
+}

http://git-wip-us.apache.org/repos/asf/spark/blob/f5e10a34/mllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala
--
diff --git a/mllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala 
b/mllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala
index 2f90afd..8eab124 100644
--- a/mllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/mllib/feature/PCASuite.scala
@@ -48,4 +48,10 @@ class PCASuite extends SparkFunSuite with 
MLlibTestSparkContext {
 }
 assert(pca.explainedVariance ~== explainedVariance relTol 1e-8)
   }
+
+  test("memory cost computation") {
+assert(PCAUtil.memoryCost(10, 100) < Int.MaxValue)
+// check overflowing
+assert(PCAUtil.memoryCost(4, 6) > Int.MaxValue)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] spark issue #19078: [SPARK-21862][ML] Add overflow check in PCA

2017-08-31 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19078
  
LGTM
Merging with master
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to BinaryLogisticRegressionSummary

2017-08-31 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master cba69aeb4 -> 96028e36b


[SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to 
BinaryLogisticRegressionSummary

## What changes were proposed in this pull request?

add an "asBinary" method to LogisticRegressionSummary for convenient casting to 
BinaryLogisticRegressionSummary.

## How was this patch tested?

Testcase updated.

Author: WeichenXu 

Closes #19072 from WeichenXu123/mlor_summary_as_binary.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/96028e36
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/96028e36
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/96028e36

Branch: refs/heads/master
Commit: 96028e36b4d08427fdd94df55595849c2346ead4
Parents: cba69ae
Author: WeichenXu 
Authored: Thu Aug 31 16:22:40 2017 -0700
Committer: Joseph K. Bradley 
Committed: Thu Aug 31 16:22:40 2017 -0700

--
 .../spark/ml/classification/LogisticRegression.scala | 11 +++
 .../ml/classification/LogisticRegressionSuite.scala  |  6 ++
 project/MimaExcludes.scala   |  1 +
 3 files changed, 18 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/96028e36/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index 1869d51..f491a67 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -1473,6 +1473,17 @@ sealed trait LogisticRegressionSummary extends 
Serializable {
   /** Returns weighted averaged f1-measure. */
   @Since("2.3.0")
   def weightedFMeasure: Double = multiclassMetrics.weightedFMeasure(1.0)
+
+  /**
+   * Convenient method for casting to binary logistic regression summary.
+   * This method will throws an Exception if the summary is not a binary 
summary.
+   */
+  @Since("2.3.0")
+  def asBinary: BinaryLogisticRegressionSummary = this match {
+case b: BinaryLogisticRegressionSummary => b
+case _ =>
+  throw new RuntimeException("Cannot cast to a binary summary.")
+  }
 }
 
 /**

http://git-wip-us.apache.org/repos/asf/spark/blob/96028e36/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
index 6649fa4..6bf1253 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
@@ -256,6 +256,7 @@ class LogisticRegressionSuite
 
 val blorModel = lr.fit(smallBinaryDataset)
 
assert(blorModel.summary.isInstanceOf[BinaryLogisticRegressionTrainingSummary])
+
assert(blorModel.summary.asBinary.isInstanceOf[BinaryLogisticRegressionSummary])
 
assert(blorModel.binarySummary.isInstanceOf[BinaryLogisticRegressionTrainingSummary])
 
 val mlorModel = lr.setFamily("multinomial").fit(smallMultinomialDataset)
@@ -265,6 +266,11 @@ class LogisticRegressionSuite
 mlorModel.binarySummary
   }
 }
+withClue("cannot cast summary to binary summary multiclass model") {
+  intercept[RuntimeException] {
+mlorModel.summary.asBinary
+  }
+}
 
 val mlorBinaryModel = lr.setFamily("multinomial").fit(smallBinaryDataset)
 
assert(mlorBinaryModel.summary.isInstanceOf[BinaryLogisticRegressionTrainingSummary])

http://git-wip-us.apache.org/repos/asf/spark/blob/96028e36/project/MimaExcludes.scala
--
diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index eecda26..27e4183 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -62,6 +62,7 @@ object MimaExcludes {
 
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.ml.classification.LogisticRegressionSummary.weightedRecall"),
 
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.ml.classification.LogisticRegressionSummary.weightedPrecision"),
 
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.ml.classification.LogisticRegressionSummary.weightedFMeasure"),
+

[GitHub] spark pull request #19072: [SPARK-17139][ML][FOLLOW-UP] Add convenient metho...

2017-08-31 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19072#discussion_r136470877
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1473,6 +1473,17 @@ sealed trait LogisticRegressionSummary extends 
Serializable {
   /** Returns weighted averaged f1-measure. */
   @Since("2.3.0")
   def weightedFMeasure: Double = multiclassMetrics.weightedFMeasure(1.0)
+
+  /**
+   * Convenient method for casting to binary logistic regression summary.
+   * This method will throws an Exception if the summary is not a binary 
summary.
--- End diff --

For future reference & being careful: This didn't correct "throws" -> 
"throw" but I'll just go ahead and merge this since it's a tiny nit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19072: [SPARK-17139][ML][FOLLOW-UP] Add convenient metho...

2017-08-30 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19072#discussion_r136249376
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1471,6 +1471,17 @@ sealed trait LogisticRegressionSummary extends 
Serializable {
   /** Returns weighted averaged f1-measure. */
   @Since("2.3.0")
   def weightedFMeasure: Double = multiclassMetrics.weightedFMeasure(1.0)
+
+  /**
+   * Convenient method for casting to binary logistic regression summary,
--- End diff --

Tiny grammar nit:
"""
Convenient method for casting to binary logistic regression summary.
This method will throw an Exception if the summary is not a binary summary.
"""

(This will also make tests rerun.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19078: [SPARK-21862][ML] Add overflow check in PCA

2017-08-30 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19078#discussion_r136248809
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala ---
@@ -110,3 +115,17 @@ class PCAModel private[spark] (
 }
   }
 }
+
+object PCAUtil {
--- End diff --

This should be made package private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19078: [SPARK-21862][ML] Add overflow check in PCA

2017-08-30 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19078#discussion_r136249083
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala ---
@@ -44,6 +44,11 @@ class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) {
 require(k <= numFeatures,
   s"source vector size $numFeatures must be no less than k=$k")
 
+require(PCAUtil.memoryCost(k, numFeatures) <= Int.MaxValue,
--- End diff --

As long as you're making updates...how about making this strict inequality? 
 (I could imagine boundary issues with indexing somewhere in Breeze or MLlib.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19078: [SPARK-21862][ML] Add overflow check in PCA

2017-08-30 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19078#discussion_r136248974
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala ---
@@ -44,6 +44,11 @@ class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) {
 require(k <= numFeatures,
   s"source vector size $numFeatures must be no less than k=$k")
 
+require(PCAUtil.memoryCost(k, numFeatures) <= Int.MaxValue,
+  "The param k and numFeatures is too large for SVD computation." +
--- End diff --

Put space between sentences: "computation. "


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19072: [SPARK-17133][ML][FOLLOW-UP] Add convenient method `asBi...

2017-08-29 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19072
  
Actually, can you please tag this with SPARK-17139 instead of SPARK-17133?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19072: [SPARK-17133][ML][FOLLOW-UP] Add convenient metho...

2017-08-29 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19072#discussion_r135951191
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1471,6 +1471,14 @@ sealed trait LogisticRegressionSummary extends 
Serializable {
   /** Returns weighted averaged f1-measure. */
   @Since("2.3.0")
   def weightedFMeasure: Double = multiclassMetrics.weightedFMeasure(1.0)
+
+  /** Convenient method for casting to binary logistic regression summary 
*/
--- End diff --

Please document that this throws an Exception if the summary is not a 
binary summary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19078: [SPARK-21862] Add overflow check in PCA

2017-08-29 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19078
  
Can you please add the "[ML]" tag to the PR title?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19078: [SPARK-21862] Add overflow check in PCA

2017-08-29 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19078#discussion_r135950507
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala ---
@@ -44,6 +44,14 @@ class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) {
 require(k <= numFeatures,
   s"source vector size $numFeatures must be no less than k=$k")
 
+val workSize = ( 3
+  * math.min(k, numFeatures) * math.min(k, numFeatures)
+  + math.max(math.max(k, numFeatures), 4 * math.min(k, numFeatures)
+  * math.min(k, numFeatures) + 4 * math.min(k, numFeatures))
+  )
+require(workSize <= Int.MaxValue,
+  "The param K and numFeatures is too large for SVD computation.")
--- End diff --

"K" -> "k"
I'd also add this to be super-explicit: "Try reducing the parameter k for 
PCA, or reduce the input feature vector dimension to make this tractable."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19078: [SPARK-21862] Add overflow check in PCA

2017-08-29 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19078#discussion_r135950323
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala ---
@@ -44,6 +44,14 @@ class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) {
 require(k <= numFeatures,
   s"source vector size $numFeatures must be no less than k=$k")
 
+val workSize = ( 3
+  * math.min(k, numFeatures) * math.min(k, numFeatures)
+  + math.max(math.max(k, numFeatures), 4 * math.min(k, numFeatures)
+  * math.min(k, numFeatures) + 4 * math.min(k, numFeatures))
+  )
+require(workSize <= Int.MaxValue,
--- End diff --

I'm pretty sure workSize is an Int and can overflow.  Can you please put 
this computation into an object method and add a unit test for it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19078: [SPARK-21862] Add overflow check in PCA

2017-08-29 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19078#discussion_r135950242
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala ---
@@ -44,6 +44,13 @@ class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) {
 require(k <= numFeatures,
   s"source vector size $numFeatures must be no less than k=$k")
 
+val workSize = ( 3
--- End diff --

Definitely, the error message is much better.  How about adding a Github 
permalink to the Breeze code this is from, for future reference?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [MINOR][ML] Document treatment of instance weights in logreg summary

2017-08-29 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 6077e3ef3 -> 840ba053b


[MINOR][ML] Document treatment of instance weights in logreg summary

## What changes were proposed in this pull request?

Add Scaladoc noting that instance weights are currently ignored in the logistic 
regression summary traits.

Author: Joseph K. Bradley <jos...@databricks.com>

Closes #19071 from jkbradley/lr-summary-minor.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/840ba053
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/840ba053
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/840ba053

Branch: refs/heads/master
Commit: 840ba053b982362dfe84c6faa59d2237994d591c
Parents: 6077e3e
Author: Joseph K. Bradley <jos...@databricks.com>
Authored: Tue Aug 29 13:01:37 2017 -0700
Committer: Joseph K. Bradley <jos...@databricks.com>
Committed: Tue Aug 29 13:01:37 2017 -0700

--
 .../org/apache/spark/ml/classification/LogisticRegression.scala  | 4 
 1 file changed, 4 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/840ba053/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index ffe4b52..1869d51 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -1356,6 +1356,8 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 /**
  * :: Experimental ::
  * Abstraction for logistic regression results for a given model.
+ *
+ * Currently, the summary ignores the instance weights.
  */
 @Experimental
 sealed trait LogisticRegressionSummary extends Serializable {
@@ -1495,6 +1497,8 @@ sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary
 /**
  * :: Experimental ::
  * Abstraction for binary logistic regression results for a given model.
+ *
+ * Currently, the summary ignores the instance weights.
  */
 @Experimental
 sealed trait BinaryLogisticRegressionSummary extends LogisticRegressionSummary 
{


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] spark issue #19071: [MINOR][ML] Document treatment of instance weights in lo...

2017-08-29 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19071
  
Merging with master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-08-28 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/17862
  
Catching up here...  To make sure I caught the decisions made in the 
discussion above, is it correct that this PR will:
* Add support for squared hinge loss, and use that as the default (which I 
fully support)
* Switch from OWLQN to LBFGS (which is fine with me if reasonably 
large-scale tests support that choice)
  * On this note, has anyone tested on large-scale datasets (ideally with 
millions of rows and columns)?

If those are the decisions, can you please update the PR description and 
JIRA as such when you update the PR @hhbyyh ?  Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19071: [MINOR][ML] Document treatment of instance weights in lo...

2017-08-28 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19071
  
CC @WeichenXu123 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19071: [MINOR][ML] Document treatment of instance weight...

2017-08-28 Thread jkbradley
GitHub user jkbradley opened a pull request:

https://github.com/apache/spark/pull/19071

[MINOR][ML] Document treatment of instance weights in logreg summary

## What changes were proposed in this pull request?

Add Scaladoc noting that instance weights are currently ignored in the 
logistic regression summary traits.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkbradley/spark lr-summary-minor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19071.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19071


commit c3589c348147007887e6f4cc5ad88a9bce3d7804
Author: Joseph K. Bradley <jos...@databricks.com>
Date:   2017-08-28T20:35:48Z

doc about instance weights for lr summary




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-17139][ML] Add model summary for MultinomialLogisticRegression

2017-08-28 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 73e64f7d5 -> c7270a46f


[SPARK-17139][ML] Add model summary for MultinomialLogisticRegression

## What changes were proposed in this pull request?

Add 4 traits, using the following hierarchy:
LogisticRegressionSummary
LogisticRegressionTrainingSummary: LogisticRegressionSummary
BinaryLogisticRegressionSummary: LogisticRegressionSummary
BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, 
BinaryLogisticRegressionSummary

and the public method such as `def summary` only return trait type listed above.

and then implement 4 concrete classes:
LogisticRegressionSummaryImpl (multiclass case)
LogisticRegressionTrainingSummaryImpl (multiclass case)
BinaryLogisticRegressionSummaryImpl (binary case).
BinaryLogisticRegressionTrainingSummaryImpl (binary case).

## How was this patch tested?

Existing tests & added tests.

Author: WeichenXu 

Closes #15435 from WeichenXu123/mlor_summary.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c7270a46
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c7270a46
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c7270a46

Branch: refs/heads/master
Commit: c7270a46fc340db62c87ddfc6568603d0b832845
Parents: 73e64f7
Author: Weichen Xu 
Authored: Mon Aug 28 13:31:01 2017 -0700
Committer: Joseph K. Bradley 
Committed: Mon Aug 28 13:31:01 2017 -0700

--
 .../ml/classification/LogisticRegression.scala  | 340 +++
 .../LogisticRegressionSuite.scala   | 160 +++--
 .../ml/regression/LinearRegressionSuite.scala   |   2 +-
 project/MimaExcludes.scala  |  21 +-
 4 files changed, 412 insertions(+), 111 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c7270a46/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index 21957d9..ffe4b52 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -22,7 +22,7 @@ import java.util.Locale
 import scala.collection.mutable
 
 import breeze.linalg.{DenseVector => BDV}
-import breeze.optimize.{CachedDiffFunction, DiffFunction, LBFGS => 
BreezeLBFGS, LBFGSB => BreezeLBFGSB, OWLQN => BreezeOWLQN}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, LBFGSB => 
BreezeLBFGSB, OWLQN => BreezeOWLQN}
 import org.apache.hadoop.fs.Path
 
 import org.apache.spark.SparkException
@@ -35,7 +35,7 @@ import org.apache.spark.ml.optim.loss.{L2Regularization, 
RDDLossFunction}
 import org.apache.spark.ml.param._
 import org.apache.spark.ml.param.shared._
 import org.apache.spark.ml.util._
-import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
+import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, 
MulticlassMetrics}
 import org.apache.spark.mllib.linalg.VectorImplicits._
 import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
 import org.apache.spark.mllib.util.MLUtils
@@ -882,21 +882,28 @@ class LogisticRegression @Since("1.2.0") (
 
 val model = copyValues(new LogisticRegressionModel(uid, coefficientMatrix, 
interceptVector,
   numClasses, isMultinomial))
-// TODO: implement summary model for multinomial case
-val m = if (!isMultinomial) {
-  val (summaryModel, probabilityColName) = 
model.findSummaryModelAndProbabilityCol()
-  val logRegSummary = new BinaryLogisticRegressionTrainingSummary(
+
+val (summaryModel, probabilityColName, predictionColName) = 
model.findSummaryModel()
+val logRegSummary = if (numClasses <= 2) {
+  new BinaryLogisticRegressionTrainingSummaryImpl(
 summaryModel.transform(dataset),
 probabilityColName,
+predictionColName,
 $(labelCol),
 $(featuresCol),
 objectiveHistory)
-  model.setSummary(Some(logRegSummary))
 } else {
-  model
+  new LogisticRegressionTrainingSummaryImpl(
+summaryModel.transform(dataset),
+probabilityColName,
+predictionColName,
+$(labelCol),
+$(featuresCol),
+objectiveHistory)
 }
-instr.logSuccess(m)
-m
+model.setSummary(Some(logRegSummary))
+instr.logSuccess(model)
+model
   }
 
   @Since("1.4.0")
@@ -1010,8 +1017,8 @@ class LogisticRegressionModel private[spark] (
   private var trainingSummary: Option[LogisticRegressionTrainingSummary] = None
 
   /**
-   * Gets 

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-08-28 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/15435
  
LGTM
Merging with master
Thanks a lot @WeichenXu123 and everyone who reviewed!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19026: [SPARK-21681][ML] fix bug of MLOR do not work correctly ...

2017-08-28 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19026
  
@WeichenXu123 could you please close this manually?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero (backport PR for 2.2)

2017-08-24 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 a58536741 -> 2b4bd7910


[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd 
contains zero (backport PR for 2.2)

## What changes were proposed in this pull request?

This is backport PR of https://github.com/apache/spark/pull/18896

fix bug of MLOR do not work correctly when featureStd contains zero

We can reproduce the bug through such dataset (features including zero 
variance), will generate wrong result (all coefficients becomes 0)
```
val multinomialDatasetWithZeroVar = {
  val nPoints = 100
  val coefficients = Array(
-0.57997, 0.912083, -0.371077,
-0.16624, -0.84355, -0.048509)

  val xMean = Array(5.843, 3.0)
  val xVariance = Array(0.6856, 0.0)  // including zero variance

  val testData = generateMultinomialLogisticInput(
coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)

  val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0))
  df.cache()
  df
}
```
## How was this patch tested?

testcase added.

Author: WeichenXu 

Closes #19026 from WeichenXu123/fix_mlor_zero_var_bug_2_2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2b4bd791
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2b4bd791
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2b4bd791

Branch: refs/heads/branch-2.2
Commit: 2b4bd7910fecc8b7b41c7d4388d2a8204c1901e8
Parents: a585367
Author: Weichen Xu 
Authored: Thu Aug 24 10:18:56 2017 -0700
Committer: Joseph K. Bradley 
Committed: Thu Aug 24 10:18:56 2017 -0700

--
 .../ml/classification/LogisticRegression.scala  | 12 ++--
 .../LogisticRegressionSuite.scala   | 75 
 2 files changed, 82 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2b4bd791/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index 567af04..1de2373 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -1727,11 +1727,13 @@ private class LogisticAggregator(
 
 val margins = new Array[Double](numClasses)
 features.foreachActive { (index, value) =>
-  val stdValue = value / localFeaturesStd(index)
-  var j = 0
-  while (j < numClasses) {
-margins(j) += localCoefficients(index * numClasses + j) * stdValue
-j += 1
+  if (localFeaturesStd(index) != 0.0 && value != 0.0) {
+val stdValue = value / localFeaturesStd(index)
+var j = 0
+while (j < numClasses) {
+  margins(j) += localCoefficients(index * numClasses + j) * stdValue
+  j += 1
+}
   }
 }
 var i = 0

http://git-wip-us.apache.org/repos/asf/spark/blob/2b4bd791/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
index 1ffd8dc..8461d64 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
@@ -45,6 +45,7 @@ class LogisticRegressionSuite
   @transient var smallMultinomialDataset: Dataset[_] = _
   @transient var binaryDataset: Dataset[_] = _
   @transient var multinomialDataset: Dataset[_] = _
+  @transient var multinomialDatasetWithZeroVar: Dataset[_] = _
   private val eps: Double = 1e-5
 
   override def beforeAll(): Unit = {
@@ -98,6 +99,23 @@ class LogisticRegressionSuite
   df.cache()
   df
 }
+
+multinomialDatasetWithZeroVar = {
+  val nPoints = 100
+  val coefficients = Array(
+-0.57997, 0.912083, -0.371077,
+-0.16624, -0.84355, -0.048509)
+
+  val xMean = Array(5.843, 3.0)
+  val xVariance = Array(0.6856, 0.0)
+
+  val testData = generateMultinomialLogisticInput(
+coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
+
+  val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
lit(1.0))
+  df.cache()
+  df
+}
   }
 
   /**
@@ -111,6 +129,11 @@ class LogisticRegressionSuite
 

[GitHub] spark issue #19026: [SPARK-21681][ML] fix bug of MLOR do not work correctly ...

2017-08-24 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19026
  
LGTM
Thanks @WeichenXu123 !
Merging with branch-2.2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-08-24 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/15435
  
OK, then let's reinstate the Experimental tags.  @weichen could you please 
mark the 4 public summary traits Experimental?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17373: [SPARK-12664][ML] Expose probability in mlp model

2017-08-22 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/17373
  
Merging with master
Thanks @WeichenXu123 !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-12664][ML] Expose probability in mlp model

2017-08-22 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master d58a3507e -> d6b30edd4


[SPARK-12664][ML] Expose probability in mlp model

## What changes were proposed in this pull request?

Modify MLP model to inherit `ProbabilisticClassificationModel` and so that it 
can expose the probability  column when transforming data.

## How was this patch tested?

Test added.

Author: WeichenXu 

Closes #17373 from WeichenXu123/expose_probability_in_mlp_model.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d6b30edd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d6b30edd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d6b30edd

Branch: refs/heads/master
Commit: d6b30edd4974b593cc8085f680ccb524c7722c85
Parents: d58a350
Author: Weichen Xu 
Authored: Tue Aug 22 21:16:34 2017 -0700
Committer: Joseph K. Bradley 
Committed: Tue Aug 22 21:16:34 2017 -0700

--
 .../scala/org/apache/spark/ml/ann/Layer.scala   | 53 +---
 .../MultilayerPerceptronClassifier.scala| 17 +--
 .../org/apache/spark/ml/ann/GradientSuite.scala |  2 +-
 .../MultilayerPerceptronClassifierSuite.scala   | 42 
 python/pyspark/ml/classification.py |  4 +-
 5 files changed, 103 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d6b30edd/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala
index e7e0dae..014ff07 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala
@@ -361,17 +361,42 @@ private[ann] trait TopologyModel extends Serializable {
* Forward propagation
*
* @param data input data
+   * @param includeLastLayer Include the last layer in the output. In
+   * MultilayerPerceptronClassifier, the last layer is 
always softmax;
+   * the last layer of outputs is needed for class 
predictions, but not
+   * for rawPrediction.
+   *
* @return array of outputs for each of the layers
*/
-  def forward(data: BDM[Double]): Array[BDM[Double]]
+  def forward(data: BDM[Double], includeLastLayer: Boolean): Array[BDM[Double]]
 
   /**
-   * Prediction of the model
+   * Prediction of the model. See {@link ProbabilisticClassificationModel}
*
-   * @param data input data
+   * @param features input features
* @return prediction
*/
-  def predict(data: Vector): Vector
+  def predict(features: Vector): Vector
+
+  /**
+   * Raw prediction of the model. See {@link ProbabilisticClassificationModel}
+   *
+   * @param features input features
+   * @return raw prediction
+   *
+   * Note: This interface is only used for classification Model.
+   */
+  def predictRaw(features: Vector): Vector
+
+  /**
+   * Probability of the model. See {@link ProbabilisticClassificationModel}
+   *
+   * @param rawPrediction raw prediction vector
+   * @return probability
+   *
+   * Note: This interface is only used for classification Model.
+   */
+  def raw2ProbabilityInPlace(rawPrediction: Vector): Vector
 
   /**
* Computes gradient for the network
@@ -463,7 +488,7 @@ private[ml] class FeedForwardModel private(
   private var outputs: Array[BDM[Double]] = null
   private var deltas: Array[BDM[Double]] = null
 
-  override def forward(data: BDM[Double]): Array[BDM[Double]] = {
+  override def forward(data: BDM[Double], includeLastLayer: Boolean): 
Array[BDM[Double]] = {
 // Initialize output arrays for all layers. Special treatment for InPlace
 val currentBatchSize = data.cols
 // TODO: allocate outputs as one big array and then create BDMs from it
@@ -481,7 +506,8 @@ private[ml] class FeedForwardModel private(
   }
 }
 layerModels(0).eval(data, outputs(0))
-for (i <- 1 until layerModels.length) {
+val end = if (includeLastLayer) layerModels.length else layerModels.length 
- 1
+for (i <- 1 until end) {
   layerModels(i).eval(outputs(i - 1), outputs(i))
 }
 outputs
@@ -492,7 +518,7 @@ private[ml] class FeedForwardModel private(
 target: BDM[Double],
 cumGradient: Vector,
 realBatchSize: Int): Double = {
-val outputs = forward(data)
+val outputs = forward(data, true)
 val currentBatchSize = data.cols
 // TODO: allocate deltas as one big array and then create BDMs from it
 if (deltas == null || deltas(0).cols != currentBatchSize) {
@@ -527,9 +553,20 @@ private[ml] class FeedForwardModel private(
 
   override def predict(data: Vector): Vector = 

[GitHub] spark issue #17373: [SPARK-12664][ML] Expose probability in mlp model

2017-08-22 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/17373
  
Thanks!  Will merge after rerunning tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-08-22 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/18313
  
+1 for merging https://github.com/apache/spark/pull/16774 before proceeding 
with the other work since it will affect everything else.

@MLnick I'd be Ok with adding options for best/all/k models in the future, 
but I'm not convinced it's needed since the topK could be computed post-hoc and 
since the 2 approaches for keeping all models should handle big & small use 
cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18896: [SPARK-21681][ML] fix bug of MLOR do not work correctly ...

2017-08-22 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/18896
  
@WeichenXu123 would you mind sending a backport PR for 2.2?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero

2017-08-22 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 01a8e4627 -> d56c26210


[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd 
contains zero

## What changes were proposed in this pull request?

fix bug of MLOR do not work correctly when featureStd contains zero

We can reproduce the bug through such dataset (features including zero 
variance), will generate wrong result (all coefficients becomes 0)
```
val multinomialDatasetWithZeroVar = {
  val nPoints = 100
  val coefficients = Array(
-0.57997, 0.912083, -0.371077,
-0.16624, -0.84355, -0.048509)

  val xMean = Array(5.843, 3.0)
  val xVariance = Array(0.6856, 0.0)  // including zero variance

  val testData = generateMultinomialLogisticInput(
coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)

  val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0))
  df.cache()
  df
}
```
## How was this patch tested?

testcase added.

Author: WeichenXu 

Closes #18896 from WeichenXu123/fix_mlor_stdvalue_zero_bug.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d56c2621
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d56c2621
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d56c2621

Branch: refs/heads/master
Commit: d56c262109a5d94b46fffc04954c34671b14ee4f
Parents: 01a8e46
Author: Weichen Xu 
Authored: Tue Aug 22 16:55:34 2017 -0700
Committer: Joseph K. Bradley 
Committed: Tue Aug 22 16:55:34 2017 -0700

--
 .../optim/aggregator/LogisticAggregator.scala   | 12 +--
 .../LogisticRegressionSuite.scala   | 78 
 .../aggregator/LogisticAggregatorSuite.scala| 37 +-
 3 files changed, 118 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d56c2621/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala
index 66a5294..272d36d 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala
@@ -270,11 +270,13 @@ private[ml] class LogisticAggregator(
 
 val margins = new Array[Double](numClasses)
 features.foreachActive { (index, value) =>
-  val stdValue = value / localFeaturesStd(index)
-  var j = 0
-  while (j < numClasses) {
-margins(j) += localCoefficients(index * numClasses + j) * stdValue
-j += 1
+  if (localFeaturesStd(index) != 0.0 && value != 0.0) {
+val stdValue = value / localFeaturesStd(index)
+var j = 0
+while (j < numClasses) {
+  margins(j) += localCoefficients(index * numClasses + j) * stdValue
+  j += 1
+}
   }
 }
 var i = 0

http://git-wip-us.apache.org/repos/asf/spark/blob/d56c2621/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
index 0570499..542977a 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
@@ -46,6 +46,7 @@ class LogisticRegressionSuite
   @transient var smallMultinomialDataset: Dataset[_] = _
   @transient var binaryDataset: Dataset[_] = _
   @transient var multinomialDataset: Dataset[_] = _
+  @transient var multinomialDatasetWithZeroVar: Dataset[_] = _
   private val eps: Double = 1e-5
 
   override def beforeAll(): Unit = {
@@ -99,6 +100,23 @@ class LogisticRegressionSuite
   df.cache()
   df
 }
+
+multinomialDatasetWithZeroVar = {
+  val nPoints = 100
+  val coefficients = Array(
+-0.57997, 0.912083, -0.371077,
+-0.16624, -0.84355, -0.048509)
+
+  val xMean = Array(5.843, 3.0)
+  val xVariance = Array(0.6856, 0.0)
+
+  val testData = generateMultinomialLogisticInput(
+coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
+
+  val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
lit(1.0))
+  df.cache()
+  df
+}
   }
 
   /**
@@ -112,6 +130,11 @@ class LogisticRegressionSuite
 multinomialDataset.rdd.map { case 

[GitHub] spark issue #18896: [SPARK-21681][ML] fix bug of MLOR do not work correctly ...

2017-08-22 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/18896
  
Merging with master and branch-2.2
Thanks @WeichenXu123 @MLnick @sethah !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-22 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r134628661
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1324,90 +1350,136 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.3.0")
--- End diff --

I see; I think this should work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-22 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r134627545
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1324,90 +1354,147 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("1.5.0")
   def predictions: DataFrame
 
   /** Field in "predictions" which gives the probability of each class as 
a vector. */
+  @Since("1.5.0")
   def probabilityCol: String
 
+  /** Field in "predictions" which gives the prediction of each class. */
+  @Since("2.3.0")
+  def predictionCol: String
+
   /** Field in "predictions" which gives the true label of each instance 
(if available). */
+  @Since("1.5.0")
   def labelCol: String
 
   /** Field in "predictions" which gives the features of each instance as 
a vector. */
+  @Since("1.6.0")
   def featuresCol: String
 
+  @transient private val multiclassMetrics = {
+new MulticlassMetrics(
+  predictions.select(
+col(predictionCol),
+col(labelCol).cast(DoubleType))
+.rdd.map { case Row(prediction: Double, label: Double) => 
(prediction, label) })
+  }
+
+  /**
+   * Returns the sequence of labels in ascending order
--- End diff --

Clarify: "Returns the sequence of labels in ascending order.  This order 
matches the order used in metrics which are specified as arrays over labels, 
e.g., truePositiveRateByLabel."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18896: [SPARK-21681][ML] fix bug of MLOR do not work correctly ...

2017-08-21 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/18896
  
Thanks @WeichenXu123 and @sethah !

@WeichenXu123 I'd leave the MLOR test since it's cheap and has a clear 
purpose, even if it overlaps a little.

LGTM
@sethah any more comments before I merge it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17373: [SPARK-12664][ML] Expose probability in mlp model

2017-08-21 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/17373
  
Oh OK makes sense.  @WeichenXu123 could you please open a JIRA (linked from 
this task's JIRA) and CC @felixcheung on it?  Thanks!

I'll rerun tests to be safe and merge this afterwards.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18702: [SPARK-21485][SQL][DOCS] Spark SQL documentation generat...

2017-08-17 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/18702
  
I tested locally, and it's fine if I install mkdocs and run 
sql/create-docs.sh
Thanks for fixing it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18896: [SPARK-21681][ML] fix bug of MLOR do not work correctly ...

2017-08-17 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/18896
  
How about updating the LogisticAggregatorSuite so it catches this error:
```
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala
index 2b29c67..a2924ad 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala
@@ -239,7 +239,14 @@ class LogisticAggregatorSuite extends SparkFunSuite 
with MLlibTestSparkContext {
   Vectors.dense(coefArray ++ interceptArray), fitIntercept = true, 
isMultinomial = true)
 instances.foreach(aggConstantFeature.add)
 // constant features should not affect gradient
-assert(aggConstantFeature.gradient(0) === 0.0)
+def validateGradient(grad: Vector): Unit = {
+  assert(grad(0) === 0.0)
+  grad.toArray.foreach { gradientValue =>
+assert(!gradientValue.isNaN &&
+  gradientValue > Double.NegativeInfinity && gradientValue < 
Double.PositiveInfinity)
+  }
+}
+validateGradient(aggConstantFeature.gradient)
 
 val binaryCoefArray = Array(1.0, 2.0)
 val intercept = 1.0
@@ -248,6 +255,6 @@ class LogisticAggregatorSuite extends SparkFunSuite 
with MLlibTestSparkContext {
   isMultinomial = false)
 instances.foreach(aggConstantFeatureBinary.add)
 // constant features should not affect gradient
-assert(aggConstantFeatureBinary.gradient(0) === 0.0)
+validateGradient(aggConstantFeatureBinary.gradient)
   }
 }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133844018
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1324,90 +1350,136 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.3.0")
--- End diff --

I confirmed: These annotations are inherited, so users may be confused 
about why some Since annotations have changed.  E.g., 
BinaryLogisticRegressionSummary.featuresCol will change from 1.5.0 to 2.3.0.  
I'd just remove the ones in the trait.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18702: [SPARK-21485][SQL][DOCS] Spark SQL documentation generat...

2017-08-17 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/18702
  
@HyukjinKwon It looks like the Jenkins doc build was broken by this PR (and 
I can't build it locally either).  Is it just that the script needs to be 
updated to run sql/create-docs.sh ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17373: [SPARK-12664][ML] Expose probability in mlp model

2017-08-17 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/17373
  
@felixcheung How will this affect ```spark.mlp``` in R?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133828976
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1484,3 +1556,96 @@ class BinaryLogisticRegressionSummary 
private[classification] (
 binaryMetrics.recallByThreshold().toDF("threshold", "recall")
   }
 }
+
+sealed trait BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary
+  with LogisticRegressionTrainingSummary
+
+/**
+ * :: Experimental ::
+ * Multiclass logistic regression training results.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+private class LogisticRegressionTrainingSummaryImpl(
+override val predictions: DataFrame,
--- End diff --

I think you can leave out the "override val" since the value is passed to 
LogisticRegressionSummaryImpl, which provides it as a val.  Same for the other 
vals provided by LogisticRegressionSummaryImpl.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133771590
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -882,21 +882,28 @@ class LogisticRegression @Since("1.2.0") (
 
 val model = copyValues(new LogisticRegressionModel(uid, 
coefficientMatrix, interceptVector,
   numClasses, isMultinomial))
-// TODO: implement summary model for multinomial case
-val m = if (!isMultinomial) {
-  val (summaryModel, probabilityColName) = 
model.findSummaryModelAndProbabilityCol()
-  val logRegSummary = new BinaryLogisticRegressionTrainingSummary(
+
+val (summaryModel, probabilityColName, predictionColName) = 
model.findSummaryModel()
+val logRegSummary = if (numClasses <=2) {
--- End diff --

nit: space before "2"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133772559
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1324,90 +1350,136 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.3.0")
--- End diff --

These are not new for this trait, so I'd remove the Since annotations.  
When in doubt, just put Since annotations in final or concrete classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133830867
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -199,15 +199,57 @@ class LogisticRegressionSuite
 }
   }
 
-  test("empty probabilityCol") {
-val lr = new LogisticRegression().setProbabilityCol("")
-val model = lr.fit(smallBinaryDataset)
-assert(model.hasSummary)
-// Validate that we re-insert a probability column for evaluation
-val fieldNames = model.summary.predictions.schema.fieldNames
-assert(smallBinaryDataset.schema.fieldNames.toSet.subsetOf(
-  fieldNames.toSet))
-assert(fieldNames.exists(s => s.startsWith("probability_")))
+  test("empty probabilityCol or predictionCol") {
+val lr = new LogisticRegression().setMaxIter(1)
+val datasetFieldNames = smallBinaryDataset.schema.fieldNames.toSet
+def checkSummarySchema(model: LogisticRegressionModel, columns: 
Seq[String]): Unit = {
+  val fieldNames = model.summary.predictions.schema.fieldNames
+  assert(model.hasSummary)
+  assert(datasetFieldNames.subsetOf(fieldNames.toSet))
+  columns.foreach { c => assert(fieldNames.exists(_.startsWith(c))) }
+}
+// check that the summary model adds the appropriate columns
+Seq(("binomial", smallBinaryDataset), ("multinomial", 
smallMultinomialDataset)).foreach {
+  case (family, dataset) =>
+lr.setFamily(family)
+lr.setProbabilityCol("").setPredictionCol("prediction")
+val modelNoProb = lr.fit(smallBinaryDataset)
+checkSummarySchema(modelNoProb, Seq("probability_"))
+
+lr.setProbabilityCol("probability").setPredictionCol("")
+val modelNoPred = lr.fit(smallBinaryDataset)
+checkSummarySchema(modelNoPred, Seq("prediction_"))
+
+lr.setProbabilityCol("").setPredictionCol("")
+val modelNoPredNoProb = lr.fit(smallBinaryDataset)
+checkSummarySchema(modelNoPredNoProb, Seq("prediction_", 
"probability_"))
+}
+  }
+
+  test("check summary types for binary and multiclass") {
+val lr = new LogisticRegression()
+  .setFamily("binomial")
+
+val blorModel = lr.fit(smallBinaryDataset)
+
assert(blorModel.summary.isInstanceOf[BinaryLogisticRegressionTrainingSummaryImpl])
--- End diff --

nit: You can test for the public trait 
BinaryLogisticRegressionTrainingSummary instead of the private implementation 
class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133824486
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1324,90 +1350,136 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.3.0")
   def predictions: DataFrame
 
   /** Field in "predictions" which gives the probability of each class as 
a vector. */
+  @Since("2.3.0")
   def probabilityCol: String
 
+  /** Field in "predictions" which gives the prediction of each class. */
+  @Since("2.3.0")
+  def predictionCol: String
+
   /** Field in "predictions" which gives the true label of each instance 
(if available). */
+  @Since("2.3.0")
   def labelCol: String
 
   /** Field in "predictions" which gives the features of each instance as 
a vector. */
+  @Since("2.3.0")
   def featuresCol: String
 
+  @transient private val multiclassMetrics = {
--- End diff --

MulticlassMetrics provides a ```labels``` field which returns the list of 
labels.  In most cases, this will be values ```{0.0, 1.0, ..., 
numClasses-1}```.  However, if the training set is missing a label, then all of 
the arrays over labels (e.g., from ```truePositiveRateByLabel```) will be of 
length numClasses-1 instead of the expected numClasses.  In the future, it'd be 
nice to fix this by having them always be of length numClasses.  For now, how 
about we provide the labels field with this kind of explanation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133829306
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1484,3 +1556,96 @@ class BinaryLogisticRegressionSummary 
private[classification] (
 binaryMetrics.recallByThreshold().toDF("threshold", "recall")
   }
 }
+
+sealed trait BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary
+  with LogisticRegressionTrainingSummary
+
+/**
+ * :: Experimental ::
+ * Multiclass logistic regression training results.
+ *
+ * @param predictions dataframe output by the model's `transform` method.
+ * @param probabilityCol field in "predictions" which gives the 
probability of
+ *   each class as a vector.
+ * @param predictionCol field in "predictions" which gives the prediction 
for a data instance as a
+ *  double.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ * @param objectiveHistory objective function (scaled loss + 
regularization) at each iteration.
+ */
+private class LogisticRegressionTrainingSummaryImpl(
+override val predictions: DataFrame,
--- End diff --

Same for BinaryLogisticRegressionTrainingSummaryImpl


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133830662
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -199,15 +199,57 @@ class LogisticRegressionSuite
 }
   }
 
-  test("empty probabilityCol") {
-val lr = new LogisticRegression().setProbabilityCol("")
-val model = lr.fit(smallBinaryDataset)
-assert(model.hasSummary)
-// Validate that we re-insert a probability column for evaluation
-val fieldNames = model.summary.predictions.schema.fieldNames
-assert(smallBinaryDataset.schema.fieldNames.toSet.subsetOf(
-  fieldNames.toSet))
-assert(fieldNames.exists(s => s.startsWith("probability_")))
+  test("empty probabilityCol or predictionCol") {
+val lr = new LogisticRegression().setMaxIter(1)
+val datasetFieldNames = smallBinaryDataset.schema.fieldNames.toSet
+def checkSummarySchema(model: LogisticRegressionModel, columns: 
Seq[String]): Unit = {
+  val fieldNames = model.summary.predictions.schema.fieldNames
+  assert(model.hasSummary)
+  assert(datasetFieldNames.subsetOf(fieldNames.toSet))
+  columns.foreach { c => assert(fieldNames.exists(_.startsWith(c))) }
+}
+// check that the summary model adds the appropriate columns
+Seq(("binomial", smallBinaryDataset), ("multinomial", 
smallMultinomialDataset)).foreach {
+  case (family, dataset) =>
+lr.setFamily(family)
+lr.setProbabilityCol("").setPredictionCol("prediction")
+val modelNoProb = lr.fit(smallBinaryDataset)
+checkSummarySchema(modelNoProb, Seq("probability_"))
+
+lr.setProbabilityCol("probability").setPredictionCol("")
+val modelNoPred = lr.fit(smallBinaryDataset)
+checkSummarySchema(modelNoPred, Seq("prediction_"))
+
+lr.setProbabilityCol("").setPredictionCol("")
+val modelNoPredNoProb = lr.fit(smallBinaryDataset)
+checkSummarySchema(modelNoPredNoProb, Seq("prediction_", 
"probability_"))
+}
+  }
+
+  test("check summary types for binary and multiclass") {
+val lr = new LogisticRegression()
--- End diff --

Set maxIter = 1 for speed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133771738
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1018,19 +1025,33 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.2.0")
--- End diff --

update to 2.3


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133828620
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1484,3 +1556,96 @@ class BinaryLogisticRegressionSummary 
private[classification] (
 binaryMetrics.recallByThreshold().toDF("threshold", "recall")
   }
 }
+
+sealed trait BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary
+  with LogisticRegressionTrainingSummary
+
+/**
+ * :: Experimental ::
--- End diff --

No need for Experimental tags on private classes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133772374
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1018,19 +1025,33 @@ class LogisticRegressionModel private[spark] (
 throw new SparkException("No training summary available for this 
LogisticRegressionModel")
   }
 
+  @Since("2.2.0")
--- End diff --

Also, add Scala doc indicating that this will throw an exception if it is a 
multiclass model.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133830499
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -199,15 +199,57 @@ class LogisticRegressionSuite
 }
   }
 
-  test("empty probabilityCol") {
-val lr = new LogisticRegression().setProbabilityCol("")
-val model = lr.fit(smallBinaryDataset)
-assert(model.hasSummary)
-// Validate that we re-insert a probability column for evaluation
-val fieldNames = model.summary.predictions.schema.fieldNames
-assert(smallBinaryDataset.schema.fieldNames.toSet.subsetOf(
-  fieldNames.toSet))
-assert(fieldNames.exists(s => s.startsWith("probability_")))
+  test("empty probabilityCol or predictionCol") {
+val lr = new LogisticRegression().setMaxIter(1)
+val datasetFieldNames = smallBinaryDataset.schema.fieldNames.toSet
+def checkSummarySchema(model: LogisticRegressionModel, columns: 
Seq[String]): Unit = {
+  val fieldNames = model.summary.predictions.schema.fieldNames
+  assert(model.hasSummary)
+  assert(datasetFieldNames.subsetOf(fieldNames.toSet))
+  columns.foreach { c => assert(fieldNames.exists(_.startsWith(c))) }
+}
+// check that the summary model adds the appropriate columns
+Seq(("binomial", smallBinaryDataset), ("multinomial", 
smallMultinomialDataset)).foreach {
+  case (family, dataset) =>
--- End diff --

You are not using "dataset" in this.  However, this logic should be the 
same for both binomial and multinomial families, so I'm OK if you just test the 
binomial case here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133636149
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -882,21 +882,28 @@ class LogisticRegression @Since("1.2.0") (
 
 val model = copyValues(new LogisticRegressionModel(uid, 
coefficientMatrix, interceptVector,
   numClasses, isMultinomial))
-// TODO: implement summary model for multinomial case
-val m = if (!isMultinomial) {
-  val (summaryModel, probabilityColName) = 
model.findSummaryModelAndProbabilityCol()
-  val logRegSummary = new BinaryLogisticRegressionTrainingSummary(
+
+val (summaryModel, probabilityColName, predictionColName) = 
model.findSummaryModel()
+val logRegSummary = if (!isMultinomial || (isMultinomial && numClasses 
<= 2)) {
--- End diff --

Oh sorry!  I didn't think about this case where a model is specified as 
multinomial but has 2 classes.  What you had before is better: ```numClasses 
<=2``` (without using ```isMultinomial```).  Feel free to question my 
suggestions :P 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133589180
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -882,21 +882,28 @@ class LogisticRegression @Since("1.2.0") (
 
 val model = copyValues(new LogisticRegressionModel(uid, 
coefficientMatrix, interceptVector,
   numClasses, isMultinomial))
-// TODO: implement summary model for multinomial case
-val m = if (!isMultinomial) {
-  val (summaryModel, probabilityColName) = 
model.findSummaryModelAndProbabilityCol()
-  val logRegSummary = new BinaryLogisticRegressionTrainingSummary(
+
+val (summaryModel, probabilityColName, predictionColName) = 
model.findSummaryModel()
+val logRegSummary = if (numClasses <= 2) {
--- End diff --

Use isMultinomial since it is easier to read.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133598714
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1324,90 +1350,130 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.3.0")
   def predictions: DataFrame
 
   /** Field in "predictions" which gives the probability of each class as 
a vector. */
+  @Since("2.3.0")
   def probabilityCol: String
 
+  /** Field in "predictions" which gives the prediction of each class. */
+  @Since("2.3.0")
+  def predictionCol: String
+
   /** Field in "predictions" which gives the true label of each instance 
(if available). */
+  @Since("2.3.0")
   def labelCol: String
 
   /** Field in "predictions" which gives the features of each instance as 
a vector. */
+  @Since("2.3.0")
   def featuresCol: String
 
+  @transient private val multiclassMetrics = {
+new MulticlassMetrics(
+  predictions.select(
+col(predictionCol),
+col(labelCol).cast(DoubleType))
+.rdd.map { case Row(prediction: Double, label: Double) => 
(prediction, label) })
+  }
+
+  /** Returns true positive rate for each label. */
+  @Since("2.3.0")
+  def truePositiveRateByLabel: Array[Double] = recallByLabel
+
+  /** Returns false positive rate for each label. */
+  @Since("2.3.0")
+  def falsePositiveRateByLabel: Array[Double] = {
+multiclassMetrics.labels.map(label => 
multiclassMetrics.falsePositiveRate(label))
+  }
+
+  /** Returns precision for each label. */
+  @Since("2.3.0")
+  def precisionByLabel: Array[Double] = {
+multiclassMetrics.labels.map(label => 
multiclassMetrics.precision(label))
+  }
+
+  /** Returns recall for each label. */
+  @Since("2.3.0")
+  def recallByLabel: Array[Double] = {
+multiclassMetrics.labels.map(label => multiclassMetrics.recall(label))
+  }
+
+  /**
+   * Returns f-measure for each label.
+   */
+  @Since("2.3.0")
+  def fMeasureByLabel(beta: Double): Array[Double] = {
+multiclassMetrics.labels.map(label => 
multiclassMetrics.fMeasure(label, beta))
+  }
+
+  /** Returns f1-measure for each label. */
+  @Since("2.3.0")
+  def fMeasureByLabel: Array[Double] = fMeasureByLabel(1.0)
+
+  /** Returns accuracy. */
+  @Since("2.3.0")
+  def accuracy: Double = multiclassMetrics.accuracy
+
+  /** Returns weighted true positive rate. */
+  @Since("2.3.0")
+  def weightedTruePositiveRate: Double = weightedRecall
+
+  /** Returns weighted false positive rate. */
+  @Since("2.3.0")
+  def weightedFalsePositiveRate: Double = 
multiclassMetrics.weightedFalsePositiveRate
+
+  /** Returns weighted averaged recall. */
+  @Since("2.3.0")
+  def weightedRecall: Double = multiclassMetrics.weightedRecall
--- End diff --

For all of these, can you please make sure to copy all of the info from the 
MulticlassMetrics Scala doc?  There is some info missing, and users cannot be 
expected to find it on their own.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-08-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r133598792
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1324,90 +1350,130 @@ private[ml] class MultiClassSummarizer extends 
Serializable {
 }
 
 /**
- * Abstraction for multinomial Logistic Regression Training results.
- * Currently, the training summary ignores the training weights except
- * for the objective trace.
- */
-sealed trait LogisticRegressionTrainingSummary extends 
LogisticRegressionSummary {
-
-  /** objective function (scaled loss + regularization) at each iteration. 
*/
-  def objectiveHistory: Array[Double]
-
-  /** Number of training iterations until termination */
-  def totalIterations: Int = objectiveHistory.length
-
-}
-
-/**
- * Abstraction for Logistic Regression Results for a given model.
+ * Abstraction for logistic regression results for a given model.
  */
 sealed trait LogisticRegressionSummary extends Serializable {
 
   /**
* Dataframe output by the model's `transform` method.
*/
+  @Since("2.3.0")
   def predictions: DataFrame
 
   /** Field in "predictions" which gives the probability of each class as 
a vector. */
+  @Since("2.3.0")
   def probabilityCol: String
 
+  /** Field in "predictions" which gives the prediction of each class. */
+  @Since("2.3.0")
+  def predictionCol: String
+
   /** Field in "predictions" which gives the true label of each instance 
(if available). */
+  @Since("2.3.0")
   def labelCol: String
 
   /** Field in "predictions" which gives the features of each instance as 
a vector. */
+  @Since("2.3.0")
   def featuresCol: String
 
+  @transient private val multiclassMetrics = {
+new MulticlassMetrics(
+  predictions.select(
+col(predictionCol),
+col(labelCol).cast(DoubleType))
+.rdd.map { case Row(prediction: Double, label: Double) => 
(prediction, label) })
+  }
+
+  /** Returns true positive rate for each label. */
+  @Since("2.3.0")
+  def truePositiveRateByLabel: Array[Double] = recallByLabel
+
+  /** Returns false positive rate for each label. */
+  @Since("2.3.0")
+  def falsePositiveRateByLabel: Array[Double] = {
+multiclassMetrics.labels.map(label => 
multiclassMetrics.falsePositiveRate(label))
+  }
+
+  /** Returns precision for each label. */
+  @Since("2.3.0")
+  def precisionByLabel: Array[Double] = {
+multiclassMetrics.labels.map(label => 
multiclassMetrics.precision(label))
+  }
+
+  /** Returns recall for each label. */
+  @Since("2.3.0")
+  def recallByLabel: Array[Double] = {
+multiclassMetrics.labels.map(label => multiclassMetrics.recall(label))
+  }
+
+  /**
+   * Returns f-measure for each label.
+   */
+  @Since("2.3.0")
+  def fMeasureByLabel(beta: Double): Array[Double] = {
+multiclassMetrics.labels.map(label => 
multiclassMetrics.fMeasure(label, beta))
+  }
+
+  /** Returns f1-measure for each label. */
+  @Since("2.3.0")
+  def fMeasureByLabel: Array[Double] = fMeasureByLabel(1.0)
+
+  /** Returns accuracy. */
+  @Since("2.3.0")
+  def accuracy: Double = multiclassMetrics.accuracy
+
+  /** Returns weighted true positive rate. */
+  @Since("2.3.0")
+  def weightedTruePositiveRate: Double = weightedRecall
+
+  /** Returns weighted false positive rate. */
+  @Since("2.3.0")
+  def weightedFalsePositiveRate: Double = 
multiclassMetrics.weightedFalsePositiveRate
+
+  /** Returns weighted averaged recall. */
+  @Since("2.3.0")
+  def weightedRecall: Double = multiclassMetrics.weightedRecall
+
+  /** Returns weighted averaged precision. */
+  @Since("2.3.0")
+  def weightedPrecision: Double = multiclassMetrics.weightedPrecision
+
+  /**
+   * Returns weighted averaged f-measure.
+   */
+  @Since("2.3.0")
+  def weightedFMeasure(beta: Double): Double = 
multiclassMetrics.weightedFMeasure(beta)
+
+  /** Returns weighted averaged f1-measure. */
+  @Since("2.3.0")
+  def weightedFMeasure: Double = multiclassMetrics.weightedFMeasure(1.0)
 }
 
 /**
- * :: Experimental ::
- * Logistic regression training results.
- *
- * @param predictions dataframe output by the model's `transform` method.
- * @param probabilityCol field in "predictions" which gives the 
probabili

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-08-16 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/15435
  
I'll do a review pass now, despite the merge conflicts


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17373: [SPARK-12664][ML] Expose probability in mlp model

2017-08-16 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/17373
  
Thinking more about the proposal about separating the 
classification-specific logic out of the generic Topology, it's something we 
should definitely do at some point, but I'm OK with leaving it as is for now.  
Adding new, unused classes is probably not worth the trouble right now.  Can 
you please document very clearly, though, that predictRaw and 
raw2ProbabilityInPlace are only for classification?  Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18896: [SPARK-21681][ML] fix bug of MLOR do not work correctly ...

2017-08-15 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/18896
  
LGTM except for making the test's title more descriptive.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model

2017-08-15 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r133324363
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala ---
@@ -527,9 +550,21 @@ private[ml] class FeedForwardModel private(
 
   override def predict(data: Vector): Vector = {
 val size = data.size
-val result = forward(new BDM[Double](size, 1, data.toArray))
+val result = forward(new BDM[Double](size, 1, data.toArray), true)
 Vectors.dense(result.last.toArray)
   }
+
+  override def predictRaw(data: Vector): Vector = {
+val size = data.size
+val result = forward(new BDM[Double](size, 1, data.toArray), false)
+Vectors.dense(result(result.length - 2).toArray)
+  }
+
+  override def raw2ProbabilityInPlace(data: Vector): Vector = {
+val dataMatrix = new BDM[Double](data.size, 1, data.toArray)
+layerModels.last.eval(dataMatrix, dataMatrix)
--- End diff --

Ping: If this proposal sounds good, then can you please update accordingly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model

2017-08-15 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r133322927
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala
 ---
@@ -82,6 +83,49 @@ class MultilayerPerceptronClassifierSuite
 }
   }
 
+  test("strong dataset test") {
+val layers = Array[Int](4, 5, 5, 2)
+
+val strongDataset = Seq(
+  (Vectors.dense(1, 2, 3, 4), 0d, Vectors.dense(1d, 0d)),
+  (Vectors.dense(4, 3, 2, 1), 1d, Vectors.dense(0d, 1d)),
+  (Vectors.dense(1, 1, 1, 1), 0d, Vectors.dense(.5, .5)),
+  (Vectors.dense(1, 1, 1, 1), 1d, Vectors.dense(.5, .5))
+).toDF("features", "label", "expectedProbability")
+val trainer = new MultilayerPerceptronClassifier()
+  .setLayers(layers)
+  .setBlockSize(1)
+  .setSeed(123L)
+  .setMaxIter(100)
+  .setSolver("l-bfgs")
+val model = trainer.fit(strongDataset)
+val result = model.transform(strongDataset)
+model.setProbabilityCol("probability")
+MLTestingUtils.checkCopyAndUids(trainer, model)
+// result.select("probability").show(false)
+result.select("probability", "expectedProbability").collect().foreach {
+  case Row(p: Vector, e: Vector) =>
+assert(p ~== e absTol 1e-3)
+}
+  }
+
+  test("test model probability") {
+val layers = Array[Int](2, 5, 2)
+val trainer = new MultilayerPerceptronClassifier()
+  .setLayers(layers)
+  .setBlockSize(1)
+  .setSeed(123L)
+  .setMaxIter(100)
+  .setSolver("l-bfgs")
+val model = trainer.fit(dataset)
+model.setProbabilityCol("probability")
--- End diff --

Ping --- this should not be necessary


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model

2017-08-15 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r133323889
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala ---
@@ -374,6 +380,22 @@ private[ann] trait TopologyModel extends Serializable {
   def predict(data: Vector): Vector
 
   /**
+   * Raw prediction of the model
+   *
+   * @param data input data
+   * @return raw prediction
+   */
+  def predictRaw(data: Vector): Vector
--- End diff --

Ping: rename data -> features


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model

2017-08-14 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r133081809
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala ---
@@ -527,9 +550,21 @@ private[ml] class FeedForwardModel private(
 
   override def predict(data: Vector): Vector = {
 val size = data.size
-val result = forward(new BDM[Double](size, 1, data.toArray))
+val result = forward(new BDM[Double](size, 1, data.toArray), true)
 Vectors.dense(result.last.toArray)
   }
+
+  override def predictRaw(data: Vector): Vector = {
+val size = data.size
+val result = forward(new BDM[Double](size, 1, data.toArray), false)
+Vectors.dense(result(result.length - 2).toArray)
+  }
+
+  override def raw2ProbabilityInPlace(data: Vector): Vector = {
+val dataMatrix = new BDM[Double](data.size, 1, data.toArray)
+layerModels.last.eval(dataMatrix, dataMatrix)
--- End diff --

This assumes that the ```eval``` method can operate in-place.  That is fine 
for the last layer for MLP (SoftmaxLayerModelWithCrossEntropyLoss), but not OK 
in general.  More generally, these methods for classifiers should not go in the 
very general TopologyModel abstraction; that abstraction may be used in the 
future for regression as well.  I'd be fine with putting this 
classification-specific logic in MLP itself; we do not need to generalize the 
logic until we add other Classifiers, which might take a long time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model

2017-08-14 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r133079098
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala ---
@@ -374,6 +380,22 @@ private[ann] trait TopologyModel extends Serializable {
   def predict(data: Vector): Vector
 
   /**
+   * Raw prediction of the model
+   *
+   * @param data input data
+   * @return raw prediction
+   */
+  def predictRaw(data: Vector): Vector
+
+  /**
+   * Probability of the model
--- End diff --

This documentation does not add any information.  Can you please link to 
ProbabilisticClassifier instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    3   4   5   6   7   8   9   10   11   12   >