[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-06 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94957348
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -248,12 +269,38 @@ class GBTClassificationModel private[ml](
 if (prediction > 0.0) 1.0 else 0.0
   }
 
+  override protected def predictRaw(features: Vector): Vector = {
+val treePredictions = 
_trees.map(_.rootNode.predictImpl(features).prediction)
+val prediction = blas.ddot(numTrees, treePredictions, 1, _treeWeights, 
1)
--- End diff --

oh ok, thank you for confirming


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-06 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94957214
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -248,12 +269,38 @@ class GBTClassificationModel private[ml](
 if (prediction > 0.0) 1.0 else 0.0
   }
 
+  override protected def predictRaw(features: Vector): Vector = {
--- End diff --

sorry, I'm still a little confused, should I override 
probability2prediction and simplify, or should I keep the argmax as is?  The 
argmax seems better because it is more general anyway, but please let me know 
if you would prefer that I make any changes here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-06 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94957257
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -248,12 +269,38 @@ class GBTClassificationModel private[ml](
 if (prediction > 0.0) 1.0 else 0.0
   }
 
+  override protected def predictRaw(features: Vector): Vector = {
--- End diff --

sorry, I'm still a little confused, should I override 
probability2prediction and simplify, or should I keep the argmax as is?  The 
argmax seems better because it is more general anyway, but please let me know 
if you would prefer that I make any changes here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94902030
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -248,12 +269,38 @@ class GBTClassificationModel private[ml](
 if (prediction > 0.0) 1.0 else 0.0
   }
 
+  override protected def predictRaw(features: Vector): Vector = {
--- End diff --

I can see how my comment was confusing now :) Since GBT only supports two 
classes right now, we could override methods like `probability2prediction` 
which are by default calling what is implemented in `ProbabilisticClassifier`. 
That method loops through all the classes in a while loop, divides by the 
threshold and takes the largest one. Since we know we only have two classes 
here, we could instead do something faster, like

scala
if (probability(1) > getThreshold) 1 else 0



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94901794
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -248,12 +269,38 @@ class GBTClassificationModel private[ml](
 if (prediction > 0.0) 1.0 else 0.0
   }
 
+  override protected def predictRaw(features: Vector): Vector = {
+val treePredictions = 
_trees.map(_.rootNode.predictImpl(features).prediction)
+val prediction = blas.ddot(numTrees, treePredictions, 1, _treeWeights, 
1)
--- End diff --

Yeah, I see it's not quite the same as in other places. We can leave it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94898917
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -66,10 +66,39 @@ class GBTClassifierSuite extends SparkFunSuite with 
MLlibTestSparkContext
 ParamsSuite.checkParams(new GBTClassifier)
 val model = new GBTClassificationModel("gbtc",
   Array(new DecisionTreeRegressionModel("dtr", new LeafNode(0.0, 0.0, 
null), 1)),
-  Array(1.0), 1)
+  Array(1.0), 1, 2)
 ParamsSuite.checkParams(model)
   }
 
+  test("Verify raw scores correspond to labels") {
+val rawPredictionCol = "MyRawPrediction"
+val predictionCol = "MyPrediction"
+val labelCol = "label"
+val featuresCol = "features"
+val gbt = new GBTClassifier()
+  .setMaxDepth(2)
+  .setLossType("logistic")
+  .setMaxIter(5)
+  .setStepSize(0.1)
+  .setCheckpointInterval(2)
+  .setSeed(123)
+  .setRawPredictionCol(rawPredictionCol)
+  .setPredictionCol(predictionCol)
+  .setLabelCol(labelCol)
+  .setFeaturesCol(featuresCol)
+val gbtModel = gbt.fit(trainData.toDF(labelCol, featuresCol))
+val scoredData = gbtModel.transform(validationData.toDF(labelCol, 
featuresCol))
+scoredData.select(rawPredictionCol, predictionCol).collect()
+  .foreach(row => {
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94898909
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -66,10 +66,39 @@ class GBTClassifierSuite extends SparkFunSuite with 
MLlibTestSparkContext
 ParamsSuite.checkParams(new GBTClassifier)
 val model = new GBTClassificationModel("gbtc",
   Array(new DecisionTreeRegressionModel("dtr", new LeafNode(0.0, 0.0, 
null), 1)),
-  Array(1.0), 1)
+  Array(1.0), 1, 2)
 ParamsSuite.checkParams(model)
   }
 
+  test("Verify raw scores correspond to labels") {
+val rawPredictionCol = "MyRawPrediction"
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94898798
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -66,10 +66,39 @@ class GBTClassifierSuite extends SparkFunSuite with 
MLlibTestSparkContext
 ParamsSuite.checkParams(new GBTClassifier)
 val model = new GBTClassificationModel("gbtc",
   Array(new DecisionTreeRegressionModel("dtr", new LeafNode(0.0, 0.0, 
null), 1)),
-  Array(1.0), 1)
+  Array(1.0), 1, 2)
 ParamsSuite.checkParams(model)
   }
 
+  test("Verify raw scores correspond to labels") {
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94877056
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -248,12 +269,38 @@ class GBTClassificationModel private[ml](
 if (prediction > 0.0) 1.0 else 0.0
   }
 
+  override protected def predictRaw(features: Vector): Vector = {
+val treePredictions = 
_trees.map(_.rootNode.predictImpl(features).prediction)
+val prediction = blas.ddot(numTrees, treePredictions, 1, _treeWeights, 
1)
+Vectors.dense(Array(-prediction, prediction))
+  }
+
+  override protected def raw2probabilityInPlace(rawPrediction: Vector): 
Vector = {
+rawPrediction match {
+  // The probability can be calculated for positive result:
+  // p+(x) = 1 / (1 + e^(-2 * F(x)))
+  // and negative result:
+  // p-(x) = 1 / (1 + e^(2 * F(x)))
+  case dv: DenseVector =>
+var i = 0
+val size = dv.size
+while (i < size) {
+  dv.values(i) = 1 / (1 + math.exp(-2 * dv.values(i)))
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94875629
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -248,12 +269,38 @@ class GBTClassificationModel private[ml](
 if (prediction > 0.0) 1.0 else 0.0
   }
 
+  override protected def predictRaw(features: Vector): Vector = {
+val treePredictions = 
_trees.map(_.rootNode.predictImpl(features).prediction)
+val prediction = blas.ddot(numTrees, treePredictions, 1, _treeWeights, 
1)
--- End diff --

it looks like BLAS.dot is only for Vector, but these are both arrays.  I'm 
worried that this may degrade performance.  Is this specifically what you are 
looking for:
BLAS.dot(Vectors.dense(treePredictions), Vectors.dense(_treeWeights))
is the extra dense vector allocation worth it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94874968
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -248,12 +269,38 @@ class GBTClassificationModel private[ml](
 if (prediction > 0.0) 1.0 else 0.0
   }
 
+  override protected def predictRaw(features: Vector): Vector = {
--- End diff --

sorry I'm a bit confused, this classifier also only deals with two classes, 
it does not support multiclass data.  Instead of overriding, what is the 
alternative?  There is no default predictRaw or raw2probability implemented in 
probabilistic classifier, and it seems that this is the minimum required for 
GBTClassifier to use ProbabilisticClassifier.  Can you please give more 
information on this point?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94873411
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -215,10 +223,23 @@ class GBTClassificationModel private[ml](
*
* @param _trees  Decision trees in the ensemble.
* @param _treeWeights  Weights for the decision trees in the ensemble.
+   * @param numFeatures  The number of features.
+   */
+  @Since("1.6.0")
+  def this(uid: String, _trees: Array[DecisionTreeRegressionModel],
+   _treeWeights: Array[Double], numFeatures: Int) =
+this(uid, _trees, _treeWeights, numFeatures, 2)
+
+  /**
+   * Construct a GBTClassificationModel
+   *
+   * @param _trees  Decision trees in the ensemble.
+   * @param _treeWeights  Weights for the decision trees in the ensemble.
*/
   @Since("1.6.0")
-  def this(uid: String, _trees: Array[DecisionTreeRegressionModel], 
_treeWeights: Array[Double]) =
-this(uid, _trees, _treeWeights, -1)
+  def this(uid: String, _trees: Array[DecisionTreeRegressionModel],
+   _treeWeights: Array[Double]) =
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94873371
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -215,10 +223,23 @@ class GBTClassificationModel private[ml](
*
* @param _trees  Decision trees in the ensemble.
* @param _treeWeights  Weights for the decision trees in the ensemble.
+   * @param numFeatures  The number of features.
+   */
+  @Since("1.6.0")
--- End diff --

removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94868309
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -66,10 +66,39 @@ class GBTClassifierSuite extends SparkFunSuite with 
MLlibTestSparkContext
 ParamsSuite.checkParams(new GBTClassifier)
 val model = new GBTClassificationModel("gbtc",
   Array(new DecisionTreeRegressionModel("dtr", new LeafNode(0.0, 0.0, 
null), 1)),
-  Array(1.0), 1)
+  Array(1.0), 1, 2)
 ParamsSuite.checkParams(model)
   }
 
+  test("Verify raw scores correspond to labels") {
+val rawPredictionCol = "MyRawPrediction"
+val predictionCol = "MyPrediction"
+val labelCol = "label"
+val featuresCol = "features"
+val gbt = new GBTClassifier()
+  .setMaxDepth(2)
+  .setLossType("logistic")
+  .setMaxIter(5)
+  .setStepSize(0.1)
+  .setCheckpointInterval(2)
+  .setSeed(123)
+  .setRawPredictionCol(rawPredictionCol)
+  .setPredictionCol(predictionCol)
+  .setLabelCol(labelCol)
+  .setFeaturesCol(featuresCol)
+val gbtModel = gbt.fit(trainData.toDF(labelCol, featuresCol))
+val scoredData = gbtModel.transform(validationData.toDF(labelCol, 
featuresCol))
+scoredData.select(rawPredictionCol, predictionCol).collect()
+  .foreach(row => {
--- End diff --

you can use `.foreach { case Row(raw: DenseVector, pred: Double, prob: 
DenseVector) => ... }` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94854198
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -215,10 +223,23 @@ class GBTClassificationModel private[ml](
*
* @param _trees  Decision trees in the ensemble.
* @param _treeWeights  Weights for the decision trees in the ensemble.
+   * @param numFeatures  The number of features.
+   */
+  @Since("1.6.0")
+  def this(uid: String, _trees: Array[DecisionTreeRegressionModel],
+   _treeWeights: Array[Double], numFeatures: Int) =
+this(uid, _trees, _treeWeights, numFeatures, 2)
+
+  /**
+   * Construct a GBTClassificationModel
+   *
+   * @param _trees  Decision trees in the ensemble.
+   * @param _treeWeights  Weights for the decision trees in the ensemble.
*/
   @Since("1.6.0")
-  def this(uid: String, _trees: Array[DecisionTreeRegressionModel], 
_treeWeights: Array[Double]) =
-this(uid, _trees, _treeWeights, -1)
+  def this(uid: String, _trees: Array[DecisionTreeRegressionModel],
+   _treeWeights: Array[Double]) =
--- End diff --

put this back on one line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94857458
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -248,12 +269,38 @@ class GBTClassificationModel private[ml](
 if (prediction > 0.0) 1.0 else 0.0
   }
 
+  override protected def predictRaw(features: Vector): Vector = {
+val treePredictions = 
_trees.map(_.rootNode.predictImpl(features).prediction)
+val prediction = blas.ddot(numTrees, treePredictions, 1, _treeWeights, 
1)
--- End diff --

We should import `org.apache.spark.ml.linalg.BLAS` and call `BLAS.dot` here 
and in `predict`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94861226
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -248,12 +269,38 @@ class GBTClassificationModel private[ml](
 if (prediction > 0.0) 1.0 else 0.0
   }
 
+  override protected def predictRaw(features: Vector): Vector = {
+val treePredictions = 
_trees.map(_.rootNode.predictImpl(features).prediction)
+val prediction = blas.ddot(numTrees, treePredictions, 1, _treeWeights, 
1)
+Vectors.dense(Array(-prediction, prediction))
+  }
+
+  override protected def raw2probabilityInPlace(rawPrediction: Vector): 
Vector = {
+rawPrediction match {
+  // The probability can be calculated for positive result:
+  // p+(x) = 1 / (1 + e^(-2 * F(x)))
+  // and negative result:
+  // p-(x) = 1 / (1 + e^(2 * F(x)))
+  case dv: DenseVector =>
+var i = 0
+val size = dv.size
+while (i < size) {
+  dv.values(i) = 1 / (1 + math.exp(-2 * dv.values(i)))
--- End diff --

my concern is that this is hard coded to logistic loss. Maybe we can add a 
static method to GBTClassificationModel 

scala
private def classProbability(class: Int, loss: String, rawPrediction: 
Double): Double = {
  loss match {
case "logistic" => ...
case _ => throw new Exception("Only logistic loss is supported ...")
  }
}



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94867743
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -66,10 +66,39 @@ class GBTClassifierSuite extends SparkFunSuite with 
MLlibTestSparkContext
 ParamsSuite.checkParams(new GBTClassifier)
 val model = new GBTClassificationModel("gbtc",
   Array(new DecisionTreeRegressionModel("dtr", new LeafNode(0.0, 0.0, 
null), 1)),
-  Array(1.0), 1)
+  Array(1.0), 1, 2)
 ParamsSuite.checkParams(model)
   }
 
+  test("Verify raw scores correspond to labels") {
--- End diff --

Could you take a look at [this 
test](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala#L314),
 and make it line up here? Specifically:

* compute probabilities manually from rawPrediction and ensure that it 
matches the probabilities column
* make sure that probabilities.argmax and rawPrediction.argmax equal the 
prediction
* make sure probabilities sum to one
* check the different code paths by unsetting some of the output columns


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94855997
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -215,10 +223,23 @@ class GBTClassificationModel private[ml](
*
* @param _trees  Decision trees in the ensemble.
* @param _treeWeights  Weights for the decision trees in the ensemble.
+   * @param numFeatures  The number of features.
+   */
+  @Since("1.6.0")
--- End diff --

This is actually not correct since the constructor was `private[ml]` 
before. Since this has always been private, and we aren't actually using it 
anywhere, I think we can remove this constructor entirely.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94861988
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -66,10 +66,39 @@ class GBTClassifierSuite extends SparkFunSuite with 
MLlibTestSparkContext
 ParamsSuite.checkParams(new GBTClassifier)
 val model = new GBTClassificationModel("gbtc",
   Array(new DecisionTreeRegressionModel("dtr", new LeafNode(0.0, 0.0, 
null), 1)),
-  Array(1.0), 1)
+  Array(1.0), 1, 2)
 ParamsSuite.checkParams(model)
   }
 
+  test("Verify raw scores correspond to labels") {
+val rawPredictionCol = "MyRawPrediction"
--- End diff --

Just use defaults here. And I'm in favor of only setting parameters that 
matter for the given test, otherwise it may give the impression that the test 
depends on a certain, say checkpoint interval.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2017-01-05 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r94868112
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -248,12 +269,38 @@ class GBTClassificationModel private[ml](
 if (prediction > 0.0) 1.0 else 0.0
   }
 
+  override protected def predictRaw(features: Vector): Vector = {
--- End diff --

In logistic regression we had previously overridden some of the methods in 
probabilistic classifier since we were only dealing with two classes, which 
makes those methods a bit faster (hard to say how much). We can do it here for 
now, but I'd be slightly in favor of _not_ doing it since I'm not sure how much 
we gain from it and it makes the code harder to follow. Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16441: [SPARK-14975][ML][WIP] Fixed GBTClassifier to pre...

2016-12-30 Thread imatiach-msft
GitHub user imatiach-msft opened a pull request:

https://github.com/apache/spark/pull/16441

[SPARK-14975][ML][WIP] Fixed GBTClassifier to predict probability per 
training instance and fixed interfaces

## What changes were proposed in this pull request?

For all of the classifiers in MLLib we can predict probabilities except for 
GBTClassifier.  
Also, all classifiers inherit from ProbabilisticClassifier but 
GBTClassifier strangely inherits from Predictor, which is a bug.
This change corrects the interface and adds the ability for the classifier 
to give a probabilities vector.

## How was this patch tested?

The basic ML tests were run after making the changes.  I've marked this as 
WIP as I need to add more tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/imatiach-msft/spark ilmat/fix-GBT

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16441.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16441


commit 63a9574a0858ed9e4c27a4b698cb50d2475afc0b
Author: Ilya Matiach 
Date:   2016-12-30T20:15:12Z

[SPARK-14975][ML][WIP] Fixed GBTClassifier to predict probability per 
training instance and fixed interfaces

commit 4468891cd83760a2f97ef257ef176a34bc79e5cd
Author: Ilya Matiach 
Date:   2016-12-30T20:20:43Z

Fixed scala style empty line




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org