[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-02-09 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-182138566
  
Here it is: [https://github.com/viirya/spark-1/pull/1]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-02-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-182143591
  
**[Test build #51012 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51012/consoleFull)**
 for PR 8734 at commit 
[`2bbe037`](https://github.com/apache/spark/commit/2bbe0373e9f3e2d04146bbade7279ca9fcadc6f1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-02-09 Thread sethah
Github user sethah commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-182135907
  
Yes, LGTM pending the improved test, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-02-09 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-182155032
  
LGTM!  Thanks @viirya and @sethah 

I'll merge with master and see how far back I can backport it easily.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-02-09 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-182142095
  
@jkbradley Great thanks. I've merged your PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-02-09 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-182126572
  
@viirya Thanks for the updates.

I think the code is correct now, but I'm going to send you a PR (to update 
this PR) in order to improve the test.  I agree with @sethah that the current 
test does not really test anything.

@sethah Does it look good to you, other than the test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-02-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-182151869
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-02-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-182151871
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51012/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-02-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-182151701
  
**[Test build #51012 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51012/consoleFull)**
 for PR 8734 at commit 
[`2bbe037`](https://github.com/apache/spark/commit/2bbe0373e9f3e2d04146bbade7279ca9fcadc6f1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-02-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8734


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-174160685
  
**[Test build #49934 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49934/consoleFull)**
 for PR 8734 at commit 
[`5c44e23`](https://github.com/apache/spark/commit/5c44e23cac423ad57e8e602c227dfa7a0f7b822c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-174165133
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49934/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-174165132
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-174165097
  
**[Test build #49934 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49934/consoleFull)**
 for PR 8734 at commit 
[`5c44e23`](https://github.com/apache/spark/commit/5c44e23cac423ad57e8e602c227dfa7a0f7b822c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/8734#discussion_r50617823
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -331,12 +336,62 @@ class DecisionTreeSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(topNode.impurity !== -1.0)
 
 // set impurity and predict for child nodes
-assert(topNode.leftNode.get.predict.predict === 0.0)
-assert(topNode.rightNode.get.predict.predict === 1.0)
+if (topNode.leftNode.get.predict.predict === 0.0) {
+  assert(topNode.rightNode.get.predict.predict === 1.0)
+} else {
+  assert(topNode.leftNode.get.predict.predict === 1.0)
+  assert(topNode.rightNode.get.predict.predict === 0.0)
+}
 assert(topNode.leftNode.get.impurity === 0.0)
 assert(topNode.rightNode.get.impurity === 0.0)
   }
 
+  test("Use soft prediction for binary classification with ordered 
categorical features") {
--- End diff --

Hmm, I just want a test case to show it actually order the bins by soft 
prediction. Although @jkbradley suggested we should use directly 
`binsToBestSplit`, but in order to do that, we also need to expose many details 
of `findBestSplits` too, e.g., `binSeqOp`, `getNodeToFeatures` and 
`partitionAggregates`...etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-22 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8734#discussion_r50576539
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -294,8 +295,12 @@ class DecisionTreeSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(topNode.impurity !== -1.0)
 
 // set impurity and predict for child nodes
-assert(topNode.leftNode.get.predict.predict === 0.0)
-assert(topNode.rightNode.get.predict.predict === 1.0)
+if (topNode.leftNode.get.predict.predict === 0.0) {
--- End diff --

I think now that we are ordering by the correct thing, this change is not 
necessary. Since we order by proportion falling in outcome class 1, the hard 
prediction 0 will always be before hard prediction 1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-22 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8734#discussion_r50572835
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -740,7 +740,11 @@ private[ml] object RandomForest extends Logging {
   val categoryStats =
 binAggregates.getImpurityCalculator(nodeFeatureOffset, 
featureValue)
   val centroid = if (categoryStats.count != 0) {
-categoryStats.predict
+if (categoryStats.count == 2) {
--- End diff --

I think you meant `categoryStats.stats.length == 2`. `categoryStats.count` 
is the count of data points falling into that particular bin. Since we are 
trying to determine here whether this is regression or binary classification, I 
think checking `if (binAggregates.metadata.isClassification)` is more clear. 

Additionally, the code under the if and else statements of 
`centroidForCategories` is identical except for a single line. It seems cleaner 
to restructure to something like:

```scala
val centroidForCategories = Range(0, numCategories).map { case featureValue 
=>
  val categoryStats =
binAggregates.getImpurityCalculator(nodeFeatureOffset, featureValue)
  val centroid = if (categoryStats.count != 0) {
if (binAggregates.metadata.isMulticlass) {
  // multiclass classification
  categoryStats.calculate()
} else if (binAggregates.metadata.isClassification) {
  // binary classification
  categoryStats.stats(1)
} else {
  // regression
  categoryStats.predict
}
  } else {
Double.MaxValue
  }
  (featureValue, centroid)
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-22 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8734#discussion_r50601582
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -331,12 +336,62 @@ class DecisionTreeSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(topNode.impurity !== -1.0)
 
 // set impurity and predict for child nodes
-assert(topNode.leftNode.get.predict.predict === 0.0)
-assert(topNode.rightNode.get.predict.predict === 1.0)
+if (topNode.leftNode.get.predict.predict === 0.0) {
+  assert(topNode.rightNode.get.predict.predict === 1.0)
+} else {
+  assert(topNode.leftNode.get.predict.predict === 1.0)
+  assert(topNode.rightNode.get.predict.predict === 0.0)
+}
 assert(topNode.leftNode.get.impurity === 0.0)
 assert(topNode.rightNode.get.impurity === 0.0)
   }
 
+  test("Use soft prediction for binary classification with ordered 
categorical features") {
--- End diff --

What is the goal of this test? I guessed that the goal would be to find a 
case where ordering by hard predictions produces a different (suboptimal) tree 
than ordering by soft predictions. However, I did a quick simulation for this 
dataset and the results I got were the same either way. Just wanted to clarify.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/8734#discussion_r50614934
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -740,7 +740,11 @@ private[ml] object RandomForest extends Logging {
   val categoryStats =
 binAggregates.getImpurityCalculator(nodeFeatureOffset, 
featureValue)
   val centroid = if (categoryStats.count != 0) {
-categoryStats.predict
+if (categoryStats.count == 2) {
--- End diff --

@sethah Thanks. You are right. I didn't read this part of codes thoroughly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-173506110
  
**[Test build #49874 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49874/consoleFull)**
 for PR 8734 at commit 
[`a37d3d8`](https://github.com/apache/spark/commit/a37d3d8fc026a7a42405b5a16814e23c6fcfa3be).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-173518885
  
**[Test build #49874 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49874/consoleFull)**
 for PR 8734 at commit 
[`a37d3d8`](https://github.com/apache/spark/commit/a37d3d8fc026a7a42405b5a16814e23c6fcfa3be).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-173519027
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-173519029
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49874/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-20 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8734#discussion_r50361275
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -740,7 +740,7 @@ private[ml] object RandomForest extends Logging {
   val categoryStats =
 binAggregates.getImpurityCalculator(nodeFeatureOffset, 
featureValue)
   val centroid = if (categoryStats.count != 0) {
-categoryStats.predict
+categoryStats.prob(categoryStats.predict)
--- End diff --

Finding the _proportion_ falling in outcome class 1 simply requires 
division of the counts by a constant. Since we're just using that number for an 
ordering, constant division won't matter. They are the same. 

My initial comment has a typo. It should say for a "binary **outcome**", 
not "binary category". 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-20 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/8734#discussion_r50355886
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -740,7 +740,7 @@ private[ml] object RandomForest extends Logging {
   val categoryStats =
 binAggregates.getImpurityCalculator(nodeFeatureOffset, 
featureValue)
   val centroid = if (categoryStats.count != 0) {
-categoryStats.predict
+categoryStats.prob(categoryStats.predict)
--- End diff --

As I saw from the implementation, `categoryStats.stats(1)` is just the 
count of class 1. Are we going to order bins by that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-20 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/8734#discussion_r50366332
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -740,7 +740,7 @@ private[ml] object RandomForest extends Logging {
   val categoryStats =
 binAggregates.getImpurityCalculator(nodeFeatureOffset, 
featureValue)
   val centroid = if (categoryStats.count != 0) {
-categoryStats.predict
+categoryStats.prob(categoryStats.predict)
--- End diff --

Yeah, I see. I was thinking we are going to order them by soft prediction 
of each bin. Actually what we want is to order them by soft prediction of 
certain class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-173484334
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-173484336
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49865/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-173484306
  
**[Test build #49865 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49865/consoleFull)**
 for PR 8734 at commit 
[`cd25214`](https://github.com/apache/spark/commit/cd252142a29e6083648bdec6a0646f2b6921b603).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-173477452
  
**[Test build #49865 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49865/consoleFull)**
 for PR 8734 at commit 
[`cd25214`](https://github.com/apache/spark/commit/cd252142a29e6083648bdec6a0646f2b6921b603).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-20 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/8734#discussion_r50285840
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -740,7 +740,7 @@ private[ml] object RandomForest extends Logging {
   val categoryStats =
 binAggregates.getImpurityCalculator(nodeFeatureOffset, 
featureValue)
   val centroid = if (categoryStats.count != 0) {
-categoryStats.predict
+categoryStats.prob(categoryStats.predict)
--- End diff --

I don't believe this is correct. Ordering by the probability of the 
prediction is essentially the same as ordering by impurity. That's because when 
the impurity is low, the predicted value will have high probability and vice 
versa. 

From Hastie, Tibshirani, and Friedman:
"We order the predictor classes according to the proportion falling in 
outcome class 1. Then we split this predictor as if it were an ordered 
predictor."

For binary category I think it should be as @jkbradley suggested 
`categoryStats.stats(1)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-173088520
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-173080387
  
**[Test build #49757 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49757/consoleFull)**
 for PR 8734 at commit 
[`2c32350`](https://github.com/apache/spark/commit/2c3235054d2d2a98f9f88ae67243bd1ba89eca1e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-173088522
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49757/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-173088436
  
**[Test build #49757 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49757/consoleFull)**
 for PR 8734 at commit 
[`2c32350`](https://github.com/apache/spark/commit/2c3235054d2d2a98f9f88ae67243bd1ba89eca1e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-19 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/8734#discussion_r50212622
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -337,12 +342,62 @@ class DecisionTreeSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(topNode.impurity !== -1.0)
 
 // set impurity and predict for child nodes
-assert(topNode.leftNode.get.predict.predict === 0.0)
-assert(topNode.rightNode.get.predict.predict === 1.0)
+if (topNode.leftNode.get.predict.predict === 0.0) {
+  assert(topNode.rightNode.get.predict.predict === 1.0)
+} else {
+  assert(topNode.leftNode.get.predict.predict === 1.0)
+  assert(topNode.rightNode.get.predict.predict === 0.0)
+}
 assert(topNode.leftNode.get.impurity === 0.0)
 assert(topNode.rightNode.get.impurity === 0.0)
   }
 
+  test("Use soft prediction for binary classification with ordered 
categorical features") {
+val arr = Array(
+  LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 0.0)), // left node
+  LabeledPoint(1.0, Vectors.dense(0.0, 1.0, 1.0)), // right node
+  LabeledPoint(0.0, Vectors.dense(2.0, 0.0, 0.0)), // left node
+  LabeledPoint(1.0, Vectors.dense(0.0, 2.0, 1.0)), // right node
+  LabeledPoint(1.0, Vectors.dense(1.0, 1.0, 0.0)), // left node
+  LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 2.0))) // left node
+val input = sc.parallelize(arr)
+
+val strategy = new Strategy(algo = Classification, impurity = Gini, 
maxDepth = 1,
+  numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3))
+val metadata = DecisionTreeMetadata.buildMetadata(input, strategy)
+val (splits, bins) = DecisionTree.findSplitsBins(input, metadata)
+
+val treeInput = TreePoint.convertToTreeRDD(input, bins, metadata)
+val baggedInput = BaggedPoint.convertToBaggedRDD(treeInput, 1.0, 1, 
false)
+
+val topNode = Node.emptyNode(nodeIndex = 1)
+assert(topNode.predict.predict === Double.MinValue)
+assert(topNode.impurity === -1.0)
+assert(topNode.isLeaf === false)
+
+val nodesForGroup = Map((0, Array(topNode)))
+val treeToNodeToIndexInfo = Map((0, Map(
+  (topNode.id, new RandomForest.NodeIndexInfo(0, None))
+)))
+val nodeQueue = new mutable.Queue[(Int, Node)]()
+DecisionTree.findBestSplits(baggedInput, metadata, Array(topNode),
--- End diff --

In order to call `binsToBestSplit` directly, we need to expose many details 
of `findBestSplits` too, e.g., `binSeqOp`, `getNodeToFeatures ` and 
`partitionAggregates`...etc., because `binsToBestSplit` needs `binAggregates` 
and `featuresForNode`..etc. as parameters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-13 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-17141
  
@jkbradley Sorry for replying late. I will try to finish this soon. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-13 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-171491697
  
Ping!  Please let me know if you don't have time to work on this, and I can 
take it over.  Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2016-01-13 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-171500434
  
OK thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2015-12-30 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-168035189
  
I'll have bandwidth to get this merged now, so I'll watch for updates.  
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2015-12-30 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/8734#discussion_r48616278
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -776,7 +776,7 @@ private[ml] object RandomForest extends Logging {
   val categoryStats =
 binAggregates.getImpurityCalculator(nodeFeatureOffset, 
featureValue)
   val centroid = if (categoryStats.count != 0) {
-categoryStats.predict
+categoryStats.calculate()
--- End diff --

This should use the soft prediction, not simply the impurity.  For 
regression, that means using predict(), but for binary classification, that 
will require using categoryStats.stats(1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2015-12-30 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/8734#discussion_r48616306
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -337,12 +342,62 @@ class DecisionTreeSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(topNode.impurity !== -1.0)
 
 // set impurity and predict for child nodes
-assert(topNode.leftNode.get.predict.predict === 0.0)
-assert(topNode.rightNode.get.predict.predict === 1.0)
+if (topNode.leftNode.get.predict.predict === 0.0) {
+  assert(topNode.rightNode.get.predict.predict === 1.0)
+} else {
+  assert(topNode.leftNode.get.predict.predict === 1.0)
+  assert(topNode.rightNode.get.predict.predict === 0.0)
+}
 assert(topNode.leftNode.get.impurity === 0.0)
 assert(topNode.rightNode.get.impurity === 0.0)
   }
 
+  test("Use soft prediction for binary classification with ordered 
categorical features") {
+val arr = Array(
+  LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 0.0)), // left node
+  LabeledPoint(1.0, Vectors.dense(0.0, 1.0, 1.0)), // right node
+  LabeledPoint(0.0, Vectors.dense(2.0, 0.0, 0.0)), // left node
+  LabeledPoint(1.0, Vectors.dense(0.0, 2.0, 1.0)), // right node
+  LabeledPoint(1.0, Vectors.dense(1.0, 1.0, 0.0)), // left node
+  LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 2.0))) // left node
+val input = sc.parallelize(arr)
+
+val strategy = new Strategy(algo = Classification, impurity = Gini, 
maxDepth = 1,
+  numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3))
+val metadata = DecisionTreeMetadata.buildMetadata(input, strategy)
+val (splits, bins) = DecisionTree.findSplitsBins(input, metadata)
+
+val treeInput = TreePoint.convertToTreeRDD(input, bins, metadata)
+val baggedInput = BaggedPoint.convertToBaggedRDD(treeInput, 1.0, 1, 
false)
+
+val topNode = Node.emptyNode(nodeIndex = 1)
+assert(topNode.predict.predict === Double.MinValue)
+assert(topNode.impurity === -1.0)
+assert(topNode.isLeaf === false)
+
+val nodesForGroup = Map((0, Array(topNode)))
+val treeToNodeToIndexInfo = Map((0, Map(
+  (topNode.id, new RandomForest.NodeIndexInfo(0, None))
+)))
+val nodeQueue = new mutable.Queue[(Int, Node)]()
+DecisionTree.findBestSplits(baggedInput, metadata, Array(topNode),
--- End diff --

Ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2015-10-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/8734#discussion_r42964746
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -294,8 +295,12 @@ class DecisionTreeSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(topNode.impurity !== -1.0)
 
 // set impurity and predict for child nodes
-assert(topNode.leftNode.get.predict.predict === 0.0)
-assert(topNode.rightNode.get.predict.predict === 1.0)
+if (topNode.leftNode.get.predict.predict === 0.0) {
--- End diff --

Because previously we rank the bins according to hard prediction. So for 
example the bin of hard prediction 0 is always in front of hard prediction 1. 
Now the order of bins can be switched.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2015-10-06 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/8734#discussion_r41326380
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -337,12 +342,62 @@ class DecisionTreeSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(topNode.impurity !== -1.0)
 
 // set impurity and predict for child nodes
-assert(topNode.leftNode.get.predict.predict === 0.0)
-assert(topNode.rightNode.get.predict.predict === 1.0)
+if (topNode.leftNode.get.predict.predict === 0.0) {
+  assert(topNode.rightNode.get.predict.predict === 1.0)
+} else {
+  assert(topNode.leftNode.get.predict.predict === 1.0)
+  assert(topNode.rightNode.get.predict.predict === 0.0)
+}
 assert(topNode.leftNode.get.impurity === 0.0)
 assert(topNode.rightNode.get.impurity === 0.0)
   }
 
+  test("Use soft prediction for binary classification with ordered 
categorical features") {
+val arr = Array(
+  LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 0.0)), // left node
+  LabeledPoint(1.0, Vectors.dense(0.0, 1.0, 1.0)), // right node
+  LabeledPoint(0.0, Vectors.dense(2.0, 0.0, 0.0)), // left node
+  LabeledPoint(1.0, Vectors.dense(0.0, 2.0, 1.0)), // right node
+  LabeledPoint(1.0, Vectors.dense(1.0, 1.0, 0.0)), // left node
+  LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 2.0))) // left node
+val input = sc.parallelize(arr)
+
+val strategy = new Strategy(algo = Classification, impurity = Gini, 
maxDepth = 1,
+  numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3))
+val metadata = DecisionTreeMetadata.buildMetadata(input, strategy)
+val (splits, bins) = DecisionTree.findSplitsBins(input, metadata)
+
+val treeInput = TreePoint.convertToTreeRDD(input, bins, metadata)
+val baggedInput = BaggedPoint.convertToBaggedRDD(treeInput, 1.0, 1, 
false)
+
+val topNode = Node.emptyNode(nodeIndex = 1)
+assert(topNode.predict.predict === Double.MinValue)
+assert(topNode.impurity === -1.0)
+assert(topNode.isLeaf === false)
+
+val nodesForGroup = Map((0, Array(topNode)))
+val treeToNodeToIndexInfo = Map((0, Map(
+  (topNode.id, new RandomForest.NodeIndexInfo(0, None))
+)))
+val nodeQueue = new mutable.Queue[(Int, Node)]()
+DecisionTree.findBestSplits(baggedInput, metadata, Array(topNode),
--- End diff --

Can you please update this test to call binsToBestSplit directly?  You can 
change it to be ```private[tree]``` so that it's callable from this test suite.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2015-10-06 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/8734#discussion_r41326387
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -294,8 +295,12 @@ class DecisionTreeSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(topNode.impurity !== -1.0)
 
 // set impurity and predict for child nodes
-assert(topNode.leftNode.get.predict.predict === 0.0)
-assert(topNode.rightNode.get.predict.predict === 1.0)
+if (topNode.leftNode.get.predict.predict === 0.0) {
--- End diff --

What happened here?  Did the leaf nodes switch because of the internal 
change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2015-09-13 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/8734

[SPARK-10524][ML] Use the soft prediction to order categories' bins

JIRA: https://issues.apache.org/jira/browse/SPARK-10524

Currently we use the hard prediction (`ImpurityCalculator.predict`) to 
order categories' bins. But we should use the soft prediction.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 dt-soft-centroids

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8734.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8734


commit 84260ca52bc037f2074bc062c7d5b88e1657314a
Author: Liang-Chi Hsieh 
Date:   2015-09-13T15:27:08Z

Use the soft prediction to order categories' bins.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2015-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-139891838
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2015-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-139887804
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2015-09-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-139892475
  
  [Test build #42382 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42382/consoleFull)
 for   PR 8734 at commit 
[`84260ca`](https://github.com/apache/spark/commit/84260ca52bc037f2074bc062c7d5b88e1657314a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2015-09-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-139896122
  
  [Test build #42382 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42382/console)
 for   PR 8734 at commit 
[`84260ca`](https://github.com/apache/spark/commit/84260ca52bc037f2074bc062c7d5b88e1657314a).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2015-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-139896144
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10524][ML] Use the soft prediction to o...

2015-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8734#issuecomment-139896145
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42382/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org