[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

2016-09-12 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14949
  
Oh, I get it now. That makes sense. If this were being applied to decision 
trees only, that would make sense and we could fix this up and document the 
meaning. I agree it only makes sense to return "no class" if actually 
thresholding.

The only problem here is that this is not being applied just to a random 
forest implementation but to all classifiers that output a probability. That's 
a little more of a stretch. I suppose the result here can be thought of as a 
likelihood ratio of class probability vs prior, not some hacky heuristic 
specific to the CRAN package. I think the name is unfortunate because I would 
not have guessed that's the meaning given the name (though to be fair the 
scaladoc does say what it means).

I'll close this but what's the best way forward?

Option 1.
Keep current behavior. Modify https://github.com/apache/spark/pull/14643 to 
include Nick's suggestions above, and add a bunch of documentation about what 
'thresholds' really means here.

Option 2.
As above but deprecate threshold and rename to 'cutoff' to be a little 
clearer.

Option 3.
As in Option 2 but also go back and actually implement thresholds.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

2016-09-12 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/14949
  
The original JIRA 
[SPARK-8069](https://issues.apache.org/jira/browse/SPARK-8069) refers to 
https://cran.r-project.org/web/packages/randomForest/randomForest.pdf. 

That R package calls it "cutoff". Though it does indeed seem to act more 
like a "weight" or "scaling". I can't say I've come across it before, and it 
appears this is the only package that does it like this (at least that I've 
been able to find from some quick searching). I haven't found any theoretical 
background for it either.

In any case, now that we have it, I think it probably best to keep it as 
is. However, It appears that our implementation here is flawed since in the 
original R code, the `cutoff` vector sum must be in (0, 1) (and also be >0 
everywhere) - see 
https://github.com/cran/randomForest/blob/9208176df98d561aba6dae239472be8b124e2631/R/predict.randomForest.R#L47.
 If we're going to base something on another impl, probably best to actually 
follow it.

So:
* If `sum(thresholds)` > 1 or < 0, throw and error
* If each entry in `thresholds` not > 0, throw an error

I believe this takes care of the edge cases since no thresholds can be `0` 
or `1`. The tie breaker element is taken care of with `Vector.argmax` (if p/t 
is the same for 2 or more classes, then ties will effectively be broken by 
class index order).

I don't like returning `NaN`. Since the R impl is actually scaling things 
rather than actually "cutting off" or "thresholding", it should always return a 
prediction and I think we should too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

2016-09-12 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14949
  
The problem is that it's called 'threshold' and not 'weight', and 
'threshold' means something different. Is anyone suggesting that it was always 
meant as a 'weight', and/or has a reference for this type of weighting? I've 
never seen it ... not sure what multiplying a probability by 1/weight would 
mean theoretically.

Note this change would also make threshold matter as a tie-breaker.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

2016-09-12 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/14949
  
I think both this change and current design are reasonable. And I 
personally prefer to current one which treat threshould as a kind of weight.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

2016-09-12 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14949
  
Trying @holdenk or @mengxr maybe. I think this behavior should be changed 
because it doesn't match the common meaning of 'threshold', but I feel like I'm 
missing context about why it was done this way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

2016-09-07 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14949
  
@jkbradley @zhengruifeng @MLnick I wonder if I could ask you for comments 
on this change? it's a behavior change, so not something I'd do lightly, but I 
do think it improves the semantics here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

2016-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14949
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

2016-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14949
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64903/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

2016-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14949
  
**[Test build #64903 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64903/consoleFull)**
 for PR 14949 at commit 
[`2fa331e`](https://github.com/apache/spark/commit/2fa331ebdf3df31b08e3299c8410c29efa645bf1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

2016-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14949
  
**[Test build #64903 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64903/consoleFull)**
 for PR 14949 at commit 
[`2fa331e`](https://github.com/apache/spark/commit/2fa331ebdf3df31b08e3299c8410c29efa645bf1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

2016-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14949
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

2016-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14949
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64900/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

2016-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14949
  
**[Test build #64900 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64900/consoleFull)**
 for PR 14949 at commit 
[`08dbe43`](https://github.com/apache/spark/commit/08dbe43bbf961186ee94432a13f9f6cfc221e4a6).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

2016-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14949
  
**[Test build #64900 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64900/consoleFull)**
 for PR 14949 at commit 
[`08dbe43`](https://github.com/apache/spark/commit/08dbe43bbf961186ee94432a13f9f6cfc221e4a6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org