[ 
https://issues.apache.org/jira/browse/SPARK-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-8375:
-----------------------
    Description: 
According to 
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

The constructor takes `RDD[(Double, Double)]` which does not make sense it 
should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.

In scikit I believe they use the number of unique scores to determine the 
number of thresholds and the ROC.  I assume this is what 
BinaryClassificationMetrics is doing since it makes no mention of buckets.  In 
a Big Data context this does not make sense as the number of unique scores may 
be huge.  

Rather user should be able to either specify the number of buckets, or the 
number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`

Finally it would then be good if either the ROC output type was changed or 
another method was added that returned confusion matricies, so that the hard 
integer values can be obtained.  E.g.

```
case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
  // bunch of methods for each of the things in the table here 
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
}

...
def confusions(numPtsPerBucket: Int): RDD[Confusion]
```




  was:
According to 
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

The constructor takes `RDD[(Double, Double)]` which does not make sense it 
should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.

In scikit I believe they use the number of unique scores to determine the 
number of thresholds and the ROC.  I assume this is what 
BinaryClassificationMetrics is doing since it makes no mention of buckets.  In 
a Big Data context this does not make as the number of unique scores may be 
huge.  

Rather user should be able to either specify the number of buckets, or the 
number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`

Finally it would then be good if either the ROC output type was changed or 
another method was added that returned confusion matricies, so that the hard 
integer values can be obtained.  E.g.

```
case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
  // bunch of methods for each of the things in the table here 
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
}

...
def confusions(numPtsPerBucket: Int): RDD[Confusion]
```





> BinaryClassificationMetrics in ML Lib has odd API
> -------------------------------------------------
>
>                 Key: SPARK-8375
>                 URL: https://issues.apache.org/jira/browse/SPARK-8375
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>            Reporter: sam
>
> According to 
> https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
> The constructor takes `RDD[(Double, Double)]` which does not make sense it 
> should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.
> In scikit I believe they use the number of unique scores to determine the 
> number of thresholds and the ROC.  I assume this is what 
> BinaryClassificationMetrics is doing since it makes no mention of buckets.  
> In a Big Data context this does not make sense as the number of unique scores 
> may be huge.  
> Rather user should be able to either specify the number of buckets, or the 
> number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`
> Finally it would then be good if either the ROC output type was changed or 
> another method was added that returned confusion matricies, so that the hard 
> integer values can be obtained.  E.g.
> ```
> case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
>   // bunch of methods for each of the things in the table here 
> https://en.wikipedia.org/wiki/Receiver_operating_characteristic
> }
> ...
> def confusions(numPtsPerBucket: Int): RDD[Confusion]
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to