sam created SPARK-8375:
--------------------------

             Summary: BinaryClassificationMetrics in ML Lib has odd API
                 Key: SPARK-8375
                 URL: https://issues.apache.org/jira/browse/SPARK-8375
             Project: Spark
          Issue Type: Bug
          Components: MLlib
            Reporter: sam


According to 
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

The constructor takes `RDD[(Double, Double)]` which does not make sense it 
should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.

In scikit I believe they use the number of unique scores to determine the 
number of thresholds and the ROC.  I assume this is what 
BinaryClassificationMetrics is doing since it makes no mention of buckets.  In 
a Big Data context this does not make as the number of unique scores may be 
huge.  

Rather user should be able to either specify the number of buckets, or the 
number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`

Finally it would then be good if either the ROC output type was changed or 
another method was added that returned confusion matricies, so that the hard 
integer values can be obtained.  E.g.

```
case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
  // bunch of methods for each of the things in the table here 
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
}

...
def confusions(numPtsPerBucket: Int): RDD[Confusion]
```






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to