[ 
https://issues.apache.org/jira/browse/SPARK-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8375.
------------------------------
    Resolution: Invalid

@sam This is a discussion for the mailing list rather than a JIRA.
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

You're looking at an API from 4 versions ago, too.
https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

The input are scores and ground-truth labels. I agree with the problem of many 
distinct values, but, this is part of the newer API.

> BinaryClassificationMetrics in ML Lib has odd API
> -------------------------------------------------
>
>                 Key: SPARK-8375
>                 URL: https://issues.apache.org/jira/browse/SPARK-8375
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>            Reporter: sam
>
> According to 
> https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
> The constructor takes `RDD[(Double, Double)]` which does not make sense it 
> should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.
> In scikit I believe they use the number of unique scores to determine the 
> number of thresholds and the ROC.  I assume this is what 
> BinaryClassificationMetrics is doing since it makes no mention of buckets.  
> In a Big Data context this does not make sense as the number of unique scores 
> may be huge.  
> Rather user should be able to either specify the number of buckets, or the 
> number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`
> Finally it would then be good if either the ROC output type was changed or 
> another method was added that returned confusion matricies, so that the hard 
> integer values can be obtained.  E.g.
> ```
> case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
>   // bunch of methods for each of the things in the table here 
> https://en.wikipedia.org/wiki/Receiver_operating_characteristic
> }
> ...
> def confusions(numPtsPerBucket: Int): RDD[Confusion]
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to