[ https://issues.apache.org/jira/browse/SPARK-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-8375. ------------------------------ Resolution: Invalid @sam This is a discussion for the mailing list rather than a JIRA. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark You're looking at an API from 4 versions ago, too. https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics The input are scores and ground-truth labels. I agree with the problem of many distinct values, but, this is part of the newer API. > BinaryClassificationMetrics in ML Lib has odd API > ------------------------------------------------- > > Key: SPARK-8375 > URL: https://issues.apache.org/jira/browse/SPARK-8375 > Project: Spark > Issue Type: Bug > Components: MLlib > Reporter: sam > > According to > https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics > The constructor takes `RDD[(Double, Double)]` which does not make sense it > should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`. > In scikit I believe they use the number of unique scores to determine the > number of thresholds and the ROC. I assume this is what > BinaryClassificationMetrics is doing since it makes no mention of buckets. > In a Big Data context this does not make sense as the number of unique scores > may be huge. > Rather user should be able to either specify the number of buckets, or the > number of data points in each bucket. E.g. `def roc(numPtsPerBucket: Int)` > Finally it would then be good if either the ROC output type was changed or > another method was added that returned confusion matricies, so that the hard > integer values can be obtained. E.g. > ``` > case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) { > // bunch of methods for each of the things in the table here > https://en.wikipedia.org/wiki/Receiver_operating_characteristic > } > ... > def confusions(numPtsPerBucket: Int): RDD[Confusion] > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org