LabeledPoint was used for both classification and regression, where label type is Double for simplicity. So in BinaryClassificationMetrics, we still use Double for labels. We compute the confusion matrix at each threshold internally, but this is not exposed to users ( https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L127). Feel free to submit a PR to make it public. -Xiangrui
On Mon, Jun 15, 2015 at 7:13 AM, Sam <samthesav...@gmail.com> wrote: > > Google+ > <https://plus.google.com/app/basic?nopromo=1&source=mog&gl=uk> > <http://mail.google.com/mail/x/mog-/gp/?source=mog&gl=uk> > Calendar > <https://www.google.com/calendar/gpcal?source=mog&gl=uk> > Web > <http://www.google.co.uk/?source=mog&gl=uk> > more > Inbox > Apache Spark Email > GmailNot Work > S > sam.sav...@barclays.com > to me > 0 minutes ago > Details > According to > https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics > > The constructor takes `RDD[(Double, Double)]` meaning lables are Doubles, > this seems odd, shouldn't it be Boolean? Similarly for MutlilabelMetrics > (I.e. Should be RDD[(Array[Double], Array[Boolean])]), and for > MulticlassMetrics the type of both should be generic? > > Additionally it would be good if either the ROC output type was changed or > another method was added that returned confusion matricies, so that the > hard integer values can be obtained before the divisions. E.g. > > ``` > case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) > { > // bunch of methods for each of the things in the table here > https://en.wikipedia.org/wiki/Receiver_operating_characteristic > } > ... > def confusions(): RDD[Confusion] > ``` >