Firstly apologies for the header of my email containing some junk, I
believe it's due to a copy and paste error on a smart phone.

Thanks for your response.  I will indeed make the PR you suggest, though
glancing at the code I realize it's not just a case of making these public
since the types are also private. Then, there is certain functionality I
will be exposing, which then ought to be tested, e.g. every bin except
potentially the last will have an equal number of data points in it*.  I'll
get round to it at some point.

As for BinaryClassificationMetrics using Double for labels, thanks for the
explanation.  If I where to make a PR to encapsulate the underlying
implementation (that uses LabeledPoint) and change the type to Boolean,
would what be the impact to versioning (since I'd be changing public API)?
An alternative would be to create a new wrapper class, say
BinaryClassificationMeasures, and deprecate the old with the intention of
migrating all the code into the new class.

* Maybe some other part of the code base tests this, since this assumption
must hold in order to average across folds in x-validation?

On Thu, Jun 18, 2015 at 1:02 AM, Xiangrui Meng <men...@gmail.com> wrote:

> LabeledPoint was used for both classification and regression, where label
> type is Double for simplicity. So in BinaryClassificationMetrics, we still
> use Double for labels. We compute the confusion matrix at each threshold
> internally, but this is not exposed to users (
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L127).
> Feel free to submit a PR to make it public. -Xiangrui
>
> On Mon, Jun 15, 2015 at 7:13 AM, Sam <samthesav...@gmail.com> wrote:
>
>>
>> Google+
>> <https://plus.google.com/app/basic?nopromo=1&source=mog&gl=uk>
>> <http://mail.google.com/mail/x/mog-/gp/?source=mog&gl=uk>
>> Calendar
>> <https://www.google.com/calendar/gpcal?source=mog&gl=uk>
>> Web
>> <http://www.google.co.uk/?source=mog&gl=uk>
>> more
>> Inbox
>> Apache Spark Email
>> GmailNot Work
>> S
>> sam.sav...@barclays.com
>> to me
>> 0 minutes ago
>> Details
>> According to
>> https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
>>
>> The constructor takes `RDD[(Double, Double)]` meaning lables are Doubles,
>> this seems odd, shouldn't it be Boolean?  Similarly for MutlilabelMetrics
>> (I.e. Should be RDD[(Array[Double], Array[Boolean])]), and for
>> MulticlassMetrics the type of both should be generic?
>>
>> Additionally it would be good if either the ROC output type was changed
>> or another method was added that returned confusion matricies, so that the
>> hard integer values can be obtained before the divisions. E.g.
>>
>> ```
>> case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int)
>> {
>>   // bunch of methods for each of the things in the table here
>> https://en.wikipedia.org/wiki/Receiver_operating_characteristic
>> }
>> ...
>> def confusions(): RDD[Confusion]
>> ```
>>
>
>

Reply via email to