Re: *Metrics API is odd in MLLib

2015-07-28 Thread Sam
Hi Xiangrui & Spark People,

I recently got round to writing an evaluation framework for Spark that I
was hoping to PR into MLLib and this would solve some of the aforementioned
issues.  I have put the code on github in a separate repo for now as I
would like to get some sandboxed feedback.  The repo complete with detailed
documentation can be found here https://github.com/samthebest/sceval.

Many thanks,

Sam



On Thu, Jun 18, 2015 at 11:00 AM, Sam  wrote:

> Firstly apologies for the header of my email containing some junk, I
> believe it's due to a copy and paste error on a smart phone.
>
> Thanks for your response.  I will indeed make the PR you suggest, though
> glancing at the code I realize it's not just a case of making these public
> since the types are also private. Then, there is certain functionality I
> will be exposing, which then ought to be tested, e.g. every bin except
> potentially the last will have an equal number of data points in it*.  I'll
> get round to it at some point.
>
> As for BinaryClassificationMetrics using Double for labels, thanks for the
> explanation.  If I where to make a PR to encapsulate the underlying
> implementation (that uses LabeledPoint) and change the type to Boolean,
> would what be the impact to versioning (since I'd be changing public API)?
> An alternative would be to create a new wrapper class, say
> BinaryClassificationMeasures, and deprecate the old with the intention of
> migrating all the code into the new class.
>
> * Maybe some other part of the code base tests this, since this assumption
> must hold in order to average across folds in x-validation?
>
> On Thu, Jun 18, 2015 at 1:02 AM, Xiangrui Meng  wrote:
>
>> LabeledPoint was used for both classification and regression, where label
>> type is Double for simplicity. So in BinaryClassificationMetrics, we still
>> use Double for labels. We compute the confusion matrix at each threshold
>> internally, but this is not exposed to users (
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L127).
>> Feel free to submit a PR to make it public. -Xiangrui
>>
>> On Mon, Jun 15, 2015 at 7:13 AM, Sam  wrote:
>>
>>>
>>> Google+
>>> 
>>> 
>>> Calendar
>>> 
>>> Web
>>> 
>>> more
>>> Inbox
>>> Apache Spark Email
>>> GmailNot Work
>>> S
>>> sam.sav...@barclays.com
>>> to me
>>> 0 minutes ago
>>> Details
>>> According to
>>> https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
>>>
>>> The constructor takes `RDD[(Double, Double)]` meaning lables are
>>> Doubles, this seems odd, shouldn't it be Boolean?  Similarly for
>>> MutlilabelMetrics (I.e. Should be RDD[(Array[Double], Array[Boolean])]),
>>> and for MulticlassMetrics the type of both should be generic?
>>>
>>> Additionally it would be good if either the ROC output type was changed
>>> or another method was added that returned confusion matricies, so that the
>>> hard integer values can be obtained before the divisions. E.g.
>>>
>>> ```
>>> case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int)
>>> {
>>>   // bunch of methods for each of the things in the table here
>>> https://en.wikipedia.org/wiki/Receiver_operating_characteristic
>>> }
>>> ...
>>> def confusions(): RDD[Confusion]
>>> ```
>>>
>>
>>
>


Re: *Metrics API is odd in MLLib

2015-06-18 Thread Sam
Firstly apologies for the header of my email containing some junk, I
believe it's due to a copy and paste error on a smart phone.

Thanks for your response.  I will indeed make the PR you suggest, though
glancing at the code I realize it's not just a case of making these public
since the types are also private. Then, there is certain functionality I
will be exposing, which then ought to be tested, e.g. every bin except
potentially the last will have an equal number of data points in it*.  I'll
get round to it at some point.

As for BinaryClassificationMetrics using Double for labels, thanks for the
explanation.  If I where to make a PR to encapsulate the underlying
implementation (that uses LabeledPoint) and change the type to Boolean,
would what be the impact to versioning (since I'd be changing public API)?
An alternative would be to create a new wrapper class, say
BinaryClassificationMeasures, and deprecate the old with the intention of
migrating all the code into the new class.

* Maybe some other part of the code base tests this, since this assumption
must hold in order to average across folds in x-validation?

On Thu, Jun 18, 2015 at 1:02 AM, Xiangrui Meng  wrote:

> LabeledPoint was used for both classification and regression, where label
> type is Double for simplicity. So in BinaryClassificationMetrics, we still
> use Double for labels. We compute the confusion matrix at each threshold
> internally, but this is not exposed to users (
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L127).
> Feel free to submit a PR to make it public. -Xiangrui
>
> On Mon, Jun 15, 2015 at 7:13 AM, Sam  wrote:
>
>>
>> Google+
>> 
>> 
>> Calendar
>> 
>> Web
>> 
>> more
>> Inbox
>> Apache Spark Email
>> GmailNot Work
>> S
>> sam.sav...@barclays.com
>> to me
>> 0 minutes ago
>> Details
>> According to
>> https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
>>
>> The constructor takes `RDD[(Double, Double)]` meaning lables are Doubles,
>> this seems odd, shouldn't it be Boolean?  Similarly for MutlilabelMetrics
>> (I.e. Should be RDD[(Array[Double], Array[Boolean])]), and for
>> MulticlassMetrics the type of both should be generic?
>>
>> Additionally it would be good if either the ROC output type was changed
>> or another method was added that returned confusion matricies, so that the
>> hard integer values can be obtained before the divisions. E.g.
>>
>> ```
>> case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int)
>> {
>>   // bunch of methods for each of the things in the table here
>> https://en.wikipedia.org/wiki/Receiver_operating_characteristic
>> }
>> ...
>> def confusions(): RDD[Confusion]
>> ```
>>
>
>


Re: *Metrics API is odd in MLLib

2015-06-17 Thread Xiangrui Meng
LabeledPoint was used for both classification and regression, where label
type is Double for simplicity. So in BinaryClassificationMetrics, we still
use Double for labels. We compute the confusion matrix at each threshold
internally, but this is not exposed to users (
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L127).
Feel free to submit a PR to make it public. -Xiangrui

On Mon, Jun 15, 2015 at 7:13 AM, Sam  wrote:

>
> Google+
> 
> 
> Calendar
> 
> Web
> 
> more
> Inbox
> Apache Spark Email
> GmailNot Work
> S
> sam.sav...@barclays.com
> to me
> 0 minutes ago
> Details
> According to
> https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
>
> The constructor takes `RDD[(Double, Double)]` meaning lables are Doubles,
> this seems odd, shouldn't it be Boolean?  Similarly for MutlilabelMetrics
> (I.e. Should be RDD[(Array[Double], Array[Boolean])]), and for
> MulticlassMetrics the type of both should be generic?
>
> Additionally it would be good if either the ROC output type was changed or
> another method was added that returned confusion matricies, so that the
> hard integer values can be obtained before the divisions. E.g.
>
> ```
> case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int)
> {
>   // bunch of methods for each of the things in the table here
> https://en.wikipedia.org/wiki/Receiver_operating_characteristic
> }
> ...
> def confusions(): RDD[Confusion]
> ```
>