Re: [Scikit-learn-general] Unexpected behavior in metrics. I don't understand util.check_arrays...

Robert Layton Wed, 23 Nov 2011 13:26:40 -0800

On 24 November 2011 01:39, Olivier Grisel <[email protected]> wrote:


> 2011/11/23 Andreas Müller <[email protected]>:
> > On 11/23/2011 03:08 PM, Olivier Grisel wrote:
> >> 2011/11/23 Andreas Müller<[email protected]>:
> >>> Hi everybody.
> >>> Me again. I was getting some unexpected behaviour from the error
> metrics.
> >>> Consider the following:
> >>>
> >>> import numpy as np
> >>> from sklearn.datasets import load_digits
> >>> from sklearn.metrics import zero_one_score
> >>>
> >>> zero_one_score(digits.target, np.vstack(digits.target))
> >>>
> >>>   >>>  0.10
> >>>
> >>> The shape of digits.target is (1797,), the shape
> >>> of the stacked version is (1797, 1).
> >>> That seems to cause broadcasting in "==".
> >> Good catch.
> >>
> >>> I thought utils.check_arrays was meant to
> >>> avoid such problems, but it does not change the shape
> >>> of these two arrays.
> >>>
> >>> What did I do wrong or what did I misunderstand here?
> >>>
> >>> Obviously I could reshape either array so that no broadcasting
> >>> happens. I feel the problem is somewhat subtle, though,
> >>> and it took me 3 hours to find.
> >>>
> >>> If you feel that is a problem, should it be addressed in
> "check_arrays"?
> >> IMHO, we should have a specific check for 1D, integer arrays used for
> >> targets in classification tasks and another specific check for
> >> regression tasks with explicit docstring telling what we check and
> >> explicit ValueError message explicating what we where expecting and
> >> what we got instead.
> >>
> > That might be a good idea. Should the check for classifications tasks
> > then be performed for each call to "fit" and each classification metric?
>
> For the classification metric: we need to check that the shape are the
> same in the two integer arrays provided as argument: the expected
> target and the predicted target.
>
> For the fit method of the classifiers we should already have good
> check coverage I think.
>
> > I am not sure if you imply that want to check the dtype whether it is
> int.
> > Or would you rather check that the array contains integers?
>
> I think this is a good idea, something like:
>
> if not y_true.dtype.kind in ['i', 'u']
>    raise ValueError("Expected integer true target values for
> classification, got: %s" % y_true.dtype)
>
> if not y_pred.dtype.kind in ['i', 'u']
>    raise ValueError("Expected integer predicted target values for
> classification, got: %s" % y_pred.dtype)
>
> Also we need to accept list or tuples by converting them to arrays
> using np.asarray() first.
>
> > Are there other requirements? I am not familiar enough with the
> > implementation of the classification algorithms to say what kind
> > of assumptions they make.
> > Do labels have to be 0..n or [-1, 1] ?
>
> They can be any integer for classification tasks. The convention is to
> use [-1, 1] for binary classification and  0..n  for multiclass but
> this should not be enforced.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>


In the normal sense, it can be ambiguous what a (1797,) array actually
means.
Is it one sample with 1797 features or 1797 samples with one feature each?

I think this is why the 2D changing happens - it solves this problem by
making it either (1, 1797) or (1797,1).

Thoughts?

-- 

Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Unexpected behavior in metrics. I don't understand util.check_arrays...

Reply via email to