Re: [Scikit-learn-general] Classification accuracy too high

Jason Williams Thu, 15 Aug 2013 04:10:18 -0700

I follow the sample at 
http://blog.yhathq.com/posts/random-forests-in-python.html where it randomly 
assigns true, false to the dataset


    np.random.uniform(0, 1, len(df)) <= .75

then partition dataset into train set and test set. I use the same way for 
creating model 

    rfc = RandomForestClassifier()

    ...
    rfc.fit(train_set, label)

and then verify against test set

    target = numpy.array(['Dog', 'Cat'])
    preds = target[rfc.predict(test_set[test_set.columns])]
 
Also test and train set are verified that they are different, as below. 

Are these the right steps to do cross-validation?  I read wiki 
(http://en.wikipedia.org/wiki/Cross-validation_(statistics)#Common_types_of_cross-validation).
 Basic steps looks the same as what the sample page described, except that the 
sample code uses threshold .75, which is different from what wiki introduces 
such as k-fold and 2-fold validation. 

Thanks for help     


$ ls /tmp/partitioned_data
total 6.8M
...  1 ... 1.7M ... test_set
...  1 ... 5.1M ... train_set

$ diff /tmp/partitioned_data/test_set /tmp/partitioned_data/train_set | wc -l
16033

$ cat /tmp/partitioned_data/test_set | wc -l # contain header
3969

$ cat /tmp/partitioned_data/train_set | wc -l # contain header
12064

$ echo $((12063+3968))
16031
 





----- Original Message -----
From: Gilles Louppe <[email protected]>
To: "[email protected]" 
<[email protected]>
Cc: Jason Williams <[email protected]>
Sent: Thursday, 15 August 2013, 3:49
Subject: Re: [Scikit-learn-general] Classification accuracy too high

Hi Jason,

It looks like you are evaluating your error on your training data,
aren't you? It will give you a (very) poor estimate of the
generalization error of your model. Instead, try your model on an
independent part of your dataset (in particular, one which has a not
been used to fit to your forest), it should give you a better
estimate. You can also evaluate your model within a cross-validation
loop.

Best,

Gilles

On 15 August 2013 09:12, Robert Layton <[email protected]> wrote:
> The first thing I'd do is publish the result (just kidding!).
>
> Try it with another data set first, especially one that has an example in
> the docs.
> If you are still getting top marks, it may be your "framework" around the
> code. (are you doing proper test/train splits, etc)
> If it drops, consider that you may have a dataset that can get high
> accuracies. Random Forests are good methods...
>
>
> On 15 August 2013 17:03, Jason Williams <[email protected]> wrote:
>>
>> I ran a few test based on Random Forest Classifier
>> (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
>> with default setting. The classification (repeated the classification
>> procedure several times) is nearly 100% correct. That seems to be
>> overfitting. Is there any points (e.g. dataset, etc.) I can check to see if
>> I did something wrong?
>>
>> Thanks
>>
>>
>> ------------------------------------------------------------------------------
>> Get 100% visibility into Java/.NET code with AppDynamics Lite!
>> It's a free troubleshooting tool designed for production.
>> Get down to code-level detail for bottlenecks, with <2% overhead.
>> Download for free and get started troubleshooting in minutes.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> --
>
> Public key at: http://pgp.mit.edu/ Search for this email address and select
> the key from "2011-08-19" (key id: 54BA8735)
>
> ------------------------------------------------------------------------------
> Get 100% visibility into Java/.NET code with AppDynamics Lite!
> It's a free troubleshooting tool designed for production.
> Get down to code-level detail for bottlenecks, with <2% overhead.
> Download for free and get started troubleshooting in minutes.
> http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>


------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Classification accuracy too high

Reply via email to