Re: [Scikit-learn-general] Classification accuracy too high

2013-08-15 Thread Jason Williams
, 8:58 Subject: Re: [Scikit-learn-general] Classification accuracy too high On 08/15/2013 01:08 PM, Jason Williams wrote: > I follow the sample at > http://blog.yhathq.com/posts/random-forests-in-python.html where it randomly > assigns true, false to the dataset > >      np.rand

Re: [Scikit-learn-general] Classification accuracy too high

2013-08-15 Thread Andreas Mueller
On 08/15/2013 01:08 PM, Jason Williams wrote: > I follow the sample at > http://blog.yhathq.com/posts/random-forests-in-python.html where it randomly > assigns true, false to the dataset > > np.random.uniform(0, 1, len(df)) <= .75 > > then partition dataset into train set and test set. I use

Re: [Scikit-learn-general] Classification accuracy too high

2013-08-15 Thread Jason Williams
ubject: Re: [Scikit-learn-general] Classification accuracy too high Hi Jason, It looks like you are evaluating your error on your training data, aren't you? It will give you a (very) poor estimate of the generalization error of your model. Instead, try your model on an independent part of your datas

Re: [Scikit-learn-general] Classification accuracy too high

2013-08-15 Thread Gilles Louppe
Hi Jason, It looks like you are evaluating your error on your training data, aren't you? It will give you a (very) poor estimate of the generalization error of your model. Instead, try your model on an independent part of your dataset (in particular, one which has a not been used to fit to your fo

Re: [Scikit-learn-general] Classification accuracy too high

2013-08-15 Thread Robert Layton
The first thing I'd do is publish the result (just kidding!). Try it with another data set first, especially one that has an example in the docs. If you are still getting top marks, it may be your "framework" around the code. (are you doing proper test/train splits, etc) If it drops, consider that

[Scikit-learn-general] Classification accuracy too high

2013-08-15 Thread Jason Williams
I ran a few test based on Random Forest Classifier (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) with default setting. The classification (repeated the classification procedure several times) is nearly 100% correct. That seems to be overfitting.