I follow the sample at http://blog.yhathq.com/posts/random-forests-in-python.html where it randomly assigns true, false to the dataset
np.random.uniform(0, 1, len(df)) <= .75 then partition dataset into train set and test set. I use the same way for creating model rfc = RandomForestClassifier() ... rfc.fit(train_set, label) and then verify against test set target = numpy.array(['Dog', 'Cat']) preds = target[rfc.predict(test_set[test_set.columns])] Also test and train set are verified that they are different, as below. Are these the right steps to do cross-validation? I read wiki (http://en.wikipedia.org/wiki/Cross-validation_(statistics)#Common_types_of_cross-validation). Basic steps looks the same as what the sample page described, except that the sample code uses threshold .75, which is different from what wiki introduces such as k-fold and 2-fold validation. Thanks for help $ ls /tmp/partitioned_data total 6.8M ... 1 ... 1.7M ... test_set ... 1 ... 5.1M ... train_set $ diff /tmp/partitioned_data/test_set /tmp/partitioned_data/train_set | wc -l 16033 $ cat /tmp/partitioned_data/test_set | wc -l # contain header 3969 $ cat /tmp/partitioned_data/train_set | wc -l # contain header 12064 $ echo $((12063+3968)) 16031 ----- Original Message ----- From: Gilles Louppe <[email protected]> To: "[email protected]" <[email protected]> Cc: Jason Williams <[email protected]> Sent: Thursday, 15 August 2013, 3:49 Subject: Re: [Scikit-learn-general] Classification accuracy too high Hi Jason, It looks like you are evaluating your error on your training data, aren't you? It will give you a (very) poor estimate of the generalization error of your model. Instead, try your model on an independent part of your dataset (in particular, one which has a not been used to fit to your forest), it should give you a better estimate. You can also evaluate your model within a cross-validation loop. Best, Gilles On 15 August 2013 09:12, Robert Layton <[email protected]> wrote: > The first thing I'd do is publish the result (just kidding!). > > Try it with another data set first, especially one that has an example in > the docs. > If you are still getting top marks, it may be your "framework" around the > code. (are you doing proper test/train splits, etc) > If it drops, consider that you may have a dataset that can get high > accuracies. Random Forests are good methods... > > > On 15 August 2013 17:03, Jason Williams <[email protected]> wrote: >> >> I ran a few test based on Random Forest Classifier >> (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) >> with default setting. The classification (repeated the classification >> procedure several times) is nearly 100% correct. That seems to be >> overfitting. Is there any points (e.g. dataset, etc.) I can check to see if >> I did something wrong? >> >> Thanks >> >> >> ------------------------------------------------------------------------------ >> Get 100% visibility into Java/.NET code with AppDynamics Lite! >> It's a free troubleshooting tool designed for production. >> Get down to code-level detail for bottlenecks, with <2% overhead. >> Download for free and get started troubleshooting in minutes. >> >> http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk >> _______________________________________________ >> Scikit-learn-general mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > -- > > Public key at: http://pgp.mit.edu/ Search for this email address and select > the key from "2011-08-19" (key id: 54BA8735) > > ------------------------------------------------------------------------------ > Get 100% visibility into Java/.NET code with AppDynamics Lite! > It's a free troubleshooting tool designed for production. > Get down to code-level detail for bottlenecks, with <2% overhead. > Download for free and get started troubleshooting in minutes. > http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
