On Thu, 09-Aug-2012 at 03:40PM -0700, Kirk Fleming wrote: |> My data is 50,000 instances of about 200 predictor values, and for |> all 50,000 examples I have the actual class labels (binary). The |> data is quite unbalanced with about 10% or less of the examples |> having a positive outcome and the remainder, of course, |> negative. Nothing suggests the data has any order, and it doesn't |> appear to have any, so I've pulled the first 30,000 examples to use |> as training data, reserving the remainder for test data. |> |> There are actually 3 distinct sets of class labels associated with |> the predictor data, and I've built 3 distinct models. When each |> model is used in predict() with the training data and true class |> labels, I get AUC values of 0.95, 0.98 and 0.98 for the 3 |> classifier problems.
I don't know where you got naiveBayes from so I can't check it, but my experience with boosted regression trees might be useful. I had AUC values fairly similar to yours with only one tenth of the number of instances you have. If naiveBayes has the ability to use a validation set, I think you'll find it makes a huge difference. In my case, it brought the Training AUC down to something like 0.85 but the test AUC was only slightly less, say 0.81. Try reserving about 20-25% of your training data for a validation set, then calculate your AUC on the combined Training and validation data. It will probably go down somewhat but your Test AUC will look much better. I'd be interested to know what you discover. -- ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. ___ Patrick Connolly {~._.~} Great minds discuss ideas _( Y )_ Average minds discuss events (:_~*~_:) Small minds discuss people (_)-(_) ..... Eleanor Roosevelt ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.