A factor with 5000 levels looks like it may be a numeric variable that was accidently coded as a factor (functions like read.table will do this if there is a non numeric character in with the numbers).
If you really have a 5000 level factor, which levels can be discarded or combined is a question for the subject specific scientist, not the statistician. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- > project.org] On Behalf Of Saeed Abu Nimeh > Sent: Thursday, August 26, 2010 1:40 PM > To: r-help@r-project.org > Subject: [R] Importance of levels in a factor variable > > I have a dataset of multiple variables and a response. For example, > > str(x) > 'data.frame': 3557238 obs. of 44 variables: > $ response : Factor w/ 2 levels > $ var2: Factor w/5000 levels > > > If var2 for example is a factor with 5000 levels, what is the best > approach to determine which of these levels is the most important to > include in building the model, and which ones to discard. Assuming > there is a way to do that, is it accurate to only include the > important levels and discard the rest for that variable when building > the model. > Thansk, > Saeed > > --- > > sessionInfo() > R version 2.10.1 (2009-12-14) > x86_64-pc-linux-gnu > 32 GB RAM > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.