I have a dataset of multiple variables and a response. For example, > str(x) 'data.frame': 3557238 obs. of 44 variables: $ response : Factor w/ 2 levels $ var2: Factor w/5000 levels
If var2 for example is a factor with 5000 levels, what is the best approach to determine which of these levels is the most important to include in building the model, and which ones to discard. Assuming there is a way to do that, is it accurate to only include the important levels and discard the rest for that variable when building the model. Thansk, Saeed --- > sessionInfo() R version 2.10.1 (2009-12-14) x86_64-pc-linux-gnu 32 GB RAM ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.