I have a dataset of multiple variables and a response. For example,
> str(x)
'data.frame':   3557238 obs. of  44 variables:
 $ response :  Factor w/ 2 levels
 $ var2: Factor w/5000 levels


If var2 for example is a factor with 5000 levels, what is the best
approach to determine which of these levels is the most important to
include in building the model, and which ones to discard. Assuming
there is a way to do that, is it accurate to only include the
important levels and discard the rest for that variable when building
the model.
Thansk,
Saeed

---
> sessionInfo()
R version 2.10.1 (2009-12-14)
x86_64-pc-linux-gnu
32 GB RAM

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to