Em Ter, 2008-09-30 às 18:56 -0500, Frank E Harrell Jr escreveu: > Bernardo Rangel Tura wrote: > > Em Sáb, 2008-09-27 às 10:51 -0700, milicic.marko escreveu: > >> I have a huge data set with thousands of variable and one binary > >> variable. I know that most of the variables are correlated and are not > >> good predictors... but... > >> > >> It is very hard to start modeling with such a huge dataset. What would > >> be your suggestion. How to make a first cut... how to eliminate most > >> of the variables but not to ignore potential interactions... for > >> example, maybe variable A is not good predictor and variable B is not > >> good predictor either, but maybe A and B together are good > >> predictor... > >> > >> Any suggestion is welcomed > > > > > > milicic.marko > > > > I think do you start with a rpart("binary variable"~.) > > This show you a set of variables to start a model and the start set to > > curoff for continous variables > > I cannot imagine a worse way to formulate a regression model. Reasons > include > > 1. Results of recursive partitioning are not trustworthy unless the > sample size exceeds 50,000 or the signal to noise ratio is extremely high. > > 2. The type I error of tests from the final regression model will be > extraordinarily inflated. > > 3. False interactions will appear in the model. > > 4. The cutoffs so chosen will not replicate and in effect assume that > covariate effects are discontinuous and piecewise flat. The use of > cutoffs results in a huge loss of information and power and makes the > analysis arbitrary and impossible to interpret (e.g., a high covariate > value:low covariate value odds ratio or mean difference is a complex > function of all the covariate values in the sample). > > 5. The model will not validate in new data.
Professor Frank, Thank you for your explain. Well, if my first idea is wrong what is your opinion on the following approach? 1- Make PCA with data excluding the binary variable 2- Put de principal components in logistic model 3- After revert principal componentes in variable (only if is interesting for milicic.marko) If this approach is wrong too what is your approach? -- Bernardo Rangel Tura, M.D,MPH,Ph.D National Institute of Cardiology Brazil ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.