Re: [R] Logistic regression problem

2008-09-30 Thread Milicic B. Marko

The only solution I can see is fitting all possib le 2 factor models enabling
interactions and then assessing if interaction term is significant...


any more ideas?




Milicic B. Marko wrote:
> 
> I have a huge data set with thousands of variable and one binary
> variable. I know that most of the variables are correlated and are not
> good predictors... but...
> 
> It is very hard to start modeling with such a huge dataset. What would
> be your suggestion. How to make a first cut... how to eliminate most
> of the variables but not to ignore potential interactions... for
> example, maybe variable A is not good predictor and variable B is not
> good predictor either, but maybe A and B together are good
> predictor...
> 
> Any suggestion is welcomed
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Logistic-regression-problem-tp19704948p19746846.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression problem

2008-09-30 Thread Frank E Harrell Jr

Milicic B. Marko wrote:

The only solution I can see is fitting all possib le 2 factor models enabling
interactions and then assessing if interaction term is significant...


any more ideas?


Please don't suggest such a thing unless you do simulations to back up 
its predictive performance, type I error properties, and the impact of 
collinearities.  You'll find this approach works as well as the U.S. 
economy.


Frank Harrell







Milicic B. Marko wrote:

I have a huge data set with thousands of variable and one binary
variable. I know that most of the variables are correlated and are not
good predictors... but...

It is very hard to start modeling with such a huge dataset. What would
be your suggestion. How to make a first cut... how to eliminate most
of the variables but not to ignore potential interactions... for
example, maybe variable A is not good predictor and variable B is not
good predictor either, but maybe A and B together are good
predictor...

Any suggestion is welcomed

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.







--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression problem

2008-09-30 Thread Bernardo Rangel Tura
Em Sáb, 2008-09-27 às 10:51 -0700, milicic.marko escreveu:
> I have a huge data set with thousands of variable and one binary
> variable. I know that most of the variables are correlated and are not
> good predictors... but...
> 
> It is very hard to start modeling with such a huge dataset. What would
> be your suggestion. How to make a first cut... how to eliminate most
> of the variables but not to ignore potential interactions... for
> example, maybe variable A is not good predictor and variable B is not
> good predictor either, but maybe A and B together are good
> predictor...
> 
> Any suggestion is welcomed


milicic.marko

I think do you start with a rpart("binary variable"~.)
This show you a set of variables to start a model and the start set to
curoff  for continous variables
-- 
Bernardo Rangel Tura, M.D,MPH,Ph.D
National Institute of Cardiology
Brazil

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression problem

2008-09-30 Thread Jason Jones Medical Informatics
So...I wouldn't suggest the trying all possible logistic models approach either 
and I'm not sure exactly what your goals are in modeling.

However, I've been fiddling around with the variable importance (varimp) 
functions that come with the randomForest and party packages.  The idea is to 
get an idea of which independent variables are likely to be useful and then to 
focus on those variables (identified as being of high importance) with more 
attention than you could spend on the whole set.

A general advantage of the recursive partitioning approach is that it deals 
fairly nicely with interactions and collinearity.

Theoretically, the recursive partitioning approaches should be able to deal 
with missing values (often a problem with large datasets), but I have been 
unable to apply this with the variable importance functions.

Let me know if you require more details.  You can check out 
http://www.biomedcentral.com/1471-2105/9/307 for a couple examples of variable 
importance.


Jason Jones, PhD
Medical Informatics
[EMAIL PROTECTED]
801.707.6898


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Frank E Harrell 
Jr
Sent: Tuesday, September 30, 2008 2:54 PM
To: Milicic B. Marko
Cc: r-help@r-project.org
Subject: Re: [R] Logistic regression problem

Milicic B. Marko wrote:
> The only solution I can see is fitting all possib le 2 factor models enabling
> interactions and then assessing if interaction term is significant...
>
>
> any more ideas?

Please don't suggest such a thing unless you do simulations to back up
its predictive performance, type I error properties, and the impact of
collinearities.  You'll find this approach works as well as the U.S.
economy.

Frank Harrell


>
>
>
>
> Milicic B. Marko wrote:
>> I have a huge data set with thousands of variable and one binary
>> variable. I know that most of the variables are correlated and are not
>> good predictors... but...
>>
>> It is very hard to start modeling with such a huge dataset. What would
>> be your suggestion. How to make a first cut... how to eliminate most
>> of the variables but not to ignore potential interactions... for
>> example, maybe variable A is not good predictor and variable B is not
>> good predictor either, but maybe A and B together are good
>> predictor...
>>
>> Any suggestion is welcomed
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>


--
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression problem

2008-09-30 Thread Frank E Harrell Jr

Bernardo Rangel Tura wrote:

Em Sáb, 2008-09-27 às 10:51 -0700, milicic.marko escreveu:

I have a huge data set with thousands of variable and one binary
variable. I know that most of the variables are correlated and are not
good predictors... but...

It is very hard to start modeling with such a huge dataset. What would
be your suggestion. How to make a first cut... how to eliminate most
of the variables but not to ignore potential interactions... for
example, maybe variable A is not good predictor and variable B is not
good predictor either, but maybe A and B together are good
predictor...

Any suggestion is welcomed



milicic.marko

I think do you start with a rpart("binary variable"~.)
This show you a set of variables to start a model and the start set to
curoff  for continous variables


I cannot imagine a worse way to formulate a regression model.  Reasons 
include


1. Results of recursive partitioning are not trustworthy unless the 
sample size exceeds 50,000 or the signal to noise ratio is extremely high.


2. The type I error of tests from the final regression model will be 
extraordinarily inflated.


3. False interactions will appear in the model.

4. The cutoffs so chosen will not replicate and in effect assume that 
covariate effects are discontinuous and piecewise flat.  The use of 
cutoffs results in a huge loss of information and power and makes the 
analysis arbitrary and impossible to interpret (e.g., a high covariate 
value:low covariate value odds ratio or mean difference is a complex 
function of all the covariate values in the sample).


5. The model will not validate in new data.

Frank
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression problem

2008-10-01 Thread Bernardo Rangel Tura
Em Ter, 2008-09-30 às 18:56 -0500, Frank E Harrell Jr escreveu:
> Bernardo Rangel Tura wrote:
> > Em Sáb, 2008-09-27 às 10:51 -0700, milicic.marko escreveu:
> >> I have a huge data set with thousands of variable and one binary
> >> variable. I know that most of the variables are correlated and are not
> >> good predictors... but...
> >>
> >> It is very hard to start modeling with such a huge dataset. What would
> >> be your suggestion. How to make a first cut... how to eliminate most
> >> of the variables but not to ignore potential interactions... for
> >> example, maybe variable A is not good predictor and variable B is not
> >> good predictor either, but maybe A and B together are good
> >> predictor...
> >>
> >> Any suggestion is welcomed
> > 
> > 
> > milicic.marko
> > 
> > I think do you start with a rpart("binary variable"~.)
> > This show you a set of variables to start a model and the start set to
> > curoff  for continous variables
> 
> I cannot imagine a worse way to formulate a regression model.  Reasons 
> include
> 
> 1. Results of recursive partitioning are not trustworthy unless the 
> sample size exceeds 50,000 or the signal to noise ratio is extremely high.
> 
> 2. The type I error of tests from the final regression model will be 
> extraordinarily inflated.
> 
> 3. False interactions will appear in the model.
> 
> 4. The cutoffs so chosen will not replicate and in effect assume that 
> covariate effects are discontinuous and piecewise flat.  The use of 
> cutoffs results in a huge loss of information and power and makes the 
> analysis arbitrary and impossible to interpret (e.g., a high covariate 
> value:low covariate value odds ratio or mean difference is a complex 
> function of all the covariate values in the sample).
> 
> 5. The model will not validate in new data.

Professor Frank,

Thank you for your explain.

Well, if my first idea is wrong what is your opinion on the following
approach?

1- Make PCA with data excluding the binary variable
2- Put de principal components in logistic model
3- After revert principal componentes in variable (only if is
interesting for milicic.marko)

If this approach is wrong too what is your approach?
-- 
Bernardo Rangel Tura, M.D,MPH,Ph.D
National Institute of Cardiology
Brazil

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression problem

2008-10-01 Thread Frank E Harrell Jr

Bernardo Rangel Tura wrote:

Em Ter, 2008-09-30 às 18:56 -0500, Frank E Harrell Jr escreveu:

Bernardo Rangel Tura wrote:

Em Sáb, 2008-09-27 às 10:51 -0700, milicic.marko escreveu:

I have a huge data set with thousands of variable and one binary
variable. I know that most of the variables are correlated and are not
good predictors... but...

It is very hard to start modeling with such a huge dataset. What would
be your suggestion. How to make a first cut... how to eliminate most
of the variables but not to ignore potential interactions... for
example, maybe variable A is not good predictor and variable B is not
good predictor either, but maybe A and B together are good
predictor...

Any suggestion is welcomed


milicic.marko

I think do you start with a rpart("binary variable"~.)
This show you a set of variables to start a model and the start set to
curoff  for continous variables
I cannot imagine a worse way to formulate a regression model.  Reasons 
include


1. Results of recursive partitioning are not trustworthy unless the 
sample size exceeds 50,000 or the signal to noise ratio is extremely high.


2. The type I error of tests from the final regression model will be 
extraordinarily inflated.


3. False interactions will appear in the model.

4. The cutoffs so chosen will not replicate and in effect assume that 
covariate effects are discontinuous and piecewise flat.  The use of 
cutoffs results in a huge loss of information and power and makes the 
analysis arbitrary and impossible to interpret (e.g., a high covariate 
value:low covariate value odds ratio or mean difference is a complex 
function of all the covariate values in the sample).


5. The model will not validate in new data.


Professor Frank,

Thank you for your explain.

Well, if my first idea is wrong what is your opinion on the following
approach?

1- Make PCA with data excluding the binary variable
2- Put de principal components in logistic model
3- After revert principal componentes in variable (only if is
interesting for milicic.marko)

If this approach is wrong too what is your approach?



Hi Bernardo,

If there is a large number of potential predictors and no previous 
knowledge to guide the modeling, principal components (PC) is often an 
excellent way to proceed.  The first few PCs can be put into the model. 
 The result is not always very interpretable, but you can "decode" the 
PCs by using stepwise regression or recursive partitioning (which are 
safer in this context because the stepwise methods are not exposed to 
the Y variable).  You can also add PCs in a stepwise fashion in the 
pre-specified order of variance explained.


There are many variations on this theme including nonlinear principal 
components (e.g., the transcan function in the Hmisc package) which may 
explain more variance of the predictors.


Frank
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression problem

2008-10-01 Thread Robert A LaBudde
It would not be possible to answer your original 
question until you specify your goal.


Is it to develop a model with external validity 
that will generalize to new data? (You are not 
likely to succeed, if you are starting with a 
"boil the ocean" approach with 44,000+ covariates 
and millions of records.) This is the point Prof. Harrell is making.


Or is it to reduce a large dataset to a tractable 
predictor formula that only interpolates your dataset?


If the former, you will need external modeling 
information to select the "wheat from the chaff" 
in your excessive predictor set.


Assuming it is the latter, then almost any 
approach that ends up with a tractable model 
(that has no meaning other than interpolation of 
this specific dataset) will be useful. For this, 
regression trees or even stepwise regression 
would work. The algorithm must be very simple and 
computer efficient. This is the area of data mining approaches.


I would suggest you start by looking at covariate 
patterns to find out where the scarcity lies. 
These will end up high leverage data.


Another place to start is common sense: Thousands 
of covariates cannot all contain independent 
information of value. Try to cluster them and 
pick the best representative from each cluster 
based on expert knowledge. You may solve your problem quickly that way.


At 05:34 AM 10/1/2008, Bernardo Rangel Tura wrote:
Em Ter, 2008-09-30 Ã s 18:56 -0500, Frank E 
Harrell Jr escreveu: > Bernardo Rangel Tura 
wrote: > > Em Sáb, 2008-09-27 às 10:51 -0700, 
milicic.marko escreveu: > >> I have a huge data 
set with thousands of variable and one 
binary > >> variable. I know that most of the 
variables are correlated and are not > >> good 
predictors... but... > >> > >> It is very hard 
to start modeling with such a huge dataset. What 
would > >> be your suggestion. How to make a 
first cut... how to eliminate most > >> of the 
variables but not to ignore potential 
interactions... for > >> example, maybe variable 
A is not good predictor and variable B is 
not > >> good predictor either, but maybe A and 
B together are good > >> predictor... > >> > >> 
Any suggestion is welcomed > > > > > > 
milicic.marko > > > > I think do you start with 
a rpart("binary variable"~.) > > This show you a 
set of variables to start a model and the start 
set to > > curoff  for continous variables > > I 
cannot imagine a worse way to formulate a 
regression model.  Reasons > include > > 1. 
Results of recursive partitioning are not 
trustworthy unless the > sample size exceeds 
50,000 or the signal to noise ratio is extremely 
high. > > 2. The type I error of tests from the 
final regression model will be > extraordinarily 
inflated. > > 3. False interactions will appear 
in the model. > > 4. The cutoffs so chosen will 
not replicate and in effect assume that > 
covariate effects are discontinuous and 
piecewise flat.  The use of > cutoffs results in 
a huge loss of information and power and makes 
the > analysis arbitrary and impossible to 
interpret (e.g., a high covariate > value:low 
covariate value odds ratio or mean difference is 
a complex > function of all the covariate values 
in the sample). > > 5. The model will not 
validate in new data. Professor Frank, Thank you 
for your explain. Well, if my first idea is 
wrong what is your opinion on the following 
approach? 1- Make PCA with data excluding the 
binary variable 2- Put de principal components 
in logistic model 3- After revert principal 
componentes in variable (only if is interesting 
for milicic.marko) If this approach is wrong too 
what is your approach? -- Bernardo Rangel Tura, 
M.D,MPH,Ph.D National Institute of Cardiology 
Brazil 
__ 
R-help@r-project.org mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help 
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html and 
provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

"Vere scire est per causas scire"


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression problem

2008-10-01 Thread Liaw, Andy
From: Frank E Harrell Jr
> 
> Bernardo Rangel Tura wrote:
> > Em Ter, 2008-09-30 às 18:56 -0500, Frank E Harrell Jr escreveu:
> >> Bernardo Rangel Tura wrote:
> >>> Em Sáb, 2008-09-27 às 10:51 -0700, milicic.marko escreveu:
>  I have a huge data set with thousands of variable and one binary
>  variable. I know that most of the variables are 
> correlated and are not
>  good predictors... but...
> 
>  It is very hard to start modeling with such a huge 
> dataset. What would
>  be your suggestion. How to make a first cut... how to 
> eliminate most
>  of the variables but not to ignore potential interactions... for
>  example, maybe variable A is not good predictor and 
> variable B is not
>  good predictor either, but maybe A and B together are good
>  predictor...
> 
>  Any suggestion is welcomed
> >>>
> >>> milicic.marko
> >>>
> >>> I think do you start with a rpart("binary variable"~.)
> >>> This show you a set of variables to start a model and the 
> start set to
> >>> curoff  for continous variables
> >> I cannot imagine a worse way to formulate a regression 
> model.  Reasons 
> >> include
> >>
> >> 1. Results of recursive partitioning are not trustworthy 
> unless the 
> >> sample size exceeds 50,000 or the signal to noise ratio is 
> extremely high.
> >>
> >> 2. The type I error of tests from the final regression 
> model will be 
> >> extraordinarily inflated.
> >>
> >> 3. False interactions will appear in the model.
> >>
> >> 4. The cutoffs so chosen will not replicate and in effect 
> assume that 
> >> covariate effects are discontinuous and piecewise flat.  
> The use of 
> >> cutoffs results in a huge loss of information and power 
> and makes the 
> >> analysis arbitrary and impossible to interpret (e.g., a 
> high covariate 
> >> value:low covariate value odds ratio or mean difference is 
> a complex 
> >> function of all the covariate values in the sample).
> >>
> >> 5. The model will not validate in new data.
> > 
> > Professor Frank,
> > 
> > Thank you for your explain.
> > 
> > Well, if my first idea is wrong what is your opinion on the 
> following
> > approach?
> > 
> > 1- Make PCA with data excluding the binary variable
> > 2- Put de principal components in logistic model
> > 3- After revert principal componentes in variable (only if is
> > interesting for milicic.marko)
> > 
> > If this approach is wrong too what is your approach?
> 
> 
> Hi Bernardo,
> 
> If there is a large number of potential predictors and no previous 
> knowledge to guide the modeling, principal components (PC) is 
> often an 
> excellent way to proceed.  The first few PCs can be put into 
> the model. 
>   The result is not always very interpretable, but you can 
> "decode" the 
> PCs by using stepwise regression or recursive partitioning (which are 
> safer in this context because the stepwise methods are not exposed to 
> the Y variable).  You can also add PCs in a stepwise fashion in the 
> pre-specified order of variance explained.
> 
> There are many variations on this theme including nonlinear principal 
> components (e.g., the transcan function in the Hmisc package) 
> which may 
> explain more variance of the predictors.

While I agree with much of what Frank had said, I'd like to add some points.

Variable selection is a treacherous business whether one is interested in
prediction or inference.  If the goal is inference, Frank's book is a
must read, IMHO.  (It's great for predictive model building, too.)

If interaction is of high interest, principal components are not going
to give you that.

Regarding cutpoint selection:  The machine learners have found that
the `optimal' split point for a continuous predictor in tree algorithms
are extremely variable, that interpreting them would be risky at best.
Breiman essentially gave up on interpretation of a single tree when he
went to random forests.

Best,
Andy

 
> Frank
> -- 
> Frank E Harrell Jr   Professor and Chair   School of Medicine
>   Department of Biostatistics   
> Vanderbilt University
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:12}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression problem

2008-10-01 Thread Pedro.Rodriguez
Hi Bernardo,

Do you have to use logistic regression? If not, try Random Forests... It has 
worked for me in past situations when I have to analyze huge datasets. 

Some want to understand the DGP with a simple linear equation; others want high 
generalization power. It is your call... See, e.g.,  
www.cis.upenn.edu/group/datamining/ReadingGroup/papers/breiman2001.pdf.

Maybe you are also interested in AD-HOC, an algorithm for feature selection, 
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.99.9130


Regards,

Pedro

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Liaw, Andy
Sent: Wednesday, October 01, 2008 12:01 PM
To: Frank E Harrell Jr; [EMAIL PROTECTED]
Cc: r-help@r-project.org
Subject: Re: [R] Logistic regression problem

From: Frank E Harrell Jr
> 
> Bernardo Rangel Tura wrote:
> > Em Ter, 2008-09-30 às 18:56 -0500, Frank E Harrell Jr escreveu:
> >> Bernardo Rangel Tura wrote:
> >>> Em Sáb, 2008-09-27 às 10:51 -0700, milicic.marko escreveu:
> >>>> I have a huge data set with thousands of variable and one binary
> >>>> variable. I know that most of the variables are 
> correlated and are not
> >>>> good predictors... but...
> >>>>
> >>>> It is very hard to start modeling with such a huge 
> dataset. What would
> >>>> be your suggestion. How to make a first cut... how to 
> eliminate most
> >>>> of the variables but not to ignore potential interactions... for
> >>>> example, maybe variable A is not good predictor and 
> variable B is not
> >>>> good predictor either, but maybe A and B together are good
> >>>> predictor...
> >>>>
> >>>> Any suggestion is welcomed
> >>>
> >>> milicic.marko
> >>>
> >>> I think do you start with a rpart("binary variable"~.)
> >>> This show you a set of variables to start a model and the 
> start set to
> >>> curoff  for continous variables
> >> I cannot imagine a worse way to formulate a regression 
> model.  Reasons 
> >> include
> >>
> >> 1. Results of recursive partitioning are not trustworthy 
> unless the 
> >> sample size exceeds 50,000 or the signal to noise ratio is 
> extremely high.
> >>
> >> 2. The type I error of tests from the final regression 
> model will be 
> >> extraordinarily inflated.
> >>
> >> 3. False interactions will appear in the model.
> >>
> >> 4. The cutoffs so chosen will not replicate and in effect 
> assume that 
> >> covariate effects are discontinuous and piecewise flat.  
> The use of 
> >> cutoffs results in a huge loss of information and power 
> and makes the 
> >> analysis arbitrary and impossible to interpret (e.g., a 
> high covariate 
> >> value:low covariate value odds ratio or mean difference is 
> a complex 
> >> function of all the covariate values in the sample).
> >>
> >> 5. The model will not validate in new data.
> > 
> > Professor Frank,
> > 
> > Thank you for your explain.
> > 
> > Well, if my first idea is wrong what is your opinion on the 
> following
> > approach?
> > 
> > 1- Make PCA with data excluding the binary variable
> > 2- Put de principal components in logistic model
> > 3- After revert principal componentes in variable (only if is
> > interesting for milicic.marko)
> > 
> > If this approach is wrong too what is your approach?
> 
> 
> Hi Bernardo,
> 
> If there is a large number of potential predictors and no previous 
> knowledge to guide the modeling, principal components (PC) is 
> often an 
> excellent way to proceed.  The first few PCs can be put into 
> the model. 
>   The result is not always very interpretable, but you can 
> "decode" the 
> PCs by using stepwise regression or recursive partitioning (which are 
> safer in this context because the stepwise methods are not exposed to 
> the Y variable).  You can also add PCs in a stepwise fashion in the 
> pre-specified order of variance explained.
> 
> There are many variations on this theme including nonlinear principal 
> components (e.g., the transcan function in the Hmisc package) 
> which may 
> explain more variance of the predictors.

While I agree with much of what Frank had said, I'd like to add some points.

Variable selection is a treacherous business whether one is interested in
prediction or inference.  If the goal is inference, Frank's book is a
must read, IMHO.  (It's great for predictive model building, too.)

If int