Re: [R] Pre-model Variable Reduction

Frank E Harrell Jr Tue, 09 Dec 2008 09:22:03 -0800

Mark Difford wrote:

Hi All,


I beg to differ with Ravi Varadhan's perspective. While it is true that
principal component analysis does not itself do variable selection, it is an
important method for pointing the way to what to select. This is what the
methods in the subselect package rely on. (One of its authors was I believe
a student of Jolliffe's). For a modern perspective on this, see the
following paper:

Debashis Paul, Eric Bair, Trevor Hastie and Robert Tibshirani:
"Preconditioning" for feature selection and regression in high-dimensional
problems We show that supervised principal components followed by a variable
selection procedure is an effective approach for variable selection in very
high dimension. Annals of Statistics 36(4), 2008, 1595-1618.

http://www-stat.stanford.edu/~hastie/Papers/Preconditioning_Annals.pdf

Regards, Mark.


Mark,

Slightly more relevant is the unsupervised sparse principal componentmethods described in the following references. If anyone knows ofbetter references for this please let me know. -Frank



@Article{zou06spa,
  author =               {Zhou, Hui and Hastie, Trevor and Tibshirani, Robert},
  title =                {Sparse principal component analysis},
  journal =      J Comp Graph Stat,
  year =                 2006,
  volume =               15,
  pages =                {265-286},
  annote =               {gene microarray;lasso/elastic net;multivariate
analysis;data reduction;singular value
decomposition;thresholding;principal components analysis that shrinks
some loadings to zero}
}
@Article{wit08tes,
  author =               {Witten, Daniela M. and Tibshirani, Robert},

title = {Testing significance of features by lassoed principalcomponents},

  journal =      Annals Appl Stat,
  year =                 2008,
  volume =       2,
  number =       3,
  pages =        {986-1012},

annote = {reduction in false discovery rates over using a vector oft-statistics;borrowing strength across genes;``one would not expect asingle gene to be associated with the outcome, since, in practice, manygenes work together to effect a particular phenotype. LPC effectivelydown-weights individual genes that are associated with the outcome butthat do not share an expression pattern with a larger group of genes,and instead favors large groups of genes that appear to bedifferentially-expressed.'';regress principal components on outcome}



Ravi Varadhan wrote:

Principal components analysis does "dimensionality reduction" but NOT
"variable reduction".  However, Jolliffe's 2004 book on PCA does discuss
the
problem of selecting a subset of variables, with the goal of representing
the internal variation of original multivariate vector as well as possible
(see Section 6.3 of that book).  I do not think that these methods can
handle missing data.  The most important issue is to think about the goal
of
variable reduction and then choose an appropriate optimality criterion for
achieving that goal.  In most instances of variable selection, the
criterion
that is optimized is never explicitly considered.

Ravi.

----------------------------------------------------------------------------
-------

Ravi Varadhan, Ph.D.

Assistant Professor, The Center on Aging and Health

Division of Geriatric Medicine and Gerontology

Johns Hopkins University

Ph: (410) 502-2619

Fax: (410) 614-9625

Email: [EMAIL PROTECTED]

Webpage:  http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html

----------------------------------------------------------------------------
--------


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On
Behalf Of Gabor Grothendieck
Sent: Tuesday, December 09, 2008 8:00 AM
To: Harsh
Cc: r-help@r-project.org
Subject: Re: [R] Pre-model Variable Reduction

See:

?prcomp
?princomp

On Tue, Dec 9, 2008 at 5:34 AM, Harsh <[EMAIL PROTECTED]> wrote:

Hello All,
I am trying to carry out variable reduction. I do not have informationabout the dependent variable, and have only the X variables as itwere.
In selecting variables I wish to keep, I have considered the following

criteria.

1) Percentage of missing value in each column/variable
2) Variance of each variable, with a cut-off value.
I recently came across Weka and found that there is an RWeka packagewhich would allow me to make use of Weka through R.Weka provides a "Genetic search" variable reduction method, but Icould not find its R code implementation in the RWeka Pdf file onCRAN.
I looked for other R packages that allow me to do variable reductionwithout considering a dependent variable. I came across 'dprep'
package but it does not have a Windows implementation.
Moreover, I have a dataset that contains continuous and categoricalvariables, some categorical variables having 3 levels, 10 levels andso on, till a max 50 levels (E.g. States in the USA).
Any suggestions in this regard will be much appreciated.

Thank you

Harsh Singhal
Decision Systems,
Mu Sigma, Inc.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Pre-model Variable Reduction

Reply via email to