[R] [R-pkgs] New version 0.9-7 of lars package

2007-05-16 Thread Trevor Hastie
I uploaded a new version of the lars package to CRAN,
which incorporates some nontrivial changes.

1) lars now has normalize and intercept options, both defaulted to TRUE,
which means the variables are scaled to have unit euclidean norm, and
an intercept is included in the model. Either or both can be set to FALSE.

2) lars has an additional type = stepwise option;
now the list is type=c(lasso, lar, forward.stagewise,stepwise)
This was included because it is trivial to implement, and useful for 
comparisons.
Stepwise is a version of forward stepwise regression, where the 
variable to
enter is the one most correlated with the residuals. This is not 
necessarily the
same as the forward stepwise implemented as part of step() in R, where the
variable entered is the one that, when included, reduces the RSS the most.

3) a method for summary() has been included, which gives an anova-type 
summary
of the sequence of steps.

4) The plot method for lars defaults to plotting coefficients against 
the relative
L1 norm of the coefficients. This was not done correctly in general for 
type lar
and forward.stagewise, since the L1 norm does not change smoothly if
coefficients pass through zero. This has been fixed.

5) A smalll number of of other changes have been made, some in response 
to email
messages from users.
 
Thanks to Yann-Ael Le Borgne for pointing out the problem in (4) and 
proposing
a solution, and to Lukas Meier for reporting some bugs. Please let me 
know of any
new problems, or old ones not yet repaired.

Trevor Hastie


  Trevor Hastie  [EMAIL PROTECTED]
  Professor  Chair, Department of Statistics, Stanford University
  Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977
 (650) 498-5233 (Biostatistics)  Fax: (650) 725-6951
  URL: http://www-stat.stanford.edu/~hastie
  address: room 104, Department of Statistics, Sequoia Hall
  390 Serra Mall, Stanford University, CA 94305-4065

___
R-packages mailing list
[EMAIL PROTECTED]
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] contingency table analysis; generalized linear model

2007-01-10 Thread Trevor Hastie
 Date: Tue, 9 Jan 2007 11:13:41 + (GMT)
 From: Mark Difford [EMAIL PROTECTED]
 Subject: Re: [R] contingency table analysis; generalized linear model

 Dear List,

 I would appreciate help on the following matter:

 I am aware that higher dimensional contingency tables can be  
 analysed using either log-linear models or as a poisson regression  
 using a generalized linear model:

 log-linear:
 loglm(~Age+Site, data=xtabs(~Age+Site, data=SSites.Rev,  
 drop.unused.levels=T))

 GLM:
 glm.table - as.data.frame(xtabs(~Age+Site, data=SSites.Rev,  
 drop.unused.levels=T))
 glm(Freq ~ Age + Site, data=glm.table, family='poisson')

 where Site is a factor and Age is cast as a factor by xtabs() and  
 treated as such.

 **Question**:
 Is it acceptable to step away from contingency table analysis by  
 recasting Age as a numerical variable, and redoing the analysis as:

 glm(Freq ~ as.numeric(Age) + Site, data=glm.table, family='poisson')

 My reasons for wanting to do this are to be able to include non- 
 linear terms in the model, using say restricted or natural cubic  
 splines.

 Thank you in advance for your help.
 Regards,
 Mark Difford.


 ---
 Mark Difford
 Ph.D. candidate, Botany Department,
 Nelson Mandela Metropolitan University,
 Port Elizabeth, SA.

Yes it is, and it is often the preferred way to view the analysis.
In this case it looks like Freq is measuring something like species  
abundance,
and it is natural to model this as a Poisson count via a log-link glm.
As such you are free to include any reasonable functions of your  
predictors
in modeling the mean.

Log-linear models are typically presented as ways of  analyzing  
dependence between
categorical variables, when represented as multi-way tables. The  
appropriate multinomial
models, conditioning on certain marginals, happen to be equivalent to  
Poisson glms with
appropriate terms included.

I would suggest in your data preparation that you
glm.table[,Age] - as.numeric(glm.table[,Age])
at the start, so that now you can think of your data in the right way.

Trevor Hastie


---
   Trevor Hastie   [EMAIL PROTECTED]
   Professor  Chair, Department of Statistics, Stanford University
   Phone: (650) 725-2231 (Statistics)  Fax: (650) 725-8977
   (650) 498-5233 (Biostatistics)   Fax: (650) 725-6951
   URL: http://www-stat.stanford.edu/~hastie
address: room 104, Department of Statistics, Sequoia Hall
390 Serra Mall, Stanford University, CA 94305-4065
  



[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Statistical Learning and Datamining Course

2006-03-06 Thread Trevor Hastie
Short course: Statistical Learning and Data Mining II:
 tools for tall and wide data

Trevor Hastie and Robert Tibshirani, Stanford University

Sheraton Hotel,
Palo Alto, California,
April 3-4, 2006.

This two-day course gives a detailed overview of statistical models for
data mining, inference and prediction.  With the rapid developments
in internet technology, genomics, financial risk modeling, and other
high-tech industries, we rely increasingly more on data analysis and
statistical models to exploit the vast amounts of data at our  
fingertips.

This course is the third in a series, and follows our popular past
offerings Modern Regression and Classification, and Statistical
Learning and Data Mining.

The two earlier courses are not a prerequisite for this new course.

In this course we emphasize the tools useful for tackling modern-day
data analysis problems. We focus on both tall data ( Np where
N=#cases, p=#features) and wide data (pN). The tools include
gradient boosting, SVMs and kernel methods, random forests, lasso and
LARS, ridge regression and GAMs, supervised principal components, and
cross-validation.  We also present some interesting case studies in a
variety of application areas. All our examples are developed using the
S language, and most of the procedures we discuss are implemented in
publicly available R packages.

Please visit the site
http://www-stat.stanford.edu/~hastie/sldm.html
for more information and registration details.

---
   Trevor Hastie   [EMAIL PROTECTED]
   Professor, Department of Statistics, Stanford University
   Phone: (650) 725-2231 (Statistics)  Fax: (650) 725-8977
   (650) 498-5233 (Biostatistics)   Fax: (650) 725-6951
   URL: http://www-stat.stanford.edu/~hastie
address: room 104, Department of Statistics, Sequoia Hall
390 Serra Mall, Stanford University, CA 94305-4065
  



[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Data Mining Course

2006-01-14 Thread Trevor Hastie
Short course: Statistical Learning and Data Mining II:
 tools for tall and wide data

Trevor Hastie and Robert Tibshirani, Stanford University

Sheraton Hotel,
Palo Alto, California,
April 3-4, 2006.

This two-day course gives a detailed overview of statistical models for
data mining, inference and prediction.  With the rapid developments
in internet technology, genomics, financial risk modeling, and other
high-tech industries, we rely increasingly more on data analysis and
statistical models to exploit the vast amounts of data at our  
fingertips.

This course is the third in a series, and follows our popular past
offerings Modern Regression and Classification, and Statistical
Learning and Data Mining.

The two earlier courses are not a prerequisite for this new course.

In this course we emphasize the tools useful for tackling modern-day
data analysis problems. We focus on both tall data ( Np where
N=#cases, p=#features) and wide data (pN). The tools include
gradient boosting, SVMs and kernel methods, random forests, lasso and
LARS, ridge regression and GAMs, supervised principal components, and
cross-validation.  We also present some interesting case studies in a
variety of application areas. All our examples are developed using the
S language, and most of the procedures we discuss are implemented in
publicly available R packages.

Please visit the site
http://www-stat.stanford.edu/~hastie/sldm.html
for more information and registration details.

---
   Trevor Hastie   [EMAIL PROTECTED]
   Professor, Department of Statistics, Stanford University
   Phone: (650) 725-2231 (Statistics)  Fax: (650) 725-8977
   (650) 498-5233 (Biostatistics)   Fax: (650) 725-6951
   URL: http://www-stat.stanford.edu/~hastie
address: room 104, Department of Statistics, Sequoia Hall
390 Serra Mall, Stanford University, CA 94305-4065
  



[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Age of an object?

2005-12-13 Thread Trevor Hastie
It would be nice to have a date stamp on an object.
In S/Splus this was always available, because objects were files.

I have looked around, but I presume this information is not available.


  Trevor Hastie  [EMAIL PROTECTED]
  Professor, Department of Statistics, Stanford University
  Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977
 (650) 498-5233 (Biostatistics)  Fax: (650) 725-6951
  URL: http://www-stat.stanford.edu/~hastie
  address: room 104, Department of Statistics, Sequoia Hall
  390 Serra Mall, Stanford University, CA 94305-4065

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] [R-pkgs] glmpath: L1 regularization path for glms

2005-11-28 Thread Trevor Hastie
We have uploaded to CRAN the first version of glmpath,
which fits the L1 regularization path for generalized linear models.

The lars package fits the entire piecewise-linear L1 regularization  
path for
the lasso. The coefficient paths for L1 regularized glms, however,   
are not piecewise linear.
glmpath uses convex optimization - in particular predictor-corrector  
methods-
to fit the coefficient path at important junctions. These junctions  
are at the knots in |beta|
where variables enter/leave the active set; i.e.  nonzero/zero values.
Users can request greater resolution at a cost of more computation,  
and compute values
on a fine grid between the knots.

The code is fast, and can handle largish problems efficiently.
it took just over 4 sec system cpu time to fit the logistic  
regression path for
the spam data from UCI with 3065 training obs and 57 predictors.
For a microarray example with 5000 variables and 100 observations, 11  
seconds cpu time.

Currently glmpath implements binomial, poisson and gaussian families.

Mee Young Park and Trevor Hastie




---
   Trevor Hastie   [EMAIL PROTECTED]
   Professor, Department of Statistics, Stanford University
   Phone: (650) 725-2231 (Statistics)  Fax: (650) 725-8977
   (650) 498-5233 (Biostatistics)   Fax: (650) 725-6951
   URL: http://www-stat.stanford.edu/~hastie
address: room 104, Department of Statistics, Sequoia Hall
390 Serra Mall, Stanford University, CA 94305-4065
  

___
R-packages mailing list
[EMAIL PROTECTED]
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] step.gam- question

2005-10-12 Thread Trevor Hastie
This is covered in the helpfile, but perhaps not clearly enough.
The gam chapter in the white book has more details.

step.gam moves around the terms in the scope aregumnet in an ordered 
fashion.
So if  a scope element is

~ 1 + x +s(x,4) + s(x,8)

and the formula at some stage is  ~ x + 

then if direction=both,  the routine checks both 1 and s(x,4) (i.e 
up or down the hierarchy by one move),
and does not check s(x,8)

If direction=forward, it will only look at (s(x,4), and so on.

This ordered behaviour was imposed in order to put some structure on the 
search,
and reduce the computational and variance overhead of a complete search.

[EMAIL PROTECTED] wrote:

Dear Professor Hastie,


I asked a question on r-help@stat.math.ethz.ch and I was told it'd be better to
contact you aboutmy problem.

I'm working with step.gam in gam package. I'm interested both in spline and
loess functions and when I define all the models that I'm interested in I get
something like that:

  

gam.object.ALC-gam(X143S~ALC,data=dane,family=binomial)



step.gam.ALC-step.gam(gam.object.ALC,scope=list(ALC=~1+ALC+s(ALC,2)+s(ALC,3)+s(ALC,4)+s(ALC,6)+s(ALC,8)+lo(ALC,degree=1,span=.5)+lo(ALC,degree=2,span=.5)+lo(ALC,degree=1,span=.25)+lo(ALC,degree=2,span=.25)))
 Start:  X143S ~ ALC; AIC= 104.0815
 Trial:  X143S ~  1; AIC= 111.1054
 Trial:  X143S ~  s(ALC, 2); AIC= 103.3325
 Step :  X143S ~ s(ALC, 2) ; AIC= 103.3325

 Trial:  X143S ~  s(ALC, 3); AIC= 102.9598
 Step :  X143S ~ s(ALC, 3) ; AIC= 102.9598

 Trial:  X143S ~  s(ALC, 4); AIC= 102.2103
 Step :  X143S ~ s(ALC, 4) ; AIC= 102.2103

 Trial:  X143S ~  s(ALC, 6); AIC= 102.4548

I have impression that the algorithm stops when the next trial gives higher AIC
without examining further functions. When I deleted some of the spline 
functions
that were worse than s(ALC,4) I got:

  

step.gam.ALC-step.gam(gam.object.ALC,scope=list(ALC=~1+ALC++s(ALC,4)+lo(ALC,degree=1,span=.5)+lo(ALC,degree=2,span=.5)+lo(ALC,degree=1,span=.25)+lo(ALC,degree=2,span=.25)))
 Start:  X143S ~ ALC; AIC= 104.0815
 Trial:  X143S ~  1; AIC= 111.1054
 Trial:  X143S ~  s(ALC, 4); AIC= 102.2103
 Step :  X143S ~ s(ALC, 4) ; AIC= 102.2103

 Trial:  X143S ~  lo(ALC, degree = 1, span = 0.5); AIC= 99.8127
 Step :  X143S ~ lo(ALC, degree = 1, span = 0.5) ; AIC= 99.8127

 Trial:  X143S ~  lo(ALC, degree = 2, span = 0.5); AIC= 100.5275

Loess turned out to be better in this situation. Is there any way to examine
all the models without stopping when AIC is higher in the next trial? How to
handle this problem?

I'd be grateful for any advise

best regards

Agnieszka Strzelczak, MSC

PhD fellow
Ministry of the Environment
National Environmental Research Institute
Velsøvej 25
P.O. Box 314
DK-8600 Silkeborg
Denmark
Phone +45 89 20 14 00
Fax +45 89 20 14 14
e-mail: [EMAIL PROTECTED]

PhD student
Institute of Chemistry and Environmental Protection
Szczecin University of Technology
Aleja Piastow 42
71-065 Szczecin
Phone +48 91 449 45 35
e-mail: [EMAIL PROTECTED]
  


-- 

  Trevor Hastie  [EMAIL PROTECTED]
  Professor, Department of Statistics, Stanford University
  Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977
 (650) 498-5233 (Biostatistics)  Fax: (650) 725-6951
  URL: http://www-stat.stanford.edu/~hastie
  address: room 104, Department of Statistics, Sequoia Hall
  390 Serra Mall, Stanford University, CA 94305-4065



[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Data Modeling Short Course

2005-09-25 Thread Trevor Hastie
Short course: Statistical Learning and Data Mining II:
 tools for tall and wide data

Trevor Hastie and Robert Tibshirani, Stanford University

The Conference Center at Harvard Medical School
Boston, MA,
Oct 31-Nov 1, 2005

This is a *new* two-day course on statistical models for data mining,
inference and prediction. It is the third in a series, and follows our
past offerings Modern Regression and Classification, and
Statistical Learning and Data Mining.

In this course we emphasize the tools useful for tackling modern-day
data analysis problems. We focus on both tall data ( Np where
N=#cases, p=#features) and wide data (pN). The tools include
gradient boosting, SVMs and kernel methods, random forests, lasso and
LARS, ridge regression and GAMs, supervised principal components, and
cross-validation.  We also present some interesting case studies in a
variety of application areas. All our examples are developed using the
S language, and most of the procedures we discuss are implemented in
publically available R packages.

Please visit the site
http://www-stat.stanford.edu/~hastie/sldm.html
for more information and registration details.

---
   Trevor Hastie   [EMAIL PROTECTED]
   Professor, Department of Statistics, Stanford University
   Phone: (650) 725-2231 (Statistics)  Fax: (650) 725-8977
   (650) 498-5233 (Biostatistics)   Fax: (650) 725-6951
   URL: http://www-stat.stanford.edu/~hastie
address: room 104, Department of Statistics, Sequoia Hall
390 Serra Mall, Stanford University, CA 94305-4065
  

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Attractive position at Stanford for statistician into computing

2005-08-23 Thread Trevor Hastie
Stanford University Statistics Department is looking to hire a
computer systems specialist. We are targeting someone with a MS or Ph.D 
in statistics,
and who is adept and interested in computing. We are very active in R 
and the S language,
have linux, pc and mac platforms, and like to think we are at the 
cutting edge of technology.

For more details, see the link on the department web page:

http://www-stat.stanford.edu/cssad.html

Trevor Hastie


  Trevor Hastie  [EMAIL PROTECTED]
  Professor, Department of Statistics, Stanford University
  Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977
 (650) 498-5233 (Biostatistics)  Fax: (650) 725-6951
  URL: http://www-stat.stanford.edu/~hastie
  address: room 104, Department of Statistics, Sequoia Hall
  390 Serra Mall, Stanford University, CA 94305-4065

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] new data mining course

2005-08-19 Thread Trevor Hastie
Short course: Statistical Learning and Data Mining II:
tools for tall and wide data

Trevor Hastie and Robert Tibshirani, Stanford University

The Conference Center at Harvard Medical School
Boston, MA,
Oct 31-Nov 1, 2005

This is a *new*  two-day course on statistical models
for data mining, inference and prediction. It is the third
in a series, and follows our past
offerings Modern Regression and Classification, and Statistical
Learning and Data Mining.

In this course we emphasize the tools useful for tackling modern-day
data analysis problems. We focus on both tall data ( Np where N=#cases,
p=#features) and wide data (pN). The tools include gradient boosting, 
SVMs and
kernel methods, random forests, lasso and LARS, ridge regression and
GAMs, supervised principal components, and cross-validation.  We also
present some interesting case studies in a variety of application
areas. All our examples are developed using the S language, and most
of the procedures we discuss are implemented in publically available
R packages.

Please visit the site
http://www-stat.stanford.edu/~hastie/sldm.html
for more information on the course and registration details.

-- 

  Trevor Hastie  [EMAIL PROTECTED]
  Professor, Department of Statistics, Stanford University
  Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977
 (650) 498-5233 (Biostatistics)  Fax: (650) 725-6951
  URL: http://www-stat.stanford.edu/~hastie
  address: room 104, Department of Statistics, Sequoia Hall
  390 Serra Mall, Stanford University, CA 94305-4065

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Digest reading is tedious

2005-08-09 Thread Trevor Hastie
Like many, I am sure, I get R-Help in digest form. Its easy enough to 
browse the
subject lines, but then if an entry interests you, you have to embark 
on this tedious search or scroll to find it.
It would be great to have a clickable digest, where the topics list 
is a set of pointers, and clicking on a topic
takes you to that entry. I can think of at least one way to do this via 
web pages, but I bet those with
more web skills than me can come up with an elegant solution.
---
  Trevor Hastie   [EMAIL PROTECTED]
  Professor, Department of Statistics, Stanford University
  Phone: (650) 725-2231 (Statistics)  Fax: (650) 725-8977 
  (650) 498-5233 (Biostatistics)   Fax: (650) 725-6951
  URL: http://www-stat.stanford.edu/~hastie 
    address: room 104, Department of Statistics, Sequoia Hall
   390 Serra Mall, Stanford University, CA 94305-4065 
  

[[alternative text/enriched version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Statistical Learning and Data Mining Course

2005-01-04 Thread Trevor Hastie
Short course: Statistical Learning and Data Mining

Trevor Hastie and Robert Tibshirani, Stanford University

Sheraton Hotel,
Palo Alto, California
February 24  25, 2005


This two-day course gives a detailed overview of statistical models
for data mining, inference and prediction.  With the rapid
developments in internet technology, genomics and other high-tech
industries, we rely increasingly more on data analysis and statistical
models to exploit the vast amounts of data at our fingertips.

This sequel to our popular Modern Regression and Classification
course covers many new areas of unsupervised learning and data mining,
and gives an in-depth treatment of some of the hottest tools in
supervised learning.

The first course is not a prerequisite for this new course.
Most of the techniques discussed in the course are implemented by the
authors and others in the S language (S-plus or R), and all of the
examples were developed in S.

Day one focuses on state-of-art  methods for supervised
learning, including PRIM, boosting, support vector machines,
and very recent work on least angle regression and the lasso.

Day two covers unsupervised learning, including clustering, principal
components, principal curves and self-organizing maps.  Many
applications will be discussed, including the analysis of DNA
expression arrays - one of the hottest new areas in biology!

###
Much of the material is based on the book:

Elements of Statistical Learning: data mining, inference and prediction

Hastie, Tibshirani  Friedman, Springer-Verlag, 2001

http://www-stat.stanford.edu/ElemStatLearn/

A copy of this book will be given to all attendees.

###

For more information, and to register, visit the course homepage:

http://www-stat.stanford.edu/~hastie/mrc.html



---
  Trevor Hastie   [EMAIL PROTECTED]
  Professor, Department of Statistics, Stanford University
  Phone: (650) 725-2231 (Statistics)  Fax: (650) 725-8977 
  (650) 498-5233 (Biostatistics)   Fax: (650) 725-6951
  URL: http://www-stat.stanford.edu/~hastie 
    address: room 104, Department of Statistics, Sequoia Hall
   390 Serra Mall, Stanford University, CA 94305-4065 
  

[[alternative text/enriched version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Statistical Learning and Data Mining course

2004-08-25 Thread Trevor Hastie
Short course: Statistical Learning and Data Mining
 
Trevor Hastie and Robert Tibshirani, Stanford University
 
Georgetown University Conference Center
Washington DC
September 20-21, 2004
 
This two-day course gives a detailed overview of statistical models
for data mining, inference and prediction.  With the rapid
developments in internet technology, genomics and other high-tech
industries, we rely increasingly more on data analysis and statistical
models to exploit the vast amounts of data at our fingertips.
 
This sequel to our popular Modern Regression and Classification
course covers many new areas of unsupervised learning and data mining,
and gives an in-depth treatment of some of the hottest tools in
supervised learning.
 
The first course is not a prerequisite for this new course.
Most of the techniques discussed in the course are implemented by the
authors and others in the S language (S-plus or R), and all of the
examples were developed in S.
 
Day one focuses on state-of-art  methods for supervised
learning, including PRIM, boosting, support vector machines,
and very recent work on least angle regression and the lasso.
 
Day two covers unsupervised learning, including clustering, principal
components, principal curves and self-organizing maps.  Many
applications will be discussed, including the analysis of DNA
expression arrays - one of the hottest new areas in biology!
 
###
Much of the material is based on the book:
 
Elements of Statistical Learning: data mining, inference and prediction
 
Hastie, Tibshirani  Friedman, Springer-Verlag, 2001
 
http://www-stat.stanford.edu/ElemStatLearn/
 
A copy of this book will be given to all attendees.
 
###
 
For more information, and to register, visit the course homepage:
 
http://www-stat.stanford.edu/~hastie/mrc.html
 


  Trevor Hastie  [EMAIL PROTECTED]  
  Professor, Department of Statistics, Stanford University
  Phone: (650) 725-2231 (Statistics)  Fax: (650) 725-8977  
  (650) 498-5233 (Biostatistics)   Fax: (650) 725-6951
  URL: http://www-stat.stanford.edu/~hastie  
  address: room 104, Department of Statistics, Sequoia Hall
   390 Serra Mall, Stanford University, CA 94305-4065

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] [R-pkgs] gam --- a new contributed package

2004-08-06 Thread Trevor Hastie
I have contributed a gam library to CRAN,
which implements Generalized Additive Models.

This implementation follows closely the description in 
the GAM chapter 7 of the white book Statistical Models in S
(Chambers  Hastie (eds), 1992, Wadsworth), as well as the philosophy
in Generalized Additive Models (Hastie  Tibshirani 1990, Chapman and
Hall). Hence it behaves pretty much like the Splus version of GAM.

Note: this gam library and functions therein are different from the
gam function in package mgcv, and both libraries should not be used
simultaneously.

The gam library allows both local regression (loess) and smoothing
spline smoothers, and uses backfitting and local scoring to fit gams.
It also allows users to supply their own smoothing methods which can
then be included in gam fits.

The gam function in mgcv uses only smoothing spline smoothers, with a
focus on automatic parameter selection via gcv. 

Some of the features of the gam library:

* full compatibility with the R functions glm and lm - a fitted gam
  inherits from class glm and lm

* print, summary, anova, predict and plot methods are provided, as
  well as the usual extractor methods like coefficients, residuals etc

* the method step.gam provides a flexible and customizable approach to
  model selection. 

Some differences with the Splus version of gam:

* predictions with new data are improved, without need for the
  safe.predict.gam function. This was partly facilitated by
  the improved prediction strategy used in R for GLMs and LMs

* Currently the only backfitting algorithm is all.wam. In the earlier
  versions of gam, dedicated fortran routines fit models that had only
  smoothing spline terms (s.wam) or all local regression terms
  (lo.wam), which in fact made calls back to Splus to update the
  working response and weights. These were designed for efficiency. It
  seems now with much faster computers this efficiency is no longer
  needed, and all.wam is modular and visible

 
This package is numbered 0.9 in anticipation of a few bug fixes and
glitches. I have tested many aspects of the functions, but there are
always a few that slip by. I will be happy to hear of any problems,
bugs and suggestions.

Plans for future versions:

* exact standard error calculations. gam employs approximations as
  described in the white book. With a bit more computing (now
  possible), we will have a function that computes exact standard
  errors along the lines described in the GAM book page 127. 

Trevor Hastie


  Trevor Hastie  [EMAIL PROTECTED]  
  Professor, Department of Statistics, Stanford University
  Phone: (650) 725-2231 (Statistics)  Fax: (650) 725-8977  
  (650) 498-5233 (Biostatistics)   Fax: (650) 725-6951
  URL: http://www-stat.stanford.edu/~hastie  
  address: room 104, Department of Statistics, Sequoia Hall
   390 Serra Mall, Stanford University, CA 94305-4065

___
R-packages mailing list
[EMAIL PROTECTED]
https://www.stat.math.ethz.ch/mailman/listinfo/r-packages

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Statistical Learning and Data Mining Course

2004-07-12 Thread Trevor Hastie
Short course: Statistical Learning and Data Mining
 
Trevor Hastie and Robert Tibshirani, Stanford University
 
Georgetown University Conference Center
Washington DC
September 20-21, 2004
 
This two-day course gives a detailed overview of statistical models
for data mining, inference and prediction.  With the rapid
developments in internet technology, genomics and other high-tech
industries, we rely increasingly more on data analysis and statistical
models to exploit the vast amounts of data at our fingertips.
 
This sequel to our popular Modern Regression and Classification
course covers many new areas of unsupervised learning and data mining,
and gives an in-depth treatment of some of the hottest tools in
supervised learning.
 
The first course is not a prerequisite for this new course.
Most of the techniques discussed in the course are implemented by the
authors and others in the S language (S-plus or R), and all of the
examples were developed in S.
 
Day one focuses on state-of-art  methods for supervised
learning, including PRIM, boosting, support vector machines,
and very recent work on least angle regression and the lasso.
 
Day two covers unsupervised learning, including clustering, principal
components, principal curves and self-organizing maps.  Many
applications will be discussed, including the analysis of DNA
expression arrays - one of the hottest new areas in biology!
 
###
Much of the material is based on the book:
 
Elements of Statistical Learning: data mining, inference and prediction
 
Hastie, Tibshirani  Friedman, Springer-Verlag, 2001
 
http://www-stat.stanford.edu/ElemStatLearn/
 
A copy of this book will be given to all attendees.
 
###
 
For more information, and to register, visit the course homepage:
 
http://www-stat.stanford.edu/~hastie/mrc.html
 


  Trevor Hastie  [EMAIL PROTECTED]  
  Professor, Department of Statistics, Stanford University
  Phone: (650) 725-2231 (Statistics)  Fax: (650) 725-8977  
  (650) 498-5233 (Biostatistics)   Fax: (650) 725-6951
  URL: http://www-stat.stanford.edu/~hastie  
  address: room 104, Department of Statistics, Sequoia Hall
   390 Serra Mall, Stanford University, CA 94305-4065  


[[alternative HTML version deleted]]

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] svmpath: fitting the entire SVM regularization path

2004-07-06 Thread Trevor Hastie
svmpath is a contributed package that fits the entire regularization path
for a two-class SVM model.

The SVM (with any kernel), has a regularization or cost parameter C, which
controls the amount of overlap
at the soft margin. When the SVM criterion is expressed in terms of a hinge
loss plus lambda x quadratic penalty, then lambda=1/C.
In many situations, the choice of C can be critical, and different regimes
for C are called for as the other kernel tuning parameters
are changed.

Most software packages come with a default value for C (typically very
large), and the user is left to explore different values of C.
It turns out that the lagrange multipliers which define the SVM solution for
any C are piecewise linear in C (and more usefully piecewise
linear and mostly piecewise constant in lambda) This means that we can
compute the entire sequence of solutions for all values of C exactly.
svmpath does this with essentially the same cost as fitting a single SVM
model with a specified value of C.

See the paper (joint work with Saharon Rosset, Ji Zhu and Rob Tibshirani)
http://www-stat.stanford.edu/~hastie/Papers/svmpath.pdf
for details.

This code has been tested on moderate sized problems, with up to 1000
observations. The current version is not industry
ready; occasionally it will run into situations where the steps are too
small, leading to machine zero situations. Usually increasing the
parameter eps from its default 1e-10 will avoid this.


Trevor Hastie



  Trevor Hastie  [EMAIL PROTECTED]
  Professor, Department of Statistics, Stanford University
  Phone: (650) 725-2231 (Statistics)  Fax: (650) 725-8977
  (650) 498-5233 (Biostatistics)   Fax: (650) 725-6951
  URL: http://www-stat.stanford.edu/~hastie
  address: room 104, Department of Statistics, Sequoia Hall
   390 Serra Mall, Stanford University, CA 94305-4065

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] problem with model.matrix

2004-06-24 Thread Trevor Hastie
This works:


 model.matrix(~I(pos3),data=data.frame(pos=c(1:5)))
  (Intercept) I(pos  3)TRUE
1   1  0
2   1  0
3   1  0
4   1  1
5   1  1
attr(,assign)
[1] 0 1
attr(,contrasts)
attr(,contrasts)$I(pos  3)
[1] contr.treatment


This does not:

 model.matrix(~I(pos3),data=data.frame(pos=c(1:2)))
Error in contrasts-(`*tmp*`, value = contr.treatment) : 
 contrasts can be applied only to factors with 2 or more levels

 

  Trevor Hastie  [EMAIL PROTECTED]  
  Professor, Department of Statistics, Stanford University
  Phone: (650) 725-2231 (Statistics)  Fax: (650) 725-8977  
  (650) 498-5233 (Biostatistics)   Fax: (650) 725-6951
  URL: http://www-stat.stanford.edu/~hastie  
  address: room 104, Department of Statistics, Sequoia Hall
   390 Serra Mall, Stanford University, CA 94305-4065  


[[alternative HTML version deleted]]

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Re: missing values for mda package

2004-04-10 Thread Trevor Hastie
The mda package has no facilities for missing data.
Users are expected to supply clean data; i.e. any
missing value treatment should take place before using
any of the routines in the package.

In particular, our version of the mars function takes 
inputs x and y, which are assumed to have no missing values.

The spam data were used to demonstrate mars in
Elements of Statistical Learning 
The spam data has no missing values, and can be obtained from
http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Trevor Hastie

  Trevor Hastie  [EMAIL PROTECTED]  
  Professor, Department of Statistics, Stanford University
  Phone: (650) 725-2231 (Statistics)Fax: (650) 725-8977  
(650) 498-5233 (Biostatistics)   Fax: (650) 725-6951
  URL: http://www-stat.stanford.edu/~hastie  
  address: room 104, Department of Statistics, Sequoia Hall
   390 Serra Mall, Stanford University, CA 94305-4065  


[[alternative HTML version deleted]]

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Statistical Learning and Datamining course based on R/Splus tools

2004-01-07 Thread Trevor Hastie
Short course: Statistical Learning and Data Mining
 
Trevor Hastie and Robert Tibshirani, Stanford University
 
Sheraton Hotel
Palo Alto, CA 
Feb 26-27, 2004
 
This two-day course gives a detailed overview of statistical models
for data mining, inference and prediction.  With the rapid
developments in internet technology, genomics and other high-tech
industries, we rely increasingly more on data analysis and statistical
models to exploit the vast amounts of data at our fingertips.
 
This sequel to our popular Modern Regression and Classification
course covers many new areas of unsupervised learning and data mining,
and gives an in-depth treatment of some of the hottest tools in
supervised learning.
 
The first course is not a prerequisite for this new course.
All of the techniques discussed in the course are implemented by the
authors and others in the S language (S-plus or R). 
 
Day one focuses on state-of-art  methods for supervised
learning, including PRIM, boosting, support vector machines,
and very recent work on least angle regression and the lasso.
 
Day two covers unsupervised learning, including clustering, principal
components, principal curves and self-organizing maps.  Many
applications will be discussed, including the analysis of DNA
expression arrays - one of the hottest new areas in biology!
 
###
Much of the material is based on the book:
 
Elements of Statistical Learning: data mining, inference and prediction
 
Hastie, Tibshirani  Friedman, Springer-Verlag, 2001
 
http://www-stat.stanford.edu/ElemStatLearn/
 
A copy of this book will be given to all attendees.
 
###
 
For more information, and to register, visit the course homepage:
 
http://www-stat.stanford.edu/~hastie/mrc.html
 


  Trevor Hastie  [EMAIL PROTECTED]  
  Professor, Department of Statistics, Stanford University
  Phone: (650) 725-2231 (Statistics)  Fax: (650) 725-8977  
  (650) 498-5233 (Biostatistics)   Fax: (650) 725-6951
  URL: http://www-stat.stanford.edu/~hastie  
  address: room 104, Department of Statistics, Sequoia Hall
   390 Serra Mall, Stanford University, CA 94305-4065  


[[alternative HTML version deleted]]

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Re: Logistic Regression

2003-09-13 Thread Trevor Hastie
Christoph Lehman had problems with seperated data in two-class logistic regression.

One useful little trick is to penalize the logistic regression using a quadratic 
penalty on the coefficients.
I am sure there are functions in the R contributed libraries to do this; otherwise it 
is easy to achieve via IRLS
using ridge regressions. Then even though the data are separated, the penalized 
log-likelihood 
has a unique maximum. One intriguing feature is that as the penalty parameter goes to 
zero, 
the solution converges to the SVM solution - i.e. the optimal separating hyperplane
see  http://www-stat.stanford.edu/~hastie/Papers/margmax1.ps 


  Trevor Hastie  [EMAIL PROTECTED]  
  Professor, Department of Statistics, Stanford University
  Phone: (650) 725-2231 (Statistics)Fax: (650) 725-8977  
(650) 498-5233 (Biostatistics)   Fax: (650) 725-6951
  URL: http://www-stat.stanford.edu/~hastie  
  address: room 104, Department of Statistics, Sequoia Hall
   390 Serra Mall, Stanford University, CA 94305-4065  


[[alternative HTML version deleted]]

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help