[R] [R-pkgs] New version 0.9-7 of lars package
I uploaded a new version of the lars package to CRAN, which incorporates some nontrivial changes. 1) lars now has normalize and intercept options, both defaulted to TRUE, which means the variables are scaled to have unit euclidean norm, and an intercept is included in the model. Either or both can be set to FALSE. 2) lars has an additional type = stepwise option; now the list is type=c(lasso, lar, forward.stagewise,stepwise) This was included because it is trivial to implement, and useful for comparisons. Stepwise is a version of forward stepwise regression, where the variable to enter is the one most correlated with the residuals. This is not necessarily the same as the forward stepwise implemented as part of step() in R, where the variable entered is the one that, when included, reduces the RSS the most. 3) a method for summary() has been included, which gives an anova-type summary of the sequence of steps. 4) The plot method for lars defaults to plotting coefficients against the relative L1 norm of the coefficients. This was not done correctly in general for type lar and forward.stagewise, since the L1 norm does not change smoothly if coefficients pass through zero. This has been fixed. 5) A smalll number of of other changes have been made, some in response to email messages from users. Thanks to Yann-Ael Le Borgne for pointing out the problem in (4) and proposing a solution, and to Lukas Meier for reporting some bugs. Please let me know of any new problems, or old ones not yet repaired. Trevor Hastie Trevor Hastie [EMAIL PROTECTED] Professor Chair, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 ___ R-packages mailing list [EMAIL PROTECTED] https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] contingency table analysis; generalized linear model
Date: Tue, 9 Jan 2007 11:13:41 + (GMT) From: Mark Difford [EMAIL PROTECTED] Subject: Re: [R] contingency table analysis; generalized linear model Dear List, I would appreciate help on the following matter: I am aware that higher dimensional contingency tables can be analysed using either log-linear models or as a poisson regression using a generalized linear model: log-linear: loglm(~Age+Site, data=xtabs(~Age+Site, data=SSites.Rev, drop.unused.levels=T)) GLM: glm.table - as.data.frame(xtabs(~Age+Site, data=SSites.Rev, drop.unused.levels=T)) glm(Freq ~ Age + Site, data=glm.table, family='poisson') where Site is a factor and Age is cast as a factor by xtabs() and treated as such. **Question**: Is it acceptable to step away from contingency table analysis by recasting Age as a numerical variable, and redoing the analysis as: glm(Freq ~ as.numeric(Age) + Site, data=glm.table, family='poisson') My reasons for wanting to do this are to be able to include non- linear terms in the model, using say restricted or natural cubic splines. Thank you in advance for your help. Regards, Mark Difford. --- Mark Difford Ph.D. candidate, Botany Department, Nelson Mandela Metropolitan University, Port Elizabeth, SA. Yes it is, and it is often the preferred way to view the analysis. In this case it looks like Freq is measuring something like species abundance, and it is natural to model this as a Poisson count via a log-link glm. As such you are free to include any reasonable functions of your predictors in modeling the mean. Log-linear models are typically presented as ways of analyzing dependence between categorical variables, when represented as multi-way tables. The appropriate multinomial models, conditioning on certain marginals, happen to be equivalent to Poisson glms with appropriate terms included. I would suggest in your data preparation that you glm.table[,Age] - as.numeric(glm.table[,Age]) at the start, so that now you can think of your data in the right way. Trevor Hastie --- Trevor Hastie [EMAIL PROTECTED] Professor Chair, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Statistical Learning and Datamining Course
Short course: Statistical Learning and Data Mining II: tools for tall and wide data Trevor Hastie and Robert Tibshirani, Stanford University Sheraton Hotel, Palo Alto, California, April 3-4, 2006. This two-day course gives a detailed overview of statistical models for data mining, inference and prediction. With the rapid developments in internet technology, genomics, financial risk modeling, and other high-tech industries, we rely increasingly more on data analysis and statistical models to exploit the vast amounts of data at our fingertips. This course is the third in a series, and follows our popular past offerings Modern Regression and Classification, and Statistical Learning and Data Mining. The two earlier courses are not a prerequisite for this new course. In this course we emphasize the tools useful for tackling modern-day data analysis problems. We focus on both tall data ( Np where N=#cases, p=#features) and wide data (pN). The tools include gradient boosting, SVMs and kernel methods, random forests, lasso and LARS, ridge regression and GAMs, supervised principal components, and cross-validation. We also present some interesting case studies in a variety of application areas. All our examples are developed using the S language, and most of the procedures we discuss are implemented in publicly available R packages. Please visit the site http://www-stat.stanford.edu/~hastie/sldm.html for more information and registration details. --- Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Data Mining Course
Short course: Statistical Learning and Data Mining II: tools for tall and wide data Trevor Hastie and Robert Tibshirani, Stanford University Sheraton Hotel, Palo Alto, California, April 3-4, 2006. This two-day course gives a detailed overview of statistical models for data mining, inference and prediction. With the rapid developments in internet technology, genomics, financial risk modeling, and other high-tech industries, we rely increasingly more on data analysis and statistical models to exploit the vast amounts of data at our fingertips. This course is the third in a series, and follows our popular past offerings Modern Regression and Classification, and Statistical Learning and Data Mining. The two earlier courses are not a prerequisite for this new course. In this course we emphasize the tools useful for tackling modern-day data analysis problems. We focus on both tall data ( Np where N=#cases, p=#features) and wide data (pN). The tools include gradient boosting, SVMs and kernel methods, random forests, lasso and LARS, ridge regression and GAMs, supervised principal components, and cross-validation. We also present some interesting case studies in a variety of application areas. All our examples are developed using the S language, and most of the procedures we discuss are implemented in publicly available R packages. Please visit the site http://www-stat.stanford.edu/~hastie/sldm.html for more information and registration details. --- Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Age of an object?
It would be nice to have a date stamp on an object. In S/Splus this was always available, because objects were files. I have looked around, but I presume this information is not available. Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] [R-pkgs] glmpath: L1 regularization path for glms
We have uploaded to CRAN the first version of glmpath, which fits the L1 regularization path for generalized linear models. The lars package fits the entire piecewise-linear L1 regularization path for the lasso. The coefficient paths for L1 regularized glms, however, are not piecewise linear. glmpath uses convex optimization - in particular predictor-corrector methods- to fit the coefficient path at important junctions. These junctions are at the knots in |beta| where variables enter/leave the active set; i.e. nonzero/zero values. Users can request greater resolution at a cost of more computation, and compute values on a fine grid between the knots. The code is fast, and can handle largish problems efficiently. it took just over 4 sec system cpu time to fit the logistic regression path for the spam data from UCI with 3065 training obs and 57 predictors. For a microarray example with 5000 variables and 100 observations, 11 seconds cpu time. Currently glmpath implements binomial, poisson and gaussian families. Mee Young Park and Trevor Hastie --- Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 ___ R-packages mailing list [EMAIL PROTECTED] https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] step.gam- question
This is covered in the helpfile, but perhaps not clearly enough. The gam chapter in the white book has more details. step.gam moves around the terms in the scope aregumnet in an ordered fashion. So if a scope element is ~ 1 + x +s(x,4) + s(x,8) and the formula at some stage is ~ x + then if direction=both, the routine checks both 1 and s(x,4) (i.e up or down the hierarchy by one move), and does not check s(x,8) If direction=forward, it will only look at (s(x,4), and so on. This ordered behaviour was imposed in order to put some structure on the search, and reduce the computational and variance overhead of a complete search. [EMAIL PROTECTED] wrote: Dear Professor Hastie, I asked a question on r-help@stat.math.ethz.ch and I was told it'd be better to contact you aboutmy problem. I'm working with step.gam in gam package. I'm interested both in spline and loess functions and when I define all the models that I'm interested in I get something like that: gam.object.ALC-gam(X143S~ALC,data=dane,family=binomial) step.gam.ALC-step.gam(gam.object.ALC,scope=list(ALC=~1+ALC+s(ALC,2)+s(ALC,3)+s(ALC,4)+s(ALC,6)+s(ALC,8)+lo(ALC,degree=1,span=.5)+lo(ALC,degree=2,span=.5)+lo(ALC,degree=1,span=.25)+lo(ALC,degree=2,span=.25))) Start: X143S ~ ALC; AIC= 104.0815 Trial: X143S ~ 1; AIC= 111.1054 Trial: X143S ~ s(ALC, 2); AIC= 103.3325 Step : X143S ~ s(ALC, 2) ; AIC= 103.3325 Trial: X143S ~ s(ALC, 3); AIC= 102.9598 Step : X143S ~ s(ALC, 3) ; AIC= 102.9598 Trial: X143S ~ s(ALC, 4); AIC= 102.2103 Step : X143S ~ s(ALC, 4) ; AIC= 102.2103 Trial: X143S ~ s(ALC, 6); AIC= 102.4548 I have impression that the algorithm stops when the next trial gives higher AIC without examining further functions. When I deleted some of the spline functions that were worse than s(ALC,4) I got: step.gam.ALC-step.gam(gam.object.ALC,scope=list(ALC=~1+ALC++s(ALC,4)+lo(ALC,degree=1,span=.5)+lo(ALC,degree=2,span=.5)+lo(ALC,degree=1,span=.25)+lo(ALC,degree=2,span=.25))) Start: X143S ~ ALC; AIC= 104.0815 Trial: X143S ~ 1; AIC= 111.1054 Trial: X143S ~ s(ALC, 4); AIC= 102.2103 Step : X143S ~ s(ALC, 4) ; AIC= 102.2103 Trial: X143S ~ lo(ALC, degree = 1, span = 0.5); AIC= 99.8127 Step : X143S ~ lo(ALC, degree = 1, span = 0.5) ; AIC= 99.8127 Trial: X143S ~ lo(ALC, degree = 2, span = 0.5); AIC= 100.5275 Loess turned out to be better in this situation. Is there any way to examine all the models without stopping when AIC is higher in the next trial? How to handle this problem? I'd be grateful for any advise best regards Agnieszka Strzelczak, MSC PhD fellow Ministry of the Environment National Environmental Research Institute Velsøvej 25 P.O. Box 314 DK-8600 Silkeborg Denmark Phone +45 89 20 14 00 Fax +45 89 20 14 14 e-mail: [EMAIL PROTECTED] PhD student Institute of Chemistry and Environmental Protection Szczecin University of Technology Aleja Piastow 42 71-065 Szczecin Phone +48 91 449 45 35 e-mail: [EMAIL PROTECTED] -- Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Data Modeling Short Course
Short course: Statistical Learning and Data Mining II: tools for tall and wide data Trevor Hastie and Robert Tibshirani, Stanford University The Conference Center at Harvard Medical School Boston, MA, Oct 31-Nov 1, 2005 This is a *new* two-day course on statistical models for data mining, inference and prediction. It is the third in a series, and follows our past offerings Modern Regression and Classification, and Statistical Learning and Data Mining. In this course we emphasize the tools useful for tackling modern-day data analysis problems. We focus on both tall data ( Np where N=#cases, p=#features) and wide data (pN). The tools include gradient boosting, SVMs and kernel methods, random forests, lasso and LARS, ridge regression and GAMs, supervised principal components, and cross-validation. We also present some interesting case studies in a variety of application areas. All our examples are developed using the S language, and most of the procedures we discuss are implemented in publically available R packages. Please visit the site http://www-stat.stanford.edu/~hastie/sldm.html for more information and registration details. --- Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Attractive position at Stanford for statistician into computing
Stanford University Statistics Department is looking to hire a computer systems specialist. We are targeting someone with a MS or Ph.D in statistics, and who is adept and interested in computing. We are very active in R and the S language, have linux, pc and mac platforms, and like to think we are at the cutting edge of technology. For more details, see the link on the department web page: http://www-stat.stanford.edu/cssad.html Trevor Hastie Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] new data mining course
Short course: Statistical Learning and Data Mining II: tools for tall and wide data Trevor Hastie and Robert Tibshirani, Stanford University The Conference Center at Harvard Medical School Boston, MA, Oct 31-Nov 1, 2005 This is a *new* two-day course on statistical models for data mining, inference and prediction. It is the third in a series, and follows our past offerings Modern Regression and Classification, and Statistical Learning and Data Mining. In this course we emphasize the tools useful for tackling modern-day data analysis problems. We focus on both tall data ( Np where N=#cases, p=#features) and wide data (pN). The tools include gradient boosting, SVMs and kernel methods, random forests, lasso and LARS, ridge regression and GAMs, supervised principal components, and cross-validation. We also present some interesting case studies in a variety of application areas. All our examples are developed using the S language, and most of the procedures we discuss are implemented in publically available R packages. Please visit the site http://www-stat.stanford.edu/~hastie/sldm.html for more information on the course and registration details. -- Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Digest reading is tedious
Like many, I am sure, I get R-Help in digest form. Its easy enough to browse the subject lines, but then if an entry interests you, you have to embark on this tedious search or scroll to find it. It would be great to have a clickable digest, where the topics list is a set of pointers, and clicking on a topic takes you to that entry. I can think of at least one way to do this via web pages, but I bet those with more web skills than me can come up with an elegant solution. --- Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 [[alternative text/enriched version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Statistical Learning and Data Mining Course
Short course: Statistical Learning and Data Mining Trevor Hastie and Robert Tibshirani, Stanford University Sheraton Hotel, Palo Alto, California February 24 25, 2005 This two-day course gives a detailed overview of statistical models for data mining, inference and prediction. With the rapid developments in internet technology, genomics and other high-tech industries, we rely increasingly more on data analysis and statistical models to exploit the vast amounts of data at our fingertips. This sequel to our popular Modern Regression and Classification course covers many new areas of unsupervised learning and data mining, and gives an in-depth treatment of some of the hottest tools in supervised learning. The first course is not a prerequisite for this new course. Most of the techniques discussed in the course are implemented by the authors and others in the S language (S-plus or R), and all of the examples were developed in S. Day one focuses on state-of-art methods for supervised learning, including PRIM, boosting, support vector machines, and very recent work on least angle regression and the lasso. Day two covers unsupervised learning, including clustering, principal components, principal curves and self-organizing maps. Many applications will be discussed, including the analysis of DNA expression arrays - one of the hottest new areas in biology! ### Much of the material is based on the book: Elements of Statistical Learning: data mining, inference and prediction Hastie, Tibshirani Friedman, Springer-Verlag, 2001 http://www-stat.stanford.edu/ElemStatLearn/ A copy of this book will be given to all attendees. ### For more information, and to register, visit the course homepage: http://www-stat.stanford.edu/~hastie/mrc.html --- Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 [[alternative text/enriched version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Statistical Learning and Data Mining course
Short course: Statistical Learning and Data Mining Trevor Hastie and Robert Tibshirani, Stanford University Georgetown University Conference Center Washington DC September 20-21, 2004 This two-day course gives a detailed overview of statistical models for data mining, inference and prediction. With the rapid developments in internet technology, genomics and other high-tech industries, we rely increasingly more on data analysis and statistical models to exploit the vast amounts of data at our fingertips. This sequel to our popular Modern Regression and Classification course covers many new areas of unsupervised learning and data mining, and gives an in-depth treatment of some of the hottest tools in supervised learning. The first course is not a prerequisite for this new course. Most of the techniques discussed in the course are implemented by the authors and others in the S language (S-plus or R), and all of the examples were developed in S. Day one focuses on state-of-art methods for supervised learning, including PRIM, boosting, support vector machines, and very recent work on least angle regression and the lasso. Day two covers unsupervised learning, including clustering, principal components, principal curves and self-organizing maps. Many applications will be discussed, including the analysis of DNA expression arrays - one of the hottest new areas in biology! ### Much of the material is based on the book: Elements of Statistical Learning: data mining, inference and prediction Hastie, Tibshirani Friedman, Springer-Verlag, 2001 http://www-stat.stanford.edu/ElemStatLearn/ A copy of this book will be given to all attendees. ### For more information, and to register, visit the course homepage: http://www-stat.stanford.edu/~hastie/mrc.html Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] [R-pkgs] gam --- a new contributed package
I have contributed a gam library to CRAN, which implements Generalized Additive Models. This implementation follows closely the description in the GAM chapter 7 of the white book Statistical Models in S (Chambers Hastie (eds), 1992, Wadsworth), as well as the philosophy in Generalized Additive Models (Hastie Tibshirani 1990, Chapman and Hall). Hence it behaves pretty much like the Splus version of GAM. Note: this gam library and functions therein are different from the gam function in package mgcv, and both libraries should not be used simultaneously. The gam library allows both local regression (loess) and smoothing spline smoothers, and uses backfitting and local scoring to fit gams. It also allows users to supply their own smoothing methods which can then be included in gam fits. The gam function in mgcv uses only smoothing spline smoothers, with a focus on automatic parameter selection via gcv. Some of the features of the gam library: * full compatibility with the R functions glm and lm - a fitted gam inherits from class glm and lm * print, summary, anova, predict and plot methods are provided, as well as the usual extractor methods like coefficients, residuals etc * the method step.gam provides a flexible and customizable approach to model selection. Some differences with the Splus version of gam: * predictions with new data are improved, without need for the safe.predict.gam function. This was partly facilitated by the improved prediction strategy used in R for GLMs and LMs * Currently the only backfitting algorithm is all.wam. In the earlier versions of gam, dedicated fortran routines fit models that had only smoothing spline terms (s.wam) or all local regression terms (lo.wam), which in fact made calls back to Splus to update the working response and weights. These were designed for efficiency. It seems now with much faster computers this efficiency is no longer needed, and all.wam is modular and visible This package is numbered 0.9 in anticipation of a few bug fixes and glitches. I have tested many aspects of the functions, but there are always a few that slip by. I will be happy to hear of any problems, bugs and suggestions. Plans for future versions: * exact standard error calculations. gam employs approximations as described in the white book. With a bit more computing (now possible), we will have a function that computes exact standard errors along the lines described in the GAM book page 127. Trevor Hastie Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 ___ R-packages mailing list [EMAIL PROTECTED] https://www.stat.math.ethz.ch/mailman/listinfo/r-packages __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Statistical Learning and Data Mining Course
Short course: Statistical Learning and Data Mining Trevor Hastie and Robert Tibshirani, Stanford University Georgetown University Conference Center Washington DC September 20-21, 2004 This two-day course gives a detailed overview of statistical models for data mining, inference and prediction. With the rapid developments in internet technology, genomics and other high-tech industries, we rely increasingly more on data analysis and statistical models to exploit the vast amounts of data at our fingertips. This sequel to our popular Modern Regression and Classification course covers many new areas of unsupervised learning and data mining, and gives an in-depth treatment of some of the hottest tools in supervised learning. The first course is not a prerequisite for this new course. Most of the techniques discussed in the course are implemented by the authors and others in the S language (S-plus or R), and all of the examples were developed in S. Day one focuses on state-of-art methods for supervised learning, including PRIM, boosting, support vector machines, and very recent work on least angle regression and the lasso. Day two covers unsupervised learning, including clustering, principal components, principal curves and self-organizing maps. Many applications will be discussed, including the analysis of DNA expression arrays - one of the hottest new areas in biology! ### Much of the material is based on the book: Elements of Statistical Learning: data mining, inference and prediction Hastie, Tibshirani Friedman, Springer-Verlag, 2001 http://www-stat.stanford.edu/ElemStatLearn/ A copy of this book will be given to all attendees. ### For more information, and to register, visit the course homepage: http://www-stat.stanford.edu/~hastie/mrc.html Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] svmpath: fitting the entire SVM regularization path
svmpath is a contributed package that fits the entire regularization path for a two-class SVM model. The SVM (with any kernel), has a regularization or cost parameter C, which controls the amount of overlap at the soft margin. When the SVM criterion is expressed in terms of a hinge loss plus lambda x quadratic penalty, then lambda=1/C. In many situations, the choice of C can be critical, and different regimes for C are called for as the other kernel tuning parameters are changed. Most software packages come with a default value for C (typically very large), and the user is left to explore different values of C. It turns out that the lagrange multipliers which define the SVM solution for any C are piecewise linear in C (and more usefully piecewise linear and mostly piecewise constant in lambda) This means that we can compute the entire sequence of solutions for all values of C exactly. svmpath does this with essentially the same cost as fitting a single SVM model with a specified value of C. See the paper (joint work with Saharon Rosset, Ji Zhu and Rob Tibshirani) http://www-stat.stanford.edu/~hastie/Papers/svmpath.pdf for details. This code has been tested on moderate sized problems, with up to 1000 observations. The current version is not industry ready; occasionally it will run into situations where the steps are too small, leading to machine zero situations. Usually increasing the parameter eps from its default 1e-10 will avoid this. Trevor Hastie Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] problem with model.matrix
This works: model.matrix(~I(pos3),data=data.frame(pos=c(1:5))) (Intercept) I(pos 3)TRUE 1 1 0 2 1 0 3 1 0 4 1 1 5 1 1 attr(,assign) [1] 0 1 attr(,contrasts) attr(,contrasts)$I(pos 3) [1] contr.treatment This does not: model.matrix(~I(pos3),data=data.frame(pos=c(1:2))) Error in contrasts-(`*tmp*`, value = contr.treatment) : contrasts can be applied only to factors with 2 or more levels Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Re: missing values for mda package
The mda package has no facilities for missing data. Users are expected to supply clean data; i.e. any missing value treatment should take place before using any of the routines in the package. In particular, our version of the mars function takes inputs x and y, which are assumed to have no missing values. The spam data were used to demonstrate mars in Elements of Statistical Learning The spam data has no missing values, and can be obtained from http://www-stat.stanford.edu/~tibs/ElemStatLearn/ Trevor Hastie Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics)Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Statistical Learning and Datamining course based on R/Splus tools
Short course: Statistical Learning and Data Mining Trevor Hastie and Robert Tibshirani, Stanford University Sheraton Hotel Palo Alto, CA Feb 26-27, 2004 This two-day course gives a detailed overview of statistical models for data mining, inference and prediction. With the rapid developments in internet technology, genomics and other high-tech industries, we rely increasingly more on data analysis and statistical models to exploit the vast amounts of data at our fingertips. This sequel to our popular Modern Regression and Classification course covers many new areas of unsupervised learning and data mining, and gives an in-depth treatment of some of the hottest tools in supervised learning. The first course is not a prerequisite for this new course. All of the techniques discussed in the course are implemented by the authors and others in the S language (S-plus or R). Day one focuses on state-of-art methods for supervised learning, including PRIM, boosting, support vector machines, and very recent work on least angle regression and the lasso. Day two covers unsupervised learning, including clustering, principal components, principal curves and self-organizing maps. Many applications will be discussed, including the analysis of DNA expression arrays - one of the hottest new areas in biology! ### Much of the material is based on the book: Elements of Statistical Learning: data mining, inference and prediction Hastie, Tibshirani Friedman, Springer-Verlag, 2001 http://www-stat.stanford.edu/ElemStatLearn/ A copy of this book will be given to all attendees. ### For more information, and to register, visit the course homepage: http://www-stat.stanford.edu/~hastie/mrc.html Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics) Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Re: Logistic Regression
Christoph Lehman had problems with seperated data in two-class logistic regression. One useful little trick is to penalize the logistic regression using a quadratic penalty on the coefficients. I am sure there are functions in the R contributed libraries to do this; otherwise it is easy to achieve via IRLS using ridge regressions. Then even though the data are separated, the penalized log-likelihood has a unique maximum. One intriguing feature is that as the penalty parameter goes to zero, the solution converges to the SVM solution - i.e. the optimal separating hyperplane see http://www-stat.stanford.edu/~hastie/Papers/margmax1.ps Trevor Hastie [EMAIL PROTECTED] Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 (Statistics)Fax: (650) 725-8977 (650) 498-5233 (Biostatistics) Fax: (650) 725-6951 URL: http://www-stat.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help