Re: [R] speeding up regressions using ddply
Why do you want to do this? If there is just a small part of the logistic regression that you are interested in, then there may be a way to compute or approximate that more quickly than doing a full glm fit on every pair. It seems unlikely that you would get much meaning out of that many full regressions, but there may be some piece that you are looking for that getting just that could lend itself to further graphing/analysis. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 > -Original Message- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- > project.org] On Behalf Of Alison Macalady > Sent: Wednesday, September 22, 2010 5:05 AM > To: r-help@r-project.org > Subject: [R] speeding up regressions using ddply > > > > Hi, > > I have a data set that I'd like to run logistic regressions on, using > ddply to speed up the computation of many models with different > combinations of variables. I would like to run regressions on every > unique two-variable combination in a portion of my data set, but I > can't quite figure out how to do using ddply. The data set looks like > this, with "status" as the binary dependent variable and V1:V8 as > potential independent variables in the logistic regression: > > m <- matrix(rnorm(288), nrow = 36) > colnames(m) <- paste('V', 1:8, sep = '') > x <- data.frame( status = factor(rep(rep(c('D','L'), each = 6), 3)), > as.data.frame(m)) > > I used melt to put my data frame into a more workable format > require(reshape) > xm <- melt(x, id = 'status') > > Here is the basic shape of the function I'd like to apply to every > combination of variables in the dataset: > > h<- function(df) > { > > attach(df) > log.glm <- (glm(status ~ value1+ value2 , family=binomial(link=logit), > na.action=na.omit)) #What I can't figure out is how to specify 2 > different variables (I've put value1 and value2 as placeholders) from > the xm to include in the model > > glm.summary<-summary(log.glm) > aic <- extractAIC(log.glm) > coef <- coef(glm.summary) > list(Est1=coef[1,2], Est2=coef[3,2], AIC=aic[2]) #or whatever other > output here > } > > And then I'd like to use ddply to speed up the computations. > > require(pplyr) > output<-dddply(xm, .(variable), as.data.frame.function(h)) > output > > > I can easily do this using ddply when I only want to use 1 variable in > the model, but can't figure out how to do it with two variables. > > Many thanks for any hints! > > Ali > > > > > Alison Macalady > Ph.D. Candidate > University of Arizona > School of Geography and Development > & Laboratory of Tree Ring Research > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speeding up regressions using ddply
There has been a recent addition of parallel processing capabilities to plyr (I believe v1.2 and later), along with a dataframe iterator construct. Both have improved performance of ddply greatly for multicore/cluster computing. So we now have the niceness of plyr's grammar with pretty good performance. From the plyr NEWS file: Version 1.2 (2010-09-09) -- NEW FEATURES * l*ply, d*ply, a*ply and m*ply all gain a .parallel argument that when TRUE, applies functions in parallel using a parallel backend registered with the foreach package: x <- seq_len(20) wait <- function(i) Sys.sleep(0.1) system.time(llply(x, wait)) # user system elapsed # 0.007 0.005 2.005 library(doMC) registerDoMC(2) system.time(llply(x, wait, .parallel = TRUE)) # user system elapsed # 0.020 0.011 1.038 On 9/22/10 10:41 AM, Ista Zahn wrote: Hi Alison, On Wed, Sep 22, 2010 at 11:05 AM, Alison Macalady wrote: Hi, I have a data set that I'd like to run logistic regressions on, using ddply to speed up the computation of many models with different combinations of variables. In my experience ddply is not particularly fast. I use it a lot because it is flexible and has easy to understand syntax, not for it's speed. I would like to run regressions on every unique two-variable combination in a portion of my data set, but I can't quite figure out how to do using ddply. I'm not sure ddply is the tool for this job. The data set looks like this, with "status" as the binary dependent variable and V1:V8 as potential independent variables in the logistic regression: m<- matrix(rnorm(288), nrow = 36) colnames(m)<- paste('V', 1:8, sep = '') x<- data.frame( status = factor(rep(rep(c('D','L'), each = 6), 3)), as.data.frame(m)) You can use combn to determine the combinations you want: Varcombos<- combn(names(x)[-1], 2) > From there you can do a loop, something like results<- list() for(i in 1:dim(Varcombos)[2]) { log.glm<- glm(as.formula(paste("status ~ ", Varcombos[1,i], " + ", Varcombos[2,i], sep="")), family=binomial(link=logit), na.action=na.omit, data=x) glm.summary<-summary(log.glm) aic<- extractAIC(log.glm) coef<- coef(glm.summary) results[[i]]<- list(Est1=coef[1,2], Est2=coef[3,2], AIC=aic[2]) #or whatever other output here names(results)[i]<- paste(Varcombos[1,i], Varcombos[2,i], sep="_") } I'm sure you could replace the loop with something more elegant, but I'm not really sure how to go about it. I used melt to put my data frame into a more workable format require(reshape) xm<- melt(x, id = 'status') Here is the basic shape of the function I'd like to apply to every combination of variables in the dataset: h<- function(df) { attach(df) log.glm<- (glm(status ~ value1+ value2 , family=binomial(link=logit), na.action=na.omit)) #What I can't figure out is how to specify 2 different variables (I've put value1 and value2 as placeholders) from the xm to include in the model glm.summary<-summary(log.glm) aic<- extractAIC(log.glm) coef<- coef(glm.summary) list(Est1=coef[1,2], Est2=coef[3,2], AIC=aic[2]) #or whatever other output here } And then I'd like to use ddply to speed up the computations. require(pplyr) output<-dddply(xm, .(variable), as.data.frame.function(h)) output I can easily do this using ddply when I only want to use 1 variable in the model, but can't figure out how to do it with two variables. I don't think this approach can work. You are saying "split up xm by variable" and then expecting to be able to reference different levels of variable within each split, an impossible request. Hope this helps, Ista Many thanks for any hints! Ali Alison Macalady Ph.D. Candidate University of Arizona School of Geography and Development & Laboratory of Tree Ring Research __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Abhijit Dasgupta, PhD Director and Principal Statistician ARAASTAT Ph: 301.385.3067 E: adasgu...@araastat.com W: http://www.araastat.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speeding up regressions using ddply
Hi Alison, On Wed, Sep 22, 2010 at 11:05 AM, Alison Macalady wrote: > > > Hi, > > I have a data set that I'd like to run logistic regressions on, using ddply > to speed up the computation of many models with different combinations of > variables. In my experience ddply is not particularly fast. I use it a lot because it is flexible and has easy to understand syntax, not for it's speed. I would like to run regressions on every unique two-variable > combination in a portion of my data set, but I can't quite figure out how > to do using ddply. I'm not sure ddply is the tool for this job. The data set looks like this, with "status" as the > binary dependent variable and V1:V8 as potential independent variables in > the logistic regression: > > m <- matrix(rnorm(288), nrow = 36) > colnames(m) <- paste('V', 1:8, sep = '') > x <- data.frame( status = factor(rep(rep(c('D','L'), each = 6), 3)), > as.data.frame(m)) > You can use combn to determine the combinations you want: Varcombos <- combn(names(x)[-1], 2) >From there you can do a loop, something like results <- list() for(i in 1:dim(Varcombos)[2]) { log.glm <- glm(as.formula(paste("status ~ ", Varcombos[1,i], " + ", Varcombos[2,i], sep="")), family=binomial(link=logit), na.action=na.omit, data=x) glm.summary<-summary(log.glm) aic <- extractAIC(log.glm) coef <- coef(glm.summary) results[[i]] <- list(Est1=coef[1,2], Est2=coef[3,2], AIC=aic[2]) #or whatever other output here names(results)[i] <- paste(Varcombos[1,i], Varcombos[2,i], sep="_") } I'm sure you could replace the loop with something more elegant, but I'm not really sure how to go about it. > I used melt to put my data frame into a more workable format > require(reshape) > xm <- melt(x, id = 'status') > > Here is the basic shape of the function I'd like to apply to every > combination of variables in the dataset: > > h<- function(df) > { > > attach(df) > log.glm <- (glm(status ~ value1+ value2 , family=binomial(link=logit), > na.action=na.omit)) #What I can't figure out is how to specify 2 different > variables (I've put value1 and value2 as placeholders) from the xm to > include in the model > > glm.summary<-summary(log.glm) > aic <- extractAIC(log.glm) > coef <- coef(glm.summary) > list(Est1=coef[1,2], Est2=coef[3,2], AIC=aic[2]) #or whatever other output > here > } > > And then I'd like to use ddply to speed up the computations. > > require(pplyr) > output<-dddply(xm, .(variable), as.data.frame.function(h)) > output > > > I can easily do this using ddply when I only want to use 1 variable in the > model, but can't figure out how to do it with two variables. I don't think this approach can work. You are saying "split up xm by variable" and then expecting to be able to reference different levels of variable within each split, an impossible request. Hope this helps, Ista > > Many thanks for any hints! > > Ali > > > > > Alison Macalady > Ph.D. Candidate > University of Arizona > School of Geography and Development > & Laboratory of Tree Ring Research > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Ista Zahn Graduate student University of Rochester Department of Clinical and Social Psychology http://yourpsyche.org __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] speeding up regressions using ddply
Hi, I have a data set that I'd like to run logistic regressions on, using ddply to speed up the computation of many models with different combinations of variables. I would like to run regressions on every unique two-variable combination in a portion of my data set, but I can't quite figure out how to do using ddply. The data set looks like this, with "status" as the binary dependent variable and V1:V8 as potential independent variables in the logistic regression: m <- matrix(rnorm(288), nrow = 36) colnames(m) <- paste('V', 1:8, sep = '') x <- data.frame( status = factor(rep(rep(c('D','L'), each = 6), 3)), as.data.frame(m)) I used melt to put my data frame into a more workable format require(reshape) xm <- melt(x, id = 'status') Here is the basic shape of the function I'd like to apply to every combination of variables in the dataset: h<- function(df) { attach(df) log.glm <- (glm(status ~ value1+ value2 , family=binomial(link=logit), na.action=na.omit)) #What I can't figure out is how to specify 2 different variables (I've put value1 and value2 as placeholders) from the xm to include in the model glm.summary<-summary(log.glm) aic <- extractAIC(log.glm) coef <- coef(glm.summary) list(Est1=coef[1,2], Est2=coef[3,2], AIC=aic[2]) #or whatever other output here } And then I'd like to use ddply to speed up the computations. require(pplyr) output<-dddply(xm, .(variable), as.data.frame.function(h)) output I can easily do this using ddply when I only want to use 1 variable in the model, but can't figure out how to do it with two variables. Many thanks for any hints! Ali Alison Macalady Ph.D. Candidate University of Arizona School of Geography and Development & Laboratory of Tree Ring Research __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.