?scale ?ave -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111
> -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- > project.org] On Behalf Of Krzysztof Sakrejda-Leavitt > Sent: Monday, June 01, 2009 7:12 AM > To: r-help@r-project.org > Subject: [R] Fast function for centering and standardizing variables > > Hi, > > I wrote a function to center variables I use in regression and > standardize them by the standard deviation (below) within certain > groupings (much like the aggregate function can apply a function to > groups). This runs fast enough when I have about 50 groups and 50k > records, but sometimes I end up with 1000 groups or so and it slows > down > considerably. The problem is probably the 'for' loops at the group > level but I am having a hard time seeing if there is a good way to > vectorize that step. Alternatively, is there a fast function already > implemented for this sort of thing? > > If you want to run the function on a test data frame (from package > MASS), here's the syntax: > > library(MASS) > zscore(data = UScereal, columns = c("calories","protein","sugars"), by > = > list(mfr = UScereal$mfr, vitamins = UScereal$vitamins)) > > It returns a data frame with new columns appended. > > ------------------ > zscore <- function(data, columns, by) { > means <- aggregate(x = data[,columns], by = by, FUN = mean, na.rm=T) > sdevs <- aggregate(x = data[,columns], by = by, FUN = sd, na.rm=T) > # Efficient (?) index for 'na' in any 'by' column. NA => FALSE > noNA <- (rowSums(is.na(as.data.frame(by))) == 0) > > for (col in columns) { > # Final name for the new column. > column <- paste(col,"CMS",sep="") > for (i in 1:nrow(means)) { > # Allocate objects for indexing on 'by' terms. > byTFmean <- by > byTFsd <- by > for (j in names(by)) { > # Construct index for each 'by' term > byTFmean[[j]] <- !(data[[j]] == means[[j]][[i]]) > byTFsd[[j]] <- !(data[[j]] == sdevs[[j]][[i]]) > } > # collapse indexes for 'by' using '&' > byTFmean <- (rowSums(as.data.frame(byTFmean)) == 0) > byTFsd <- (rowSums(as.data.frame(byTFsd)) == 0) > data[[column]][noNA & byTFmean & byTFsd] <- ( data[[col]][noNA & > byTFmean & byTFsd] - means[[col]][i] ) / sdevs[[col]][i] > } > } > return(data) > } > ------------------------ > > Any suggestions are welcome and I'm happy to post back the final code. > > Best, > > Krzysztof > > > ----------------------------------------------- > Krzysztof Sakrejda-Leavitt > > Organismic and Evolutionary Biology > University of Massachusetts, Amherst > 319 Morrill Science Center South > 611 N. Pleasant Street > Amherst, MA 01003 > > work #: 413-325-6555 > email: sakre...@nsm.umass.edu > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.