Dmitri: If you follow the R posting guide you're more likely to get useful replies. In particular it asks for **small** reproducible examples -- your example is far more code then I care to spend time on anyway (others may be more willing or more able to do so of course). I suggest you try (if you haven't already):
1. Profiling the code using Rprof to isolate where the time is spent.And then... 2. Writing a **small** reproducible example to exercise that portion of the code and post it with your question to the list. If you need to... Typically, if you do these things you'll figure out how to fix the situation on your own. Cheers, Bert Gunter Genentech Nonclinical Statistics -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Dimitri Liakhovitski Sent: Friday, March 26, 2010 2:06 PM To: r-help Subject: [R] Competing with SPSS and SAS: improving code that loops throughrows (data manipulation) Dear R-ers, In my question there are no statistics involved - it's all about data manipulation in R. I am trying to write a code that should replace what's currently being done in SAS and SPSS. Or, at least, I am trying to show to my colleagues R is not much worse than SAS/SPSS for the task at hand. I've written a code that works but it's too slow. Probably because it's looping through a lot of things. But I am not seeing how to improve it. I've already written a different code but it's 5 times slower than this one. The code below takes me slightly above 5 sec for the tiny data set. I've tried using it with a real one - was not done after hours. Need help of the list! Maybe someone will have an idea on how to increase the efficiency of my code (just one block of it - in the "DATA TRANSFORMATION" Section below)? Below - I am creating the data set whose structure is similar to the data sets the code should be applied to. Also - I have desribed what's actually being done - in comments. Thanks a lot to anyone for any suggestion! Dimitri ###### CREATING THE TEST DATA SET ################################ set.seed(123) data<-data.frame(group=c(rep("first",10),rep("second",10)),week=c(1:10,1:10) ,a=abs(round(rnorm(20)*10,0)), b=abs(round(rnorm(20)*100,0))) data dim(data)[1] # !!! In real life I might have up to 150 (!) rows (weeks) within each subgroup ### Specifying parameters used in the code below: vars<-names(data)[3:4] # names of variables to be transformed nr.vars<-length(vars) # number of variables to be transformed; !!! in real life I'll have to deal with up to 50-60 variables, not 2. group.var<-names(data)[1] # name of the grouping variable subgroups<-levels(data[[group.var]]) # names of subgroups; !!! in real life I'll have up to 20-25 subgroups, not 2. # For EACH subgroup: indexing variables a and b to their maximum in that subgroup; # Further, I'll have to use these indexed variables to build the new ones: for(i in vars){ new.name<-paste(i,".ind.to.max",sep="") data[[new.name]]<-NA } indexed.vars<-names(data)[grep("ind.to.max$", names(data))] # variables indexed to subgroup max for(subgroup in subgroups){ data[data[[group.var]] %in% subgroup,indexed.vars]<-lapply(data[data[[group.var]] %in% subgroup,vars],function(x){ y<-x/max(x) return(y) }) } data ############# DATA TRANSFORMATION ######################################### # Objective: Create new variables based on the old ones (a and b ind.to.max) # For each new variable, the value in a given row is a function of (a) 2 constants (that have several levels each), # (b) the corresponding value of the original variable (e.g., a.ind.to.max"), and the value in the previous row on the same new variable # PLUS: - it has to be done by subgroup (variable "group") constant1<-c(1:3) # constant 1 used for transformation - has 3 levels; !!! in real life it will have up to 7 levels constant2<-seq(.15,.45,.15) # constant 2 used for transformation - has 3 levels; !!! in real life it will have up to 7 levels # CODE THAT IS TOO SLOW (it uses parameters specified in the previous code section): start1<-Sys.time() for(var in indexed.vars){ # looping through variables for(c1 in 1:length(constant1)){ # looping through levels of constant1 for(c2 in 1:length(constant2)){ # looping through levels of constant2 d=log(0.5)/constant1[c1] l=-log(1-constant2[c2]) name<-paste(strsplit(var,".ind.to.max"),constant1[c1],constant2[c2]*100,"..t ransf",sep=".") data[[name]]<-NA for(subgroup in subgroups){ # looping through subgroups data[data[[group.var]] %in% subgroup, name][1] = 1-((1-0*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup, var][1]*l*10))) # this is just the very first row of each subgroup for(case in 2:nrow(data[data[[group.var]] %in% subgroup, ])){ # looping through the remaining rows of the subgroup data[data[[group.var]] %in% subgroup, name][case]= 1-((1-data[data[[group.var]] %in% subgroup, name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup, var][case]*l*10))) } } } } } end1<-Sys.time() print(end1-start1) # Takes me ~0.53 secs names(data) data -- Dimitri Liakhovitski Ninah.com dimitri.liakhovit...@ninah.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.