Re: [R] Dataframe of factors transform speed?
The problem is in the way that 'as.data.frame' works. Use Rprof on a small list and you will see where it is spending its time. Now if you are really sure that all your data is consistent with being a data frame, you can create your own dataframe structure your self. Not that I would advocate it, but if you look at the output of 'dput' on a dataframe, you can construct your own. Here it took 20 seconds to create the test data with a list of 50,000 and only 2 seconds to create the data frame from that. > set.seed(123) > n <- 5 > system.time({ + genoT <- lapply(1:n, function(i) factor(sample(c("AA", + "AB", "BB"), 1000, prob=c(1000, 1, 1), rep=T))) + }) user system elapsed 20.850.12 22.83 > names(genoT) = paste("snp", 1:n, sep="") > > # create your own data frame structure -- if you are real sure of your data > > system.time(genoTz <- structure(genoT, .Names=names(genoT), + row.names=c(NA, -length(genoT[[1]])), class='data.frame')) user system elapsed 2.000.082.11 > str(genoTz) 'data.frame': 1000 obs. of 5 variables: $ snp1: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ... $ snp2: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ... $ snp3: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ... $ snp4: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ... $ snp5: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ... $ snp6: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ... $ snp7: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ snp8: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 1 ... $ snp9: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ... $ snp10 : Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ... $ snp11 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... > On 7/21/07, Latchezar Dimitrov <[EMAIL PROTECTED]> wrote: > Jim, > > No, this is _not the problem. If you go to my 1st mail I have a monster > (at least was when I purchased it) with 32GB (sic :-) of RAM and 4 dual > core AMD64 285 (the fastest at that time and still pretty fast now :-) > > The machine stats paging when I run 2 copies of R working on two things > like that :-). If you look at my last e-mail I found a solution but > still have no clue why the heck x<-as.data.frame(y) where why is a list > of the same columns take real for ever and this the thing that killed me > before. > > Thanks, > Latchezar > > > -Original Message- > > From: jim holtman [mailto:[EMAIL PROTECTED] > > Sent: Saturday, July 21, 2007 5:33 PM > > To: Latchezar Dimitrov > > Cc: Benilton Carvalho; r-help@stat.math.ethz.ch > > Subject: Re: [R] Dataframe of factors transform speed? > > > > One of the problems is that you are probably paging on your > > system with an object that size (24 x 1000). This is > > about 1GB for a single object: > > > > > set.seed(123) > > > n <- 24 > > > system.time({ > > + genoT <- lapply(1:n, function(i) factor(sample(c("AA", "AB", "BB"), > > + 1000, prob=c(1000, 1, 1), rep=T))) > > + }) > >user system elapsed > > 95.000.61 104.71 > > > names(genoT) = paste("snp", 1:n, sep="") > > > > > > object.size(genoT) > > [1] 1045258752 > > > > > > > I can create it on my 2GB machine as a list, but have > > problems converting it to a dataframe because I don't have > > enough memory. > > > > So unless you have at least 4GB on your system, it might take > > a long time. Look at your performance measurements on your > > system and see if you have run out of physical memory and are paging. > > > > On 7/21/07, Latchezar Dimitrov <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > Thanks for the help. My 1st question still unanswered though :-) > > > Please see bellow > > > > > > > -Original Message- > > > > From: Benilton Carvalho [mailto:[EMAIL PROTECTED] > > > > Sent: Friday, July 20, 2007 3:30 AM > > > > To: Latchezar Dimitrov > > > > Cc: r-help@stat.math.ethz.ch > > > > Subject: Re: [R] Dataframe of factors transform speed? > > > > > > > > set.seed(123) > > > > genoT = lapply(1:24, function(i) factor(sample(c("AA", &quo
Re: [R] Dataframe of factors transform speed?
Jim, No, this is _not the problem. If you go to my 1st mail I have a monster (at least was when I purchased it) with 32GB (sic :-) of RAM and 4 dual core AMD64 285 (the fastest at that time and still pretty fast now :-) The machine stats paging when I run 2 copies of R working on two things like that :-). If you look at my last e-mail I found a solution but still have no clue why the heck x<-as.data.frame(y) where why is a list of the same columns take real for ever and this the thing that killed me before. Thanks, Latchezar > -Original Message- > From: jim holtman [mailto:[EMAIL PROTECTED] > Sent: Saturday, July 21, 2007 5:33 PM > To: Latchezar Dimitrov > Cc: Benilton Carvalho; r-help@stat.math.ethz.ch > Subject: Re: [R] Dataframe of factors transform speed? > > One of the problems is that you are probably paging on your > system with an object that size (24 x 1000). This is > about 1GB for a single object: > > > set.seed(123) > > n <- 24 > > system.time({ > + genoT <- lapply(1:n, function(i) factor(sample(c("AA", "AB", "BB"), > + 1000, prob=c(1000, 1, 1), rep=T))) > + }) >user system elapsed > 95.000.61 104.71 > > names(genoT) = paste("snp", 1:n, sep="") > > > > object.size(genoT) > [1] 1045258752 > > > > I can create it on my 2GB machine as a list, but have > problems converting it to a dataframe because I don't have > enough memory. > > So unless you have at least 4GB on your system, it might take > a long time. Look at your performance measurements on your > system and see if you have run out of physical memory and are paging. > > On 7/21/07, Latchezar Dimitrov <[EMAIL PROTECTED]> wrote: > > Hi, > > > > Thanks for the help. My 1st question still unanswered though :-) > > Please see bellow > > > > > -Original Message- > > > From: Benilton Carvalho [mailto:[EMAIL PROTECTED] > > > Sent: Friday, July 20, 2007 3:30 AM > > > To: Latchezar Dimitrov > > > Cc: r-help@stat.math.ethz.ch > > > Subject: Re: [R] Dataframe of factors transform speed? > > > > > > set.seed(123) > > > genoT = lapply(1:24, function(i) factor(sample(c("AA", "AB", > > > "BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T))) > > > names(genoT) = paste("snp", 1:24, sep="") genoT = > > > as.data.frame(genoT) > > > > Now this _is the problem. Everything before converting to > data.frame > > worked almost instantaneously however as.data.frame runs forever. > > Obviously there is some scalability memory management issue. When I > > tried my own method but creating a new result (instead of modifying > > the > > old) dataframe it worked like a charm for the 1st 100 cols ~ .3s. I > > figured 300,000 cols should be ~1000s. Nope! It ran for about > > 50,000(!)s to finish about 42,000 cols only. > > > > BTW, what ver. of R is yours? > > > > Now here's what I "discovered" further. > > > > #-- create a 1-col frame: > >geno <- > > > data.frame(c(geno.GASP[[1]],geno.JAG[[1]]),row.names=c(rownames(geno.G > > AS > > P),rownames(geno.JAG))) > > > > #-- main code I repeated it w/ j in 1:1000, 2001:3000, and > 3001:4000, > > i.e., adding a 1000 of cols to geno each time > > > > system.time( > > # for(j in 1:(ncol(geno.GASP ))){ > >for(j in 3001:(4000 )){ > > gt.GASP<-geno.GASP[[j]] > > for(l in 1:length([EMAIL PROTECTED])){ > > levels(gt.GASP)[l] <- > > switch([EMAIL PROTECTED],AA="0",AB="1",BB="2") > > } > > gt.JAG <-geno.JAG [[j]] > > # for(l in 1:length(gt.JAG @levels)){ > > #levels(gt.JAG )[l] <- switch(gt.JAG > > @levels[l],AA="0",AB="1",BB="2") > > # } > > geno[[j]]<-factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1 > > ### factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1 > > ,as.numeric(factor(gt.JAG, levels=0:2))-1 > > ) > >,levels=0:2 > >) > >} > > ) > > > > Times (each one is for a 1000 cols!): > > [1] 26.673 0.032 26.705 0.000 0.000 [1] 77.186 0.037 > 77.225 0.000 > > 0.000 > > [1] 128.165 0.042 128.209 0.000 0.000 > > [1] 180.940 0.047 180.989 0.000 0.000 > > > > See
Re: [R] Dataframe of factors transform speed?
One of the problems is that you are probably paging on your system with an object that size (24 x 1000). This is about 1GB for a single object: > set.seed(123) > n <- 24 > system.time({ + genoT <- lapply(1:n, function(i) factor(sample(c("AA", + "AB", "BB"), 1000, prob=c(1000, 1, 1), rep=T))) + }) user system elapsed 95.000.61 104.71 > names(genoT) = paste("snp", 1:n, sep="") > > object.size(genoT) [1] 1045258752 > I can create it on my 2GB machine as a list, but have problems converting it to a dataframe because I don't have enough memory. So unless you have at least 4GB on your system, it might take a long time. Look at your performance measurements on your system and see if you have run out of physical memory and are paging. On 7/21/07, Latchezar Dimitrov <[EMAIL PROTECTED]> wrote: > Hi, > > Thanks for the help. My 1st question still unanswered though :-) Please > see bellow > > > -Original Message- > > From: Benilton Carvalho [mailto:[EMAIL PROTECTED] > > Sent: Friday, July 20, 2007 3:30 AM > > To: Latchezar Dimitrov > > Cc: r-help@stat.math.ethz.ch > > Subject: Re: [R] Dataframe of factors transform speed? > > > > set.seed(123) > > genoT = lapply(1:24, function(i) factor(sample(c("AA", > > "AB", "BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T))) > > names(genoT) = paste("snp", 1:24, sep="") genoT = > > as.data.frame(genoT) > > Now this _is the problem. Everything before converting to data.frame > worked almost instantaneously however as.data.frame runs forever. > Obviously there is some scalability memory management issue. When I > tried my own method but creating a new result (instead of modifying the > old) dataframe it worked like a charm for the 1st 100 cols ~ .3s. I > figured 300,000 cols should be ~1000s. Nope! It ran for about 50,000(!)s > to finish about 42,000 cols only. > > BTW, what ver. of R is yours? > > Now here's what I "discovered" further. > > #-- create a 1-col frame: >geno <- > data.frame(c(geno.GASP[[1]],geno.JAG[[1]]),row.names=c(rownames(geno.GAS > P),rownames(geno.JAG))) > > #-- main code I repeated it w/ j in 1:1000, 2001:3000, and 3001:4000, > i.e., adding a 1000 of cols to geno each time > > system.time( > # for(j in 1:(ncol(geno.GASP ))){ >for(j in 3001:(4000 )){ > gt.GASP<-geno.GASP[[j]] > for(l in 1:length([EMAIL PROTECTED])){ > levels(gt.GASP)[l] <- > switch([EMAIL PROTECTED],AA="0",AB="1",BB="2") > } > gt.JAG <-geno.JAG [[j]] > # for(l in 1:length(gt.JAG @levels)){ > #levels(gt.JAG )[l] <- switch(gt.JAG > @levels[l],AA="0",AB="1",BB="2") > # } > geno[[j]]<-factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1 > ### factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1 > ,as.numeric(factor(gt.JAG, levels=0:2))-1 > ) >,levels=0:2 >) >} > ) > > Times (each one is for a 1000 cols!): > [1] 26.673 0.032 26.705 0.000 0.000 > [1] 77.186 0.037 77.225 0.000 0.000 > [1] 128.165 0.042 128.209 0.000 0.000 > [1] 180.940 0.047 180.989 0.000 0.000 > > See the big diff and the scaling I mentioned above? > > Further more I removed geno[[j]] assignment leaving the operation > though, i.e., replaced it with ### line above. Times: > > [1] 0.857 0.008 0.865 0.000 0.000 > > Huh!? What the heck! That's my second question :-) Any ideas? > > I still believe my method is near optimal. Of course I have to somehow > get rid of the assignment bottleneck. > > For now the lesson is: "God bless lists" > > Here is my final solution: > > > system.time({ > + geno.GASP.L<-lapply(geno.GASP > +,function(x){ > + for(l in 1:length([EMAIL > PROTECTED])){levels(x)[l] <- > switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")} > + factor(x,levels=0:2) > + } > + ) > + geno.JAG.L <-lapply(geno.JAG > +,function(x){ > + # for(l in 1:length([EMAIL > PROTECTED])){levels(x)[l] <- > switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")} > + factor(x,levels=0:2) > + } > + ) > + }) > [1] 192.800 1.5
Re: [R] Dataframe of factors transform speed?
Hi, Thanks for the help. My 1st question still unanswered though :-) Please see bellow > -Original Message- > From: Benilton Carvalho [mailto:[EMAIL PROTECTED] > Sent: Friday, July 20, 2007 3:30 AM > To: Latchezar Dimitrov > Cc: r-help@stat.math.ethz.ch > Subject: Re: [R] Dataframe of factors transform speed? > > set.seed(123) > genoT = lapply(1:24, function(i) factor(sample(c("AA", > "AB", "BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T))) > names(genoT) = paste("snp", 1:24, sep="") genoT = > as.data.frame(genoT) Now this _is the problem. Everything before converting to data.frame worked almost instantaneously however as.data.frame runs forever. Obviously there is some scalability memory management issue. When I tried my own method but creating a new result (instead of modifying the old) dataframe it worked like a charm for the 1st 100 cols ~ .3s. I figured 300,000 cols should be ~1000s. Nope! It ran for about 50,000(!)s to finish about 42,000 cols only. BTW, what ver. of R is yours? Now here's what I "discovered" further. #-- create a 1-col frame: geno <- data.frame(c(geno.GASP[[1]],geno.JAG[[1]]),row.names=c(rownames(geno.GAS P),rownames(geno.JAG))) #-- main code I repeated it w/ j in 1:1000, 2001:3000, and 3001:4000, i.e., adding a 1000 of cols to geno each time system.time( # for(j in 1:(ncol(geno.GASP ))){ for(j in 3001:(4000 )){ gt.GASP<-geno.GASP[[j]] for(l in 1:length([EMAIL PROTECTED])){ levels(gt.GASP)[l] <- switch([EMAIL PROTECTED],AA="0",AB="1",BB="2") } gt.JAG <-geno.JAG [[j]] # for(l in 1:length(gt.JAG @levels)){ #levels(gt.JAG )[l] <- switch(gt.JAG @levels[l],AA="0",AB="1",BB="2") # } geno[[j]]<-factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1 ### factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1 ,as.numeric(factor(gt.JAG, levels=0:2))-1 ) ,levels=0:2 ) } ) Times (each one is for a 1000 cols!): [1] 26.673 0.032 26.705 0.000 0.000 [1] 77.186 0.037 77.225 0.000 0.000 [1] 128.165 0.042 128.209 0.000 0.000 [1] 180.940 0.047 180.989 0.000 0.000 See the big diff and the scaling I mentioned above? Further more I removed geno[[j]] assignment leaving the operation though, i.e., replaced it with ### line above. Times: [1] 0.857 0.008 0.865 0.000 0.000 Huh!? What the heck! That's my second question :-) Any ideas? I still believe my method is near optimal. Of course I have to somehow get rid of the assignment bottleneck. For now the lesson is: "God bless lists" Here is my final solution: > system.time({ + geno.GASP.L<-lapply(geno.GASP +,function(x){ + for(l in 1:length([EMAIL PROTECTED])){levels(x)[l] <- switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")} + factor(x,levels=0:2) + } + ) + geno.JAG.L <-lapply(geno.JAG +,function(x){ + # for(l in 1:length([EMAIL PROTECTED])){levels(x)[l] <- switch([EMAIL PROTECTED],AA="0",AB="1",BB="2")} + factor(x,levels=0:2) + } + ) + }) [1] 192.800 1.566 194.413 0.000 0.000 ! :-) > system.time({ + class(geno.GASP.L)<-"data.frame" + row.names(geno.GASP.L)<-row.names(geno.GASP) + class(geno.JAG.L )<-"data.frame" + row.names(geno.JAG.L )<-row.names(geno.JAG ) + }) [1] 12.156 0.001 12.155 0.000 0.000 > system.time({ + geno<-rbind(geno.GASP.L,geno.JAG.L) + }) [1] 1542.3409.072 2066.3100.0000.000 I logged my notes here as I was trying various things. Partly the reason is my two questions: "What was wrong with me?" and "What the heck?!" remember above? :-))) which still remain unanswered :-( I would have had a lot of fun if I had not to have this done by ... Yesterday :-)) Thanks a lot for the help Latchezar > dim(genoT) > class(genoT) > system.time(out <- lapply(genoT, function(x) match(x, c("AA", "AB", > "BB"))-1)) > ## > ## > user system elapsed > 119.288 0.004 119.339 > > (for all 240K) > > best, > b > > ps: note that "out" is a list. > > On Jul 20, 2007, at 2:01 AM, Latchezar Dimitrov wrote: > > > Hi, > > > >> -Original Message- > >> From: Benilton Carvalho [mailto:[EMAIL PROTECTED] > >> Sent: Friday, July 20, 2007 1
Re: [R] Dataframe of factors transform speed?
On Thu, 19 Jul 2007, Latchezar Dimitrov wrote: > Hello, > > This is a speed question. I have a dataframe genoT: > >> dim(genoT) > [1] 1002 238304 It looks like these are all numeric originally. Handling these as a vector or matrix will speed things up a bit. You can then stitch together a data.frame: # simulate: # genoT.names <- scan('data.file, what='a', nlines=1, ) # genoT <- scan('data.file',skip=1) # > > genoT <- sample(0:2, 24*1002, repl=T) > t1 <- proc.time() > genoT <- factor(genoT,0:2,c("AA","AB","BB")) > dim(genoT) <- c(1002,24) > genoT.list <- lapply(1:24, function(x) genoT[,x]) > # simulate: names(genoT.list) <- genoT.names : > names(genoT.list) <- make.names(1:24) > class(genoT.list) <- "data.frame" > row.names(genoT.list) <- 1:1002 > proc.time()-t1 user system elapsed 20.978 2.036 49.714 > Most of the _elapsed_ time is due to lags in copy-and-paste-ing in the commands. HTH, Chuck > >> str(genoT) > 'data.frame': 1002 obs. of 238304 variables: > $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 > ... > $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2 > ... > $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 > ... > $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 > ... > $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1 > ... > $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 2 1 > ... > $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2 > ... > $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2 > ... > $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2 > ... > $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3 > ... > $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 2 2 3 > ... > $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 3 3 3 > ... > $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 2 2 2 > ... > $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ... > $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 1 1 2 > ... > $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 > ... > $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 1 1 1 > ... > $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ... > $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 1 1 2 > ... > $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 NA 1 NA 2 > 1 ... > $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3 1 1 1 > ... > $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2 2 ... > $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 1 ... > $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 2 2 1 > ... > > Its columns are factors with different number of levels (from 1 to 3 - > that's what I got from read.table, i.e., it dropped missing levels). I > want to convert it to uniform factors with 3 levels. The 1st 10 rows > above show already converted columns and the rest are not yet converted. > Here's my attempt wich is a complete failure as speed: > >> system.time( > + for(j in 1:(10 )){ #-- this is to try 1st 10 cols and > measure the time, it otherwise is ncol(genoT) instead of 10 > > +gt<-genoT[[j]] #-- this is to avoid 2D indices > +for(l in 1:length([EMAIL PROTECTED])){ > + levels(gt)[l] <- switch([EMAIL PROTECTED],AA="0",AB="1",BB="2") > #-- convert levels to "0","1", or "2" > + genoT[[j]]<-factor(gt,levels=0:2) #-- make a 3-level factor > and put it back > +} > + } > + ) > [1] 785.085 4.358 789.454 0.000 0.000 > > 789s for 10 columns only! > > To me it seems like replacing 10 x 3 levels and then making a factor of > 1002 element vector x 10 is a "negligible" amount of operations needed. > > So, what's wrong with me? Any idea how to accelerate significantly the > transformation or (to go to the very beginning) to make read.table use a > fixed set of levels ("AA","AB", and "BB") and not to drop any (missing) > level? > > R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit > > The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) so it's not > it. > > Thank you very much for the help, > > Latchezar Dimitrov, > Analyst/Programmer IV, > Wake Forest University School of Medicine, > Winston-Salem, North Carolina, USA > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > Charles C. Berry(858) 534-2098 Dept of Family/Preventive Medicine E mailto:[EMAIL PROTECTED] UC San Diego http
Re: [R] Dataframe of factors transform speed?
set.seed(123) genoT = lapply(1:24, function(i) factor(sample(c("AA", "AB", "BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T))) names(genoT) = paste("snp", 1:24, sep="") genoT = as.data.frame(genoT) dim(genoT) class(genoT) system.time(out <- lapply(genoT, function(x) match(x, c("AA", "AB", "BB"))-1)) ## ## user system elapsed 119.288 0.004 119.339 (for all 240K) best, b ps: note that "out" is a list. On Jul 20, 2007, at 2:01 AM, Latchezar Dimitrov wrote: > Hi, > >> -Original Message- >> From: Benilton Carvalho [mailto:[EMAIL PROTECTED] >> Sent: Friday, July 20, 2007 12:25 AM >> To: Latchezar Dimitrov >> Cc: r-help@stat.math.ethz.ch >> Subject: Re: [R] Dataframe of factors transform speed? >> >> it looks like that whatever method you used to genotype the >> 1002 samples on the STY array gave you a transposed matrix of >> genotype calls. :-) > > It only looks like :-) > > Otherwise it is correctly created dataframe of 1002 samples X (big > number) of columns (SNP genotypes). It worked perfectly until I > decided > to put together to cohorts independently processed in R already. I got > stuck with my lack of foreseeing. Otherwise I would have put 3 dummy > lines w/ AA,AB, and AB on each one to make sure all 3 genotypes are > present and that's it! Lesson for the future :-) > > Maybe I am not using columns and rows appropriately here but the > dataframe is correct (I have not used FORTRAN since FORTRAN IV ;-) > - as > str says 1002 observ. of (big number) vars. > >> >> i'd use: >> >> genoT = read.table(yourFile, stringsAsFactors = FALSE) >> >> as a starting point... but I don't think that would be >> efficient (as you'd need to fix one column at a time - lapply). > > No it was not efficient at all. 'matter of fact nothing is more > efficient then loading already read data, alas :-( > >> >> i'd preprocess yourFile before trying to load it: >> >> cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e >> 's/BB/3/ g' > outFile >> >> and, now, in R: >> >> genoT = read.table(outFile, header=TRUE) > > ... Too late ;-) As it must be clear now I have two dataframes I > want to > put together with rbind(geno1,geno2). The issue again is > "uniformization" of factor variables w/ missing factors - they > ended up > like levels AA,BB on one of the and levels AB,BB on the other which > means as.numeric of AA is 1 on the 1st and as.numeric of AB is 1 on > the > second - complete mess. That's why I tried to make both uniform, i.e. > levels "AA","AB", and "BB" for every SNP and then rbind works. > > In any case my 1st questions remains: "What's wrong with me?" :-) > > Thanks, > Latchezar > >> >> b >> >> On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote: >> >>> Hello, >>> >>> This is a speed question. I have a dataframe genoT: >>> >>>> dim(genoT) >>> [1] 1002 238304 >>> >>>> str(genoT) >>> 'data.frame': 1002 obs. of 238304 variables: >>> $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 >> 3 3 3 3 3 >>> ... >>> $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 >> 1 1 2 2 2 >>> ... >>> $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 >> 1 1 1 1 1 >>> ... >>> $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 >> 3 3 3 3 3 >>> ... >>> $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 >> 3 2 3 3 1 >>> ... >>> $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 >>> 2 1 >>> ... >>> $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 >> 1 1 1 1 2 >>> ... >>> $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 >> 3 3 3 3 2 >>> ... >>> $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 >> 1 1 1 1 2 >>> ... >>> $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 >> 1 2 1 1 3 >>> ... >>> $ SNP_A.4261597: Factor w/ 3 leve
Re: [R] Dataframe of factors transform speed?
Hi, > -Original Message- > From: Benilton Carvalho [mailto:[EMAIL PROTECTED] > Sent: Friday, July 20, 2007 12:25 AM > To: Latchezar Dimitrov > Cc: r-help@stat.math.ethz.ch > Subject: Re: [R] Dataframe of factors transform speed? > > it looks like that whatever method you used to genotype the > 1002 samples on the STY array gave you a transposed matrix of > genotype calls. :-) It only looks like :-) Otherwise it is correctly created dataframe of 1002 samples X (big number) of columns (SNP genotypes). It worked perfectly until I decided to put together to cohorts independently processed in R already. I got stuck with my lack of foreseeing. Otherwise I would have put 3 dummy lines w/ AA,AB, and AB on each one to make sure all 3 genotypes are present and that's it! Lesson for the future :-) Maybe I am not using columns and rows appropriately here but the dataframe is correct (I have not used FORTRAN since FORTRAN IV ;-) - as str says 1002 observ. of (big number) vars. > > i'd use: > > genoT = read.table(yourFile, stringsAsFactors = FALSE) > > as a starting point... but I don't think that would be > efficient (as you'd need to fix one column at a time - lapply). No it was not efficient at all. 'matter of fact nothing is more efficient then loading already read data, alas :-( > > i'd preprocess yourFile before trying to load it: > > cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e > 's/BB/3/ g' > outFile > > and, now, in R: > > genoT = read.table(outFile, header=TRUE) ... Too late ;-) As it must be clear now I have two dataframes I want to put together with rbind(geno1,geno2). The issue again is "uniformization" of factor variables w/ missing factors - they ended up like levels AA,BB on one of the and levels AB,BB on the other which means as.numeric of AA is 1 on the 1st and as.numeric of AB is 1 on the second - complete mess. That's why I tried to make both uniform, i.e. levels "AA","AB", and "BB" for every SNP and then rbind works. In any case my 1st questions remains: "What's wrong with me?" :-) Thanks, Latchezar > > b > > On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote: > > > Hello, > > > > This is a speed question. I have a dataframe genoT: > > > >> dim(genoT) > > [1] 1002 238304 > > > >> str(genoT) > > 'data.frame': 1002 obs. of 238304 variables: > > $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 > 3 3 3 3 3 > > ... > > $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 > 1 1 2 2 2 > > ... > > $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 > 1 1 1 1 1 > > ... > > $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 > 3 3 3 3 3 > > ... > > $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 > 3 2 3 3 1 > > ... > > $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 > > 2 1 > > ... > > $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 > 1 1 1 1 2 > > ... > > $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 > 3 3 3 3 2 > > ... > > $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 > 1 1 1 1 2 > > ... > > $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 > 1 2 1 1 3 > > ... > > $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 > > 2 2 3 > > ... > > $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 > > 3 3 3 > > ... > > $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 > > 2 2 2 > > ... > > $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 > > 1 ... > > $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 > > 1 1 2 > > ... > > $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 > > 1 1 1 > > ... > > $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 > > 1 1 1 > > ... > > $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 > > 1 ..
Re: [R] Dataframe of factors transform speed?
Is this what you want? It took 0.01 seconds to convert 20 rows of the test data: > # create some data (20 rows with 1000 columns) > n <- 20 > result <- list() > vals <- c("AA", "AB", "BB") > for (i in 1:n){ + result[[as.character(i)]] <- sample(vals,1000, replace=TRUE, prob=c(9000,1,1)) + } > result.df <- do.call('data.frame', result) > > > str(result.df) 'data.frame': 1000 obs. of 20 variables: $ X1 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X2 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X3 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X4 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X5 : Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ... $ X6 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X7 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X8 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X9 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X10: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X11: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X12: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X13: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X14: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X15: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X16: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X17: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X18: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X19: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... $ X20: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ... > > # go through each row and convert the factors according to 'vals' above > system.time({ # time to convert 20 rows + x <- lapply(result.df, function(facts){ + factor(match(as.character(facts), vals) - 1, levels=0:2) + }) + result.df <- do.call('data.frame', x) + }) user system elapsed 0.010.000.01 > > str(result.df) 'data.frame': 1000 obs. of 20 variables: $ X1 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X2 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X3 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X4 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X5 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X6 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X7 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X8 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X9 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X10: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X11: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X12: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X13: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X14: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X15: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X16: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X17: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X18: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X19: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ X20: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... > On 7/19/07, Latchezar Dimitrov <[EMAIL PROTECTED]> wrote: > Hello, > > This is a speed question. I have a dataframe genoT: > > > dim(genoT) > [1] 1002 238304 > > > str(genoT) > 'data.frame': 1002 obs. of 238304 variables: > $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 > ... > $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2 > ... > $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 > ... > $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 > ... > $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1 > ... > $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 2 1 > ... > $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2 > ... > $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2 > ... > $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2 > ... > $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3 > ... > $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 2 2 3 > ... > $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 3 3 3 > ... > $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 2 2 2 > ... > $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ... > $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 1 1 2 > ... > $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 > ... > $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 1 1 1 > ... > $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ... > $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 1 1 2 > ... > $ SNP_A.4261513: Factor w/ 3 levels "AA","AB"
Re: [R] Dataframe of factors transform speed?
it looks like that whatever method you used to genotype the 1002 samples on the STY array gave you a transposed matrix of genotype calls. :-) i'd use: genoT = read.table(yourFile, stringsAsFactors = FALSE) as a starting point... but I don't think that would be efficient (as you'd need to fix one column at a time - lapply). i'd preprocess yourFile before trying to load it: cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e 's/BB/3/ g' > outFile and, now, in R: genoT = read.table(outFile, header=TRUE) b On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote: > Hello, > > This is a speed question. I have a dataframe genoT: > >> dim(genoT) > [1] 1002 238304 > >> str(genoT) > 'data.frame': 1002 obs. of 238304 variables: > $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 > ... > $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2 > ... > $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 > ... > $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 > ... > $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1 > ... > $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 > 2 1 > ... > $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2 > ... > $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2 > ... > $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2 > ... > $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3 > ... > $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 > 2 2 3 > ... > $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 > 3 3 3 > ... > $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 > 2 2 2 > ... > $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 > 1 ... > $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 > 1 1 2 > ... > $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 > 1 1 1 > ... > $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 > 1 1 1 > ... > $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 > 1 ... > $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 > 1 1 2 > ... > $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 NA 1 > NA 2 > 1 ... > $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3 > 1 1 1 > ... > $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2 > 2 ... > $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 > 1 ... > $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 > 2 2 1 > ... > > Its columns are factors with different number of levels (from 1 to 3 - > that's what I got from read.table, i.e., it dropped missing levels). I > want to convert it to uniform factors with 3 levels. The 1st 10 rows > above show already converted columns and the rest are not yet > converted. > Here's my attempt wich is a complete failure as speed: > >> system.time( > + for(j in 1:(10 )){ #-- this is to try 1st 10 cols and > measure the time, it otherwise is ncol(genoT) instead of 10 > > +gt<-genoT[[j]] #-- this is to avoid 2D indices > +for(l in 1:length([EMAIL PROTECTED])){ > + levels(gt)[l] <- switch([EMAIL PROTECTED],AA="0",AB="1",BB="2") > #-- convert levels to "0","1", or "2" > + genoT[[j]]<-factor(gt,levels=0:2) #-- make a 3-level > factor > and put it back > +} > + } > + ) > [1] 785.085 4.358 789.454 0.000 0.000 > > 789s for 10 columns only! > > To me it seems like replacing 10 x 3 levels and then making a > factor of > 1002 element vector x 10 is a "negligible" amount of operations > needed. > > So, what's wrong with me? Any idea how to accelerate significantly the > transformation or (to go to the very beginning) to make read.table > use a > fixed set of levels ("AA","AB", and "BB") and not to drop any > (missing) > level? > > R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit > > The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) so it's not > it. > > Thank you very much for the help, > > Latchezar Dimitrov, > Analyst/Programmer IV, > Wake Forest University School of Medicine, > Winston-Salem, North Carolina, USA > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Dataframe of factors transform speed?
Hello, This is a speed question. I have a dataframe genoT: > dim(genoT) [1] 1002 238304 > str(genoT) 'data.frame': 1002 obs. of 238304 variables: $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ... $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2 ... $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ... $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1 ... $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 2 1 ... $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2 ... $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2 ... $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2 ... $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3 ... $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 2 2 3 ... $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 3 3 3 ... $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 2 2 2 ... $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ... $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 1 1 2 ... $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ... $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 1 1 1 ... $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ... $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 1 1 2 ... $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 NA 1 NA 2 1 ... $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3 1 1 1 ... $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2 2 ... $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 1 ... $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 2 2 1 ... Its columns are factors with different number of levels (from 1 to 3 - that's what I got from read.table, i.e., it dropped missing levels). I want to convert it to uniform factors with 3 levels. The 1st 10 rows above show already converted columns and the rest are not yet converted. Here's my attempt wich is a complete failure as speed: > system.time( + for(j in 1:(10 )){ #-- this is to try 1st 10 cols and measure the time, it otherwise is ncol(genoT) instead of 10 +gt<-genoT[[j]] #-- this is to avoid 2D indices +for(l in 1:length([EMAIL PROTECTED])){ + levels(gt)[l] <- switch([EMAIL PROTECTED],AA="0",AB="1",BB="2") #-- convert levels to "0","1", or "2" + genoT[[j]]<-factor(gt,levels=0:2) #-- make a 3-level factor and put it back +} + } + ) [1] 785.085 4.358 789.454 0.000 0.000 789s for 10 columns only! To me it seems like replacing 10 x 3 levels and then making a factor of 1002 element vector x 10 is a "negligible" amount of operations needed. So, what's wrong with me? Any idea how to accelerate significantly the transformation or (to go to the very beginning) to make read.table use a fixed set of levels ("AA","AB", and "BB") and not to drop any (missing) level? R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) so it's not it. Thank you very much for the help, Latchezar Dimitrov, Analyst/Programmer IV, Wake Forest University School of Medicine, Winston-Salem, North Carolina, USA __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.