Re: [R] Quicker way of combining vectors into a data.frame
[ Resending to the list as I fell foul of the too many recipients rule ] On Thu, 2006-11-30 at 11:34 -0600, Marc Schwartz wrote: Thanks to Marc, Prof. Ripley, Sebastian and Sebastian (Luque - offline) for your comments and suggestions. I noticed that two of the vectors were named and so I removed the names (names(vec) - NULL) and that pushed the execution time for the function from c. 40 seconds to c. 115 seconds and all the time was taken within the data.frame(...) call. So having names *on* some of the vectors seemed to help things along, which was the opposite of what i had expected. If I use the cbind method of Marc, then the execution time for the function drops to c. 1 second (most of which is in the calculation of one of the vectors). So I guess I can work round this now. What I find interesting is that: test.dat - rnorm(4471) system.time(z - data.frame(col1 = test.dat, col2 = test.dat, col3 = test.dat, + col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat, + col8 = test.dat, col9 = test.dat, col10 = test.dat)) [1] 0.008 0.000 0.007 0.000 0.000 Whereas doing exactly the same thing with different data in the function gives the following timings: system.time(fab - data.frame(lc.ratio, Q, + fNupt, + rho.n, rho.s, + net.Nimm, + net.Nden, + CLminN, + CLmaxN, + CLmaxS)) [1] 173.415 0.260 192.192 0.000 0.000 Most of that was without a change in memory, but towards the end for c. 5 seconds memory use by R increased by 200-300 MB. and... system.time(fab - data.frame(lc.ratio = lc.ratio, Q = Q, + fNupt = fNupt, + rho.n = rho.n, rho.s = rho.s, + net.Nimm = net.Nimm, + net.Nden = net.Nden, + CLminN = CLminN, + CLmaxN = CLmaxN, + CLmaxS = CLmaxS)) [1] 99.966 0.140 114.091 0.000 0.000 Again with a slight increase in memory usage in last 5 seconds. So now, having stripped the names of two of the vectors (so now all are un-named), the un-named version of the data.frame call is almost twice as slow as the named data.frame call. If I leave the names on the two vectors that had them, I get the following timings for those same calls system.time(fab - data.frame(lc.ratio, Q, + fNupt, + rho.n, rho.s, + net.Nimm, + net.Nden, + CLminN, + CLmaxN, + CLmaxS)) [1] 96.234 0.244 101.706 0.000 0.000 system.time(fab - data.frame(lc.ratio = lc.ratio, Q = Q, + fNupt = fNupt, + rho.n = rho.n, rho.s = rho.s, + net.Nimm = net.Nimm, + net.Nden = net.Nden, + CLminN = CLminN, + CLmaxN = CLmaxN, + CLmaxS = CLmaxS)) [1] 13.597 0.088 15.868 0.000 0.000 So having the 2 named vectors and using the named version of the data.frame call is the fastest combination. This is all done within the debugger at the time when I would be generating fab, and if I do, system.time(z - data.frame(col1 = test.dat, col2 = test.dat, col3 = test.dat, + col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat, + col8 = test.dat, col9 = test.dat, col10 = test.dat)) [1] 0.008 0.000 0.007 0.000 0.000 (as above) at this point in the debugger it is exceedingly quick. I just don't understand what is going on with data.frame. I have yet to try Prof. Ripley's suggestion of being a bit naughty with R - I'll see if that is any quicker. Once again, thanks to you all for your suggestions. All the best, G Gavin, One more note, which is that even timing the direct data frame creation on my system with colnames, again using the same 10 numeric columns, I get: system.time(DF1 - data.frame(lc.ratio = Col1, Q = Col2, fNupt = Col3, rho.n = Col4, rho.s = Col5, net.Nimm = Col6, net.Nden = Col7, CLminN = Col8, CLmaxN = Col9, CLmaxS = Col10)) [1] 0.012 0.000 0.028 0.000 0.000 str(DF1) 'data.frame': 4471 obs. of 10 variables: $ lc.ratio: num 0.1423 0.1873 -1.8129 0.0255 -1.7650 ... $ Q : num 0.8340 -0.2387 -0.0864 -1.1184 -0.3368 ... $ fNupt : num -0.1718 -0.0549 1.5194 -1.6127 -1.2019 ... $ rho.n : num -0.740 0.240 0.522 -1.492 1.003 ... $ rho.s : num -0.2363 -1.6248 -0.3045 0.0294 0.1240 ... $ net.Nimm: num -0.774 0.947 -1.098 0.809 1.216 ... $ net.Nden: num -0.198 -0.135 -0.300 -0.618 -0.784 ... $ CLminN : num 0.924 -3.265 0.211 0.813 0.262 ... $ CLmaxN : num 0.3212 -0.0502 -0.9978 0.9005 -1.6535 ... $ CLmaxS : num
Re: [R] Quicker way of combining vectors into a data.frame
Gavin Simpson wrote: [ Resending to the list as I fell foul of the too many recipients rule ] On Thu, 2006-11-30 at 11:34 -0600, Marc Schwartz wrote: Thanks to Marc, Prof. Ripley, Sebastian and Sebastian (Luque - offline) for your comments and suggestions. I noticed that two of the vectors were named and so I removed the names (names(vec) - NULL) and that pushed the execution time for the function from c. 40 seconds to c. 115 seconds and all the time was taken within the data.frame(...) call. So having names *on* some of the vectors seemed to help things along, which was the opposite of what i had expected. If I use the cbind method of Marc, then the execution time for the function drops to c. 1 second (most of which is in the calculation of one of the vectors). So I guess I can work round this now. What I find interesting is that: test.dat - rnorm(4471) system.time(z - data.frame(col1 = test.dat, col2 = test.dat, col3 = test.dat, + col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat, + col8 = test.dat, col9 = test.dat, col10 = test.dat)) [1] 0.008 0.000 0.007 0.000 0.000 Whereas doing exactly the same thing with different data in the function gives the following timings: system.time(fab - data.frame(lc.ratio, Q, + fNupt, + rho.n, rho.s, + net.Nimm, + net.Nden, + CLminN, + CLmaxN, + CLmaxS)) [1] 173.415 0.260 192.192 0.000 0.000 Most of that was without a change in memory, but towards the end for c. 5 seconds memory use by R increased by 200-300 MB. and... system.time(fab - data.frame(lc.ratio = lc.ratio, Q = Q, + fNupt = fNupt, + rho.n = rho.n, rho.s = rho.s, + net.Nimm = net.Nimm, + net.Nden = net.Nden, + CLminN = CLminN, + CLmaxN = CLmaxN, + CLmaxS = CLmaxS)) [1] 99.966 0.140 114.091 0.000 0.000 Again with a slight increase in memory usage in last 5 seconds. So now, having stripped the names of two of the vectors (so now all are un-named), the un-named version of the data.frame call is almost twice as slow as the named data.frame call. If I leave the names on the two vectors that had them, I get the following timings for those same calls system.time(fab - data.frame(lc.ratio, Q, + fNupt, + rho.n, rho.s, + net.Nimm, + net.Nden, + CLminN, + CLmaxN, + CLmaxS)) [1] 96.234 0.244 101.706 0.000 0.000 system.time(fab - data.frame(lc.ratio = lc.ratio, Q = Q, + fNupt = fNupt, + rho.n = rho.n, rho.s = rho.s, + net.Nimm = net.Nimm, + net.Nden = net.Nden, + CLminN = CLminN, + CLmaxN = CLmaxN, + CLmaxS = CLmaxS)) [1] 13.597 0.088 15.868 0.000 0.000 So having the 2 named vectors and using the named version of the data.frame call is the fastest combination. This is all done within the debugger at the time when I would be generating fab, and if I do, system.time(z - data.frame(col1 = test.dat, col2 = test.dat, col3 = test.dat, + col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat, + col8 = test.dat, col9 = test.dat, col10 = test.dat)) [1] 0.008 0.000 0.007 0.000 0.000 (as above) at this point in the debugger it is exceedingly quick. I just don't understand what is going on with data.frame. I think there is something about the data you're not telling us... Could you e.g. do something like str(data.frame(lc.ratio, Q, fNupt, rho.n, rho.s, net.Nimm, net.Nden, CLminN, CLmaxN, CLmaxS)) and str(list(lc.ratio, Q, fNupt, rho.n, rho.s, net.Nimm, net.Nden, CLminN, CLmaxN, CLmaxS)) -- O__ Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Quicker way of combining vectors into a data.frame
On Fri, 2006-12-01 at 12:13 +0100, Peter Dalgaard wrote: Gavin Simpson wrote: snip / I just don't understand what is going on with data.frame. I think there is something about the data you're not telling us... Yes, that I was doing something very, very silly that I thought would work (produce a vector CLmaxN of the required length), but was in fact blowing out to a huge named list. It was this that was causing the massive increase in computation time in data.frame over cbind. After correcting my mistake, timings for data.frame are: system.time(fab - data.frame(lc.ratio, Q, + fNupt, + rho.n, rho.s, + net.Nimm, + net.Nden, + CLminN, + CLmaxN, + CLmaxS)) [1] 0.012 0.000 0.011 0.000 0.000 Browse[1] system.time(fab - data.frame(lc.ratio = lc.ratio, Q = Q, + fNupt = fNupt, + rho.n = rho.n, rho.s = rho.s, + net.Nimm = net.Nimm, + net.Nden = net.Nden, + CLminN = CLminN, + CLmaxN = CLmaxN, + CLmaxS = CLmaxS)) [1] 0.008 0.000 0.018 0.000 0.000 One vector has names for some reason, removing them brings the un-named data.frame version down to the named version timing and makes no difference to the named version Browse[1] names(CLmaxS) - NULL Browse[1] system.time(fab - data.frame(lc.ratio, Q, + fNupt, + rho.n, rho.s, + net.Nimm, + net.Nden, + CLminN, + CLmaxN, + CLmaxS)) [1] 0.008 0.000 0.016 0.000 0.000 Browse[1] system.time(fab - data.frame(lc.ratio = lc.ratio, Q = Q, + fNupt = fNupt, + rho.n = rho.n, rho.s = rho.s, + net.Nimm = net.Nimm, + net.Nden = net.Nden, + CLminN = CLminN, + CLmaxN = CLmaxN, + CLmaxS = CLmaxS)) [1] 0.008 0.000 0.009 0.000 0.000 Apologies to the list for bothering you all with my stupidity and thank you again to everyone who replied - I knew it was I who was doing something wrong, but couldn't see it and thanks to your comments, suggestions and queries I was able to work out what that was. All the best, G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Gavin Simpson [t] +44 (0)20 7679 0522 ECRC ENSIS, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Quicker way of combining vectors into a data.frame
On Thu, 2006-11-30 at 19:26 +, Gavin Simpson wrote: On Thu, 2006-11-30 at 11:34 -0600, Marc Schwartz wrote: Thanks to Marc, Prof. Ripley, Sebastian and Sebastian (Luque - offline) for your comments and suggestions. I noticed that two of the vectors were named and so I removed the names (names(vec) - NULL) and that pushed the execution time for the function from c. 40 seconds to c. 115 seconds and all the time was taken within the data.frame(...) call. So having names *on* some of the vectors seemed to help things along, which was the opposite of what i had expected. If I use the cbind method of Marc, then the execution time for the function drops to c. 1 second (most of which is in the calculation of one of the vectors). So I guess I can work round this now. What I find interesting is that: test.dat - rnorm(4471) system.time(z - data.frame(col1 = test.dat, col2 = test.dat, col3 = test.dat, + col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat, + col8 = test.dat, col9 = test.dat, col10 = test.dat)) [1] 0.008 0.000 0.007 0.000 0.000 Whereas doing exactly the same thing with different data in the function gives the following timings: system.time(fab - data.frame(lc.ratio, Q, + fNupt, + rho.n, rho.s, + net.Nimm, + net.Nden, + CLminN, + CLmaxN, + CLmaxS)) [1] 173.415 0.260 192.192 0.000 0.000 Most of that was without a change in memory, but towards the end for c. 5 seconds memory use by R increased by 200-300 MB. and... system.time(fab - data.frame(lc.ratio = lc.ratio, Q = Q, + fNupt = fNupt, + rho.n = rho.n, rho.s = rho.s, + net.Nimm = net.Nimm, + net.Nden = net.Nden, + CLminN = CLminN, + CLmaxN = CLmaxN, + CLmaxS = CLmaxS)) [1] 99.966 0.140 114.091 0.000 0.000 Again with a slight increase in memory usage in last 5 seconds. So now, having stripped the names of two of the vectors (so now all are un-named), the un-named version of the data.frame call is almost twice as slow as the named data.frame call. If I leave the names on the two vectors that had them, I get the following timings for those same calls system.time(fab - data.frame(lc.ratio, Q, + fNupt, + rho.n, rho.s, + net.Nimm, + net.Nden, + CLminN, + CLmaxN, + CLmaxS)) [1] 96.234 0.244 101.706 0.000 0.000 system.time(fab - data.frame(lc.ratio = lc.ratio, Q = Q, + fNupt = fNupt, + rho.n = rho.n, rho.s = rho.s, + net.Nimm = net.Nimm, + net.Nden = net.Nden, + CLminN = CLminN, + CLmaxN = CLmaxN, + CLmaxS = CLmaxS)) [1] 13.597 0.088 15.868 0.000 0.000 So having the 2 named vectors and using the named version of the data.frame call is the fastest combination. This is all done within the debugger at the time when I would be generating fab, and if I do, system.time(z - data.frame(col1 = test.dat, col2 = test.dat, col3 = test.dat, + col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat, + col8 = test.dat, col9 = test.dat, col10 = test.dat)) [1] 0.008 0.000 0.007 0.000 0.000 (as above) at this point in the debugger it is exceedingly quick. I just don't understand what is going on with data.frame. I have yet to try Prof. Ripley's suggestion of being a bit naughty with R - I'll see if that is any quicker. Once again, thanks to you all for your suggestions. Gavin, Can you post the results of: str(fab) and str(lc.ratio) str(Q) str(fNupt) str(rho.n) str(rho.s) str(net.Nimm) str(net.Nden) str(CLminN) str(CLmaxN) str(CLmaxS) This is taking way too long. There is either something about one or more of these objects that is more complex than just being simple vectors, or there is something corrupt in your R session/environment. You might want to try running a new and clean R session using: R --vanilla and then re-run your code to see if that changes anything. If so, it suggests that my latter idea may be in play. HTH, Marc __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Quicker way of combining vectors into a data.frame
On Thu, 2006-11-30 at 17:00 +, Gavin Simpson wrote: Hi, In a function, I compute 10 (un-named) vectors of reasonable length (4471 in the particular example I have to hand) that I want to combine into a data frame object, that the function will return. This is very slow, so *I'm* doing something wrong if I want it to be quick and efficient, though I'm not sure what the best way to do this would be. I know it is the combining into data frame bit that is slow, because I've Rprof'ed it: $by.self self.time self.pct total.time total.pct names-.default 16.58 52.8 16.58 52.8 unlist 7.22 23.0 7.26 23.1 data.frame 1.72 5.5 29.38 93.6 duplicated.default 1.66 5.3 1.66 5.3 + 1.20 3.8 1.20 3.8 list 0.40 1.3 0.40 1.3 as.data.frame.numeric 0.28 0.9 3.32 10.6 apply 0.26 0.8 1.70 5.4 pmatch 0.22 0.7 0.22 0.7 paste 0.20 0.6 0.90 2.9 deparse0.14 0.4 0.70 2.2 eval 0.12 0.4 31.28 99.7 names-0.12 0.4 16.70 53.2 FUN0.12 0.4 1.32 4.2 names 0.12 0.4 0.14 0.4 as.list.default0.12 0.4 0.12 0.4 duplicated 0.10 0.3 1.76 5.6 gc 0.10 0.3 0.10 0.3 And I stepped through it under debug() and all the calculations before are quick, and then this bit takes a little over 20 seconds to complete fab - data.frame(lc.ratio = lc.ratio, Q = Q, fNupt = fNupt, rho.n = rho.n, rho.s = rho.s, net.Nimm = net.Nimm, net.Nden = net.Nden, CLminN = CLminN, CLmaxN = CLmaxN, CLmaxS = CLmaxS) I can get it down to c. 5 seconds if I do (not Rprof'ed): fab - data.frame(lc.ratio, Q, fNupt, rho.n, rho.s, net.Nimm, net.Nden, CLminN, CLmaxN, CLmaxS) But this still seems quite a long time, so I'm thinking that there must be a quicker of doing what I want (end up with a data.frame with the 10 vectors in it). Can anyone enlighten me? I am imputing from the above, that the 10 columns are all numeric as there seems to be time spent in the column naming process (the lack of which speeds up your second example), as well as the use of as.data.frame.numeric() and related activities. It is not clear, if this is correct, why you want a dataframe as opposed to a numeric matrix, but in either case: If we have 10 vectors, named Colx, where x is 1:10 and each vector is: str(Col1) num [1:4471] 0.1423 0.1873 -1.8129 0.0255 -1.7650 ... Then: system.time(Mat - cbind(Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10)) [1] 0.002 0.000 0.001 0.000 0.000 Or: system.time(DF - as.data.frame(cbind(Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10))) [1] 0.005 0.000 0.005 0.000 0.000 You can then add colnames() subsequent to the cbind()ing: system.time(colnames(Mat) - c(lc.ratio, Q, fNupt, rho.n, rho.s, net.Nimm, net.Nden, CLminN, CLmaxN, CLmaxS)) [1] 0.002 0.000 0.001 0.000 0.000 system.time(colnames(DF) - c(lc.ratio, Q, fNupt, rho.n, rho.s, net.Nimm, net.Nden, CLminN, CLmaxN, CLmaxS)) [1] 0.011 0.000 0.020 0.000 0.000 str(Mat) num [1:4471, 1:10] 0.1423 0.1873 -1.8129 0.0255 -1.7650 ... - attr(*, dimnames)=List of 2 ..$ : NULL ..$ : chr [1:10] lc.ratio Q fNupt rho.n ... str(DF) 'data.frame': 4471 obs. of 10 variables: $ lc.ratio: num 0.1423 0.1873 -1.8129 0.0255 -1.7650 ... $ Q : num 0.8340 -0.2387 -0.0864 -1.1184 -0.3368 ... $ fNupt : num -0.1718 -0.0549 1.5194 -1.6127 -1.2019 ... $ rho.n : num -0.740 0.240 0.522 -1.492 1.003 ... $ rho.s : num -0.2363 -1.6248 -0.3045 0.0294 0.1240 ... $ net.Nimm: num -0.774 0.947 -1.098 0.809 1.216 ... $ net.Nden: num -0.198 -0.135 -0.300 -0.618 -0.784 ... $ CLminN : num 0.924 -3.265 0.211 0.813 0.262 ... $ CLmaxN : num 0.3212 -0.0502 -0.9978 0.9005 -1.6535 ... $ CLmaxS : num -0.520 0.278 -0.546 -0.925 1.507 ... HTH, Marc Schwartz __ R-help@stat.math.ethz.ch mailing list
Re: [R] Quicker way of combining vectors into a data.frame
If you are prepared to give up most of the sanity checks, see this at the bottom of read.table: ## this is extremely underhanded ## we should use the constructor function ... ## don't try this at home kids class(data) - data.frame row.names(data) - row.names data So create a (named?) list with your vectors in it, assign class data.frame and then row.names(data) - NULL On Thu, 30 Nov 2006, Gavin Simpson wrote: Hi, In a function, I compute 10 (un-named) vectors of reasonable length (4471 in the particular example I have to hand) that I want to combine into a data frame object, that the function will return. This is very slow, so *I'm* doing something wrong if I want it to be quick and efficient, though I'm not sure what the best way to do this would be. I know it is the combining into data frame bit that is slow, because I've Rprof'ed it: $by.self self.time self.pct total.time total.pct names-.default 16.58 52.8 16.58 52.8 unlist 7.22 23.0 7.26 23.1 data.frame 1.72 5.5 29.38 93.6 duplicated.default 1.66 5.3 1.66 5.3 + 1.20 3.8 1.20 3.8 list 0.40 1.3 0.40 1.3 as.data.frame.numeric 0.28 0.9 3.32 10.6 apply 0.26 0.8 1.70 5.4 pmatch 0.22 0.7 0.22 0.7 paste 0.20 0.6 0.90 2.9 deparse0.14 0.4 0.70 2.2 eval 0.12 0.4 31.28 99.7 names-0.12 0.4 16.70 53.2 FUN0.12 0.4 1.32 4.2 names 0.12 0.4 0.14 0.4 as.list.default0.12 0.4 0.12 0.4 duplicated 0.10 0.3 1.76 5.6 gc 0.10 0.3 0.10 0.3 And I stepped through it under debug() and all the calculations before are quick, and then this bit takes a little over 20 seconds to complete fab - data.frame(lc.ratio = lc.ratio, Q = Q, fNupt = fNupt, rho.n = rho.n, rho.s = rho.s, net.Nimm = net.Nimm, net.Nden = net.Nden, CLminN = CLminN, CLmaxN = CLmaxN, CLmaxS = CLmaxS) I can get it down to c. 5 seconds if I do (not Rprof'ed): fab - data.frame(lc.ratio, Q, fNupt, rho.n, rho.s, net.Nimm, net.Nden, CLminN, CLmaxN, CLmaxS) But this still seems quite a long time, so I'm thinking that there must be a quicker of doing what I want (end up with a data.frame with the 10 vectors in it). Can anyone enlighten me? version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status Patched major 2 minor 4.0 year 2006 month 10 day03 svn rev39576 language R version.string R version 2.4.0 Patched (2006-10-03 r39576) sessionInfo() R version 2.4.0 Patched (2006-10-03 r39576) i686-pc-linux-gnu locale: LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] methods stats graphics grDevices utils datasets [7] base Thanks in advance, G -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Quicker way of combining vectors into a data.frame
Hi! I don't know for sure - and I have not tried it yet, but how about allocating a matrix which will hold all stuff, then put all vectors in it and at last assign some dimnames to it: data - matrix(0, ncol=5, nrow=length(vec1)) data[1,] - vec1 ... dimnames(data) - list(c(1,2,3,4,5), ) as.data.frame(data) I forgot, I of course assume all of your vectors to be numeric ... Hope that helps! Greetings, Sebastian On Thu, 2006-11-30 at 17:00 +, Gavin Simpson wrote: Hi, In a function, I compute 10 (un-named) vectors of reasonable length (4471 in the particular example I have to hand) that I want to combine into a data frame object, that the function will return. This is very slow, so *I'm* doing something wrong if I want it to be quick and efficient, though I'm not sure what the best way to do this would be. I know it is the combining into data frame bit that is slow, because I've Rprof'ed it: $by.self self.time self.pct total.time total.pct names-.default 16.58 52.8 16.58 52.8 unlist 7.22 23.0 7.26 23.1 data.frame 1.72 5.5 29.38 93.6 duplicated.default 1.66 5.3 1.66 5.3 + 1.20 3.8 1.20 3.8 list 0.40 1.3 0.40 1.3 as.data.frame.numeric 0.28 0.9 3.32 10.6 apply 0.26 0.8 1.70 5.4 pmatch 0.22 0.7 0.22 0.7 paste 0.20 0.6 0.90 2.9 deparse0.14 0.4 0.70 2.2 eval 0.12 0.4 31.28 99.7 names-0.12 0.4 16.70 53.2 FUN0.12 0.4 1.32 4.2 names 0.12 0.4 0.14 0.4 as.list.default0.12 0.4 0.12 0.4 duplicated 0.10 0.3 1.76 5.6 gc 0.10 0.3 0.10 0.3 And I stepped through it under debug() and all the calculations before are quick, and then this bit takes a little over 20 seconds to complete fab - data.frame(lc.ratio = lc.ratio, Q = Q, fNupt = fNupt, rho.n = rho.n, rho.s = rho.s, net.Nimm = net.Nimm, net.Nden = net.Nden, CLminN = CLminN, CLmaxN = CLmaxN, CLmaxS = CLmaxS) I can get it down to c. 5 seconds if I do (not Rprof'ed): fab - data.frame(lc.ratio, Q, fNupt, rho.n, rho.s, net.Nimm, net.Nden, CLminN, CLmaxN, CLmaxS) But this still seems quite a long time, so I'm thinking that there must be a quicker of doing what I want (end up with a data.frame with the 10 vectors in it). Can anyone enlighten me? version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status Patched major 2 minor 4.0 year 2006 month 10 day03 svn rev39576 language R version.string R version 2.4.0 Patched (2006-10-03 r39576) sessionInfo() R version 2.4.0 Patched (2006-10-03 r39576) i686-pc-linux-gnu locale: LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] methods stats graphics grDevices utils datasets [7] base Thanks in advance, G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Gavin Simpson [t] +44 (0)20 7679 0522 ECRC ENSIS, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ R-help@stat.math.ethz.ch mailing list
Re: [R] Quicker way of combining vectors into a data.frame
Gavin, One more note, which is that even timing the direct data frame creation on my system with colnames, again using the same 10 numeric columns, I get: system.time(DF1 - data.frame(lc.ratio = Col1, Q = Col2, fNupt = Col3, rho.n = Col4, rho.s = Col5, net.Nimm = Col6, net.Nden = Col7, CLminN = Col8, CLmaxN = Col9, CLmaxS = Col10)) [1] 0.012 0.000 0.028 0.000 0.000 str(DF1) 'data.frame': 4471 obs. of 10 variables: $ lc.ratio: num 0.1423 0.1873 -1.8129 0.0255 -1.7650 ... $ Q : num 0.8340 -0.2387 -0.0864 -1.1184 -0.3368 ... $ fNupt : num -0.1718 -0.0549 1.5194 -1.6127 -1.2019 ... $ rho.n : num -0.740 0.240 0.522 -1.492 1.003 ... $ rho.s : num -0.2363 -1.6248 -0.3045 0.0294 0.1240 ... $ net.Nimm: num -0.774 0.947 -1.098 0.809 1.216 ... $ net.Nden: num -0.198 -0.135 -0.300 -0.618 -0.784 ... $ CLminN : num 0.924 -3.265 0.211 0.813 0.262 ... $ CLmaxN : num 0.3212 -0.0502 -0.9978 0.9005 -1.6535 ... $ CLmaxS : num -0.520 0.278 -0.546 -0.925 1.507 ... So there is something else going on, either with your code or some other conflict, unless my assumptions about your data are incorrect. HTH, Marc __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.