Marie Pierre Sylvestre wrote: > Dear R users, > > I am analysing a very large data set and I need to perform several data > manipulations. The dataset is so big that the only way I can play with it > without having memory problems (E.g. "cannot allocate vectors of size...") > is to write a batch script to: > > 1. cut the data into pieces > 2. save the pieces in seperate .RData files > 3. Remove everything from the environment > 4. load one of the piece > 5. perform the manipulations on it > 6. save it and remove it from the environment > 7. Redo 4-6 for every piece > 8. Merge everything together at the end > > It works if coded line by line but since I'll have to perform these tasks > on other data sets, I am trying to automate this as much as I can.
The trackObjs package is designed to make it easy to work in approximately this manner -- it saves objects automatically to disk but they are still accessible as normal. Here's how you could do the above - this example works with 10 8Mb objects in a R session with a limit of 40Mb. # allow R only 40Mb of vector memory mem.limits(vsize=40e6) mem.limits()/1e6 library(trackObjs) # start tracking to store data objects in the directory 'data' # each object is 8Mb, and we store 10 of them track.start("data") n <- 10 m <- 1e6 constructObject <- function(i) i+rnorm(m) # steps 1, 2 & 3: for (i in 1:n) { xname <- paste("x", i, sep="") cat("", xname) assign(xname, constructObject(i)) # store in a file, accessible by name: track(list=xname) } cat("\n") gc(TRUE) # accessing object by name object.size(x1)/2^20 # In Mb mean(x1) mean(x2) gc(TRUE) # steps 4:6 # accessing object through a constructed name result <- sapply(1:n, function(i) mean(get(paste("x", i, sep="")))) result # remove the data objects track.remove(list=paste("x", 1:n, sep="")) track.stop() Here's the a full transcript of the above - note how whenever gc() is called there is hardly any vector memory in use. > # allow R only 40Mb of vector memory > mem.limits(vsize=40e6) nsize vsize NA 40000000 > mem.limits()/1e6 nsize vsize NA 40 > library(trackObjs) > # start tracking to store data objects in the directory 'data' > # each object is 8Mb, and we store 10 of them > track.start("data") > n <- 10 > m <- 1e6 > constructObject <- function(i) i+rnorm(m) > # steps 1, 2 & 3: > for (i in 1:n) { + xname <- paste("x", i, sep="") + cat("", xname) + assign(xname, constructObject(i)) + # store in a file, accessible by name: + track(list=xname) + } x1 x2 x3 x4 x5 x6 x7 x8 x9 x10> cat("\n") > gc(TRUE) Garbage collection 19 = 6+0+13 (level 2) ... 4.0 Mbytes of cons cells used (42%) 0.7 Mbytes of vectors used (5%) used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) Ncells 148362 4.0 350000 9.4 NA 350000 9.4 Vcells 89973 0.7 1950935 14.9 38.2 2112735 16.2 > # accessing object by name > object.size(x1)/2^20 # In Mb [1] 7.629417 > mean(x1) [1] 0.998635 > mean(x2) [1] 1.999656 > gc(TRUE) Garbage collection 22 = 7+1+14 (level 2) ... 4.0 Mbytes of cons cells used (43%) 0.7 Mbytes of vectors used (6%) used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) Ncells 149264 4.0 350000 9.4 NA 350000 9.4 Vcells 90160 0.7 1560747 12.0 38.2 2112735 16.2 > # steps 4:6 > result <- sapply(1:n, function(i) mean(get(paste("x", i, sep="")))) > result [1] 0.998635 1.999656 2.997368 4.000197 5.000159 6.001216 6.999552 [8] 7.999743 8.999982 10.001355 > # remove the data objects > track.remove(list=paste("x", 1:n, sep="")) [1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9" "x10" > track.stop() > > > I am using a loop in which I used 'assign' and 'get' (pseudo code below). > My problem is when I use 'get', it prints the whole object on the screen. > I am wondering whether there is a more efficient way to do what I need to > do. Any help would be appreciated. Please keep in mind that the whole > process is quite computer-intensive, so I can't keep everything in the > environment while R performs calculations. > > Say I have 1 big dataframe called data. I use 'split' to divide it into a > list of 12 dataframes (call this list my.list) > > my.fun is a function that takes a dataframe, performs several > manipulations on it and returns a dataframe. > > > for (i in 1:12){ > assign( paste( "data", i, sep=""), my.fun(my.list[i])) # this works > # now I need to save this new object as a RData. > > # The following line does not work > save(paste("data", i, sep = ""), file = paste( paste("data", i, sep = > ""), "RData", sep=".")) > } > > # This works but it is a bit convoluted!!! > temp <- get(paste("data", i, sep = "")) > save(temp, file = "lala.RData") > } > > > I am *sure* there is something more clever to do but I can't find it. Any > help would be appreciated. > > best regards, > > MP > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.