> From: Huiqin Yang [mailto:[EMAIL PROTECTED] > > Hi all, > > I am a beginner trying to use R to work with large amounts of > oceanographic data, and I find that computations can be VERY > slow. In particular, computational speed seems to depend > strongly on the number and size of the objects that are > loaded (when R starts up). The same computations are > significantly faster when all but the essential objects are > removed. I am running R on a machine with 16 GB of RAM, and > our unix system manager assures me that there is memory > available to my R process that has not been used. > > 1. Is the problem associated with how R uses memory? If so, > is there some way to increase the amount of memory used by my > R process to get better performance?
Is R compiled as 64-bit? If not, it won't be able to use more than 4GB of RAM (that's my understanding, anyway). R keeps objects in memory, so if you are working with large amount of data, it's a good habit to keep only the absolute essential objects in the workspace, and save() and rm() things you don't need for the computation. > > The computations that are particularly slow involve looping > with by(). The data are measurements of vertical profiles of > pressure, temperature, and salinity at a number of stations, > which are organized into a dataframe p.1 (1925930 rows, 8 > columns: id, p, t, and s, etc.), and the objective is to get > a much smaller dataframe and the unique > values for ID is 1409 with the minimum and maximum pressure > for each profile. The slow part is: > > h.maxmin <- by(p.1,p.1$id,function(x){ > data.frame(id=x$id[1], > maxp=max(x$p), > minp=min(x$p))}) > > 2. Even with unneeded data objects removed, this is very > slow. Is there a faster way to get the maximum and minimum values? Why do you need to use by(), and why have the function return a data frame containing only one row? Here's an experiment on my 900MHz PIII laptop: > n <- 1e5 > dat <- data.frame(id = sort(sample(LETTERS, n, replace=TRUE)), + p = rnorm(n)) > > > system.time(h.maxmin <- by(dat, dat$id,function(x) { + data.frame(id=x$id[1], maxp=max(x$p), minp=min(x$p))})) [1] 2.75 0.01 2.78 NA NA > system.time(junk <- tapply(dat$p, dat$id, function(x) range(x))) [1] 0.12 0.01 0.13 NA NA If you want to coerce the result to a data frame with id as row names and min and max as the two variables, you can do: junk.dat <- as.data.frame(do.call("rbind", junk)) HTH, Andy > platform sparc-sun-solaris2.9 > arch sparc > os solaris2.9 > system sparc, solaris2.9 > status > major 1 > minor 7.0 > year 2003 > month 04 > day 16 > language R > > Thank you for your time. > > Helen > > ______________________________________________ > [EMAIL PROTECTED] mailing list > https://www.stat.math.ethz.ch/mailman/listinfo> /r-help > ------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, ...{{dropped}} ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help