[R] Computations slow in spite of large amounts of RAM.
Hi all, I am a beginner trying to use R to work with large amounts of oceanographic data, and I find that computations can be VERY slow. In particular, computational speed seems to depend strongly on the number and size of the objects that are loaded (when R starts up). The same computations are significantly faster when all but the essential objects are removed. I am running R on a machine with 16 GB of RAM, and our unix system manager assures me that there is memory available to my R process that has not been used. 1. Is the problem associated with how R uses memory? If so, is there some way to increase the amount of memory used by my R process to get better performance? The computations that are particularly slow involve looping with by(). The data are measurements of vertical profiles of pressure, temperature, and salinity at a number of stations, which are organized into a dataframe p.1 (1925930 rows, 8 columns: id, p, t, and s, etc.), and the objective is to get a much smaller dataframe and the unique values for ID is 1409 with the minimum and maximum pressure for each profile. The slow part is: h.maxmin - by(p.1,p.1$id,function(x){ data.frame(id=x$id[1], maxp=max(x$p), minp=min(x$p))}) 2. Even with unneeded data objects removed, this is very slow. Is there a faster way to get the maximum and minimum values? platform sparc-sun-solaris2.9 arch sparc os solaris2.9 system sparc, solaris2.9 status major1 minor7.0 year 2003 month04 day 16 language R Thank you for your time. Helen __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
RE: [R] Computations slow in spite of large amounts of RAM.
From: Huiqin Yang [mailto:[EMAIL PROTECTED] Hi all, I am a beginner trying to use R to work with large amounts of oceanographic data, and I find that computations can be VERY slow. In particular, computational speed seems to depend strongly on the number and size of the objects that are loaded (when R starts up). The same computations are significantly faster when all but the essential objects are removed. I am running R on a machine with 16 GB of RAM, and our unix system manager assures me that there is memory available to my R process that has not been used. 1. Is the problem associated with how R uses memory? If so, is there some way to increase the amount of memory used by my R process to get better performance? Is R compiled as 64-bit? If not, it won't be able to use more than 4GB of RAM (that's my understanding, anyway). R keeps objects in memory, so if you are working with large amount of data, it's a good habit to keep only the absolute essential objects in the workspace, and save() and rm() things you don't need for the computation. The computations that are particularly slow involve looping with by(). The data are measurements of vertical profiles of pressure, temperature, and salinity at a number of stations, which are organized into a dataframe p.1 (1925930 rows, 8 columns: id, p, t, and s, etc.), and the objective is to get a much smaller dataframe and the unique values for ID is 1409 with the minimum and maximum pressure for each profile. The slow part is: h.maxmin - by(p.1,p.1$id,function(x){ data.frame(id=x$id[1], maxp=max(x$p), minp=min(x$p))}) 2. Even with unneeded data objects removed, this is very slow. Is there a faster way to get the maximum and minimum values? Why do you need to use by(), and why have the function return a data frame containing only one row? Here's an experiment on my 900MHz PIII laptop: n - 1e5 dat - data.frame(id = sort(sample(LETTERS, n, replace=TRUE)), + p = rnorm(n)) system.time(h.maxmin - by(dat, dat$id,function(x) { + data.frame(id=x$id[1], maxp=max(x$p), minp=min(x$p))})) [1] 2.75 0.01 2.78 NA NA system.time(junk - tapply(dat$p, dat$id, function(x) range(x))) [1] 0.12 0.01 0.13 NA NA If you want to coerce the result to a data frame with id as row names and min and max as the two variables, you can do: junk.dat - as.data.frame(do.call(rbind, junk)) HTH, Andy platform sparc-sun-solaris2.9 arch sparc os solaris2.9 system sparc, solaris2.9 status major1 minor7.0 year 2003 month04 day 16 language R Thank you for your time. Helen __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo /r-help -- Notice: This e-mail message, together with any attachments, ...{{dropped}} __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Computations slow in spite of large amounts of RAM.
Huiqin Yang [EMAIL PROTECTED] writes: Hi all, I am a beginner trying to use R to work with large amounts of oceanographic data, and I find that computations can be VERY slow. In particular, computational speed seems to depend strongly on the number and size of the objects that are loaded (when R starts up). The same computations are significantly faster when all but the essential objects are removed. I am running R on a machine with 16 GB of RAM, and our unix system manager assures me that there is memory available to my R process that has not been used. 1. Is the problem associated with how R uses memory? If so, is there some way to increase the amount of memory used by my R process to get better performance? You could try setting a large nsize and vsize using mem.limits See the description in ?Memory The computations that are particularly slow involve looping with by(). The data are measurements of vertical profiles of pressure, temperature, and salinity at a number of stations, which are organized into a dataframe p.1 (1925930 rows, 8 columns: id, p, t, and s, etc.), and the objective is to get a much smaller dataframe and the unique values for ID is 1409 with the minimum and maximum pressure for each profile. The slow part is: h.maxmin - by(p.1,p.1$id,function(x){ data.frame(id=x$id[1], maxp=max(x$p), minp=min(x$p))}) I think it would be faster to use h.maxmin - tapply(p.1$p, p.1$id, range) In the call to by you are subsetting the entire data frame and that probably means taking at least one copy of that frame. If you use tapply on only the relevant columns you will use much less space. 2. Even with unneeded data objects removed, this is very slow. Is there a faster way to get the maximum and minimum values? See above. -- Douglas Bates[EMAIL PROTECTED] Statistics Department608/262-2598 University of Wisconsin - Madisonhttp://www.stat.wisc.edu/~bates/ __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help