From: Huiqin Yang [mailto:[EMAIL PROTECTED]
Hi all,
I am a beginner trying to use R to work with large amounts of
oceanographic data, and I find that computations can be VERY
slow. In particular, computational speed seems to depend
strongly on the number and size of the objects that are
loaded (when R starts up). The same computations are
significantly faster when all but the essential objects are
removed. I am running R on a machine with 16 GB of RAM, and
our unix system manager assures me that there is memory
available to my R process that has not been used.
1. Is the problem associated with how R uses memory? If so,
is there some way to increase the amount of memory used by my
R process to get better performance?
Is R compiled as 64-bit? If not, it won't be able to use more than 4GB of
RAM (that's my understanding, anyway).
R keeps objects in memory, so if you are working with large amount of data,
it's a good habit to keep only the absolute essential objects in the
workspace, and save() and rm() things you don't need for the computation.
The computations that are particularly slow involve looping
with by(). The data are measurements of vertical profiles of
pressure, temperature, and salinity at a number of stations,
which are organized into a dataframe p.1 (1925930 rows, 8
columns: id, p, t, and s, etc.), and the objective is to get
a much smaller dataframe and the unique
values for ID is 1409 with the minimum and maximum pressure
for each profile. The slow part is:
h.maxmin - by(p.1,p.1$id,function(x){
data.frame(id=x$id[1],
maxp=max(x$p),
minp=min(x$p))})
2. Even with unneeded data objects removed, this is very
slow. Is there a faster way to get the maximum and minimum values?
Why do you need to use by(), and why have the function return a data frame
containing only one row? Here's an experiment on my 900MHz PIII laptop:
n - 1e5
dat - data.frame(id = sort(sample(LETTERS, n, replace=TRUE)),
+ p = rnorm(n))
system.time(h.maxmin - by(dat, dat$id,function(x) {
+ data.frame(id=x$id[1], maxp=max(x$p), minp=min(x$p))}))
[1] 2.75 0.01 2.78 NA NA
system.time(junk - tapply(dat$p, dat$id, function(x) range(x)))
[1] 0.12 0.01 0.13 NA NA
If you want to coerce the result to a data frame with id as row names and
min and max as the two variables, you can do:
junk.dat - as.data.frame(do.call(rbind, junk))
HTH,
Andy
platform sparc-sun-solaris2.9
arch sparc
os solaris2.9
system sparc, solaris2.9
status
major1
minor7.0
year 2003
month04
day 16
language R
Thank you for your time.
Helen
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo /r-help
--
Notice: This e-mail message, together with any attachments, ...{{dropped}}
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help