[R] Computations slow in spite of large amounts of RAM.

2003-07-01 Thread Huiqin Yang
Hi all,

I am a beginner trying to use R to work with large amounts of
oceanographic data, and I find that computations can be VERY slow.  In
particular, computational speed seems to depend strongly on the number
and size of the objects that are loaded (when R starts up).  The same
computations are significantly faster when all but the essential
objects are removed.  I am running R on a machine with 16 GB of RAM,
and our unix system manager assures me that there is memory available
to my R process that has not been used.

1.  Is the problem associated with how R uses memory?  If so, is there
some way to increase the amount of memory used by my R process to get
better performance?

The computations that are particularly slow involve looping with
by().  The data are measurements of vertical profiles of pressure,
temperature, and salinity at a number of stations, which are organized
into a dataframe p.1 (1925930 rows, 8 columns: id, p, t, and s, etc.),
and the objective is to get a much smaller dataframe and the unique 
values for ID is 1409 with the minimum and maximum pressure for each
profile.  The slow part is:

h.maxmin - by(p.1,p.1$id,function(x){
 data.frame(id=x$id[1],
  maxp=max(x$p),
  minp=min(x$p))})

2.  Even with unneeded data objects removed, this is very slow.  Is
there a faster way to get the maximum and minimum values?

platform sparc-sun-solaris2.9
arch sparc   
os   solaris2.9  
system   sparc, solaris2.9   
status   
major1   
minor7.0 
year 2003
month04  
day  16  
language R 

Thank you for your time.

Helen

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


RE: [R] Computations slow in spite of large amounts of RAM.

2003-07-01 Thread Liaw, Andy
 From: Huiqin Yang [mailto:[EMAIL PROTECTED] 
 
 Hi all,
 
 I am a beginner trying to use R to work with large amounts of 
 oceanographic data, and I find that computations can be VERY 
 slow.  In particular, computational speed seems to depend 
 strongly on the number and size of the objects that are 
 loaded (when R starts up).  The same computations are 
 significantly faster when all but the essential objects are 
 removed.  I am running R on a machine with 16 GB of RAM, and 
 our unix system manager assures me that there is memory 
 available to my R process that has not been used.
 
 1.  Is the problem associated with how R uses memory?  If so, 
 is there some way to increase the amount of memory used by my 
 R process to get better performance?

Is R compiled as 64-bit?  If not, it won't be able to use more than 4GB of
RAM (that's my understanding, anyway).

R keeps objects in memory, so if you are working with large amount of data,
it's a good habit to keep only the absolute essential objects in the
workspace, and save() and rm() things you don't need for the computation.

 
 The computations that are particularly slow involve looping 
 with by().  The data are measurements of vertical profiles of 
 pressure, temperature, and salinity at a number of stations, 
 which are organized into a dataframe p.1 (1925930 rows, 8 
 columns: id, p, t, and s, etc.), and the objective is to get 
 a much smaller dataframe and the unique 
 values for ID is 1409 with the minimum and maximum pressure 
 for each profile.  The slow part is:
 
 h.maxmin - by(p.1,p.1$id,function(x){
  data.frame(id=x$id[1],
   maxp=max(x$p),
   minp=min(x$p))})
 
 2.  Even with unneeded data objects removed, this is very 
 slow.  Is there a faster way to get the maximum and minimum values?

Why do you need to use by(), and why have the function return a data frame
containing only one row?  Here's an experiment on my 900MHz PIII laptop:

 n - 1e5
 dat - data.frame(id = sort(sample(LETTERS, n, replace=TRUE)),
+   p = rnorm(n))
 
 
 system.time(h.maxmin - by(dat, dat$id,function(x) {
+   data.frame(id=x$id[1], maxp=max(x$p), minp=min(x$p))}))
[1] 2.75 0.01 2.78   NA   NA
 system.time(junk - tapply(dat$p, dat$id, function(x) range(x)))
[1] 0.12 0.01 0.13   NA   NA

If you want to coerce the result to a data frame with id as row names and
min and max as the two variables, you can do:

  junk.dat - as.data.frame(do.call(rbind, junk))

HTH,
Andy


 
 platform sparc-sun-solaris2.9
 arch sparc   
 os   solaris2.9  
 system   sparc, solaris2.9   
 status   
 major1   
 minor7.0 
 year 2003
 month04  
 day  16  
 language R 
 
 Thank you for your time.
 
 Helen
 
 __
 [EMAIL PROTECTED] mailing list 
 https://www.stat.math.ethz.ch/mailman/listinfo /r-help
 

--
Notice: This e-mail message, together with any attachments, ...{{dropped}}

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] Computations slow in spite of large amounts of RAM.

2003-07-01 Thread Douglas Bates
Huiqin Yang [EMAIL PROTECTED] writes:

 Hi all,
 
 I am a beginner trying to use R to work with large amounts of
 oceanographic data, and I find that computations can be VERY slow.  In
 particular, computational speed seems to depend strongly on the number
 and size of the objects that are loaded (when R starts up).  The same
 computations are significantly faster when all but the essential
 objects are removed.  I am running R on a machine with 16 GB of RAM,
 and our unix system manager assures me that there is memory available
 to my R process that has not been used.
 
 1.  Is the problem associated with how R uses memory?  If so, is there
 some way to increase the amount of memory used by my R process to get
 better performance?

You could try setting a large nsize and vsize using 

 mem.limits

See the description in ?Memory

 The computations that are particularly slow involve looping with
 by().  The data are measurements of vertical profiles of pressure,
 temperature, and salinity at a number of stations, which are organized
 into a dataframe p.1 (1925930 rows, 8 columns: id, p, t, and s, etc.),
 and the objective is to get a much smaller dataframe and the unique 
 values for ID is 1409 with the minimum and maximum pressure for each
 profile.  The slow part is:
 
 h.maxmin - by(p.1,p.1$id,function(x){
  data.frame(id=x$id[1],
   maxp=max(x$p),
   minp=min(x$p))})

I think it would be faster to use

h.maxmin - tapply(p.1$p, p.1$id, range)

In the call to by you are subsetting the entire data frame and that
probably means taking at least one copy of that frame.  If you use
tapply on only the relevant columns you will use much less space.

 2.  Even with unneeded data objects removed, this is very slow.  Is
 there a faster way to get the maximum and minimum values?

See above.


-- 
Douglas Bates[EMAIL PROTECTED]
Statistics Department608/262-2598
University of Wisconsin - Madisonhttp://www.stat.wisc.edu/~bates/

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help