Re: [R] Possible memory leak with R v.2.5.0
Hi Peter -- Here's my guess. Ironically, adding things to broken code reduces the signal to noise ratio. I ended up with get.vars.for.cluster = function( cluster, genes=get.global("gene.ids" ), ratios=get.global("ratios")) { cluster <<- cluster rows <- cluster$rows cols <- cluster$cols r <- ratios[ rows, cols ] avg.rows <- apply( r, 2, mean, na.rm=TRUE ) r.all <- ratios[ genes, cols ] devs <- apply( r.all, 1, "-", avg.rows ) apply( devs, 2, var, na.rm=TRUE ) } at what might reproduce your problem (though can't be sure!). The unusual bit is cluster <<- cluster At first I thought this would be a no-op (assigning cluster to itself), but apparently at this point in the code cluster does not exist in the environment of the function (just in the call) so cluster gets assigned outside the function. So then my guess is that get.vars.for.cluster is part of a package, and the package has a variable called cluster. get.vars.for.cluster then assigns its first argument to the package variable cluster (which is the first variable called cluster that <<- encounters). rm(list=ls(all=TRUE)) removes everything from the global environment, but (fortunately!) not from the package environment. You might end up storing more than 'just' cluster, depending on what it is. So I think the solution is to rethink the use of <<- (and also the get.global(), which are either for convenience (in which case it would probably be better to specify a default for the function argument) or out of a sense that copying is bad (but this is probably mistaken, since R's semantics are 'copy on change', so passing a 'big' object into a function does not usually trigger a copy)). You could also try 'detach'ing the package that get.vars.for.cluster is defined in. Hope that points in the right direction, Martin Peter Waltman <[EMAIL PROTECTED]> writes: >I'm working with a very large matrix ( 22k rows x 2k cols) of RNA >expression data with R v.2.5.0 on a RedHat Enterprise machine, x86_64 >architecture. >The relevant code is below, but I call a function that takes a cluster >of this data ( a list structure that contains a $rows elt which lists >the rows (genes ) in the cluster by ID, but not the actual data itself >). >The function creates two copies of the matrix, one containing the rows >in the cluster, and one with the rest of the rows in the matrix. >After doing some statistical massaging, the function returns a >statistical score for each rows/genes in the matrix, producing a >vector of 22k elt's. >When I run 'top', I see that the memory stamp of R after loading the >matrix is ~750M. However, after calling this function on 10 clusters, >this jumps to > 3.7 gig (at least by 'top's measurement), and this >will not be reduced by any subsequent calls to gc(). >Output from gc() is: > > > gc() used (Mb) gc trigger (Mb) max used (Mb) > Ncells 377925 20.26819934 364.3 604878 32.4 > Vcells 88857341 678.0 240204174 1832.7 90689707 692.0 > > > >output from top is: > > PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ > COMMAND > 1199 waltman 17 0 3844m 3.7g 3216 S 0.0 23.6 29:58.74 R > >Note, the relevant call that invoked my function is: > > test <- sapply( c(1:10), function(x) get.vars.for.cluster( > clusterStack[[x]], opt="rows" ) ) > >Finally, for fun, I rm()'d all variables with the rm( list=ls() ) >command, and then called gc(). The memory of this "empty" instance of >R is still 3.4 gig, i.e. >R.console: > > > rm( list=ls() ) > > ls() > character(0) > > gc() > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 363023 19.45455947 291.4 604878 32.4 > Vcells 44434871 339.1 192163339 1466.1 90689707 692.0 > > > >Subsequent top output: >output from top is: > > PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ > COMMAND > 1199 waltman 16 0 3507m 3.4g 3216 S 0.0 21.5 29:58.92 R > >Thanks for any help or suggestions, >Peter Waltman >p.s. code snippet follows. Note, that I've added extra rm() and gc() >calls w/in the function to try to reduce the memory stamp to no avail. > > get.vars.for.cluster = function( cluster, genes=get.global( > "gene.ids" ), opt=c("rows","cols"), >ratios=get.global("ratios"), var.norm=T, > r.sig=get.global( "r.sig" ), > allow.anticor=get.global( "allow.anticor" ) > ) { >cat( "phw dbg msg\n") >cluster <<- cluster >opt <- match.arg( opt ) >rows <- cluster$rows >cols <- cluster$cols >if ( opt == "rows" ) { > cat( "phw dbg msg: if opt == rows\n" ) > r <- ratios[ rows, co
Re: [R] Possible memory leak with R v.2.5.0
Hi Martin - Thanks for the feedback. Right after I sent the email to the list last night, I realized that I'd forgotten to clear the all the vars we attach() to the environment before rm()'ing everything. Whoops. However, I found that while doing this *does* reduce the memory stamp by about a gig (down to 2.6 from 3.4), subsequent calls to gc() still can't reduce the memory stamp any further. Also, I probably should have added some add'l explanation of our code. I'm working on is legacy code that doesn't implement packages per se (or per the definition that R uses, at least). Instead, they are lists() of functions, i.e. try ( detach( "gain-funcs" ), silent=T ) gain-funcs = list( func1 = function() { ... }, func2 = function(){ ... } ) attach( gain-funcs ) In this framework, we can source a given .R file and have access to these functions without cluttering up the global namespace. Additionally, the 'ratios' var contains the expression matrix, and is actually an elt of another variable called 'global.data' that is also attach()'d. Similar to what you suggested, I pulled out the get.global( 'ratios' ) parameter from the function (since it's already available globally), and found that that had no effect on reducing the memory stamp. Frustrating. Likewise, after checking, I discovered that the cluster <<- cluster command was just for debugging purposes, so I've since commented it out in both your version, as well as mine. Additionally, there is no need to pass the cluster into the argument since we keep it in a stack (implemented as a list) that's available in the global namespace, so I re-wrote your version (and mine) to take the index of the cluster, rather than pass the cluster itself. Running your stripped down version of the get.vars.for.cluster() fn on 5 clusters caused R's memory stamp to jump up to as high as 5.7 gig, ending with a final stamp of 4.9 gig. Detach()'ing all the vars we add and the gc()'ing allowed it to drop to 3.4 gig. rm()'ing the var that stored the results of these 5 get.vars.for.cluster() calls and then gc()'ing did not further reduce the memory stamp. rm()'ing everything from the global namespace and gc()'in dropped this further to 3.1 gig, but no further. Adding lines to remove the r and r.all vars and gc() w/in the get.vars.for.cluster() function reduced the running memory footprint to a range of 3.5-4.4 gig, and detach()'ing, rm()'ing and gc()'ing dropped it down to 2.8. The odd thing is that calling gc() at this point reports that R is using far less memory than 'top' reports, for example, from one of my tests: > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 380537 20.46366476 340.1 380701 20.4 Vcells 88715615 676.9 277437132 2116.7 88715745 676.9 versus top: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 1199 waltman 17 0 3844m 3.7g 3216 S 0.0 23.6 29:58.74 R Given this behavior, I can only assume that there's a memory leak in R, at least for v.2.5.0. I'll try to get a version of v.2.4.x installed somewhere to see if I see a similar behavior with it. Peter Martin Morgan wrote: > Hi Peter -- > > Here's my guess. > > Ironically, adding things to broken code reduces the signal to noise > ratio. I ended up with > > get.vars.for.cluster = function( > cluster, > genes=get.global("gene.ids" ), > ratios=get.global("ratios")) > { > cluster <<- cluster > rows <- cluster$rows > cols <- cluster$cols > > r <- ratios[ rows, cols ] > avg.rows <- apply( r, 2, mean, na.rm=TRUE ) > r.all <- ratios[ genes, cols ] > devs <- apply( r.all, 1, "-", avg.rows ) > > apply( devs, 2, var, na.rm=TRUE ) > } > > at what might reproduce your problem (though can't be sure!). The > unusual bit is > > cluster <<- cluster > > At first I thought this would be a no-op (assigning cluster to > itself), but apparently at this point in the code cluster does not > exist in the environment of the function (just in the call) so cluster > gets assigned outside the function. > > So then my guess is that get.vars.for.cluster is part of a package, > and the package has a variable called cluster. get.vars.for.cluster > then assigns its first argument to the package variable cluster (which > is the first variable called cluster that <<- > encounters). rm(list=ls(all=TRUE)) removes everything from the global > environment, but (fortunately!) not from the package environment. > > You might end up storing more than 'just' cluster, depending on what > it is. > > So I think the solution is to rethink the use of <<- (and also the > get.global(), which are either for convenience (in which case it would > probably be better to specify a default for the function argument) or > out of a sense that copying is bad (but this is probably mistaken, > since R's semantics are 'copy on cha
Re: [R] Possible memory leak with R v.2.5.0
Hi Martin (and others, hopefully ) - Upon your suggestion, I tried your code and did not see a leak, and this was reflected by the 'top' command. However, I modified your code snippet below to make it a little closer to what I'm actually doing, and can now reproducibly cause the leak that I'm seeing (I've put my code snippet at the bottom of this message. Note, I'm using R v.2.5.1 on a MacOS box with 4 gigs of memory, and the following output from sessionInfo(): > sessionInfo() R version 2.5.1 (2007-06-27) powerpc-apple-darwin8.9.1 locale: C attached base packages: [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods" [7] "base" > As you'll see, to make it closer to what I'm doing, I changed the 'val' matrix to a 20k x 2k matrix, containing random vals. You'll also see that I added dimnames to the matrix, though they're commented out in the code I sent you. When using the dimnames, R will start throwing malloc errors, hence the reason I commented them out. Using the f.inds() function with sapply, i.e. res <- sapply(1:10, function(i) f.inds() ), I will get a memory even if I've detached both 'foo' and 'bar' and have removed everything ( rm( list=ls() ) ). When I garbage collect, R reports: > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 237802 6.4 531268 14.2531268 14.2 Vcells 105865 0.9 104088594 794.2 161014887 1228.5 > However, top still reports that R is using 284 Megs of memory. I know you don't think that 'top' is the best way of gauging memory usage, but could you try out my code on your machine and let me know what you're seeing? If you don't recommend 'top' than what other method should I try? Thanks, Peter foo=list(getbar=function() get("val", "bar"), f.names=function() { my.cols <- sample( colnames( val ), 750 ) my.r <- val[ sample( rownames( val ), 15 ), my.cols ] avg.rows <- apply( my.r, 2, mean, na.rm=TRUE ) rm ( my.r) gc() my.r.all <- val[ , my.cols ] devs <- apply( my.r.all, 1, "-", avg.rows ) rm( my.r.all ) gc() apply( devs, 2, var, na.rm=TRUE ) }, f.inds=function() { my.cols <- sample( ncol( val ), 750 ) my.r <- val[ sample( nrow( val ), 15 ), my.cols ] avg.rows <- apply( my.r, 2, mean, na.rm=TRUE ) rm ( my.r) gc() my.r.all <- val[ , my.cols ] devs <- apply( my.r.all, 1, "-", avg.rows ) rm( my.r.all ) gc() apply( devs, 2, var, na.rm=TRUE ) } ) attach(foo) bar <- list(val=matrix( rnorm( (2*2000) ), 2, 2000 ) )#, dimnames= list( paste( "AT2G", 1:2,sep="" ), paste( "AT2Gcol", 1:2000,sep="" ) )\ ) ) attach(bar) #res <- sapply(1:10, function(i) f.inds()) #gc() Martin Morgan wrote: > Hi Peter -- > > I guess I think that the people who need to know are already reading > the list, but they're not jumping in because you haven't stated the > problem in a way that makes it clear it really is a bug in R, as > opposed to a bug in your code. > > The two things 'they' need are: > > (1) an easy way to reproduce the problem. Here's my best guess at > where you're at now, pasted into a new R session: > > >> gc() >> > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 140809 7.6 35 18.7 35 18.7 > Vcells 126374 1.0 786432 6.0 576498 4.4 > >> foo=list(getbar=function() get("val", "bar"), >> > + f=function() apply(getbar(), 2, var)) > >> attach(foo) >> bar <- list(val=matrix(0,1000, 1000)) >> attach(bar) >> res <- sapply(1:10, function(i) f()) >> gc() >> > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 141150 7.6 35 18.7 35 18.7 > Vcells 2136495 16.44944736 37.8 4944700 37.8 > >> rm(list=ls(all=TRUE)) >> detach("foo") >> detach("bar") >> gc() >> > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 141533 7.6 35 18.7 35 18.7 > Vcells 126472 1.03955788 30.2 4944700 37.8 > > Except of course that it tidies up after itself nicely! I'm pretty > dubious about 'top' as a source of reliable memory usage info, but > don't have enough experience to know of an easy better way (just > harder better ways!). > > (2) 'They' will also want to see the output of sessionInfo() > > >> sessionInfo() >> > R version 2.6.0 Under development (unstable) (2007-08-14 r42505) > x86_64-unknown-linux-gnu > > locale: > LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > Probably 'the