Re: [R] memory management
* William Dunlap jqha...@gvopb.pbz [2012-02-28 23:06:54 +]: You need to walk through the objects, checking for environments on each component or attribute of an object. so why doesn't object.size do that? f - function(n) { + d - data.frame(y = rnorm(n), x = rnorm(n)) + lm(y ~ poly(x, 4), data=d) + } I am not doing any modeling. No ~. No formulas. The whole thing is just a bunch of data frames. I do a lot of strsplit, unlist, subsetting, so I could imagine why the RSS is triple the total size of my data if all the intermediate results are not released. -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://honestreporting.com http://memri.org http://jihadwatch.org http://pmw.org.il http://camera.org http://ffii.org To be popular with ladies one has to be smart, handsome rich. Or to be a cat. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management
I do a lot of strsplit, unlist, subsetting, so I could imagine why the RSS is triple the total size of my data if all the intermediate results are not released. I can only give some generalities about that. Using lots of small chunks of memory (like short strings) may cause fragmentation (wasted space between blocks of memory). Depending on your operating system, calling free(pointerToMemoryBlock) may or may not reduce the virtual memory size of the process, so something like '/bin/ps -o vsize,size' or Process Explorer may only show the high water mark of memory usage. Another way to gauge the total size of the visible data and the environments associated with it is to call save(list=objects(all=TRUE), compress=FALSE,file=someFile) and look at the size of the file. Headers probably have a different size in the file than in the process, but it can give some hints about how much hidden environments are adding to things. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: Sam Steingold [mailto:sam.steing...@gmail.com] On Behalf Of Sam Steingold Sent: Wednesday, February 29, 2012 8:42 AM To: William Dunlap Cc: r-help@r-project.org Subject: Re: memory management * William Dunlap jqha...@gvopb.pbz [2012-02-28 23:06:54 +]: You need to walk through the objects, checking for environments on each component or attribute of an object. so why doesn't object.size do that? f - function(n) { + d - data.frame(y = rnorm(n), x = rnorm(n)) + lm(y ~ poly(x, 4), data=d) + } I am not doing any modeling. No ~. No formulas. The whole thing is just a bunch of data frames. I do a lot of strsplit, unlist, subsetting, so I could imagine why the RSS is triple the total size of my data if all the intermediate results are not released. -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://honestreporting.com http://memri.org http://jihadwatch.org http://pmw.org.il http://camera.org http://ffii.org To be popular with ladies one has to be smart, handsome rich. Or to be a cat. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management
Le mercredi 29 février 2012 à 11:42 -0500, Sam Steingold a écrit : * William Dunlap jqha...@gvopb.pbz [2012-02-28 23:06:54 +]: You need to walk through the objects, checking for environments on each component or attribute of an object. so why doesn't object.size do that? f - function(n) { + d - data.frame(y = rnorm(n), x = rnorm(n)) + lm(y ~ poly(x, 4), data=d) + } I am not doing any modeling. No ~. No formulas. The whole thing is just a bunch of data frames. I do a lot of strsplit, unlist, subsetting, so I could imagine why the RSS is triple the total size of my data if all the intermediate results are not released. I think you're simply hitting a (terrible) OS limitation. Linux is very often not able to reclaim the memory R has used because it's fragmented. The OS can only get the pages back if nothing is above them, and most of the time there is data after the object you remove. I'm not able to give you a more precise explanation, but that's apparently a known problem and that's hard to fix. At least, I can confirm that after doing a lot of merges on big data frames, R can keep using 3GB of shared memory on my box even if gc() only reports 500MB currently used. Restarting R makes memory use go down to the normal expectations. Regards __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management
* Milan Bouchet-Valat anyvzv...@pyho.se [2012-02-29 18:18:50 +0100]: I think you're simply hitting a (terrible) OS limitation. Linux is very often not able to reclaim the memory R has used because it's fragmented. The OS can only get the pages back if nothing is above them, and most of the time there is data after the object you remove. I'm not able to give you a more precise explanation, but that's apparently a known problem and that's hard to fix. compacting garbage collector is our best friend! -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://iris.org.il http://www.memritv.org http://ffii.org http://honestreporting.com http://jihadwatch.org To a Lisp hacker, XML is S-expressions with extra cruft. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management
On Wed, 29 Feb 2012, Sam Steingold wrote: * Milan Bouchet-Valat anyvzv...@pyho.se [2012-02-29 18:18:50 +0100]: I think you're simply hitting a (terrible) OS limitation. Linux is very often not able to reclaim the memory R has used because it's fragmented. The OS can only get the pages back if nothing is above them, and most of the time there is data after the object you remove. I'm not able to give you a more precise explanation, but that's apparently a known problem and that's hard to fix. compacting garbage collector is our best friend! Which R does not use because of the problems it would create for external C/Fortran code on which R heavily relies. -- Luke Tierney Chair, Statistics and Actuarial Science Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management
* yhxr-gvre...@hvbjn.rqh [2012-02-29 13:55:25 -0600]: On Wed, 29 Feb 2012, Sam Steingold wrote: compacting garbage collector is our best friend! Which R does not use because of the problems it would create for external C/Fortran code on which R heavily relies. Well, you know better, of course. However, I cannot stop wondering if this really is absolutely necessary. If you do not call GC while the external C/Fortran code is running, you should be fine with a compacting garbage collector. If you access the C/Fortran data (managed by the C/Fortran code), then it should live in a separate universe from the one managed by R GC. -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://dhimmi.com http://camera.org http://iris.org.il http://truepeace.org http://mideasttruth.com Lisp: it's here to save your butt. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management
My basic worry is that the GC does not work properly, i.e., the unreachable data is never collected. * Bert Gunter thagre.ore...@trar.pbz [2012-02-27 14:35:14 -0800]: This appears to be the sort of query that (with apologies to other R gurus) only Brian Ripley or Luke Tierney could figure out. R generally passes by value into function calls (but not *always*), so often multiple copies of objects are made during the course of calls. I would speculate that this is what might be going on below -- maybe even that's what you meant. Just a guess on my part, of course, so treat accordingly. -- Bert On Mon, Feb 27, 2012 at 1:03 PM, Sam Steingold s...@gnu.org wrote: It appears that the intermediate data in functions is never GCed even after the return from the function call. R's RSS is 4 Gb (after a gc()) and sum(unlist(lapply(lapply(ls(),get),object.size))) [1] 1009496520 (less than 1 GB) how do I figure out where the 3GB of uncollected garbage is hiding? -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://jihadwatch.org http://memri.org http://palestinefacts.org http://truepeace.org http://iris.org.il I may be getting older, but I refuse to grow up! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management
On Tue, Feb 28, 2012 at 11:57 AM, Sam Steingold s...@gnu.org wrote: My basic worry is that the GC does not work properly, i.e., the unreachable data is never collected. Highly unlikely. Such basic inner R code has been well tested over 20 years. I believe that you merely don't understand the inner guts of what R is doing here, which is the essence of my response. (Clearly, I make no claim that I do either). I suggest you move on. -- Bert * Bert Gunter thagre.ore...@trar.pbz [2012-02-27 14:35:14 -0800]: This appears to be the sort of query that (with apologies to other R gurus) only Brian Ripley or Luke Tierney could figure out. R generally passes by value into function calls (but not *always*), so often multiple copies of objects are made during the course of calls. I would speculate that this is what might be going on below -- maybe even that's what you meant. Just a guess on my part, of course, so treat accordingly. -- Bert On Mon, Feb 27, 2012 at 1:03 PM, Sam Steingold s...@gnu.org wrote: It appears that the intermediate data in functions is never GCed even after the return from the function call. R's RSS is 4 Gb (after a gc()) and sum(unlist(lapply(lapply(ls(),get),object.size))) [1] 1009496520 (less than 1 GB) how do I figure out where the 3GB of uncollected garbage is hiding? -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://jihadwatch.org http://memri.org http://palestinefacts.org http://truepeace.org http://iris.org.il I may be getting older, but I refuse to grow up! -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management
Look into environments that may be stored with your data. object.size(obj) does not report on the size of the environment(s) associated with obj. E.g., f - function(n) { +d - data.frame(y=rnorm(n), x1=rnorm(n), x2=rnorm(n)) +terms(data=d, y~.) + } z - f(1e6) object.size(z) 1760 bytes eapply(environment(z), object.size) $d 24000520 bytes $n 32 bytes That happens because formula objects (like function objects) contain a reference to the environment in which they were created and that environmentwill not be destroyed until the last reference to it is gone. You might be able write code using, e.g., the codetools package to walk through your objects looking for all distinct environments that they reference (directly and indirectly, via ancestors of environments directly referenced). Then you can add up the sizes of things in those environments. Another possible reason for your problem is that by using ls() instead of ls(all=TRUE) you are not looking at datasets whose names start with a dot. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Sam Steingold Sent: Tuesday, February 28, 2012 11:58 AM To: r-help@r-project.org; Bert Gunter Subject: Re: [R] memory management My basic worry is that the GC does not work properly, i.e., the unreachable data is never collected. * Bert Gunter thagre.ore...@trar.pbz [2012-02-27 14:35:14 -0800]: This appears to be the sort of query that (with apologies to other R gurus) only Brian Ripley or Luke Tierney could figure out. R generally passes by value into function calls (but not *always*), so often multiple copies of objects are made during the course of calls. I would speculate that this is what might be going on below -- maybe even that's what you meant. Just a guess on my part, of course, so treat accordingly. -- Bert On Mon, Feb 27, 2012 at 1:03 PM, Sam Steingold s...@gnu.org wrote: It appears that the intermediate data in functions is never GCed even after the return from the function call. R's RSS is 4 Gb (after a gc()) and sum(unlist(lapply(lapply(ls(),get),object.size))) [1] 1009496520 (less than 1 GB) how do I figure out where the 3GB of uncollected garbage is hiding? -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://jihadwatch.org http://memri.org http://palestinefacts.org http://truepeace.org http://iris.org.il I may be getting older, but I refuse to grow up! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management
* William Dunlap jqha...@gvopb.pbz [2012-02-28 20:19:06 +]: Look into environments that may be stored with your data. thanks, but I see nothing like that: for (n in ls(all.names = TRUE)) { o - get(n) print(object.size(o), units=Kb) e - environment(o) if (!identical(e,NULL) !identical(e,.GlobalEnv)) { print(e) print(eapply(e,object.size)) } } 25.8 Kb 0.5 Kb 49.1 Kb 0.1 Kb 30.8 Kb 13.6 Kb 17.4 Kb 59.4 Kb 52.2 Kb 0.1 Kb 3.9 Kb 49.1 Kb 21.2 Kb 0.1 Kb 0.1 Kb 51 Kb 13.2 Kb 53.5 Kb 18.1 Kb 64.3 Kb 25.8 Kb 33.5 Kb 0.1 Kb 0.1 Kb 8 Kb 10 Kb 15.7 Kb 15.6 Kb 9.9 Kb 401672.7 Kb 19.1 Kb 76 Kb 12 Kb 32.4 Kb 156.3 Kb 13.1 Kb 20.5 Kb 21.8 Kb 10.8 Kb sum(unlist(lapply(lapply(ls(all.names = TRUE),get),object.size))) [1] 412351928 i.e., total of data is about 400MB. why does the process take in access of 1GB? top: 1235m 1.1g 4452 S0 14.6 7:12.27 R -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://pmw.org.il http://camera.org http://dhimmi.com http://palestinefacts.org http://ffii.org Fighting for peace is like screwing for virginity. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management
You need to walk through the objects, checking for environments on each component or attribute of an object. You also have to look at the parent.env of each environment found. E.g., f - function(n) { + d - data.frame(y = rnorm(n), x = rnorm(n)) + lm(y ~ poly(x, 4), data=d) + } z - f(1e5) environment(z) NULL object.size(z) 21610708 bytes sapply(z, object.size) coefficients residuals effects 384 4400104 1200336 rank fitted.valuesassign 32 440010456 qr df.residual xlevels 760123232 104 call terms model 508 2804 4004276 environment(z$terms) environment: 0x0abb86e4 eapply(environment(z$terms), object.size) $d 1600448 bytes $n 32 bytes Coding this is tedious; the codetools package may make it easier. Summing the sizes may well give an overestimate of the memory actually used, since several objects may share the same memory. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: Sam Steingold [mailto:sam.steing...@gmail.com] On Behalf Of Sam Steingold Sent: Tuesday, February 28, 2012 2:56 PM To: r-help@r-project.org; William Dunlap Subject: Re: memory management * William Dunlap jqha...@gvopb.pbz [2012-02-28 20:19:06 +]: Look into environments that may be stored with your data. thanks, but I see nothing like that: for (n in ls(all.names = TRUE)) { o - get(n) print(object.size(o), units=Kb) e - environment(o) if (!identical(e,NULL) !identical(e,.GlobalEnv)) { print(e) print(eapply(e,object.size)) } } 25.8 Kb 0.5 Kb 49.1 Kb 0.1 Kb 30.8 Kb 13.6 Kb 17.4 Kb 59.4 Kb 52.2 Kb 0.1 Kb 3.9 Kb 49.1 Kb 21.2 Kb 0.1 Kb 0.1 Kb 51 Kb 13.2 Kb 53.5 Kb 18.1 Kb 64.3 Kb 25.8 Kb 33.5 Kb 0.1 Kb 0.1 Kb 8 Kb 10 Kb 15.7 Kb 15.6 Kb 9.9 Kb 401672.7 Kb 19.1 Kb 76 Kb 12 Kb 32.4 Kb 156.3 Kb 13.1 Kb 20.5 Kb 21.8 Kb 10.8 Kb sum(unlist(lapply(lapply(ls(all.names = TRUE),get),object.size))) [1] 412351928 i.e., total of data is about 400MB. why does the process take in access of 1GB? top: 1235m 1.1g 4452 S0 14.6 7:12.27 R -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://pmw.org.il http://camera.org http://dhimmi.com http://palestinefacts.org http://ffii.org Fighting for peace is like screwing for virginity. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management
It appears that the intermediate data in functions is never GCed even after the return from the function call. R's RSS is 4 Gb (after a gc()) and sum(unlist(lapply(lapply(ls(),get),object.size))) [1] 1009496520 (less than 1 GB) how do I figure out where the 3GB of uncollected garbage is hiding? -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://camera.org http://truepeace.org http://www.PetitionOnline.com/tap12009/ http://thereligionofpeace.com Modern man is the missing link between apes and human beings. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management
This appears to be the sort of query that (with apologies to other R gurus) only Brian Ripley or Luke Tierney could figure out. R generally passes by value into function calls (but not *always*), so often multiple copies of objects are made during the course of calls. I would speculate that this is what might be going on below -- maybe even that's what you meant. Just a guess on my part, of course, so treat accordingly. -- Bert On Mon, Feb 27, 2012 at 1:03 PM, Sam Steingold s...@gnu.org wrote: It appears that the intermediate data in functions is never GCed even after the return from the function call. R's RSS is 4 Gb (after a gc()) and sum(unlist(lapply(lapply(ls(),get),object.size))) [1] 1009496520 (less than 1 GB) how do I figure out where the 3GB of uncollected garbage is hiding? -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://camera.org http://truepeace.org http://www.PetitionOnline.com/tap12009/ http://thereligionofpeace.com Modern man is the missing link between apes and human beings. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] memory management
zz - data.frame(a=c(1,2,3),b=c(4,5,6)) zz a b 1 1 4 2 2 5 3 3 6 a - zz$a a [1] 1 2 3 a[2] - 100 a [1] 1 100 3 zz a b 1 1 4 2 2 5 3 3 6 clearly a is a _copy_ of its namesake column in zz. when was the copy made? when a was modified? at assignment? is there a way to find out how much memory an object takes? gc() appears not to reclaim all memory after rm() - anyone can confirm? thanks! -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://mideasttruth.com http://americancensorship.org http://www.memritv.org http://jihadwatch.org http://ffii.org C combines the power of assembler with the portability of assembler. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management
This should help: invisible(gc()) m0 - memory.size() mem.usage - function(){invisible(gc()); memory.size() - m0} Mb.size - function(x)print(object.size(x), units=Mb) zz - data.frame(a=runif(100), b=runif(100)) mem.usage() [1] 15.26 Mb.size(zz) 15.3 Mb a - zz$a mem.usage() [1] 15.26 Mb.size(a) 7.6 Mb a[2] - 100 mem.usage() [1] 22.89 Mb.size(a) 7.6 Mb You can see that a - zz$a really has no impact on your memory usage. It is when you start modifying it that R needs to store a whole new object in memory. On Thu, Feb 9, 2012 at 5:17 PM, Sam Steingold s...@gnu.org wrote: zz - data.frame(a=c(1,2,3),b=c(4,5,6)) zz a b 1 1 4 2 2 5 3 3 6 a - zz$a a [1] 1 2 3 a[2] - 100 a [1] 1 100 3 zz a b 1 1 4 2 2 5 3 3 6 clearly a is a _copy_ of its namesake column in zz. when was the copy made? when a was modified? at assignment? is there a way to find out how much memory an object takes? gc() appears not to reclaim all memory after rm() - anyone can confirm? thanks! -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://mideasttruth.com http://americancensorship.org http://www.memritv.org http://jihadwatch.org http://ffii.org C combines the power of assembler with the portability of assembler. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management
* Florent D. syb...@tznvy.pbz [2012-02-09 19:26:59 -0500]: m0 - memory.size() Mb.size - function(x)print(object.size(x), units=Mb) indeed, these are very useful, thanks. ls reports these objects larger than 100k: behavior : 390.1 Mb mydf : 115.3 Mb nb : 0.2 Mb pl : 1.2 Mb however, top reports that R uses 1.7Gb of RAM (RSS) - even after gc(). what part of R is using the 1GB of RAM? -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://honestreporting.com http://dhimmi.com http://jihadwatch.org http://americancensorship.org http://camera.org Money does not buy happiness, but it helps to make unhappiness comfortable. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Memory management
I am trying to run a very large Bradley-Terry model using the BradleyTerry2 package. (There are 288 players in the BT model). My problem is that I ran the model below successfully. WLMat is a win-loss matrix that is 288 by 288 WLdf-countsToBinomial(WLMat) mod1-BTm(cbind(win1,win2),player1,player2,~player,id=player,data=WLdf) Then I needed to run the same model with a subset of the observations that went into the win-loss matrix. So I created my new win-loss matrix and tried to run a new model. Now I get: Error: cannot allocate vector of size 90.5 Mb I found this particularly puzzling because the actual input data is the same size as the original model, just different values. I tried increasing memory size, I tried running it in a clean workspace and the error message is always the same (sometimes the vector it is trying to allocate is 181.0MB (twice as large)) but it is always one of those two numbers no matter what I have done to the available memory. To further complicate this...I cannot get the system to re-run my first model either . Same errors. traceback indicates that the error occurs when the program is trying to do a qr decomposition. R 2.13.0 Windows XP Any suggestions? W. Michael Conklin Chief Methodologist Google Voice: (612) 56STATS MarketTools, Inc. | www.markettools.comhttp://www.markettools.com 6465 Wayzata Blvd | Suite 170 | St. Louis Park, MN 55426. PHONE: 952.417.4719 | CELL: 612.201.8978 This email and attachment(s) may contain confidential and/or proprietary information and is intended only for the intended addressee(s) or its authorized agent(s). Any disclosure, printing, copying or use of such information is strictly prohibited. If this email and/or attachment(s) were received in error, please immediately notify the sender and delete all copies [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory Management under Linux
It would be very useful if you would post some information about what exactly you are doing. There si something with the size of the data object you are processing ('str' would help us understand it) and then a portion of the script (both before and after the error message) so we can understand the transformation that you are doing. It is very easy to generate a similar message: x - matrix(0,2, 2) Error: cannot allocate vector of size 3.0 Gb but unless you know the context, it is almost impossible to give advice. It also depends on if you are in some function calls were copies of objects may have been made, etc. On Thu, Nov 4, 2010 at 7:52 PM, ricardo souza ricsouz...@yahoo.com.br wrote: Dear all, I am using ubuntu linux 32 with 4 Gb. I am running a very small script and I always got the same error message: CAN NOT ALLOCATE A VECTOR OF SIZE 231.8 Mb. I have reading carefully the instruction in ?Memory. Using the function gc() I got very low numbers of memory (please sea below). I know that it has been posted several times at r-help (http://tolstoy.newcastle.edu.au/R/help/05/06/7565.html#7627qlink2). However I did not find yet the solution to improve my memory issue in Linux. Somebody cold please give some instruction how to improve my memory under linux? gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 170934 4.6 35 9.4 35 9.4 Vcells 195920 1.5 786432 6.0 781384 6.0 INCREASING THE R MEMORY FOLLOWING THE INSTRUCTION IN ?Memory I started R with: R --min-vsize=10M --max-vsize=4G --min-nsize=500k --max-nsize=900M gc() used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) Ncells 130433 3.5 50 13.4 25200 50 13.4 Vcells 81138 0.7 1310720 10.0 NA 499143 3.9 It increased but not so much! Please, please let me know. I have read all r-help about this matter, but not solution. Thanks for your attention! Ricardo [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory Management under Linux
Dear Jim, Thanks for your attention. I am running a geostatistic analysis with geoR that is computational intense. At the end my analysis I call the function krige.control and krige.conv. Do you have any idea how to improve the memory allocation in Linux? Thanks, Ricardo De: jim holtman jholt...@gmail.com Assunto: Re: [R] Memory Management under Linux Para: ricardo souza ricsouz...@yahoo.com.br Cc: r-help@r-project.org Data: Sexta-feira, 5 de Novembro de 2010, 10:21 It would be very useful if you would post some information about what exactly you are doing. There si something with the size of the data object you are processing ('str' would help us understand it) and then a portion of the script (both before and after the error message) so we can understand the transformation that you are doing. It is very easy to generate a similar message: x - matrix(0,2, 2) Error: cannot allocate vector of size 3.0 Gb but unless you know the context, it is almost impossible to give advice. It also depends on if you are in some function calls were copies of objects may have been made, etc. On Thu, Nov 4, 2010 at 7:52 PM, ricardo souza ricsouz...@yahoo.com.br wrote: Dear all, I am using ubuntu linux 32 with 4 Gb. I am running a very small script and I always got the same error message: CAN NOT ALLOCATE A VECTOR OF SIZE 231.8 Mb. I have reading carefully the instruction in ?Memory. Using the function gc() I got very low numbers of memory (please sea below). I know that it has been posted several times at r-help (http://tolstoy.newcastle.edu.au/R/help/05/06/7565.html#7627qlink2). However I did not find yet the solution to improve my memory issue in Linux. Somebody cold please give some instruction how to improve my memory under linux? gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 170934 4.6 35 9.4 35 9.4 Vcells 195920 1.5 786432 6.0 781384 6.0 INCREASING THE R MEMORY FOLLOWING THE INSTRUCTION IN ?Memory I started R with: R --min-vsize=10M --max-vsize=4G --min-nsize=500k --max-nsize=900M gc() used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) Ncells 130433 3.5 50 13.4 25200 50 13.4 Vcells 81138 0.7 1310720 10.0 NA 499143 3.9 It increased but not so much! Please, please let me know. I have read all r-help about this matter, but not solution. Thanks for your attention! Ricardo [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory Management under Linux
I would do some monitoring (debugging) of the script by placing some 'gc()' calls in the sequence of statements leading to the problem to see what the memory usage is at that point. Take a close look at the sizes of your objects. If it is happening in some function you have called, you may have to take a look and understand if multiple copies are being made. Most problems of this type may require that you put hooks in your code (most of the stuff that I write has it in so I can isolate performance problems) to gain an understanding of what is happening when. To improve memory allocation, you first have to understand what is causing the problem, and enough information has not been provided so that I could make a comment on it. There are lots of rules of thumb that can be used, but many depend on exactly what you are trying to do. On Fri, Nov 5, 2010 at 2:59 PM, ricardo souza ricsouz...@yahoo.com.brwrote: Dear Jim, Thanks for your attention. I am running a geostatistic analysis with geoR that is computational intense. At the end my analysis I call the function krige.control and krige.conv. Do you have any idea how to improve the memory allocation in Linux? Thanks, Ricardo De: jim holtman jholt...@gmail.com Assunto: Re: [R] Memory Management under Linux Para: ricardo souza ricsouz...@yahoo.com.br Cc: r-help@r-project.org Data: Sexta-feira, 5 de Novembro de 2010, 10:21 It would be very useful if you would post some information about what exactly you are doing. There si something with the size of the data object you are processing ('str' would help us understand it) and then a portion of the script (both before and after the error message) so we can understand the transformation that you are doing. It is very easy to generate a similar message: x - matrix(0,2, 2) Error: cannot allocate vector of size 3.0 Gb but unless you know the context, it is almost impossible to give advice. It also depends on if you are in some function calls were copies of objects may have been made, etc. On Thu, Nov 4, 2010 at 7:52 PM, ricardo souza ricsouz...@yahoo.com.brhttp://mc/compose?to=ricsouz...@yahoo.com.br wrote: Dear all, I am using ubuntu linux 32 with 4 Gb. I am running a very small script and I always got the same error message: CAN NOT ALLOCATE A VECTOR OF SIZE 231.8 Mb. I have reading carefully the instruction in ?Memory. Using the function gc() I got very low numbers of memory (please sea below). I know that it has been posted several times at r-help ( http://tolstoy.newcastle.edu.au/R/help/05/06/7565.html#7627qlink2). However I did not find yet the solution to improve my memory issue in Linux. Somebody cold please give some instruction how to improve my memory under linux? gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 170934 4.6 35 9.4 35 9.4 Vcells 195920 1.5 786432 6.0 781384 6.0 INCREASING THE R MEMORY FOLLOWING THE INSTRUCTION IN ?Memory I started R with: R --min-vsize=10M --max-vsize=4G --min-nsize=500k --max-nsize=900M gc() used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) Ncells 130433 3.5 50 13.4 25200 50 13.4 Vcells 81138 0.71310720 10.0 NA 499143 3.9 It increased but not so much! Please, please let me know. I have read all r-help about this matter, but not solution. Thanks for your attention! Ricardo [[alternative HTML version deleted]] __ R-help@r-project.org http://mc/compose?to=r-h...@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Memory Management under Linux
Dear all, I am using ubuntu linux 32 with 4 Gb. I am running a very small script and I always got the same error message: CAN NOT ALLOCATE A VECTOR OF SIZE 231.8 Mb. I have reading carefully the instruction in ?Memory. Using the function gc() I got very low numbers of memory (please sea below). I know that it has been posted several times at r-help (http://tolstoy.newcastle.edu.au/R/help/05/06/7565.html#7627qlink2). However I did not find yet the solution to improve my memory issue in Linux. Somebody cold please give some instruction how to improve my memory under linux? gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 170934 4.6 35 9.4 35 9.4 Vcells 195920 1.5 786432 6.0 781384 6.0 INCREASING THE R MEMORY FOLLOWING THE INSTRUCTION IN ?Memory I started R with: R --min-vsize=10M --max-vsize=4G --min-nsize=500k --max-nsize=900M gc() used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) Ncells 130433 3.5 50 13.4 25200 50 13.4 Vcells 81138 0.7 1310720 10.0 NA 499143 3.9 It increased but not so much! Please, please let me know. I have read all r-help about this matter, but not solution. Thanks for your attention! Ricardo [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management in R
I already offered the Biostrings package. It provides more robust methods for string matching than does grepl. Is there a reason that you choose not to? Indeed that is the way I should go for and I have installed the package after some struggling. Since biostring is a fairly complex package and I need only a way to check if a certain string A is a subset of string B, do you know the biostring functions to achieve this? I see a lot of methods for biological (DNA, RNA) sequences, and they may not apply to my series (which are definitely not from biology). Cheers Lorenzo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management in R
Date: Sun, 10 Oct 2010 15:27:11 +0200 From: lorenzo.ise...@gmail.com To: dwinsem...@comcast.net CC: r-help@r-project.org Subject: Re: [R] Memory management in R I already offered the Biostrings package. It provides more robust methods for string matching than does grepl. Is there a reason that you choose not to? Indeed that is the way I should go for and I have installed the package after some struggling. Since biostring is a fairly complex package and I need only a way to check if a certain string A is a subset of string B, do you know the biostring functions to achieve this? I see a lot of methods for biological (DNA, RNA) sequences, and they may not apply to my series (which are definitely not from biology). Generally the differences relate to alphabet and things you may want to know about them. Unless you are looking for reverse complement text strings, there will be a lot of stuff you don't need. Offhand, I'd be looking for things like computational linguistics packages as you are looking to find patterns or predictability in human readable character sequences. Now, humans can probably write hairpin-text( look at what RNA can do LOL) but this is probably not what you care about. However, as I mentioned earlier, I had to write my own regex compiler ( coincidently for bio apps ) to get required performance. Your application and understanding may benefit from things like building dictionaries that aren't really part of regex and that can easily be done in a few lines of c++ code using STL containers. To get statistically meaningful samples, you almost will certainly need faster code. Cheers Lorenzo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management in R
Hi David, I am replying to you and to the other people who provided some insight into my problems with grepl. Well, at least we now know that the bug is reproducible. Indeed it is a strange sequence the one I am postprocessing, probably pathological to some extent, nevertheless the problem is given by grepl crushing when a long (but not huge) chunk of repeated data is loaded has to be acknowledged. Now, my problem is the following: given a potentially long string (or before that a sequence, where every element has been generated via the hash function, algo='crc32' of the digest package), how can I, starting from an arbitrary position i along the list, calculate the shortest substring in the future of i (i.e. the interval i:end of the series) that has not occurred in the past of i (i.e. [1:i-1])? Efficiency is not the main point here, I need to run this code only once to get what I need, but it cannot crush on a 2000-entry string. Cheers Lorenzo On 10/09/2010 01:30 AM, David Winsemius wrote: What puzzles me is that the list is not really long (less than 2000 entries) and I have not experienced the same problem even with longer lists. But maybe your loop terminated in them eaarlier/ Someplace between 11*225 and 11*240 the grepping machine gives up: eprs - paste(rep(aa, 225), collapse=#) grepl(eprs, eprs) [1] TRUE eprs - paste(rep(aa, 240), collapse=#) grepl(eprs, eprs) Error in grepl(eprs, eprs) : invalid regular expression 'aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#a In addition: Warning message: In grepl(eprs, eprs) : regcomp error: 'Out of memory' The complexity of the problem may depend on the distribution of values. You have a very skewed distribution with the vast majority being in the same value as appeared in your error message : table(x) x 12653a6 202fbcc4 48bef8c3 4e084ddc 51f342a4 5d64d58a 78087f5e abddf3d1 1419 299 1 1 1 3 1 1 ac76183b b955be36 c600173a e96f6bbd e9c56275 1 30 5 1 9 And you have 1159 of them in one clump (which would seem to be somewhat improbably under a random null hypothesis: max(rle(x)$lengths) [1] 1159 which(rle(x)$lengths == 1159) [1] 123 rle(x)$values[123] [1] 12653a6 HTH (although I think it means you need to construct a different implementation strategy); David. Many thanks Lorenzo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management in R
On Oct 9, 2010, at 9:45 AM, Lorenzo Isella wrote: Hi David, I am replying to you and to the other people who provided some insight into my problems with grepl. Well, at least we now know that the bug is reproducible. Indeed it is a strange sequence the one I am postprocessing, probably pathological to some extent, nevertheless the problem is given by grepl crushing when a long (but not huge) chunk of repeated data is loaded has to be acknowledged. Now, my problem is the following: given a potentially long string (or before that a sequence, where every element has been generated via the hash function, algo='crc32' of the digest package), how can I, starting from an arbitrary position i along the list, calculate the shortest substring in the future of i (i.e. the interval i:end of the series) that has not occurred in the past of i (i.e. [1:i-1])? Maybe you should work on a less convoluted explanation of the test? Or perhaps a couple of compact examples, preferably in R-copy-paste format? Efficiency is not the main point here, I need to run this code only once to get what I need, but it cannot crush on a 2000-entry string. My suggestion is to explore other alternatives. (I will admit that I don't yet fully understand the test that you are applying.) The two that have occurred to me are Biostrings which I have already mentioned and rle() which I have illustrated the use of but not referenced as an avenue. The Biostrings package is part of bioConductor (part of the R universe) although you should be prepared for a coffee break when you install it if you haven't gotten at least bioClite already installed. When I installed it last night it had 54 other package dependents also downloaded and installed. It seems to me that taking advantage of the coding resources in the molecular biology domain that are currently directed at decoding the information storage mechanism of life might be a smart strategy. You have not described the domain you are working in but I would guess that the digest package might be biological in primary application? So forgive me if I am preaching to the choir. The rle option also occurred to me but it might take a smarter coder than I to fully implement it. (But maybe Holtman would be up to it. He's a _lot_ smarter than I.) In your example the long x string is faithfully represented by two aligned vectors, each 197 characters in length. The long repeat sequence that broke the grepl mechanism are just one pair of values. rle(x) Run Length Encoding lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ... values : chr [1:197] 5d64d58a ac76183b 202fbcc4 78087f5e ... So maybe as soon as you got to a bundle that was greater than 1/2 the overall length (as happened in the x case) you could stop, since it could not have occurred before. -- David. Cheers Lorenzo On 10/09/2010 01:30 AM, David Winsemius wrote: What puzzles me is that the list is not really long (less than 2000 entries) and I have not experienced the same problem even with longer lists. But maybe your loop terminated in them eaarlier/ Someplace between 11*225 and 11*240 the grepping machine gives up: eprs - paste(rep(aa, 225), collapse=#) grepl(eprs, eprs) [1] TRUE eprs - paste(rep(aa, 240), collapse=#) grepl(eprs, eprs) Error in grepl(eprs, eprs) : invalid regular expression 'aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#a In addition: Warning message: In grepl(eprs, eprs) : regcomp error: 'Out of memory' The complexity of the problem may depend on the distribution of values. You have a very skewed distribution with the vast majority being in the same value as appeared in your error message : table(x) x 12653a6 202fbcc4 48bef8c3 4e084ddc 51f342a4 5d64d58a 78087f5e abddf3d1 1419 299 1 1 1 3 1 1 ac76183b b955be36 c600173a e96f6bbd e9c56275 1 30 5 1 9 And you have 1159 of them in one clump (which would seem to be somewhat improbably under a random null hypothesis: max(rle(x)$lengths) [1] 1159 which(rle(x)$lengths ==
Re: [R] Memory management in R
My suggestion is to explore other alternatives. (I will admit that I don't yet fully understand the test that you are applying.) Hi, I am trying to partially implement the Lempel Ziv compression algorithm. The point is that compressibility and entropy of a time series are related, hence my final goal is to evaluate the entropy of a time series. You can find more at http://bit.ly/93zX4T http://en.wikipedia.org/wiki/LZ77_and_LZ78 http://bit.ly/9NgIFt The two that have occurred to me are Biostrings which I have already mentioned and rle() which I have illustrated the use of but not referenced as an avenue. The Biostrings package is part of bioConductor (part of the R universe) although you should be prepared for a coffee break when you install it if you haven't gotten at least bioClite already installed. When I installed it last night it had 54 other package dependents also downloaded and installed. It seems to me that taking advantage of the coding resources in the molecular biology domain that are currently directed at decoding the information storage mechanism of life might be a smart strategy. You have not described the domain you are working in but I would guess that the digest package might be biological in primary application? So forgive me if I am preaching to the choir. The rle option also occurred to me but it might take a smarter coder than I to fully implement it. (But maybe Holtman would be up to it. He's a _lot_ smarter than I.) In your example the long x string is faithfully represented by two aligned vectors, each 197 characters in length. The long repeat sequence that broke the grepl mechanism are just one pair of values. rle(x) Run Length Encoding lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ... values : chr [1:197] 5d64d58a ac76183b 202fbcc4 78087f5e ... So maybe as soon as you got to a bundle that was greater than 1/2 the overall length (as happened in the x case) you could stop, since it could not have occurred before. I doubt that rle() can be deployed to replace Lempel-Ziv (LZ) algorithm in a trivial way. As a less convoluted example, consider the series x - c(d,a,b,d,a,b,e,z) If i=4 and therefore the i-th element is the second 'd' in the series, the shortest series starting from i=4 that I do not see in the past of 'd' is d,a,b,e, whose length is equal to 4 and that is the value returned by the function below. The frustrating thing is that I already have the tools I need, just they crash for reasons beyond my control on relatively short series. If anyone can make the function below more robust, that is really a big help for me. Cheers Lorenzo ### entropy_lz - function(x,i){ past - x[1:i-1] n - length(x) lp - length(past) future - x[i:n] go_on - 1 count_len - 0 past_string - paste(past, collapse=#) while (go_on0){ new_seq - x[i:(i+count_len)] fut_string - paste(new_seq, collapse=#) count_len - count_len+1 if (grepl(fut_string,past_string)!=1){ go_on - -1 } } return(count_len) } x - c(c,a,b,c,a,b,e,z) S - entropy_lz(x,4) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management in R
On Oct 9, 2010, at 4:23 PM, Lorenzo Isella wrote: My suggestion is to explore other alternatives. (I will admit that I don't yet fully understand the test that you are applying.) Hi, I am trying to partially implement the Lempel Ziv compression algorithm. The point is that compressibility and entropy of a time series are related, hence my final goal is to evaluate the entropy of a time series. You can find more at http://bit.ly/93zX4T http://en.wikipedia.org/wiki/LZ77_and_LZ78 http://bit.ly/9NgIFt The two that have occurred to me are Biostrings which I have already mentioned and rle() which I have illustrated the use of but not referenced as an avenue. The Biostrings package is part of bioConductor (part of the R universe) although you should be prepared for a coffee break when you install it if you haven't gotten at least bioClite already installed. When I installed it last night it had 54 other package dependents also downloaded and installed. It seems to me that taking advantage of the coding resources in the molecular biology domain that are currently directed at decoding the information storage mechanism of life might be a smart strategy. You have not described the domain you are working in but I would guess that the digest package might be biological in primary application? So forgive me if I am preaching to the choir. The rle option also occurred to me but it might take a smarter coder than I to fully implement it. (But maybe Holtman would be up to it. He's a _lot_ smarter than I.) In your example the long x string is faithfully represented by two aligned vectors, each 197 characters in length. The long repeat sequence that broke the grepl mechanism are just one pair of values. rle(x) Run Length Encoding lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ... values : chr [1:197] 5d64d58a ac76183b 202fbcc4 78087f5e ... So maybe as soon as you got to a bundle that was greater than 1/2 the overall length (as happened in the x case) you could stop, since it could not have occurred before. I doubt that rle() can be deployed to replace Lempel-Ziv (LZ) algorithm in a trivial way. As a less convoluted example, consider the series x - c(d,a,b,d,a,b,e,z) If i=4 and therefore the i-th element is the second 'd' in the series, the shortest series starting from i=4 that I do not see in the past of 'd' is d,a,b,e, whose length is equal to 4 and that is the value returned by the function below. The frustrating thing is that I already have the tools I need, just they crash for reasons beyond my control on relatively short series. If anyone can make the function below more robust, that is really a big help for me. I already offered the Biostrings package. It provides more robust methods for string matching than does grepl. Is there a reason that you choose not to? -- David. Cheers Lorenzo ### entropy_lz - function(x,i){ past - x[1:i-1] n - length(x) lp - length(past) future - x[i:n] go_on - 1 count_len - 0 past_string - paste(past, collapse=#) while (go_on0){ new_seq - x[i:(i+count_len)] fut_string - paste(new_seq, collapse=#) count_len - count_len+1 if (grepl(fut_string,past_string)!=1){ go_on - -1 } } return(count_len) } x - c(c,a,b,c,a,b,e,z) S - entropy_lz(x,4) David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Memory management in R
Dear All, I am experiencing some problems with a script of mine. It crashes with this message Error in grepl(fut_string, past_string) : invalid regular expression '12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12 Calls: entropy_estimate_hash - total_entropy_lz - entropy_lz - grepl In addition: Warning message: In grepl(fut_string, past_string) : regcomp error: 'Out of memory' Execution halted To make a long story short, I use some functions which eventually call grepl on very long strings to check whether a certain substring is part of a longer string. Now, the script technically works (it never crashes when I run it on a smaller dataset) and the problem does not seem to be RAM memory (I have several GB of RAM on my machine and its consumption never shoots up so my machine never resorts to swap memory). So (though I am not an expert) it looks like the problem is some limitation of grepl or R memory management. Any idea about how I could tackle this problem or how I can profile my code to fix it (though it really seems to me that I have to find a way to allow R to process longer strings). Any suggestion is appreciated. Cheers Lorenzo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management in R
On 10/08/2010 07:25 PM, Doran, Harold wrote: These questions are OS-specific. Please provide sessionInfo() or other details as needed I see. I am running R on a 64 bit machine running Ubuntu 10.04 sessionInfo() R version 2.11.1 (2010-05-31) x86_64-pc-linux-gnu locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base and in case it matters, this is the output of my top command $ top top - 19:28:21 up 8:04, 8 users, load average: 0.60, 0.72, 1.33 Tasks: 220 total, 1 running, 219 sleeping, 0 stopped, 0 zombie Cpu(s): 10.3%us, 0.6%sy, 0.0%ni, 87.2%id, 1.9%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 6110484k total, 3847008k used, 2263476k free,72748k buffers Swap: 2929656k total,0k used, 2929656k free, 2621420k cached Cheers Lorenzo -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] Sent: Friday, October 08, 2010 1:12 PM To: r-help Subject: [R] Memory management in R Dear All, I am experiencing some problems with a script of mine. It crashes with this message Error in grepl(fut_string, past_string) : invalid regular expression '12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12 Calls: entropy_estimate_hash - total_entropy_lz - entropy_lz - grepl In addition: Warning message: In grepl(fut_string, past_string) : regcomp error: 'Out of memory' Execution halted To make a long story short, I use some functions which eventually call grepl on very long strings to check whether a certain substring is part of a longer string. Now, the script technically works (it never crashes when I run it on a smaller dataset) and the problem does not seem to be RAM memory (I have several GB of RAM on my machine and its consumption never shoots up so my machine never resorts to swap memory). So (though I am not an expert) it looks like the problem is some limitation of grepl or R memory management. Any idea about how I could tackle this problem or how I can profile my code to fix it (though it really seems to me that I have to find a way to allow R to process longer strings). Any suggestion is appreciated. Cheers Lorenzo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management in R
These questions are OS-specific. Please provide sessionInfo() or other details as needed -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Lorenzo Isella Sent: Friday, October 08, 2010 1:12 PM To: r-help Subject: [R] Memory management in R Dear All, I am experiencing some problems with a script of mine. It crashes with this message Error in grepl(fut_string, past_string) : invalid regular expression '12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12 Calls: entropy_estimate_hash - total_entropy_lz - entropy_lz - grepl In addition: Warning message: In grepl(fut_string, past_string) : regcomp error: 'Out of memory' Execution halted To make a long story short, I use some functions which eventually call grepl on very long strings to check whether a certain substring is part of a longer string. Now, the script technically works (it never crashes when I run it on a smaller dataset) and the problem does not seem to be RAM memory (I have several GB of RAM on my machine and its consumption never shoots up so my machine never resorts to swap memory). So (though I am not an expert) it looks like the problem is some limitation of grepl or R memory management. Any idea about how I could tackle this problem or how I can profile my code to fix it (though it really seems to me that I have to find a way to allow R to process longer strings). Any suggestion is appreciated. Cheers Lorenzo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management in R
More specificity: how long is the string, what is the pattern you are matching against? It sounds like you might have a complex pattern that in trying to match the string might be doing a lot of back tracking and such. There is an O'Reilly book on Mastering Regular Expression that might help you understand what might be happening. So if you can provide a better example than just the error message, it would be helpful. On Fri, Oct 8, 2010 at 1:11 PM, Lorenzo Isella lorenzo.ise...@gmail.com wrote: Dear All, I am experiencing some problems with a script of mine. It crashes with this message Error in grepl(fut_string, past_string) : invalid regular expression '12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12 Calls: entropy_estimate_hash - total_entropy_lz - entropy_lz - grepl In addition: Warning message: In grepl(fut_string, past_string) : regcomp error: 'Out of memory' Execution halted To make a long story short, I use some functions which eventually call grepl on very long strings to check whether a certain substring is part of a longer string. Now, the script technically works (it never crashes when I run it on a smaller dataset) and the problem does not seem to be RAM memory (I have several GB of RAM on my machine and its consumption never shoots up so my machine never resorts to swap memory). So (though I am not an expert) it looks like the problem is some limitation of grepl or R memory management. Any idea about how I could tackle this problem or how I can profile my code to fix it (though it really seems to me that I have to find a way to allow R to process longer strings). Any suggestion is appreciated. Cheers Lorenzo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management in R
Date: Fri, 8 Oct 2010 13:30:59 -0400 From: jholt...@gmail.com To: lorenzo.ise...@gmail.com CC: r-help@r-project.org Subject: Re: [R] Memory management in R More specificity: how long is the string, what is the pattern you are matching against? It sounds like you might have a complex pattern that in trying to match the string might be doing a lot of back tracking and such. There is an O'Reilly book on Mastering Regular Expression that might help you understand what might be happening. So if you can provide a better example than just the error message, it would be helpful. This is possibly a stack issue. Error messages are not often literal, I have seen out of memory for graphic device objects :) Regex suggests a stack issue but that would be a guess on the mechanism of death but what you probably really want is a simpler regex :) On Fri, Oct 8, 2010 at 1:11 PM, Lorenzo Isella wrote: Dear All, I am experiencing some problems with a script of mine. It crashes with this message Error in grepl(fut_string, past_string) : invalid regular expression '12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12 Calls: entropy_estimate_hash - total_entropy_lz - entropy_lz - grepl In addition: Warning message: In grepl(fut_string, past_string) : regcomp error: 'Out of memory' Execution halted To make a long story short, I use some functions which eventually call grepl on very long strings to check whether a certain substring is part of a longer string. Now, the script technically works (it never crashes when I run it on a smaller dataset) and the problem does not seem to be RAM memory (I have several GB of RAM on my machine and its consumption never shoots up so my machine never resorts to swap memory). So (though I am not an expert) it looks like the problem is some limitation of grepl or R memory management. Any idea about how I could tackle this problem or how I can profile my code to fix it (though it really seems to me that I have to find a way to allow R to process longer strings). Any suggestion is appreciated. Cheers Lorenzo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management in R
Thanks for lending a helping hand. I put together a self-contained example. Basically, it all relies on a couple of functions, where one function simply iterates the application of the other function. I am trying to implement the so-called Lempel-Ziv entropy estimator. The idea is to choose a position i along a string x (standing for a time series) and find the length of the shortest string starting from i which has never occurred before i. Please find below the R snippet which requires an input file (a simple text file) you can download from http://dl.dropbox.com/u/5685598/time_series25_.dat What puzzles me is that the list is not really long (less than 2000 entries) and I have not experienced the same problem even with longer lists. Many thanks Lorenzo ## total_entropy_lz - function(x){ if (length(x)==1){ print(sequence too short) return(error) } else{ n - length(x) prefactor - 1/(n*log(n)/log(2)) n_seq - seq(n) entropy_list - n_seq for (i in n_seq){ entropy_list[i] - entropy_lz(x,i) } } total_entropy - 1/(prefactor*sum(entropy_list)) return(total_entropy) } entropy_lz - function(x,i){ past - x[1:i-1] n - length(x) lp - length(past) future - x[i:n] go_on - 1 count_len - 0 past_string - paste(past, collapse=#) while (go_on0){ new_seq - x[i:(i+count_len)] fut_string - paste(new_seq, collapse=#) count_len - count_len+1 if (grepl(fut_string,past_string)!=1){ go_on - -1 } } return(count_len) } x - scan(time_series25_.dat, what=) S - total_entropy_lz(x) On 10/08/2010 07:30 PM, jim holtman wrote: More specificity: how long is the string, what is the pattern you are matching against? It sounds like you might have a complex pattern that in trying to match the string might be doing a lot of back tracking and such. There is an O'Reilly book on Mastering Regular Expression that might help you understand what might be happening. So if you can provide a better example than just the error message, it would be helpful. On Fri, Oct 8, 2010 at 1:11 PM, Lorenzo Isellalorenzo.ise...@gmail.com wrote: Dear All, I am experiencing some problems with a script of mine. It crashes with this message Error in grepl(fut_string, past_string) : invalid regular expression '12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12 Calls: entropy_estimate_hash - total_entropy_lz - entropy_lz - grepl In addition: Warning message: In grepl(fut_string, past_string) : regcomp error: 'Out of memory' Execution halted To make a long story short, I use some functions which eventually call grepl on very long strings to check whether a certain substring is part of a longer string. Now, the script technically works (it never crashes when I run it on a smaller dataset) and the problem does not seem to be RAM memory (I have several GB of RAM on my machine and its consumption never shoots up so my machine never resorts to swap memory). So (though I am not an expert) it looks like the problem is some limitation of grepl or R memory management. Any idea about how I could tackle this problem or how I can profile my code to fix it (though it really seems to me that I have to find a way to allow R to process longer strings). Any suggestion is appreciated. Cheers Lorenzo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management in R
#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12 Calls: entropy_estimate_hash - total_entropy_lz - entropy_lz - grepl In addition: Warning message: In grepl(fut_string, past_string) : regcomp error: 'Out of memory' Execution halted To make a long story short, I use some functions which eventually call grepl on very long strings to check whether a certain substring is part of a longer string. Now, the script technically works (it never crashes when I run it on a smaller dataset) and the problem does not seem to be RAM memory (I have several GB of RAM on my machine and its consumption never shoots up so my machine never resorts to swap memory). So (though I am not an expert) it looks like the problem is some limitation of grepl or R memory management. Any idea about how I could tackle this problem or how I can profile my code to fix it (though it really seems to me that I have to find a way to allow R to process longer strings). Any suggestion is appreciated. Cheers Lorenzo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management in R
From: dwinsem...@comcast.net To: lorenzo.ise...@gmail.com Date: Fri, 8 Oct 2010 19:30:45 -0400 CC: r-help@r-project.org Subject: Re: [R] Memory management in R On Oct 8, 2010, at 6:42 PM, Lorenzo Isella wrote: Please find below the R snippet which requires an input file (a simple text file) you can download from http://dl.dropbox.com/u/5685598/time_series25_.dat What puzzles me is that the list is not really long (less than 2000 entries) and I have not experienced the same problem even with longer lists. But maybe your loop terminated in them eaarlier/ Someplace between 11*225 and 11*240 the grepping machine gives up: eprs - paste(rep(aa, 225), collapse=#) grepl(eprs, eprs) [1] TRUE eprs - paste(rep(aa, 240), collapse=#) grepl(eprs, eprs) Error in grepl(eprs, eprs) : invalid regular expression 'aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#a In addition: Warning message: In grepl(eprs, eprs) : regcomp error: 'Out of memory' The complexity of the problem may depend on the distribution of values. You have a very skewed distribution with the vast majority being in the same value as appeared in your error message : HTH (although I think it means you need to construct a different implementation strategy); You really need to look at the question posed by your regex and consider the complexity of what you are asking and what likely implementations would do with your regex. Something like this probably needs to be implemented in dedicated code to handle the more general case or you need to determine if input data is pathological given your regex. Being able to write something concisely doesn't mean the execution of that something is simple. Even if it does manage to return a result, it likely will get very slow. In the past I have had to write my own simple regex compilers to handle a limited class of expressions to make the speed reasonable. In this case, depending on your objectives, dedicated code may even be helpful to you in understanding the algorithm. David. Many thanks Lorenzo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management in R
On Oct 8, 2010, at 9:19 PM, Mike Marchywka wrote: From: dwinsem...@comcast.net To: lorenzo.ise...@gmail.com Date: Fri, 8 Oct 2010 19:30:45 -0400 CC: r-help@r-project.org Subject: Re: [R] Memory management in R On Oct 8, 2010, at 6:42 PM, Lorenzo Isella wrote: Please find below the R snippet which requires an input file (a simple text file) you can download from http://dl.dropbox.com/u/5685598/time_series25_.dat What puzzles me is that the list is not really long (less than 2000 entries) and I have not experienced the same problem even with longer lists. But maybe your loop terminated in them eaarlier/ Someplace between 11*225 and 11*240 the grepping machine gives up: eprs - paste(rep(aa, 225), collapse=#) grepl(eprs, eprs) [1] TRUE eprs - paste(rep(aa, 240), collapse=#) grepl(eprs, eprs) Error in grepl(eprs, eprs) : invalid regular expression 'aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#a In addition: Warning message: In grepl(eprs, eprs) : regcomp error: 'Out of memory' The complexity of the problem may depend on the distribution of values. You have a very skewed distribution with the vast majority being in the same value as appeared in your error message : HTH (although I think it means you need to construct a different implementation strategy); You really need to look at the question posed by your regex and consider the complexity of what you are asking and what likely implementations would do with your regex. The R regex machine (at least on a Mac with R 2.11.1) breaks when the length of the the pattern argument exceeds 2559 characters. There is no complexity for the regex parser here. No metacharacters were in the string. Something like this probably needs to be implemented in dedicated code to handle the more general case or you need to determine if input data is pathological given your regex. There is a Biostrings package in BioC that may provide more robust treatment of long strings. -- David. Being able to write something concisely doesn't mean the execution of that something is simple. Even if it does manage to return a result, it likely will get very slow. In the past I have had to write my own simple regex compilers to handle a limited class of expressions to make the speed reasonable. In this case, depending on your objectives, dedicated code may even be helpful to you in understanding the algorithm. David. Many thanks Lorenzo David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] memory management in R
I have volunteered to give a short talk on memory management in R to my local R user group, mainly to motivate myself to learn about it. The focus will be on what a typical R coder might want to know ( e.g. how objects are created, call by value, basics of garbage collection ) but I want to go a little deeper just in case there are some advanced users in the crowd. Here are the resources I am using right now Chambers book Software for Data Analysis Manuals such as R Internals and Writing R Extensions Any suggestions on other sources of information? There are still some things that are not clear to me, such as - how to make sense of the output from various memory diagnostics such as memory.profile ... are these counts? How to get the amount of memory used: gc() and memory.size() seem to differ - what gets allocated on the heap versus stack - why the name cons cells for the stack allocation Any help with these would be greatly appreciated. Thanks greatly, John Muller __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management in R
You might want to mention/talk about packages that enhance R's ability to work with less RAM / more data, such as package SOAR (transparently moving objects between RAM and disk) and ff (which allows vectors and dataframes larger than RAM and which supports dense datatypes like true boolean, short integers etc.). Jens Oehlschlägel -Ursprüngliche Nachricht- Von: john mull...@fastmail.fm Gesendet: Jun 16, 2010 12:20:17 PM An: r-help@r-project.org Betreff: [R] memory management in R I have volunteered to give a short talk on memory management in R to my local R user group, mainly to motivate myself to learn about it. The focus will be on what a typical R coder might want to know ( e.g. how objects are created, call by value, basics of garbage collection ) but I want to go a little deeper just in case there are some advanced users in the crowd. Here are the resources I am using right now Chambers book Software for Data Analysis Manuals such as R Internals and Writing R Extensions Any suggestions on other sources of information? There are still some things that are not clear to me, such as - how to make sense of the output from various memory diagnostics such as memory.profile ... are these counts? How to get the amount of memory used: gc() and memory.size() seem to differ - what gets allocated on the heap versus stack - why the name cons cells for the stack allocation Any help with these would be greatly appreciated. Thanks greatly, John Muller __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] About R memory management?
I'm wondering where I can find the detailed descriptions on R memory management. Understanding this could help me understand the runtime of R program. For example, depending on how memory is allocated (either allocate a chuck of memory that is more than necessary for the current use, or allocate the memory that is just enough for the current use), the performance of the following program could be very different. Could somebody let me know some good references? unsorted_index=NULL for(i in 1:100) { unsorted_index=c(unsorted_index, i) } unsorted_index __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] About R memory management?
Related... Rule of thumb: Pre-allocate your object of the *correct* data type, if you know the final dimensions. /Henrik On Thu, Dec 10, 2009 at 8:26 AM, Peng Yu pengyu...@gmail.com wrote: I'm wondering where I can find the detailed descriptions on R memory management. Understanding this could help me understand the runtime of R program. For example, depending on how memory is allocated (either allocate a chuck of memory that is more than necessary for the current use, or allocate the memory that is just enough for the current use), the performance of the following program could be very different. Could somebody let me know some good references? unsorted_index=NULL for(i in 1:100) { unsorted_index=c(unsorted_index, i) } unsorted_index __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] About R memory management?
For the case below, you don't need to know anything about how R manages memory, but you do need to understand basic concepts algorithmic complexity. You might find The Algorithm Design Manual, http://www.amazon.com/dp/1848000693, a good start. Hadley On Thu, Dec 10, 2009 at 10:26 AM, Peng Yu pengyu...@gmail.com wrote: I'm wondering where I can find the detailed descriptions on R memory management. Understanding this could help me understand the runtime of R program. For example, depending on how memory is allocated (either allocate a chuck of memory that is more than necessary for the current use, or allocate the memory that is just enough for the current use), the performance of the following program could be very different. Could somebody let me know some good references? unsorted_index=NULL for(i in 1:100) { unsorted_index=c(unsorted_index, i) } unsorted_index __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] About R memory management?
I have a situation that I can not predict the final result's dimension. In C++, I believe that the class valarray could preallocate some memory than it is actually needed (maybe 2 times more). The runtime for a C++ equivalent (using append) to the R code would still be C*n, where C is a constant and n is the length of the vector. However, if it just allocate enough memory, the run time will be C*n^2. Based on your reply, I suspect that R doesn't allocate some memory than it is currently needed, right? On Fri, Dec 11, 2009 at 11:22 AM, Henrik Bengtsson h...@stat.berkeley.edu wrote: Related... Rule of thumb: Pre-allocate your object of the *correct* data type, if you know the final dimensions. /Henrik On Thu, Dec 10, 2009 at 8:26 AM, Peng Yu pengyu...@gmail.com wrote: I'm wondering where I can find the detailed descriptions on R memory management. Understanding this could help me understand the runtime of R program. For example, depending on how memory is allocated (either allocate a chuck of memory that is more than necessary for the current use, or allocate the memory that is just enough for the current use), the performance of the following program could be very different. Could somebody let me know some good references? unsorted_index=NULL for(i in 1:100) { unsorted_index=c(unsorted_index, i) } unsorted_index __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] About R memory management?
If you really want to code like a C++ coder in R, then create your own object and extend it when necessary: # take a variation of this; preallocate and then extend when you read a limit x - numeric(2) for (i in 1:100){ if (i length(x)){ # double the length (or whatever you want) length(x) - length(x) * 2 } x[i] - i } On Thu, Dec 10, 2009 at 11:30 AM, Peng Yu pengyu...@gmail.com wrote: I have a situation that I can not predict the final result's dimension. In C++, I believe that the class valarray could preallocate some memory than it is actually needed (maybe 2 times more). The runtime for a C++ equivalent (using append) to the R code would still be C*n, where C is a constant and n is the length of the vector. However, if it just allocate enough memory, the run time will be C*n^2. Based on your reply, I suspect that R doesn't allocate some memory than it is currently needed, right? On Fri, Dec 11, 2009 at 11:22 AM, Henrik Bengtsson h...@stat.berkeley.edu wrote: Related... Rule of thumb: Pre-allocate your object of the *correct* data type, if you know the final dimensions. /Henrik On Thu, Dec 10, 2009 at 8:26 AM, Peng Yu pengyu...@gmail.com wrote: I'm wondering where I can find the detailed descriptions on R memory management. Understanding this could help me understand the runtime of R program. For example, depending on how memory is allocated (either allocate a chuck of memory that is more than necessary for the current use, or allocate the memory that is just enough for the current use), the performance of the following program could be very different. Could somebody let me know some good references? unsorted_index=NULL for(i in 1:100) { unsorted_index=c(unsorted_index, i) } unsorted_index __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] About R memory management?
Thatis was not my original question. My original questions was how memory is managed/allocated in R? On Thu, Dec 10, 2009 at 6:08 PM, jim holtman jholt...@gmail.com wrote: If you really want to code like a C++ coder in R, then create your own object and extend it when necessary: # take a variation of this; preallocate and then extend when you read a limit x - numeric(2) for (i in 1:100){ if (i length(x)){ # double the length (or whatever you want) length(x) - length(x) * 2 } x[i] - i } On Thu, Dec 10, 2009 at 11:30 AM, Peng Yu pengyu...@gmail.com wrote: I have a situation that I can not predict the final result's dimension. In C++, I believe that the class valarray could preallocate some memory than it is actually needed (maybe 2 times more). The runtime for a C++ equivalent (using append) to the R code would still be C*n, where C is a constant and n is the length of the vector. However, if it just allocate enough memory, the run time will be C*n^2. Based on your reply, I suspect that R doesn't allocate some memory than it is currently needed, right? On Fri, Dec 11, 2009 at 11:22 AM, Henrik Bengtsson h...@stat.berkeley.edu wrote: Related... Rule of thumb: Pre-allocate your object of the *correct* data type, if you know the final dimensions. /Henrik On Thu, Dec 10, 2009 at 8:26 AM, Peng Yu pengyu...@gmail.com wrote: I'm wondering where I can find the detailed descriptions on R memory management. Understanding this could help me understand the runtime of R program. For example, depending on how memory is allocated (either allocate a chuck of memory that is more than necessary for the current use, or allocate the memory that is just enough for the current use), the performance of the following program could be very different. Could somebody let me know some good references? unsorted_index=NULL for(i in 1:100) { unsorted_index=c(unsorted_index, i) } unsorted_index __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] FW: R memory management
Hi, I'm using R to collect data for a number of exchanges through a socket connection and constantly running into memory problems even though task I believe is not that memory consuming. I guess there is a miscommunication between R and WinXP about freeing up memory. So this is the code: for (x in 1:length(exchanges.to.get)) { tickers-sqlQuery(channel,paste(SELECT Symbol FROM symbols_list WHERE Exchange=',exchanges.to.get[x],';,sep=''))[,1] dir.create(paste(Working.dir,exchanges.to.get[x],'/',sep='')) for (y in 1:length(tickers)) { con2 - socketConnection(Sys.info()[nodename], port = ) #open socket connection to get data writeLines(paste(command,',',tickers[y],',',interval,';',sep=''), con2) data.-readLines(con2) end.of.data-sum(c(data.==!ENDMSG!,data.==!SYNTAX_ERROR!)) while(end.of.data!=1) {new.data-readLines(con2);end.of.data-sum(new.data==!ENDMSG!); data.-c(data.,new.data)} if (length(data.)3) write.table(data.[1:(length(data.)-2)],paste(Working.dir,exchanges.to.get[x] ,'/',sub('\\*','\+',tickers[y]),'_.csv',sep=''),quote=F,col.names = F,row.names=F) close(con2) } rm(tickers) gc() With command gcinfo(TRUE) I got the following info (some examples) : Garbage collection 16362 = 15411+754+197 (level 0) ... 6.3 Mbytes of cons cells used (22%) 2.2 Mbytes of vectors used (8%) Garbage collection 16407 = 15454+756+197 (level 0) ... 13.1 Mbytes of cons cells used (46%) 10.4 Mbytes of vectors used (39%) Garbage collection 16410 = 15456+756+198 (level 2) ... 4.9 Mbytes of cons cells used (21%) 0.9 Mbytes of vectors used (4%) Garbage collection 16679 = 15634+796+249 (level 0) ... 150.7 Mbytes of cons cells used (95%) 203.9 Mbytes of vectors used (75%) Garbage collection 16680 = 15634+796+250 (level 2) ... 4.9 Mbytes of cons cells used (4%) 0.9 Mbytes of vectors used (0%) Garbage collection 16808 = 15754+802+252 (level 0) ... 6.1 Mbytes of cons cells used (7%) 1.8 Mbytes of vectors used (1%) But the end result is in Task Manager: RGui.exe Mem Usage 470,472K VM Size 541,988K Even though R reports Garbage collection 16808 = 15754+802+252 (level 0) ... 6.1 Mbytes of cons cells used (7%) 1.8 Mbytes of vectors used (1%) Has anybody encountered this problem and how you guys deal with it? It seems like a memory leak to me, as tasks are not memory demandind, the biggest amount of data in a single file is about 40MB. Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] FW: R memory management
The line: data. - c(data., new.data) will eat both memory and time voraciously. You should change it by creating 'data.' to be the final size it will be and then subscript into it. If you don't know the final size, then you can grow it a lot a few times instead of growing it a little lots of times. Patrick Burns [EMAIL PROTECTED] +44 (0)20 8525 0696 http://www.burns-stat.com (home of S Poetry and A Guide for the Unwilling S User) Yuri Volchik wrote: Hi, I'm using R to collect data for a number of exchanges through a socket connection and constantly running into memory problems even though task I believe is not that memory consuming. I guess there is a miscommunication between R and WinXP about freeing up memory. So this is the code: for (x in 1:length(exchanges.to.get)) { tickers-sqlQuery(channel,paste(SELECT Symbol FROM symbols_list WHERE Exchange=',exchanges.to.get[x],';,sep=''))[,1] dir.create(paste(Working.dir,exchanges.to.get[x],'/',sep='')) for (y in 1:length(tickers)) { con2 - socketConnection(Sys.info()[nodename], port = ) #open socket connection to get data writeLines(paste(command,',',tickers[y],',',interval,';',sep=''), con2) data.-readLines(con2) end.of.data-sum(c(data.==!ENDMSG!,data.==!SYNTAX_ERROR!)) while(end.of.data!=1) {new.data-readLines(con2);end.of.data-sum(new.data==!ENDMSG!); data.-c(data.,new.data)} if (length(data.)3) write.table(data.[1:(length(data.)-2)],paste(Working.dir,exchanges.to.get[x] ,'/',sub('\\*','\+',tickers[y]),'_.csv',sep=''),quote=F,col.names = F,row.names=F) close(con2) } rm(tickers) gc() With command gcinfo(TRUE) I got the following info (some examples) : Garbage collection 16362 = 15411+754+197 (level 0) ... 6.3 Mbytes of cons cells used (22%) 2.2 Mbytes of vectors used (8%) Garbage collection 16407 = 15454+756+197 (level 0) ... 13.1 Mbytes of cons cells used (46%) 10.4 Mbytes of vectors used (39%) Garbage collection 16410 = 15456+756+198 (level 2) ... 4.9 Mbytes of cons cells used (21%) 0.9 Mbytes of vectors used (4%) Garbage collection 16679 = 15634+796+249 (level 0) ... 150.7 Mbytes of cons cells used (95%) 203.9 Mbytes of vectors used (75%) Garbage collection 16680 = 15634+796+250 (level 2) ... 4.9 Mbytes of cons cells used (4%) 0.9 Mbytes of vectors used (0%) Garbage collection 16808 = 15754+802+252 (level 0) ... 6.1 Mbytes of cons cells used (7%) 1.8 Mbytes of vectors used (1%) But the end result is in Task Manager: RGui.exe Mem Usage 470,472K VM Size 541,988K Even though R reports Garbage collection 16808 = 15754+802+252 (level 0) ... 6.1 Mbytes of cons cells used (7%) 1.8 Mbytes of vectors used (1%) Has anybody encountered this problem and how you guys deal with it? It seems like a memory leak to me, as tasks are not memory demandind, the biggest amount of data in a single file is about 40MB. Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management
Hi, I apologize again for posting something not suitable on this list. Basically, it sounds like I should go put this large dataset into a database... The dataset I have had trouble with is the transportation network of Chicago Consolidated Metropolitan Statistical Area. The number of samples is about 7,200 points; and every points have outbound and inbound traffic flows: volumes, times, distances, etc. So a quick approximation of the number of rows would be 49,000,000 rows (and 249 columns). This is a text file. I could work with a portion of the data at a time like nearest neighbors or pairs of points. I used read.table('filename',header=F).. I should probably use some bits of data at a time instead of putting all at a time... I am learning RSQLite and RMySQL. As Mr. Wan suggests, I will learn C a bit more. Thank you very much. TK im holtman wrote: When you say you can not import 4.8GB, is this the size of the text file that you are reading in? If so, what is the structure of the file? How are you reading in the file ('read.table', 'scan', etc). Do you really need all the data or can you work with a portion at a time? If so, then consider putting the data in a database and retrieving the data as needed. If all the data is in an object, how big to you think this object will be? (# rows, # columns, mode of the data). So you need to provide some more information as to the problem that you are trying to solve. On 9/15/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi, Let me apologize for this simple question. I use 64 bit R on my Fedora Core 6 Linux workstation. A 64 bit R has saved a lot of time. I am sure this is a lot to do with my memory limit, but I cannot import 4.8GB. My workstation has a 8GB RAM, Athlon X2 5600, and 1200W PSU. This PC configuration is the best I could get. I know a bit of C and Perl. Should I use C or Perl to manage this large dataset? or should I even go to 16GB RAM. Sorry for this silly question. But I appreciate if anyone could give me advice. Thank you very much. TK __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management
If you data file has 49M rows and 249 columns, then if each column had 5 characters, then you are looking at a text file with 60GB. If these were all numerics (8 bytes per number), then you are looking at an R object that would be almost 100GB. If this is your data, then this is definitely a candidate for a data base since you would need a fairly large machine (at least 300GB of real memory). You probably need to give some serious thought to how you want to store your data and then what type of processing you need to do on it. BTW, do you need all 249 columns, or could you work with just 3-4 columns at a time (this at least makes an R object of about 1.5GB which might be easier to handle). On 9/16/07, Takatsugu Kobayashi [EMAIL PROTECTED] wrote: Hi, I apologize again for posting something not suitable on this list. Basically, it sounds like I should go put this large dataset into a database... The dataset I have had trouble with is the transportation network of Chicago Consolidated Metropolitan Statistical Area. The number of samples is about 7,200 points; and every points have outbound and inbound traffic flows: volumes, times, distances, etc. So a quick approximation of the number of rows would be 49,000,000 rows (and 249 columns). This is a text file. I could work with a portion of the data at a time like nearest neighbors or pairs of points. I used read.table('filename',header=F).. I should probably use some bits of data at a time instead of putting all at a time... I am learning RSQLite and RMySQL. As Mr. Wan suggests, I will learn C a bit more. Thank you very much. TK im holtman wrote: When you say you can not import 4.8GB, is this the size of the text file that you are reading in? If so, what is the structure of the file? How are you reading in the file ('read.table', 'scan', etc). Do you really need all the data or can you work with a portion at a time? If so, then consider putting the data in a database and retrieving the data as needed. If all the data is in an object, how big to you think this object will be? (# rows, # columns, mode of the data). So you need to provide some more information as to the problem that you are trying to solve. On 9/15/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi, Let me apologize for this simple question. I use 64 bit R on my Fedora Core 6 Linux workstation. A 64 bit R has saved a lot of time. I am sure this is a lot to do with my memory limit, but I cannot import 4.8GB. My workstation has a 8GB RAM, Athlon X2 5600, and 1200W PSU. This PC configuration is the best I could get. I know a bit of C and Perl. Should I use C or Perl to manage this large dataset? or should I even go to 16GB RAM. Sorry for this silly question. But I appreciate if anyone could give me advice. Thank you very much. TK __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.