Re: [R] memory management

2012-02-29 Thread Sam Steingold
 * William Dunlap jqha...@gvopb.pbz [2012-02-28 23:06:54 +]:

 You need to walk through the objects, checking for environments on
 each component or attribute of an object.

so why doesn't object.size do that?

f - function(n) {
   +   d - data.frame(y = rnorm(n), x = rnorm(n))
   +   lm(y ~ poly(x, 4), data=d)
   + }

I am not doing any modeling. No ~. No formulas.
The whole thing is just a bunch of data frames.
I do a lot of strsplit, unlist,  subsetting, so I could imagine why
the RSS is triple the total size of my data if all the intermediate
results are not released.

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://honestreporting.com http://memri.org
http://jihadwatch.org http://pmw.org.il http://camera.org http://ffii.org
To be popular with ladies one has to be smart, handsome  rich. Or to be a cat.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management

2012-02-29 Thread William Dunlap
 I do a lot of strsplit, unlist,  subsetting, so I could imagine why
 the RSS is triple the total size of my data if all the intermediate
 results are not released.

I can only give some generalities about that.  Using lots of
small chunks of memory (like short strings) may cause fragmentation
(wasted space between blocks of memory).  Depending on your operating
system, calling free(pointerToMemoryBlock) may or may not reduce the
virtual memory size of the process, so something like '/bin/ps -o vsize,size'
or Process Explorer may only show the high water mark of memory usage.

Another way to gauge the total size of the visible data and the
environments associated with it is to call save(list=objects(all=TRUE),
compress=FALSE,file=someFile) and look at the size of the file.
Headers probably have a different size in the file than in the process,
but it can give some hints about how much hidden environments are
adding to things.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

 -Original Message-
 From: Sam Steingold [mailto:sam.steing...@gmail.com] On Behalf Of Sam 
 Steingold
 Sent: Wednesday, February 29, 2012 8:42 AM
 To: William Dunlap
 Cc: r-help@r-project.org
 Subject: Re: memory management
 
  * William Dunlap jqha...@gvopb.pbz [2012-02-28 23:06:54 +]:
 
  You need to walk through the objects, checking for environments on
  each component or attribute of an object.
 
 so why doesn't object.size do that?
 
 f - function(n) {
+   d - data.frame(y = rnorm(n), x = rnorm(n))
+   lm(y ~ poly(x, 4), data=d)
+ }
 
 I am not doing any modeling. No ~. No formulas.
 The whole thing is just a bunch of data frames.
 I do a lot of strsplit, unlist,  subsetting, so I could imagine why
 the RSS is triple the total size of my data if all the intermediate
 results are not released.
 
 --
 Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 
 11.0.11004000
 http://www.childpsy.net/ http://honestreporting.com http://memri.org
 http://jihadwatch.org http://pmw.org.il http://camera.org http://ffii.org
 To be popular with ladies one has to be smart, handsome  rich. Or to be a 
 cat.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management

2012-02-29 Thread Milan Bouchet-Valat
Le mercredi 29 février 2012 à 11:42 -0500, Sam Steingold a écrit :
  * William Dunlap jqha...@gvopb.pbz [2012-02-28 23:06:54 +]:
 
  You need to walk through the objects, checking for environments on
  each component or attribute of an object.
 
 so why doesn't object.size do that?
 
 f - function(n) {
+   d - data.frame(y = rnorm(n), x = rnorm(n))
+   lm(y ~ poly(x, 4), data=d)
+ }
 
 I am not doing any modeling. No ~. No formulas.
 The whole thing is just a bunch of data frames.
 I do a lot of strsplit, unlist,  subsetting, so I could imagine why
 the RSS is triple the total size of my data if all the intermediate
 results are not released.
I think you're simply hitting a (terrible) OS limitation. Linux is very
often not able to reclaim the memory R has used because it's fragmented.
The OS can only get the pages back if nothing is above them, and most of
the time there is data after the object you remove. I'm not able to give
you a more precise explanation, but that's apparently a known problem
and that's hard to fix.

At least, I can confirm that after doing a lot of merges on big data
frames, R can keep using 3GB of shared memory on my box even if gc()
only reports 500MB currently used. Restarting R makes memory use go down
to the normal expectations.


Regards

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management

2012-02-29 Thread Sam Steingold
 * Milan Bouchet-Valat anyvzv...@pyho.se [2012-02-29 18:18:50 +0100]:

 I think you're simply hitting a (terrible) OS limitation. Linux is
 very often not able to reclaim the memory R has used because it's
 fragmented.  The OS can only get the pages back if nothing is above
 them, and most of the time there is data after the object you
 remove. I'm not able to give you a more precise explanation, but
 that's apparently a known problem and that's hard to fix.

compacting garbage collector is our best friend!

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://iris.org.il http://www.memritv.org
http://ffii.org http://honestreporting.com http://jihadwatch.org
To a Lisp hacker, XML is S-expressions with extra cruft.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management

2012-02-29 Thread luke-tierney

On Wed, 29 Feb 2012, Sam Steingold wrote:


* Milan Bouchet-Valat anyvzv...@pyho.se [2012-02-29 18:18:50 +0100]:

I think you're simply hitting a (terrible) OS limitation. Linux is
very often not able to reclaim the memory R has used because it's
fragmented.  The OS can only get the pages back if nothing is above
them, and most of the time there is data after the object you
remove. I'm not able to give you a more precise explanation, but
that's apparently a known problem and that's hard to fix.


compacting garbage collector is our best friend!


Which R does not use because of the problems it would create for
external C/Fortran code on which R heavily relies.


--
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management

2012-02-29 Thread Sam Steingold
 *  yhxr-gvre...@hvbjn.rqh [2012-02-29 13:55:25 -0600]:
 On Wed, 29 Feb 2012, Sam Steingold wrote:
 compacting garbage collector is our best friend!

 Which R does not use because of the problems it would create for
 external C/Fortran code on which R heavily relies.

Well, you know better, of course.

However, I cannot stop wondering if this really is absolutely necessary.
If you do not call GC while the external C/Fortran code is running, you
should be fine with a compacting garbage collector.
If you access the C/Fortran data (managed by the C/Fortran code), then
it should live in a separate universe from the one managed by R GC.

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://dhimmi.com http://camera.org
http://iris.org.il http://truepeace.org http://mideasttruth.com
Lisp: it's here to save your butt.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management

2012-02-28 Thread Sam Steingold
My basic worry is that the GC does not work properly,
i.e., the unreachable data is never collected.

 * Bert Gunter thagre.ore...@trar.pbz [2012-02-27 14:35:14 -0800]:

 This appears to be the sort of query that (with apologies to other R
 gurus) only Brian Ripley or Luke Tierney could figure out. R generally
 passes by value into function calls (but not *always*), so often
 multiple copies of objects are made during the course of calls. I
 would speculate that this is what might be going on below -- maybe
 even that's what you meant.

 Just a guess on my part, of course, so treat accordingly.

 -- Bert

 On Mon, Feb 27, 2012 at 1:03 PM, Sam Steingold s...@gnu.org wrote:
 It appears that the intermediate data in functions is never GCed even
 after the return from the function call.
 R's RSS is 4 Gb (after a gc()) and

 sum(unlist(lapply(lapply(ls(),get),object.size)))
 [1] 1009496520

 (less than 1 GB)

 how do I figure out where the 3GB of uncollected garbage is hiding?

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://jihadwatch.org http://memri.org
http://palestinefacts.org http://truepeace.org http://iris.org.il
I may be getting older, but I refuse to grow up!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management

2012-02-28 Thread Bert Gunter
On Tue, Feb 28, 2012 at 11:57 AM, Sam Steingold s...@gnu.org wrote:
 My basic worry is that the GC does not work properly,
 i.e., the unreachable data is never collected.

Highly unlikely. Such basic inner R code has been well tested over 20
years.  I believe that you merely don't understand the inner guts of
what R is doing here, which is the essence of my response. (Clearly, I
make no claim that I do either).

I suggest you move on.

-- Bert


 * Bert Gunter thagre.ore...@trar.pbz [2012-02-27 14:35:14 -0800]:

 This appears to be the sort of query that (with apologies to other R
 gurus) only Brian Ripley or Luke Tierney could figure out. R generally
 passes by value into function calls (but not *always*), so often
 multiple copies of objects are made during the course of calls. I
 would speculate that this is what might be going on below -- maybe
 even that's what you meant.

 Just a guess on my part, of course, so treat accordingly.

 -- Bert

 On Mon, Feb 27, 2012 at 1:03 PM, Sam Steingold s...@gnu.org wrote:
 It appears that the intermediate data in functions is never GCed even
 after the return from the function call.
 R's RSS is 4 Gb (after a gc()) and

 sum(unlist(lapply(lapply(ls(),get),object.size)))
 [1] 1009496520

 (less than 1 GB)

 how do I figure out where the 3GB of uncollected garbage is hiding?

 --
 Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 
 11.0.11004000
 http://www.childpsy.net/ http://jihadwatch.org http://memri.org
 http://palestinefacts.org http://truepeace.org http://iris.org.il
 I may be getting older, but I refuse to grow up!



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management

2012-02-28 Thread William Dunlap
Look into environments that may be stored
with your data.  object.size(obj) does not
report on the size of the environment(s)
associated with obj.  E.g.,

   f - function(n) {
  +d - data.frame(y=rnorm(n), x1=rnorm(n), x2=rnorm(n))
  +terms(data=d, y~.)
  + }
   z - f(1e6)
   object.size(z)
  1760 bytes
   eapply(environment(z), object.size)
  $d
  24000520 bytes

  $n
  32 bytes
That happens because formula objects (like function
objects) contain a reference to the environment in
which they were created and that environmentwill not
be destroyed until the last reference to it is gone.
You might be able write code using, e.g., the codetools
package to walk through your objects looking for all
distinct environments that they reference (directly
and indirectly, via ancestors of environments directly
referenced).  Then you can add up the sizes of things
in those environments.

Another possible reason for your problem is that by using ls()
instead of ls(all=TRUE) you are not looking at datasets
whose names start with a dot.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
 Behalf Of Sam Steingold
 Sent: Tuesday, February 28, 2012 11:58 AM
 To: r-help@r-project.org; Bert Gunter
 Subject: Re: [R] memory management
 
 My basic worry is that the GC does not work properly,
 i.e., the unreachable data is never collected.
 
  * Bert Gunter thagre.ore...@trar.pbz [2012-02-27 14:35:14 -0800]:
 
  This appears to be the sort of query that (with apologies to other R
  gurus) only Brian Ripley or Luke Tierney could figure out. R generally
  passes by value into function calls (but not *always*), so often
  multiple copies of objects are made during the course of calls. I
  would speculate that this is what might be going on below -- maybe
  even that's what you meant.
 
  Just a guess on my part, of course, so treat accordingly.
 
  -- Bert
 
  On Mon, Feb 27, 2012 at 1:03 PM, Sam Steingold s...@gnu.org wrote:
  It appears that the intermediate data in functions is never GCed even
  after the return from the function call.
  R's RSS is 4 Gb (after a gc()) and
 
  sum(unlist(lapply(lapply(ls(),get),object.size)))
  [1] 1009496520
 
  (less than 1 GB)
 
  how do I figure out where the 3GB of uncollected garbage is hiding?
 
 --
 Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 
 11.0.11004000
 http://www.childpsy.net/ http://jihadwatch.org http://memri.org
 http://palestinefacts.org http://truepeace.org http://iris.org.il
 I may be getting older, but I refuse to grow up!
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management

2012-02-28 Thread Sam Steingold
 * William Dunlap jqha...@gvopb.pbz [2012-02-28 20:19:06 +]:

 Look into environments that may be stored with your data.

thanks, but I see nothing like that:

for (n in ls(all.names = TRUE)) {
  o - get(n)
  print(object.size(o), units=Kb)
  e - environment(o)
  if (!identical(e,NULL)  !identical(e,.GlobalEnv)) {
print(e)
print(eapply(e,object.size))
  }
}
25.8 Kb
0.5 Kb
49.1 Kb
0.1 Kb
30.8 Kb
13.6 Kb
17.4 Kb
59.4 Kb
52.2 Kb
0.1 Kb
3.9 Kb
49.1 Kb
21.2 Kb
0.1 Kb
0.1 Kb
51 Kb
13.2 Kb
53.5 Kb
18.1 Kb
64.3 Kb
25.8 Kb
33.5 Kb
0.1 Kb
0.1 Kb
8 Kb
10 Kb
15.7 Kb
15.6 Kb
9.9 Kb
401672.7 Kb
19.1 Kb
76 Kb
12 Kb
32.4 Kb
156.3 Kb
13.1 Kb
20.5 Kb
21.8 Kb
10.8 Kb

sum(unlist(lapply(lapply(ls(all.names = TRUE),get),object.size)))
[1] 412351928

i.e., total of data is about 400MB.
why does the process take in access of 1GB?

top: 1235m 1.1g 4452 S0 14.6   7:12.27 R

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://pmw.org.il http://camera.org
http://dhimmi.com http://palestinefacts.org http://ffii.org
Fighting for peace is like screwing for virginity.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management

2012-02-28 Thread William Dunlap
You need to walk through the objects, checking for
environments on each component or attribute of an
object.  You also have to look at the parent.env
of each environment found.  E.g.,
   f - function(n) {
  +   d - data.frame(y = rnorm(n), x = rnorm(n))
  +   lm(y ~ poly(x, 4), data=d)
  + }
   z - f(1e5)
   environment(z)
  NULL
   object.size(z)
  21610708 bytes
   sapply(z, object.size)
   coefficients residuals   effects 
384   4400104   1200336 
   rank fitted.valuesassign 
 32   440010456 
 qr   df.residual   xlevels 
760123232   104 
   call terms model 
508  2804   4004276
   environment(z$terms)
  environment: 0x0abb86e4
   eapply(environment(z$terms), object.size)
  $d
  1600448 bytes

  $n
  32 bytes

Coding this is tedious; the codetools package may make it
easier.  Summing the sizes may well give an overestimate
of the memory actually used, since several objects may
share the same memory.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

 -Original Message-
 From: Sam Steingold [mailto:sam.steing...@gmail.com] On Behalf Of Sam 
 Steingold
 Sent: Tuesday, February 28, 2012 2:56 PM
 To: r-help@r-project.org; William Dunlap
 Subject: Re: memory management
 
  * William Dunlap jqha...@gvopb.pbz [2012-02-28 20:19:06 +]:
 
  Look into environments that may be stored with your data.
 
 thanks, but I see nothing like that:
 
 for (n in ls(all.names = TRUE)) {
   o - get(n)
   print(object.size(o), units=Kb)
   e - environment(o)
   if (!identical(e,NULL)  !identical(e,.GlobalEnv)) {
 print(e)
 print(eapply(e,object.size))
   }
 }
 25.8 Kb
 0.5 Kb
 49.1 Kb
 0.1 Kb
 30.8 Kb
 13.6 Kb
 17.4 Kb
 59.4 Kb
 52.2 Kb
 0.1 Kb
 3.9 Kb
 49.1 Kb
 21.2 Kb
 0.1 Kb
 0.1 Kb
 51 Kb
 13.2 Kb
 53.5 Kb
 18.1 Kb
 64.3 Kb
 25.8 Kb
 33.5 Kb
 0.1 Kb
 0.1 Kb
 8 Kb
 10 Kb
 15.7 Kb
 15.6 Kb
 9.9 Kb
 401672.7 Kb
 19.1 Kb
 76 Kb
 12 Kb
 32.4 Kb
 156.3 Kb
 13.1 Kb
 20.5 Kb
 21.8 Kb
 10.8 Kb
 
 sum(unlist(lapply(lapply(ls(all.names = TRUE),get),object.size)))
 [1] 412351928
 
 i.e., total of data is about 400MB.
 why does the process take in access of 1GB?
 
 top: 1235m 1.1g 4452 S0 14.6   7:12.27 R
 
 --
 Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 
 11.0.11004000
 http://www.childpsy.net/ http://pmw.org.il http://camera.org
 http://dhimmi.com http://palestinefacts.org http://ffii.org
 Fighting for peace is like screwing for virginity.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management

2012-02-27 Thread Sam Steingold
It appears that the intermediate data in functions is never GCed even
after the return from the function call.
R's RSS is 4 Gb (after a gc()) and

sum(unlist(lapply(lapply(ls(),get),object.size)))
[1] 1009496520

(less than 1 GB)

how do I figure out where the 3GB of uncollected garbage is hiding?

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://camera.org http://truepeace.org
http://www.PetitionOnline.com/tap12009/ http://thereligionofpeace.com
Modern man is the missing link between apes and human beings.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management

2012-02-27 Thread Bert Gunter
This appears to be the sort of query that (with apologies to other R
gurus) only Brian Ripley or Luke Tierney could figure out. R generally
passes by value into function calls (but not *always*), so often
multiple copies of objects are made during the course of calls. I
would speculate that this is what might be going on below -- maybe
even that's what you meant.

Just a guess on my part, of course, so treat accordingly.

-- Bert

On Mon, Feb 27, 2012 at 1:03 PM, Sam Steingold s...@gnu.org wrote:
 It appears that the intermediate data in functions is never GCed even
 after the return from the function call.
 R's RSS is 4 Gb (after a gc()) and

 sum(unlist(lapply(lapply(ls(),get),object.size)))
 [1] 1009496520

 (less than 1 GB)

 how do I figure out where the 3GB of uncollected garbage is hiding?

 --
 Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 
 11.0.11004000
 http://www.childpsy.net/ http://camera.org http://truepeace.org
 http://www.PetitionOnline.com/tap12009/ http://thereligionofpeace.com
 Modern man is the missing link between apes and human beings.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] memory management

2012-02-09 Thread Sam Steingold
 zz - data.frame(a=c(1,2,3),b=c(4,5,6))
 zz
  a b
1 1 4
2 2 5
3 3 6
 a - zz$a
 a
[1] 1 2 3
 a[2] - 100
 a
[1]   1 100   3
 zz
  a b
1 1 4
2 2 5
3 3 6


clearly a is a _copy_ of its namesake column in zz.

when was the copy made? when a was modified? at assignment?

is there a way to find out how much memory an object takes?

gc() appears not to reclaim all memory after rm() - anyone can confirm?

thanks!

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://mideasttruth.com http://americancensorship.org
http://www.memritv.org http://jihadwatch.org http://ffii.org
C combines the power of assembler with the portability of assembler.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management

2012-02-09 Thread Florent D.
This should help:

 invisible(gc())

 m0 - memory.size()
 mem.usage - function(){invisible(gc()); memory.size() - m0}
 Mb.size  - function(x)print(object.size(x), units=Mb)

 zz - data.frame(a=runif(100), b=runif(100))
 mem.usage()
[1] 15.26
 Mb.size(zz)
15.3 Mb
 a - zz$a
 mem.usage()
[1] 15.26
 Mb.size(a)
7.6 Mb
 a[2] - 100
 mem.usage()
[1] 22.89
 Mb.size(a)
7.6 Mb

You can see that a - zz$a really has no impact on your memory usage.
It is when you start modifying it that R needs to store a whole new
object in memory.



On Thu, Feb 9, 2012 at 5:17 PM, Sam Steingold s...@gnu.org wrote:
 zz - data.frame(a=c(1,2,3),b=c(4,5,6))
 zz
  a b
 1 1 4
 2 2 5
 3 3 6
 a - zz$a
 a
 [1] 1 2 3
 a[2] - 100
 a
 [1]   1 100   3
 zz
  a b
 1 1 4
 2 2 5
 3 3 6


 clearly a is a _copy_ of its namesake column in zz.

 when was the copy made? when a was modified? at assignment?

 is there a way to find out how much memory an object takes?

 gc() appears not to reclaim all memory after rm() - anyone can confirm?

 thanks!

 --
 Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 
 11.0.11004000
 http://www.childpsy.net/ http://mideasttruth.com http://americancensorship.org
 http://www.memritv.org http://jihadwatch.org http://ffii.org
 C combines the power of assembler with the portability of assembler.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management

2012-02-09 Thread Sam Steingold
 * Florent D. syb...@tznvy.pbz [2012-02-09 19:26:59 -0500]:

 m0 - memory.size()
 Mb.size  - function(x)print(object.size(x), units=Mb)

indeed, these are very useful, thanks.

ls reports these objects larger than 100k:

behavior : 390.1 Mb
mydf : 115.3 Mb
nb : 0.2 Mb
pl : 1.2 Mb

however, top reports that R uses 1.7Gb of RAM (RSS) - even after gc().
what part of R is using the 1GB of RAM?

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://honestreporting.com http://dhimmi.com
http://jihadwatch.org http://americancensorship.org http://camera.org
Money does not buy happiness, but it helps to make unhappiness comfortable.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Memory management

2011-06-01 Thread Michael Conklin
I am trying to run a very large Bradley-Terry model using the BradleyTerry2 
package.  (There are 288 players in the BT model).

My problem is that I ran the model below successfully.
WLMat is a win-loss matrix that is 288 by 288
WLdf-countsToBinomial(WLMat)
  mod1-BTm(cbind(win1,win2),player1,player2,~player,id=player,data=WLdf)

Then I needed to run the same model with a subset of the observations that went 
into the win-loss matrix.  So I created my new win-loss matrix and tried to run 
a new model.

Now I get:  Error: cannot allocate vector of size 90.5 Mb

I found this particularly puzzling because the actual input data is the same 
size as the original model, just different values.

I tried increasing memory size, I tried running it in a clean workspace and the 
error message is always the same (sometimes the vector it is trying to allocate 
is 181.0MB (twice as large)) but it is always one of those two numbers no 
matter what I have done to the available memory.

To further complicate this...I cannot get the system to re-run my first model 
either . Same errors.

traceback indicates that the error occurs when the program is trying to do a qr 
decomposition.

R 2.13.0
Windows XP

Any suggestions?

W. Michael Conklin
Chief Methodologist
Google Voice: (612) 56STATS

MarketTools, Inc. | www.markettools.comhttp://www.markettools.com
6465 Wayzata Blvd | Suite 170 |  St. Louis Park, MN 55426.  PHONE: 952.417.4719 
| CELL: 612.201.8978
This email and attachment(s) may contain confidential and/or proprietary 
information and is intended only for the intended addressee(s) or its 
authorized agent(s). Any disclosure, printing, copying or use of such 
information is strictly prohibited. If this email and/or attachment(s) were 
received in error, please immediately notify the sender and delete all copies


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory Management under Linux

2010-11-05 Thread jim holtman
It would be very useful if you would post some information about what
exactly you are doing.  There si something with the size of the data
object you are processing ('str' would help us understand it) and then
a portion of the script (both before and after the error message) so
we can understand the transformation that you are doing.  It is very
easy to generate a similar message:

 x - matrix(0,2, 2)
Error: cannot allocate vector of size 3.0 Gb

but unless you know the context, it is almost impossible to give
advice.  It also depends on if you are in some function calls were
copies of objects may have been made, etc.

On Thu, Nov 4, 2010 at 7:52 PM, ricardo souza ricsouz...@yahoo.com.br wrote:
 Dear all,

 I am using ubuntu linux 32 with 4 Gb.  I am running a very small script and I 
 always got the same error message:  CAN NOT ALLOCATE A VECTOR OF SIZE 231.8 
 Mb.

 I have reading carefully the instruction in ?Memory.  Using the function gc() 
 I got very low numbers of memory (please sea below).  I know that it has been 
 posted several times at r-help 
 (http://tolstoy.newcastle.edu.au/R/help/05/06/7565.html#7627qlink2).  However 
 I did not find yet the solution to improve my memory issue in Linux.  
 Somebody cold please give some instruction how to improve my memory under 
 linux?

 gc()
  used (Mb) gc trigger (Mb) max used (Mb)
 Ncells 170934  4.6 35  9.4   35  9.4
 Vcells 195920  1.5 786432  6.0   781384  6.0

 INCREASING THE R MEMORY FOLLOWING THE INSTRUCTION IN  ?Memory

 I started R with:

 R --min-vsize=10M --max-vsize=4G --min-nsize=500k --max-nsize=900M
 gc()
  used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
 Ncells 130433  3.5 50 13.4  25200   50 13.4
 Vcells  81138  0.7    1310720 10.0 NA   499143  3.9

 It increased but not so much!

 Please, please let me know.  I have read all r-help about this matter, but 
 not solution. Thanks for your attention!

 Ricardo







        [[alternative HTML version deleted]]


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.





-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory Management under Linux

2010-11-05 Thread ricardo souza
Dear Jim,

Thanks for your attention. I am running a geostatistic analysis with geoR that 
is computational intense. At the end my analysis I call the function 
krige.control and krige.conv.  Do you have any idea how to improve the memory 
allocation in Linux?

Thanks,
Ricardo



De: jim holtman jholt...@gmail.com
Assunto: Re: [R] Memory Management under Linux
Para: ricardo souza ricsouz...@yahoo.com.br
Cc: r-help@r-project.org
Data: Sexta-feira, 5 de Novembro de 2010, 10:21

It would be very useful if you would post some information about what
exactly you are doing.  There si something with the size of the data
object you are processing ('str' would help us understand it) and then
a portion of the script (both before and after the error message) so
we can understand the transformation that you are doing.  It is very
easy to generate a similar message:

 x - matrix(0,2, 2)
Error: cannot allocate vector of size 3.0 Gb

but unless you know the context, it is almost impossible to give
advice.  It also depends on if you are in some function calls were
copies of objects may have been made, etc.

On Thu, Nov 4, 2010 at 7:52 PM, ricardo souza ricsouz...@yahoo.com.br wrote:
 Dear all,

 I am using ubuntu linux 32 with 4 Gb.  I am running a very small script and I 
 always got the same error message:  CAN NOT ALLOCATE A VECTOR OF SIZE 231.8 
 Mb.

 I have reading carefully the instruction in ?Memory.  Using the function gc() 
 I got very low numbers of memory (please sea below).  I know that it has been 
 posted several times at r-help 
 (http://tolstoy.newcastle.edu.au/R/help/05/06/7565.html#7627qlink2).  However 
 I did not find yet the solution to improve my memory issue in Linux.  
 Somebody cold please give some instruction how to improve my memory under 
 linux?

 gc()
  used (Mb) gc trigger (Mb) max used (Mb)
 Ncells 170934  4.6 35  9.4   35  9.4
 Vcells 195920  1.5 786432  6.0   781384  6.0

 INCREASING THE R MEMORY FOLLOWING THE INSTRUCTION IN  ?Memory

 I started R with:

 R --min-vsize=10M --max-vsize=4G --min-nsize=500k --max-nsize=900M
 gc()
  used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
 Ncells 130433  3.5 50 13.4  25200   50 13.4
 Vcells  81138  0.7    1310720 10.0 NA   499143  3.9

 It increased but not so much!

 Please, please let me know.  I have read all r-help about this matter, but 
 not solution. Thanks for your attention!

 Ricardo







        [[alternative HTML version deleted]]


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.





-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?



  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory Management under Linux

2010-11-05 Thread jim holtman
I would do some monitoring (debugging) of the script by placing some 'gc()'
calls in the sequence of statements leading to the problem to see what the
memory usage is at that point.  Take a close look at the sizes of your
objects.  If it is happening in some function you have called, you may have
to take a look and understand if multiple copies are being made.  Most
problems of this type may require that you put hooks in your code (most of
the stuff that I write has it in so I can isolate performance problems) to
gain an understanding of what is happening when.  To improve memory
allocation, you first have to understand what is causing the problem, and
enough information has not been provided so that I could make a comment on
it.  There are lots of rules of thumb that can be used, but many depend on
exactly what you are trying to do.

On Fri, Nov 5, 2010 at 2:59 PM, ricardo souza ricsouz...@yahoo.com.brwrote:

   Dear Jim,

 Thanks for your attention. I am running a geostatistic analysis with geoR
 that is computational intense. At the end my analysis I call the function
 krige.control and krige.conv.  Do you have any idea how to improve the
 memory allocation in Linux?

 Thanks,
 Ricardo



 De: jim holtman jholt...@gmail.com
 Assunto: Re: [R] Memory Management under Linux
 Para: ricardo souza ricsouz...@yahoo.com.br
 Cc: r-help@r-project.org
 Data: Sexta-feira, 5 de Novembro de 2010, 10:21


 It would be very useful if you would post some information about what
 exactly you are doing.  There si something with the size of the data
 object you are processing ('str' would help us understand it) and then
 a portion of the script (both before and after the error message) so
 we can understand the transformation that you are doing.  It is very
 easy to generate a similar message:

  x - matrix(0,2, 2)
 Error: cannot allocate vector of size 3.0 Gb

 but unless you know the context, it is almost impossible to give
 advice.  It also depends on if you are in some function calls were
 copies of objects may have been made, etc.

 On Thu, Nov 4, 2010 at 7:52 PM, ricardo souza 
 ricsouz...@yahoo.com.brhttp://mc/compose?to=ricsouz...@yahoo.com.br
 wrote:
  Dear all,
 
  I am using ubuntu linux 32 with 4 Gb.  I am running a very small script
 and I always got the same error message:  CAN NOT ALLOCATE A VECTOR OF SIZE
 231.8 Mb.
 
  I have reading carefully the instruction in ?Memory.  Using the function
 gc() I got very low numbers of memory (please sea below).  I know that it
 has been posted several times at r-help (
 http://tolstoy.newcastle.edu.au/R/help/05/06/7565.html#7627qlink2).
 However I did not find yet the solution to improve my memory issue in
 Linux.  Somebody cold please give some instruction how to improve my memory
 under linux?
 
  gc()
   used (Mb) gc trigger (Mb) max used (Mb)
  Ncells 170934  4.6 35  9.4   35  9.4
  Vcells 195920  1.5 786432  6.0   781384  6.0
 
  INCREASING THE R MEMORY FOLLOWING THE INSTRUCTION IN  ?Memory
 
  I started R with:
 
  R --min-vsize=10M --max-vsize=4G --min-nsize=500k --max-nsize=900M
  gc()
   used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
  Ncells 130433  3.5 50 13.4  25200   50 13.4
  Vcells  81138  0.71310720 10.0 NA   499143  3.9
 
  It increased but not so much!
 
  Please, please let me know.  I have read all r-help about this matter,
 but not solution. Thanks for your attention!
 
  Ricardo
 
 
 
 
 
 
 
 [[alternative HTML version deleted]]
 
 
  __
  R-help@r-project.org http://mc/compose?to=r-h...@r-project.org mailing
 list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 



 --
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390

 What is the problem that you are trying to solve?







-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Memory Management under Linux

2010-11-04 Thread ricardo souza
Dear all, 

I am using ubuntu linux 32 with 4 Gb.  I am running a very small script and I 
always got the same error message:  CAN NOT ALLOCATE A VECTOR OF SIZE 231.8 Mb. 

I have reading carefully the instruction in ?Memory.  Using the function gc() I 
got very low numbers of memory (please sea below).  I know that it has been 
posted several times at r-help 
(http://tolstoy.newcastle.edu.au/R/help/05/06/7565.html#7627qlink2).  However I 
did not find yet the solution to improve my memory issue in Linux.  Somebody 
cold please give some instruction how to improve my memory under linux?  

 gc()
 used (Mb) gc trigger (Mb) max used (Mb)
Ncells 170934  4.6 35  9.4   35  9.4
Vcells 195920  1.5 786432  6.0   781384  6.0

INCREASING THE R MEMORY FOLLOWING THE INSTRUCTION IN  ?Memory

I started R with: 

R --min-vsize=10M --max-vsize=4G --min-nsize=500k --max-nsize=900M 
 gc()
 used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 130433  3.5 50 13.4  25200   50 13.4
Vcells  81138  0.7    1310720 10.0 NA   499143  3.9

It increased but not so much!  

Please, please let me know.  I have read all r-help about this matter, but not 
solution. Thanks for your attention! 

Ricardo






  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-10 Thread Lorenzo Isella



I already offered the Biostrings package. It provides more robust
methods for string matching than does grepl. Is there a reason that you
choose not to?



Indeed that is the way I should go for and I have installed the package 
after some struggling. Since biostring is a fairly complex package and I 
need only a way to check if a certain string A is a subset of string B, 
do you know the biostring functions to achieve this?
I see a lot of methods for biological (DNA, RNA) sequences, and they may 
not apply to my series (which are definitely not from biology).

Cheers

Lorenzo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-10 Thread Mike Marchywka








 Date: Sun, 10 Oct 2010 15:27:11 +0200
 From: lorenzo.ise...@gmail.com
 To: dwinsem...@comcast.net
 CC: r-help@r-project.org
 Subject: Re: [R] Memory management in R


  I already offered the Biostrings package. It provides more robust
  methods for string matching than does grepl. Is there a reason that you
  choose not to?
 

 Indeed that is the way I should go for and I have installed the package
 after some struggling. Since biostring is a fairly complex package and I
 need only a way to check if a certain string A is a subset of string B,
 do you know the biostring functions to achieve this?
 I see a lot of methods for biological (DNA, RNA) sequences, and they may
 not apply to my series (which are definitely not from biology).

Generally the differences relate to alphabet and things you may want
to know about them. Unless you are looking for reverse complement
text strings, there will be a lot of stuff you don't need. Offhand,
I'd be looking for things like computational linguistics packages
as you are looking to find patterns or predictability in human readable 
character sequences. Now, humans can probably write hairpin-text( look
at what RNA can do LOL) but this is probably not what you care about. 

However,  as I mentioned earlier, I had to write my own regex compiler ( 
coincidently
for bio apps ) to get required performance. Your application and understanding
may benefit from things like building dictionaries that aren't really
part of regex and that can easily be done in a few lines of c++ code
using STL containers. To get statistically meaningful samples, you almost
will certainly need faster code.




 Cheers

 Lorenzo

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-09 Thread Lorenzo Isella

Hi David,
I am replying to you and to the other people who provided some insight 
into my problems with grepl.

Well, at least we now know that the bug is reproducible.
Indeed it is a strange sequence the one I am postprocessing, probably 
pathological to some extent, nevertheless the problem is given by grepl 
crushing when a long (but not huge) chunk of repeated data is loaded has 
to be acknowledged.
Now, my problem is the following: given a potentially long string (or 
before that a sequence, where every element has been generated via the 
hash function, algo='crc32' of the digest package), how can I, starting 
from an arbitrary position i along the list, calculate the shortest 
substring in the future of i (i.e. the interval i:end of the series) 
that has not occurred in the past of i (i.e. [1:i-1])?
Efficiency is not the main point here, I need to run this code only once 
to get what I need, but it cannot crush on a 2000-entry string.

Cheers

Lorenzo


On 10/09/2010 01:30 AM, David Winsemius wrote:


What puzzles me is that the list is not really long (less than 2000
entries) and I have not experienced the same problem even with longer
lists.


But maybe your loop terminated in them eaarlier/ Someplace between
11*225 and 11*240 the grepping machine gives up:

  eprs - paste(rep(aa, 225), collapse=#)
  grepl(eprs, eprs)
[1] TRUE

  eprs - paste(rep(aa, 240), collapse=#)
  grepl(eprs, eprs)
Error in grepl(eprs, eprs) :
invalid regular expression
'aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#a

In addition: Warning message:
In grepl(eprs, eprs) : regcomp error: 'Out of memory'

The complexity of the problem may depend on the distribution of values.
You have a very skewed distribution with the vast majority being in the
same value as appeared in your error message :

  table(x)
x
12653a6 202fbcc4 48bef8c3 4e084ddc 51f342a4 5d64d58a 78087f5e abddf3d1
1419 299 1 1 1 3 1 1
ac76183b b955be36 c600173a e96f6bbd e9c56275
1 30 5 1 9

And you have 1159 of them in one clump (which would seem to be somewhat
improbably under a random null hypothesis:

  max(rle(x)$lengths)
[1] 1159
  which(rle(x)$lengths == 1159)
[1] 123
  rle(x)$values[123]
[1] 12653a6

HTH (although I think it means you need to construct a different
implementation strategy);

David.



Many thanks

Lorenzo




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-09 Thread David Winsemius


On Oct 9, 2010, at 9:45 AM, Lorenzo Isella wrote:


Hi David,
I am replying to you and to the other people who provided some  
insight into my problems with grepl.

Well, at least we now know that the bug is reproducible.
Indeed it is a strange sequence the one I am postprocessing,  
probably pathological to some extent, nevertheless the problem is  
given by grepl crushing when a long (but not huge) chunk of repeated  
data is loaded has to be acknowledged.
Now, my problem is the following: given a potentially long string  
(or before that a sequence, where every element has been generated  
via the hash function, algo='crc32' of the digest package), how can  
I, starting from an arbitrary position i along the list, calculate  
the shortest substring in the future of i (i.e. the interval i:end  
of the series) that has not occurred in the past of i (i.e. [1:i-1])?


Maybe you should work on a less convoluted explanation of the test? Or  
perhaps a couple of compact examples, preferably in R-copy-paste format?


Efficiency is not the main point here, I need to run this code only  
once to get what I need, but it cannot crush on a 2000-entry string.


My suggestion is to explore other alternatives. (I will admit that I  
don't yet fully understand the test that you are applying.) The two  
that have occurred to me are Biostrings which I have already mentioned  
and rle() which I have illustrated the use of but not referenced as an  
avenue. The Biostrings package is part of bioConductor (part of the R  
universe) although you should be prepared for a coffee break when you  
install it if you haven't gotten at least bioClite already installed.  
When I installed it last night it had 54 other package dependents also  
downloaded and installed. It seems to me that taking advantage of the  
coding resources in the molecular biology domain that are currently  
directed at decoding the information storage mechanism of life might  
be a smart strategy. You have not described the domain you are working  
in but I would guess that the digest package might be biological in  
primary application? So forgive me if I am preaching to the choir.


The rle option also occurred to me but it might take a smarter coder  
than I to fully implement it. (But maybe Holtman would be up to it.  
He's a _lot_ smarter than I.)  In your example the long x string is  
faithfully represented by two aligned vectors, each 197 characters in  
length. The long repeat sequence that broke the grepl mechanism are  
just one pair of values.

 rle(x)
Run Length Encoding
  lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ...
  values : chr [1:197] 5d64d58a ac76183b 202fbcc4 78087f5e ...

So maybe as soon as you got to a bundle that was greater than 1/2 the  
overall length (as happened in the x case) you could stop, since it  
could not have occurred before.


--
David.



Cheers

Lorenzo


On 10/09/2010 01:30 AM, David Winsemius wrote:


What puzzles me is that the list is not really long (less than 2000
entries) and I have not experienced the same problem even with  
longer

lists.


But maybe your loop terminated in them eaarlier/ Someplace between
11*225 and 11*240 the grepping machine gives up:

 eprs - paste(rep(aa, 225), collapse=#)
 grepl(eprs, eprs)
[1] TRUE

 eprs - paste(rep(aa, 240), collapse=#)
 grepl(eprs, eprs)
Error in grepl(eprs, eprs) :
invalid regular expression
'aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#a

In addition: Warning message:
In grepl(eprs, eprs) : regcomp error: 'Out of memory'

The complexity of the problem may depend on the distribution of  
values.
You have a very skewed distribution with the vast majority being in  
the

same value as appeared in your error message :

 table(x)
x
12653a6 202fbcc4 48bef8c3 4e084ddc 51f342a4 5d64d58a 78087f5e  
abddf3d1

1419 299 1 1 1 3 1 1
ac76183b b955be36 c600173a e96f6bbd e9c56275
1 30 5 1 9

And you have 1159 of them in one clump (which would seem to be  
somewhat

improbably under a random null hypothesis:

 max(rle(x)$lengths)
[1] 1159
 which(rle(x)$lengths == 

Re: [R] Memory management in R

2010-10-09 Thread Lorenzo Isella



My suggestion is to explore other alternatives. (I will admit that I
don't yet fully understand the test that you are applying.)


Hi,
I am trying to partially implement the Lempel Ziv compression algorithm.
The point is that compressibility and entropy of a time series are 
related, hence my final goal is to evaluate the entropy of a time series.

You can find more at

http://bit.ly/93zX4T
http://en.wikipedia.org/wiki/LZ77_and_LZ78
http://bit.ly/9NgIFt




The two that

have occurred to me are Biostrings which I have already mentioned and
rle() which I have illustrated the use of but not referenced as an
avenue. The Biostrings package is part of bioConductor (part of the R
universe) although you should be prepared for a coffee break when you
install it if you haven't gotten at least bioClite already installed.
When I installed it last night it had 54 other package dependents also
downloaded and installed. It seems to me that taking advantage of the
coding resources in the molecular biology domain that are currently
directed at decoding the information storage mechanism of life might be
a smart strategy. You have not described the domain you are working in
but I would guess that the digest package might be biological in
primary application? So forgive me if I am preaching to the choir.

The rle option also occurred to me but it might take a smarter coder
than I to fully implement it. (But maybe Holtman would be up to it. He's
a _lot_ smarter than I.) In your example the long x string is
faithfully represented by two aligned vectors, each 197 characters in
length. The long repeat sequence that broke the grepl mechanism are just
one pair of values.
  rle(x)
Run Length Encoding
lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ...
values : chr [1:197] 5d64d58a ac76183b 202fbcc4 78087f5e ...

So maybe as soon as you got to a bundle that was greater than 1/2 the
overall length (as happened in the x case) you could stop, since it
could not have occurred before.



I doubt that rle() can be deployed to replace Lempel-Ziv (LZ) algorithm 
in a trivial way. As a less convoluted example, consider the series


x - c(d,a,b,d,a,b,e,z)

If i=4 and therefore the i-th element is the second 'd' in the series, 
the shortest series starting from i=4 that I do not see in the past of 
'd' is


d,a,b,e, whose length is equal to 4 and that is the value 
returned by the function below.
The frustrating thing is that I already have the tools I need, just they 
crash for reasons beyond my control on relatively short series.
If anyone can make the function below more robust, that is really a big 
help for me.

Cheers

Lorenzo

###
entropy_lz - function(x,i){

past - x[1:i-1]

n - length(x)

lp - length(past)

future - x[i:n]

go_on - 1

count_len - 0

past_string - paste(past, collapse=#)

while (go_on0){

new_seq - x[i:(i+count_len)]

fut_string - paste(new_seq, collapse=#)

count_len - count_len+1

if (grepl(fut_string,past_string)!=1){

go_on - -1

}
}
return(count_len)

}

x - c(c,a,b,c,a,b,e,z)

S - entropy_lz(x,4)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-09 Thread David Winsemius


On Oct 9, 2010, at 4:23 PM, Lorenzo Isella wrote:




My suggestion is to explore other alternatives. (I will admit that I
don't yet fully understand the test that you are applying.)


Hi,
I am trying to partially implement the Lempel Ziv compression  
algorithm.
The point is that compressibility and entropy of a time series are  
related, hence my final goal is to evaluate the entropy of a time  
series.

You can find more at

http://bit.ly/93zX4T
http://en.wikipedia.org/wiki/LZ77_and_LZ78
http://bit.ly/9NgIFt




The two that

have occurred to me are Biostrings which I have already mentioned and
rle() which I have illustrated the use of but not referenced as an
avenue. The Biostrings package is part of bioConductor (part of the R
universe) although you should be prepared for a coffee break when you
install it if you haven't gotten at least bioClite already installed.
When I installed it last night it had 54 other package dependents  
also

downloaded and installed. It seems to me that taking advantage of the
coding resources in the molecular biology domain that are currently
directed at decoding the information storage mechanism of life  
might be
a smart strategy. You have not described the domain you are working  
in

but I would guess that the digest package might be biological in
primary application? So forgive me if I am preaching to the choir.

The rle option also occurred to me but it might take a smarter coder
than I to fully implement it. (But maybe Holtman would be up to it.  
He's

a _lot_ smarter than I.) In your example the long x string is
faithfully represented by two aligned vectors, each 197 characters in
length. The long repeat sequence that broke the grepl mechanism are  
just

one pair of values.
 rle(x)
Run Length Encoding
lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ...
values : chr [1:197] 5d64d58a ac76183b 202fbcc4 78087f5e ...

So maybe as soon as you got to a bundle that was greater than 1/2 the
overall length (as happened in the x case) you could stop, since it
could not have occurred before.



I doubt that rle() can be deployed to replace Lempel-Ziv (LZ)  
algorithm in a trivial way. As a less convoluted example, consider  
the series


x - c(d,a,b,d,a,b,e,z)

If i=4 and therefore the i-th element is the second 'd' in the  
series, the shortest series starting from i=4 that I do not see in  
the past of 'd' is


d,a,b,e, whose length is equal to 4 and that is the value  
returned by the function below.
The frustrating thing is that I already have the tools I need, just  
they crash for reasons beyond my control on relatively short series.
If anyone can make the function below more robust, that is really a  
big help for me.


I already offered the Biostrings package. It provides more robust  
methods for string matching than does grepl. Is there a reason that  
you choose not to?


--
David.

Cheers

Lorenzo

###
entropy_lz - function(x,i){

past - x[1:i-1]

n - length(x)

lp - length(past)

future - x[i:n]

go_on - 1

count_len - 0

past_string - paste(past, collapse=#)

while (go_on0){

new_seq - x[i:(i+count_len)]

fut_string - paste(new_seq, collapse=#)

count_len - count_len+1

if (grepl(fut_string,past_string)!=1){

go_on - -1

}
}
return(count_len)

}

x - c(c,a,b,c,a,b,e,z)

S - entropy_lz(x,4)


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Memory management in R

2010-10-08 Thread Lorenzo Isella

Dear All,
I am experiencing some problems with a script of mine.
It crashes with this message

Error in grepl(fut_string, past_string) :
  invalid regular expression 
'12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12

Calls: entropy_estimate_hash - total_entropy_lz - entropy_lz - grepl
In addition: Warning message:
In grepl(fut_string, past_string) : regcomp error:  'Out of memory'
Execution halted

To make a long story short, I use some functions which eventually call 
grepl on very long strings to check whether a certain substring is part 
of a longer string.
Now, the script technically works (it never crashes when I run it on a 
smaller dataset) and the problem does not seem to be RAM memory (I have 
several GB of RAM on my machine and its consumption never shoots up so 
my machine never resorts to swap memory).
So (though I am not an expert) it looks like the problem is some 
limitation of grepl or R memory management.
Any idea about how I could tackle this problem or how I can profile my 
code to fix it (though it really seems to me that I have to find a way 
to allow R to process longer strings).

Any suggestion is appreciated.
Cheers

Lorenzo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-08 Thread Lorenzo Isella

On 10/08/2010 07:25 PM, Doran, Harold wrote:

These questions are OS-specific. Please provide sessionInfo() or other details 
as needed




I see. I am running R on a 64 bit machine running Ubuntu 10.04

 sessionInfo()
R version 2.11.1 (2010-05-31)
x86_64-pc-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C  LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base


and in case it matters, this is the output of my top command

$ top

top - 19:28:21 up  8:04,  8 users,  load average: 0.60, 0.72, 1.33
Tasks: 220 total,   1 running, 219 sleeping,   0 stopped,   0 zombie
Cpu(s): 10.3%us,  0.6%sy,  0.0%ni, 87.2%id,  1.9%wa,  0.0%hi,  0.0%si, 
0.0%st

Mem:   6110484k total,  3847008k used,  2263476k free,72748k buffers
Swap:  2929656k total,0k used,  2929656k free,  2621420k cached

Cheers

Lorenzo


-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
Sent: Friday, October 08, 2010 1:12 PM
To: r-help
Subject: [R] Memory management in R

Dear All,
I am experiencing some problems with a script of mine.
It crashes with this message

Error in grepl(fut_string, past_string) :
invalid regular expression
'12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12
Calls: entropy_estimate_hash -  total_entropy_lz -  entropy_lz -  grepl
In addition: Warning message:
In grepl(fut_string, past_string) : regcomp error:  'Out of memory'
Execution halted

To make a long story short, I use some functions which eventually call
grepl on very long strings to check whether a certain substring is part
of a longer string.
Now, the script technically works (it never crashes when I run it on a
smaller dataset) and the problem does not seem to be RAM memory (I have
several GB of RAM on my machine and its consumption never shoots up so
my machine never resorts to swap memory).
So (though I am not an expert) it looks like the problem is some
limitation of grepl or R memory management.
Any idea about how I could tackle this problem or how I can profile my
code to fix it (though it really seems to me that I have to find a way
to allow R to process longer strings).
Any suggestion is appreciated.
Cheers

Lorenzo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-08 Thread Doran, Harold
These questions are OS-specific. Please provide sessionInfo() or other details 
as needed

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Lorenzo Isella
Sent: Friday, October 08, 2010 1:12 PM
To: r-help
Subject: [R] Memory management in R

Dear All,
I am experiencing some problems with a script of mine.
It crashes with this message

Error in grepl(fut_string, past_string) :
   invalid regular expression 
'12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12
Calls: entropy_estimate_hash - total_entropy_lz - entropy_lz - grepl
In addition: Warning message:
In grepl(fut_string, past_string) : regcomp error:  'Out of memory'
Execution halted

To make a long story short, I use some functions which eventually call 
grepl on very long strings to check whether a certain substring is part 
of a longer string.
Now, the script technically works (it never crashes when I run it on a 
smaller dataset) and the problem does not seem to be RAM memory (I have 
several GB of RAM on my machine and its consumption never shoots up so 
my machine never resorts to swap memory).
So (though I am not an expert) it looks like the problem is some 
limitation of grepl or R memory management.
Any idea about how I could tackle this problem or how I can profile my 
code to fix it (though it really seems to me that I have to find a way 
to allow R to process longer strings).
Any suggestion is appreciated.
Cheers

Lorenzo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-08 Thread jim holtman
More specificity: how long is the string, what is the pattern you are
matching against?  It sounds like you might have a complex pattern
that in trying to match the string might be doing a lot of back
tracking and such.  There is an O'Reilly book on Mastering Regular
Expression that might help you understand what might be happening.  So
if you can provide a better example than just the error message, it
would be helpful.

On Fri, Oct 8, 2010 at 1:11 PM, Lorenzo Isella lorenzo.ise...@gmail.com wrote:
 Dear All,
 I am experiencing some problems with a script of mine.
 It crashes with this message

 Error in grepl(fut_string, past_string) :
  invalid regular expression
 '12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12
 Calls: entropy_estimate_hash - total_entropy_lz - entropy_lz - grepl
 In addition: Warning message:
 In grepl(fut_string, past_string) : regcomp error:  'Out of memory'
 Execution halted

 To make a long story short, I use some functions which eventually call grepl
 on very long strings to check whether a certain substring is part of a
 longer string.
 Now, the script technically works (it never crashes when I run it on a
 smaller dataset) and the problem does not seem to be RAM memory (I have
 several GB of RAM on my machine and its consumption never shoots up so my
 machine never resorts to swap memory).
 So (though I am not an expert) it looks like the problem is some limitation
 of grepl or R memory management.
 Any idea about how I could tackle this problem or how I can profile my code
 to fix it (though it really seems to me that I have to find a way to allow R
 to process longer strings).
 Any suggestion is appreciated.
 Cheers

 Lorenzo

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-08 Thread Mike Marchywka







 Date: Fri, 8 Oct 2010 13:30:59 -0400
 From: jholt...@gmail.com
 To: lorenzo.ise...@gmail.com
 CC: r-help@r-project.org
 Subject: Re: [R] Memory management in R

 More specificity: how long is the string, what is the pattern you are
 matching against? It sounds like you might have a complex pattern
 that in trying to match the string might be doing a lot of back
 tracking and such. There is an O'Reilly book on Mastering Regular
 Expression that might help you understand what might be happening. So
 if you can provide a better example than just the error message, it
 would be helpful.


This is possibly a stack issue. Error messages are not often literal,
I have seen out of memory for graphic device objects :) Regex suggests
a stack issue but that would be a guess on the mechanism of death but
what you probably really want is a simpler regex :)





 On Fri, Oct 8, 2010 at 1:11 PM, Lorenzo Isella  wrote:
  Dear All,
  I am experiencing some problems with a script of mine.
  It crashes with this message
 
  Error in grepl(fut_string, past_string) :
   invalid regular expression
  '12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12
  Calls: entropy_estimate_hash - total_entropy_lz - entropy_lz - grepl
  In addition: Warning message:
  In grepl(fut_string, past_string) : regcomp error:  'Out of memory'
  Execution halted
 
  To make a long story short, I use some functions which eventually call grepl
  on very long strings to check whether a certain substring is part of a
  longer string.
  Now, the script technically works (it never crashes when I run it on a
  smaller dataset) and the problem does not seem to be RAM memory (I have
  several GB of RAM on my machine and its consumption never shoots up so my
  machine never resorts to swap memory).
  So (though I am not an expert) it looks like the problem is some limitation
  of grepl or R memory management.
  Any idea about how I could tackle this problem or how I can profile my code
  to fix it (though it really seems to me that I have to find a way to allow R
  to process longer strings).
  Any suggestion is appreciated.
  Cheers
 
  Lorenzo
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 



 --
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390

 What is the problem that you are trying to solve?

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-08 Thread Lorenzo Isella

Thanks for lending a helping hand.
I put together a self-contained example. Basically, it all relies on a 
couple of functions, where one function simply iterates the application 
of the other function.
I am trying to implement the so-called Lempel-Ziv entropy estimator. The 
idea is to choose a position i along a string x (standing for a time 
series) and find the length of the shortest string starting from i which 
has never occurred before i.
Please find below the R snippet which requires an input file (a simple 
text file) you can download from


http://dl.dropbox.com/u/5685598/time_series25_.dat

What puzzles me is that the list is not really long (less than 2000 
entries) and I have not experienced the same problem even with longer lists.

Many thanks

Lorenzo

##


total_entropy_lz - function(x){

if (length(x)==1){

print(sequence too short)

return(error)

} else{


n - length(x)

prefactor - 1/(n*log(n)/log(2))

n_seq - seq(n)

entropy_list - n_seq

for (i in n_seq){

entropy_list[i] - entropy_lz(x,i)


}


}

total_entropy - 1/(prefactor*sum(entropy_list))


return(total_entropy)

}


entropy_lz - function(x,i){

past - x[1:i-1]

n - length(x)

lp - length(past)

future - x[i:n]

go_on - 1

count_len - 0

past_string - paste(past, collapse=#)

while (go_on0){

new_seq - x[i:(i+count_len)]

fut_string - paste(new_seq, collapse=#)

count_len - count_len+1

if (grepl(fut_string,past_string)!=1){

go_on - -1
}
}
return(count_len)
}

x - scan(time_series25_.dat, what=)


S - total_entropy_lz(x)






On 10/08/2010 07:30 PM, jim holtman wrote:

More specificity: how long is the string, what is the pattern you are
matching against?  It sounds like you might have a complex pattern
that in trying to match the string might be doing a lot of back
tracking and such.  There is an O'Reilly book on Mastering Regular
Expression that might help you understand what might be happening.  So
if you can provide a better example than just the error message, it
would be helpful.

On Fri, Oct 8, 2010 at 1:11 PM, Lorenzo Isellalorenzo.ise...@gmail.com  wrote:

Dear All,
I am experiencing some problems with a script of mine.
It crashes with this message

Error in grepl(fut_string, past_string) :
  invalid regular expression
'12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12
Calls: entropy_estimate_hash -  total_entropy_lz -  entropy_lz -  grepl
In addition: Warning message:
In grepl(fut_string, past_string) : regcomp error:  'Out of memory'
Execution halted

To make a long story short, I use some functions which eventually call grepl
on very long strings to check whether a certain substring is part of a
longer string.
Now, the script technically works (it never crashes when I run it on a
smaller dataset) and the problem does not seem to be RAM memory (I have
several GB of RAM on my machine and its consumption never shoots up so my
machine never resorts to swap memory).
So (though I am not an expert) it looks like the problem is some limitation
of grepl or R memory management.
Any idea about how I could tackle this problem or how I can profile my code
to fix it (though it really seems to me that I have to find a way to allow R
to process longer strings).
Any suggestion is appreciated.
Cheers

Lorenzo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.







__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-08 Thread David Winsemius
#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12
Calls: entropy_estimate_hash -  total_entropy_lz -  entropy_lz - 
  grepl

In addition: Warning message:
In grepl(fut_string, past_string) : regcomp error:  'Out of memory'
Execution halted

To make a long story short, I use some functions which eventually  
call grepl
on very long strings to check whether a certain substring is part  
of a

longer string.
Now, the script technically works (it never crashes when I run it  
on a
smaller dataset) and the problem does not seem to be RAM memory (I  
have
several GB of RAM on my machine and its consumption never shoots  
up so my

machine never resorts to swap memory).
So (though I am not an expert) it looks like the problem is some  
limitation

of grepl or R memory management.
Any idea about how I could tackle this problem or how I can  
profile my code
to fix it (though it really seems to me that I have to find a way  
to allow R

to process longer strings).
Any suggestion is appreciated.
Cheers

Lorenzo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.







__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-08 Thread Mike Marchywka








 From: dwinsem...@comcast.net
 To: lorenzo.ise...@gmail.com
 Date: Fri, 8 Oct 2010 19:30:45 -0400
 CC: r-help@r-project.org
 Subject: Re: [R] Memory management in R


 On Oct 8, 2010, at 6:42 PM, Lorenzo Isella wrote:


  Please find below the R snippet which requires an input file (a
  simple text file) you can download from
 
  http://dl.dropbox.com/u/5685598/time_series25_.dat
 
  What puzzles me is that the list is not really long (less than 2000
  entries) and I have not experienced the same problem even with
  longer lists.

 But maybe your loop terminated in them eaarlier/ Someplace between
 11*225 and 11*240 the grepping machine gives up:

  eprs - paste(rep(aa, 225), collapse=#)
  grepl(eprs, eprs)
 [1] TRUE

  eprs - paste(rep(aa, 240), collapse=#)
  grepl(eprs, eprs)
 Error in grepl(eprs, eprs) :
 invalid regular expression
 'aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#a
 In addition: Warning message:
 In grepl(eprs, eprs) : regcomp error: 'Out of memory'

 The complexity of the problem may depend on the distribution of
 values. You have a very skewed distribution with the vast majority
 being in the same value as appeared in your error message :



 HTH (although I think it means you need to construct a different
 implementation strategy);

You really need to look at the question posed by your regex and consider 
the complexity of what you are asking and what likely implementations
would do with your regex. Something like this probably needs to be implemented
in dedicated code to handle the more general case or you need to determine
if input data is pathological given your regex. Being able to write something
concisely doesn't mean the execution of that something is simple. Even if
it does manage to return a result, it likely will get very slow. In the
past I have had to write my own simple regex compilers to handle a limited
class of expressions to make the speed reasonable. In this case, depending
on your objectives, dedicated code may even be helpful to you in understanding
the algorithm. 


 David.


  Many thanks
 
  Lorenzo
 

  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-08 Thread David Winsemius


On Oct 8, 2010, at 9:19 PM, Mike Marchywka wrote:



From: dwinsem...@comcast.net
To: lorenzo.ise...@gmail.com
Date: Fri, 8 Oct 2010 19:30:45 -0400
CC: r-help@r-project.org
Subject: Re: [R] Memory management in R


On Oct 8, 2010, at 6:42 PM, Lorenzo Isella wrote:




Please find below the R snippet which requires an input file (a
simple text file) you can download from

http://dl.dropbox.com/u/5685598/time_series25_.dat

What puzzles me is that the list is not really long (less than 2000
entries) and I have not experienced the same problem even with
longer lists.


But maybe your loop terminated in them eaarlier/ Someplace between
11*225 and 11*240 the grepping machine gives up:


eprs - paste(rep(aa, 225), collapse=#)
grepl(eprs, eprs)

[1] TRUE


eprs - paste(rep(aa, 240), collapse=#)
grepl(eprs, eprs)

Error in grepl(eprs, eprs) :
invalid regular expression
'aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#a
In addition: Warning message:
In grepl(eprs, eprs) : regcomp error: 'Out of memory'

The complexity of the problem may depend on the distribution of
values. You have a very skewed distribution with the vast majority
being in the same value as appeared in your error message :





HTH (although I think it means you need to construct a different
implementation strategy);


You really need to look at the question posed by your regex and  
consider

the complexity of what you are asking and what likely implementations
would do with your regex.


The R regex machine (at least on a Mac with R 2.11.1)  breaks when the  
length of the the pattern argument exceeds  2559 characters. There is  
no complexity  for the regex parser here. No metacharacters were in  
the string.



Something like this probably needs to be implemented
in dedicated code to handle the more general case or you need to  
determine

if input data is pathological given your regex.


There is a Biostrings package in BioC that may provide more robust  
treatment of long strings.


--
David.



Being able to write something
concisely doesn't mean the execution of that something is simple.  
Even if
it does manage to return a result, it likely will get very slow. In  
the
past I have had to write my own simple regex compilers to handle a  
limited
class of expressions to make the speed reasonable. In this case,  
depending
on your objectives, dedicated code may even be helpful to you in  
understanding

the algorithm.



David.



Many thanks

Lorenzo






David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] memory management in R

2010-06-16 Thread john


I have volunteered to give a short talk on memory management in R 
   to my local R user group, mainly to motivate myself to learn about it. 

The focus will be on what a typical R coder might want to know  ( e.g. how
objects are created, call by value, basics of garbage collection ) but I
want to go a little deeper just in case there are some advanced users in the
crowd. 

Here are the resources I am using right now
  Chambers book Software for Data Analysis 
  Manuals such as R Internals and Writing R Extensions 

Any suggestions on other sources of information? 

There are still some things that are not clear to me, such as
  - how to make sense of the output from various memory diagnostics such as 
memory.profile ... are these counts? 
How to get the amount of memory used: gc() and memory.size() seem to
differ
 -  what gets allocated on the heap versus stack
 - why the name cons cells for the stack allocation 

Any help with these would be greatly appreciated. 

Thanks greatly, 

John Muller

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management in R

2010-06-16 Thread Jens Oehlschlägel
You might want to mention/talk about packages that enhance R's ability to work 
with less RAM / more data, such as package SOAR (transparently moving objects 
between RAM and disk) and ff (which allows vectors and dataframes larger than 
RAM and which supports dense datatypes like true boolean, short integers etc.). 

Jens Oehlschlägel



-Ursprüngliche Nachricht-
Von: john mull...@fastmail.fm
Gesendet: Jun 16, 2010 12:20:17 PM
An: r-help@r-project.org
Betreff: [R] memory management in R



I have volunteered to give a short talk on memory management in R 
   to my local R user group, mainly to motivate myself to learn about it. 

The focus will be on what a typical R coder might want to know  ( e.g. how
objects are created, call by value, basics of garbage collection ) but I
want to go a little deeper just in case there are some advanced users in the
crowd. 

Here are the resources I am using right now
  Chambers book Software for Data Analysis 
  Manuals such as R Internals and Writing R Extensions 

Any suggestions on other sources of information? 

There are still some things that are not clear to me, such as
  - how to make sense of the output from various memory diagnostics such as 
memory.profile ... are these counts? 
How to get the amount of memory used: gc() and memory.size() seem to
differ
 -  what gets allocated on the heap versus stack
 - why the name cons cells for the stack allocation 

Any help with these would be greatly appreciated. 

Thanks greatly, 

John Muller

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] About R memory management?

2009-12-10 Thread Peng Yu
I'm wondering where I can find the detailed descriptions on R memory
management. Understanding this could help me understand the runtime of
R program. For example, depending on how memory is allocated (either
allocate a chuck of memory that is more than necessary for the current
use, or allocate the memory that is just enough for the current use),
the performance of the following program could be very different.
Could somebody let me know some good references?

unsorted_index=NULL
for(i in 1:100) {
  unsorted_index=c(unsorted_index, i)
}
unsorted_index

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] About R memory management?

2009-12-10 Thread Henrik Bengtsson
Related...

Rule of thumb:
Pre-allocate your object of the *correct* data type, if you know the
final dimensions.

/Henrik

On Thu, Dec 10, 2009 at 8:26 AM, Peng Yu pengyu...@gmail.com wrote:
 I'm wondering where I can find the detailed descriptions on R memory
 management. Understanding this could help me understand the runtime of
 R program. For example, depending on how memory is allocated (either
 allocate a chuck of memory that is more than necessary for the current
 use, or allocate the memory that is just enough for the current use),
 the performance of the following program could be very different.
 Could somebody let me know some good references?

 unsorted_index=NULL
 for(i in 1:100) {
  unsorted_index=c(unsorted_index, i)
 }
 unsorted_index

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] About R memory management?

2009-12-10 Thread hadley wickham
For the case below, you don't need to know anything about how R
manages memory, but you do need to understand basic concepts
algorithmic complexity.  You might find The Algorithm Design Manual,
http://www.amazon.com/dp/1848000693, a good start.

Hadley

On Thu, Dec 10, 2009 at 10:26 AM, Peng Yu pengyu...@gmail.com wrote:
 I'm wondering where I can find the detailed descriptions on R memory
 management. Understanding this could help me understand the runtime of
 R program. For example, depending on how memory is allocated (either
 allocate a chuck of memory that is more than necessary for the current
 use, or allocate the memory that is just enough for the current use),
 the performance of the following program could be very different.
 Could somebody let me know some good references?

 unsorted_index=NULL
 for(i in 1:100) {
  unsorted_index=c(unsorted_index, i)
 }
 unsorted_index

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
http://had.co.nz/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] About R memory management?

2009-12-10 Thread Peng Yu
I have a situation that I can not predict the final result's dimension.

In C++, I believe that the class valarray could preallocate some
memory than it is actually needed (maybe 2 times more). The runtime
for a C++ equivalent (using append) to the R code would still be C*n,
where C is a constant and n is the length of the vector. However, if
it just allocate enough memory, the run time will be C*n^2.

Based on your reply, I suspect that R doesn't allocate some memory
than it is currently needed, right?

On Fri, Dec 11, 2009 at 11:22 AM, Henrik Bengtsson h...@stat.berkeley.edu 
wrote:
 Related...

 Rule of thumb:
 Pre-allocate your object of the *correct* data type, if you know the
 final dimensions.

 /Henrik

 On Thu, Dec 10, 2009 at 8:26 AM, Peng Yu pengyu...@gmail.com wrote:
 I'm wondering where I can find the detailed descriptions on R memory
 management. Understanding this could help me understand the runtime of
 R program. For example, depending on how memory is allocated (either
 allocate a chuck of memory that is more than necessary for the current
 use, or allocate the memory that is just enough for the current use),
 the performance of the following program could be very different.
 Could somebody let me know some good references?

 unsorted_index=NULL
 for(i in 1:100) {
  unsorted_index=c(unsorted_index, i)
 }
 unsorted_index

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] About R memory management?

2009-12-10 Thread jim holtman
If you really want to code like a C++ coder in R, then create your own
object and extend it when necessary:

# take a variation of this; preallocate and then extend when you read a
limit
x - numeric(2)
for (i in 1:100){
if (i  length(x)){
# double the length (or whatever you want)
length(x) - length(x) * 2
}
x[i] - i
}

On Thu, Dec 10, 2009 at 11:30 AM, Peng Yu pengyu...@gmail.com wrote:

 I have a situation that I can not predict the final result's dimension.

 In C++, I believe that the class valarray could preallocate some
 memory than it is actually needed (maybe 2 times more). The runtime
 for a C++ equivalent (using append) to the R code would still be C*n,
 where C is a constant and n is the length of the vector. However, if
 it just allocate enough memory, the run time will be C*n^2.

 Based on your reply, I suspect that R doesn't allocate some memory
 than it is currently needed, right?

 On Fri, Dec 11, 2009 at 11:22 AM, Henrik Bengtsson h...@stat.berkeley.edu
 wrote:
  Related...
 
  Rule of thumb:
  Pre-allocate your object of the *correct* data type, if you know the
  final dimensions.
 
  /Henrik
 
  On Thu, Dec 10, 2009 at 8:26 AM, Peng Yu pengyu...@gmail.com wrote:
  I'm wondering where I can find the detailed descriptions on R memory
  management. Understanding this could help me understand the runtime of
  R program. For example, depending on how memory is allocated (either
  allocate a chuck of memory that is more than necessary for the current
  use, or allocate the memory that is just enough for the current use),
  the performance of the following program could be very different.
  Could somebody let me know some good references?
 
  unsorted_index=NULL
  for(i in 1:100) {
   unsorted_index=c(unsorted_index, i)
  }
  unsorted_index
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] About R memory management?

2009-12-10 Thread Peng Yu
Thatis was not my original question. My original questions was how
memory is managed/allocated in R?

On Thu, Dec 10, 2009 at 6:08 PM, jim holtman jholt...@gmail.com wrote:
 If you really want to code like a C++ coder in R, then create your own
 object and extend it when necessary:

 # take a variation of this; preallocate and then extend when you read a
 limit
 x - numeric(2)
 for (i in 1:100){
     if (i  length(x)){
     # double the length (or whatever you want)
     length(x) - length(x) * 2
     }
     x[i] - i
 }

 On Thu, Dec 10, 2009 at 11:30 AM, Peng Yu pengyu...@gmail.com wrote:

 I have a situation that I can not predict the final result's dimension.

 In C++, I believe that the class valarray could preallocate some
 memory than it is actually needed (maybe 2 times more). The runtime
 for a C++ equivalent (using append) to the R code would still be C*n,
 where C is a constant and n is the length of the vector. However, if
 it just allocate enough memory, the run time will be C*n^2.

 Based on your reply, I suspect that R doesn't allocate some memory
 than it is currently needed, right?

 On Fri, Dec 11, 2009 at 11:22 AM, Henrik Bengtsson h...@stat.berkeley.edu
 wrote:
  Related...
 
  Rule of thumb:
  Pre-allocate your object of the *correct* data type, if you know the
  final dimensions.
 
  /Henrik
 
  On Thu, Dec 10, 2009 at 8:26 AM, Peng Yu pengyu...@gmail.com wrote:
  I'm wondering where I can find the detailed descriptions on R memory
  management. Understanding this could help me understand the runtime of
  R program. For example, depending on how memory is allocated (either
  allocate a chuck of memory that is more than necessary for the current
  use, or allocate the memory that is just enough for the current use),
  the performance of the following program could be very different.
  Could somebody let me know some good references?
 
  unsorted_index=NULL
  for(i in 1:100) {
   unsorted_index=c(unsorted_index, i)
  }
  unsorted_index
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



 --
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390

 What is the problem that you are trying to solve?


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] FW: R memory management

2007-12-08 Thread Yuri Volchik
Hi,

 

I'm using R to collect data for a number of exchanges through a socket
connection and constantly running into memory problems even though task I
believe is not that memory consuming. I guess there is a miscommunication
between R and WinXP about freeing up memory.

So this is the code:

 

for (x in 1:length(exchanges.to.get)) {

   tickers-sqlQuery(channel,paste(SELECT Symbol FROM symbols_list WHERE
Exchange=',exchanges.to.get[x],';,sep=''))[,1]

   dir.create(paste(Working.dir,exchanges.to.get[x],'/',sep=''))

   for (y in 1:length(tickers)) {

 con2 - socketConnection(Sys.info()[nodename], port = )  #open
socket connection to get data

 writeLines(paste(command,',',tickers[y],',',interval,';',sep=''), con2)

 data.-readLines(con2)

 end.of.data-sum(c(data.==!ENDMSG!,data.==!SYNTAX_ERROR!))

 while(end.of.data!=1)
{new.data-readLines(con2);end.of.data-sum(new.data==!ENDMSG!);
data.-c(data.,new.data)}

 if (length(data.)3)
write.table(data.[1:(length(data.)-2)],paste(Working.dir,exchanges.to.get[x]
,'/',sub('\\*','\+',tickers[y]),'_.csv',sep=''),quote=F,col.names =
F,row.names=F)

 close(con2)

   }

  rm(tickers)

  gc()

 

 

With command  gcinfo(TRUE) I got the following info (some examples) :

 

Garbage collection 16362 = 15411+754+197 (level 0) ... 

6.3 Mbytes of cons cells used (22%)

2.2 Mbytes of vectors used (8%)

 

Garbage collection 16407 = 15454+756+197 (level 0) ... 

13.1 Mbytes of cons cells used (46%)

10.4 Mbytes of vectors used (39%)

 

Garbage collection 16410 = 15456+756+198 (level 2) ... 

4.9 Mbytes of cons cells used (21%)

0.9 Mbytes of vectors used (4%)

 

Garbage collection 16679 = 15634+796+249 (level 0) ... 

150.7 Mbytes of cons cells used (95%)

203.9 Mbytes of vectors used (75%)

 

Garbage collection 16680 = 15634+796+250 (level 2) ... 

4.9 Mbytes of cons cells used (4%)

0.9 Mbytes of vectors used (0%)

 

Garbage collection 16808 = 15754+802+252 (level 0) ... 

6.1 Mbytes of cons cells used (7%)

1.8 Mbytes of vectors used (1%)

 

But the end result is in Task Manager:

RGui.exe  Mem Usage 470,472K  VM Size 541,988K

 

Even though R reports 

Garbage collection 16808 = 15754+802+252 (level 0) ... 

6.1 Mbytes of cons cells used (7%)

1.8 Mbytes of vectors used (1%)

 

Has anybody encountered this problem and how you guys deal with it?  It
seems like a memory leak to me, as tasks are not memory demandind, the
biggest amount of data in a single file is about 40MB.

 

Thanks


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: R memory management

2007-12-08 Thread Patrick Burns
The line:

  data. - c(data., new.data)

will eat both memory and time voraciously.

You should change it by creating 'data.' to
be the final size it will be and then subscript
into it.  If you don't know the final size, then
you can grow it a lot a few times instead of
growing it a little lots of times.


Patrick Burns
[EMAIL PROTECTED]
+44 (0)20 8525 0696
http://www.burns-stat.com
(home of S Poetry and A Guide for the Unwilling S User)

Yuri Volchik wrote:

Hi,

 

I'm using R to collect data for a number of exchanges through a socket
connection and constantly running into memory problems even though task I
believe is not that memory consuming. I guess there is a miscommunication
between R and WinXP about freeing up memory.

So this is the code:

 

for (x in 1:length(exchanges.to.get)) {

   tickers-sqlQuery(channel,paste(SELECT Symbol FROM symbols_list WHERE
Exchange=',exchanges.to.get[x],';,sep=''))[,1]

   dir.create(paste(Working.dir,exchanges.to.get[x],'/',sep=''))

   for (y in 1:length(tickers)) {

 con2 - socketConnection(Sys.info()[nodename], port = )  #open
socket connection to get data

 writeLines(paste(command,',',tickers[y],',',interval,';',sep=''), con2)

 data.-readLines(con2)

 end.of.data-sum(c(data.==!ENDMSG!,data.==!SYNTAX_ERROR!))

 while(end.of.data!=1)
{new.data-readLines(con2);end.of.data-sum(new.data==!ENDMSG!);
data.-c(data.,new.data)}

 if (length(data.)3)
write.table(data.[1:(length(data.)-2)],paste(Working.dir,exchanges.to.get[x]
,'/',sub('\\*','\+',tickers[y]),'_.csv',sep=''),quote=F,col.names =
F,row.names=F)

 close(con2)

   }

  rm(tickers)

  gc()

 

 

With command  gcinfo(TRUE) I got the following info (some examples) :

 

Garbage collection 16362 = 15411+754+197 (level 0) ... 

6.3 Mbytes of cons cells used (22%)

2.2 Mbytes of vectors used (8%)

 

Garbage collection 16407 = 15454+756+197 (level 0) ... 

13.1 Mbytes of cons cells used (46%)

10.4 Mbytes of vectors used (39%)

 

Garbage collection 16410 = 15456+756+198 (level 2) ... 

4.9 Mbytes of cons cells used (21%)

0.9 Mbytes of vectors used (4%)

 

Garbage collection 16679 = 15634+796+249 (level 0) ... 

150.7 Mbytes of cons cells used (95%)

203.9 Mbytes of vectors used (75%)

 

Garbage collection 16680 = 15634+796+250 (level 2) ... 

4.9 Mbytes of cons cells used (4%)

0.9 Mbytes of vectors used (0%)

 

Garbage collection 16808 = 15754+802+252 (level 0) ... 

6.1 Mbytes of cons cells used (7%)

1.8 Mbytes of vectors used (1%)

 

But the end result is in Task Manager:

RGui.exe  Mem Usage 470,472K  VM Size 541,988K

 

Even though R reports 

Garbage collection 16808 = 15754+802+252 (level 0) ... 

6.1 Mbytes of cons cells used (7%)

1.8 Mbytes of vectors used (1%)

 

Has anybody encountered this problem and how you guys deal with it?  It
seems like a memory leak to me, as tasks are not memory demandind, the
biggest amount of data in a single file is about 40MB.

 

Thanks


   [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


  


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management

2007-09-15 Thread Takatsugu Kobayashi
Hi,

I apologize again for posting something not suitable on this list.

Basically, it sounds like I should go put this large dataset into a 
database... The dataset I have had trouble with is the transportation 
network of Chicago Consolidated Metropolitan Statistical Area. The 
number of samples is about 7,200 points; and every points have outbound 
and inbound traffic flows: volumes, times, distances, etc. So a quick 
approximation of the number of rows would be
49,000,000 rows (and 249 columns).

This is a text file. I could work with a portion of the data at a time 
like nearest neighbors or pairs of points.

I used read.table('filename',header=F).. I should probably use some bits 
of data at a time instead of putting all at a time...

I am learning RSQLite and RMySQL. As Mr. Wan suggests, I will learn C a 
bit more.

Thank you very much.

TK

im holtman wrote:
 When you say you can not import 4.8GB, is this the size of the text
 file that you are reading in?  If so, what is the structure of the
 file?  How are you reading in the file ('read.table', 'scan', etc).

 Do you really need all the data or can you work with a portion at a
 time?  If so, then consider putting the data in a database and
 retrieving the data as needed.  If all the data is in an object, how
 big to you think this object will be? (# rows, # columns, mode of the
 data).

 So you need to provide some more information as to the problem that
 you are trying to solve.

 On 9/15/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
   
 Hi,

 Let me apologize for this simple question.

 I use 64 bit R on my Fedora Core 6 Linux workstation. A 64 bit R has
 saved a lot of time. I am sure this is a lot to do with my memory
 limit, but I cannot import 4.8GB. My workstation has a 8GB RAM, Athlon
 X2 5600, and 1200W PSU. This PC configuration is the best I could get.

 I know a bit of C and Perl. Should I use C or Perl to manage this large
 dataset? or should I even go to 16GB RAM.

 Sorry for this silly question. But I appreciate if anyone could give me
 advice.

 Thank you very much.

 TK

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management

2007-09-15 Thread jim holtman
If you data file has 49M rows and 249 columns, then if each column had
5 characters, then you are looking at a text file with 60GB.  If these
were all numerics (8 bytes per number), then you are looking at an R
object that would be almost 100GB.  If this is your data, then this is
definitely a candidate for a data base since you would need a fairly
large machine (at least 300GB of real memory).

You probably need to give some serious thought to how you want to
store your data and then what type of processing you need to do on it.
BTW, do you need all 249 columns, or could you work with just 3-4
columns at a time (this at least makes an R object of about 1.5GB
which might be easier to handle).

On 9/16/07, Takatsugu Kobayashi [EMAIL PROTECTED] wrote:
 Hi,

 I apologize again for posting something not suitable on this list.

 Basically, it sounds like I should go put this large dataset into a
 database... The dataset I have had trouble with is the transportation
 network of Chicago Consolidated Metropolitan Statistical Area. The
 number of samples is about 7,200 points; and every points have outbound
 and inbound traffic flows: volumes, times, distances, etc. So a quick
 approximation of the number of rows would be
 49,000,000 rows (and 249 columns).

 This is a text file. I could work with a portion of the data at a time
 like nearest neighbors or pairs of points.

 I used read.table('filename',header=F).. I should probably use some bits
 of data at a time instead of putting all at a time...

 I am learning RSQLite and RMySQL. As Mr. Wan suggests, I will learn C a
 bit more.

 Thank you very much.

 TK

 im holtman wrote:
  When you say you can not import 4.8GB, is this the size of the text
  file that you are reading in?  If so, what is the structure of the
  file?  How are you reading in the file ('read.table', 'scan', etc).
 
  Do you really need all the data or can you work with a portion at a
  time?  If so, then consider putting the data in a database and
  retrieving the data as needed.  If all the data is in an object, how
  big to you think this object will be? (# rows, # columns, mode of the
  data).
 
  So you need to provide some more information as to the problem that
  you are trying to solve.
 
  On 9/15/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 
  Hi,
 
  Let me apologize for this simple question.
 
  I use 64 bit R on my Fedora Core 6 Linux workstation. A 64 bit R has
  saved a lot of time. I am sure this is a lot to do with my memory
  limit, but I cannot import 4.8GB. My workstation has a 8GB RAM, Athlon
  X2 5600, and 1200W PSU. This PC configuration is the best I could get.
 
  I know a bit of C and Perl. Should I use C or Perl to manage this large
  dataset? or should I even go to 16GB RAM.
 
  Sorry for this silly question. But I appreciate if anyone could give me
  advice.
 
  Thank you very much.
 
  TK
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 
 
 




-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.