Re: [Rd] Large discrepancies in the same object being saved to .RData

2010-07-12 Thread Martin Maechler
 DM == Duncan Murdoch murdoch.dun...@gmail.com
 on Sun, 11 Jul 2010 17:25:45 -0400 writes:

DM On 11/07/2010 1:30 PM, Prof Brian Ripley wrote:

[]

 On 7/10/2010 10:10 PM, bill.venab...@csiro.au wrote:
 Well, I have answered one of my questions below.  The hidden
 environment is attached to the 'terms' component of v1.

 Well, not really hidden.  A terms component is a formula
 (see ?terms.object), and a formula has an environment
 just as a closure does.  In neither case does the print()
 method tell you about it -- but ?formula does.

DM I've just changed the default print method for formulas to display the 
DM environment if it is not globalenv(), which is the rule used for 
DM closures as well.  So now in R-devel:

 as.formula(y ~ x)
DM y ~ x

DM as before, but

 as.formula(y ~ x, env=new.env())
DM y ~ x
DM environment: 01f83400

I see that our print.formula() actually has not truely fulfilled
our own rule about print methods:

?print   has
  Description:
  
   ‘print’ prints its argument and returns it _invisibly_ 
   ..

Further, I completely agree that it's good to mention the
environment, however, it can be a nuisance when it's part of a
larger print(.) method, so I'd like allowing to suppress that
and hence I've committed the current

print.formula - function(x, showEnv = !identical(e, .GlobalEnv), ...)
{
e - environment(.x - x) ## return(.) original x
attr(x, .Environment) - NULL
print.default(unclass(x), ...)
if (showEnv) print(e)
invisible(.x)
}

--
Martin Maechler, ETH Zurich

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Large discrepancies in the same object being saved to .RData

2010-07-12 Thread Terry Therneau
 I only wish to add a request for further documentation of hidden
environments, their consequences, and how to turn them off.  Perhaps a
page in the Extending R guide, and a suggestion for book authors.

  I was bitten by this with the coxph frailty functions.  They are
called during the model frame creation and create a matrix object with
various small attached functions as attributes.  In creating the 'x'
columns they have to deal with factors and can create a huge transient
temporary matrix while doing so; something that will never be needed
again.  A user was exceeding disk quotas when he saved a model fit.

  As someone with years of experience with functional languages (which S
once was), I wasn't used to the idea that one would have to take
explicit --- and mysterious --- steps to make local variables go away.
This discussion has revealed that the hidden rules causing local
variables to be kept are more complex than I thought.  Perhaps a don't
save environments option to save could be added to help mere mortals
get rid of all this stuff in the attic (with its secret staircase)?

[[Soapbox on]] Environments have proven useful for many things, and
certainly aren't going away.  But to quote the bard Oh what tangled
webs we weave, when first we practice to decieve.

Terry Therneau

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Large discrepancies in the same object being saved to .RData

2010-07-11 Thread Prof Brian Ripley

On Sun, 11 Jul 2010, Tony Plate wrote:

Another way of seeing the environments referenced in an object is using 
str(), e.g.:



f1 - function() {

+ junk - rnorm(1000)
+ x - 1:3
+ y - rnorm(3)
+ lm(y ~ x)
+ }

v1 - f1()
object.size(f1)

1636 bytes

grep(Environment, capture.output(str(v1)), value=TRUE)

[1]   .. ..- attr(*, \.Environment\)=environment: 0x01f11a30 
[2]   .. .. ..- attr(*, \.Environment\)=environment: 0x01f11a30 


'Some of the environments in a few cases': remember environments have 
environments (and so on), and that namespaces and packages are also 
environments.  So we need to know about the environment of 
environment(v1$terms), which also gets saved (either as a reference or 
as an environment, depending on what it is).


And this approach does not work for many of the commonest cases:


f - function() {

+ x - pi
+ g - function() print(x)
+ return(g)
+ }

g - f()
str(g)

function ()
 - attr(*, source)= chr function() print(x)

ls(environment(g))

[1] g x

In fact I think it works only for formulae.


-- Tony Plate

On 7/10/2010 10:10 PM, bill.venab...@csiro.au wrote:

Well, I have answered one of my questions below.  The hidden
environment is attached to the 'terms' component of v1.


Well, not really hidden.  A terms component is a formula (see 
?terms.object), and a formula has an environment just as a closure 
does.  In neither case does the print() method tell you about it -- 
but ?formula does.



To see this



lapply(v1, environment)


$coefficients
NULL

$residuals
NULL

$effects
NULL

$rank
NULL

$fitted.values
NULL

$assign
NULL

$qr
NULL

$df.residual
NULL

$xlevels
NULL

$call
NULL

$terms
environment: 0x021b9e18

$model
NULL



rm(junk, envir = with(v1, environment(terms)))
usedVcells()


[1] 96532





This is still a bit of a trap for young (and old!) players...

I think the main point in my mind is why is it that object.size()
excludes enclosing environments in its reckonings?

Bill Venables.

-Original Message-
From: Venables, Bill (CMIS, Cleveland)
Sent: Sunday, 11 July 2010 11:40 AM
To: 'Duncan Murdoch'; 'Paul Johnson'
Cc: 'r-devel@r-project.org'; Taylor, Julian (CMIS, Waite Campus)
Subject: RE: [Rd] Large discrepancies in the same object being saved to 
.RData


I'm still a bit puzzled by the original question.  I don't think it
has much to do with .RData files and their sizes.  For me the puzzle
comes much earlier.  Here is an example of what I mean using a little
session



usedVcells- function() gc()[Vcells, used]
usedVcells()### the base load


[1] 96345

### Now look at what happens when a function returns a formula as the
### value, with a big item floating around in the function closure:



f0- function() {


+ junk- rnorm(1000)
+ y ~ x
+ }


v0- f0()
usedVcells()   ### much bigger than base, why?


[1] 10096355


v0 ### no obvious envirnoment


y ~ x


object.size(v0)  ### so far, no clue given where


### the extra Vcells are located.
372 bytes

### Does v0 have an enclosing environment?



environment(v0) ### yep.


environment: 0x021cc538


ls(envir = environment(v0)) ### as expected, there's the junk


[1] junk


rm(junk, envir = environment(v0))  ### this does the trick.
usedVcells()


[1] 96355

### Now consider a second example where the object
### is not a formula, but contains one.



f1- function() {


+ junk- rnorm(1000)
+ x- 1:3
+ y- rnorm(3)
+ lm(y ~ x)
+ }



v1- f1()
usedVcells()  ### as might have been expected.


[1] 10096455

### in this case, though, there is no
### (obvious) enclosing environment



environment(v1)


NULL


object.size(v1)  ### so where are the junk Vcells located?


7744 bytes


ls(envir = environment(v1))  ### clearly wil not work


Error in ls(envir = environment(v1)) : invalid 'envir' argument



rm(v1) ### removing the object does clear out the junk.
usedVcells()


[1] 96366




And in this second case, as noted by Julian Taylor, if you save() the
object the .RData file is also huge.  There is an environment attached
to the object somewhere, but it appears to be occluded and entirely
inaccessible.  (I have poked around the object components trying to
find the thing but without success.)

Have I missed something?

Bill Venables.

-Original Message-
From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org] 
On Behalf Of Duncan Murdoch

Sent: Sunday, 11 July 2010 10:36 AM
To: Paul Johnson
Cc: r-devel@r-project.org
Subject: Re: [Rd] Large discrepancies in the same object being saved to 
.RData


On 10/07/2010 2:33 PM, Paul Johnson wrote:

On Wed, Jul 7, 2010 at 7:12 AM, Duncan Murdochmurdoch.dun...@gmail.com 
wrote:




On 06/07/2010 9:04 PM, julian.tay...@csiro.au wrote:



Hi developers,



After some investigation I have found there can be large discrepancies 
in
the same object being saved as an external xx.RData file. The 
immediate
repercussion of this is the possible increased size of your .RData 
workspace

Re: [Rd] Large discrepancies in the same object being saved to .RData

2010-07-11 Thread Duncan Murdoch

On 11/07/2010 1:30 PM, Prof Brian Ripley wrote:

On Sun, 11 Jul 2010, Tony Plate wrote:

Another way of seeing the environments referenced in an object is using 
str(), e.g.:



f1 - function() {

+ junk - rnorm(1000)
+ x - 1:3
+ y - rnorm(3)
+ lm(y ~ x)
+ }

v1 - f1()
object.size(f1)

1636 bytes

grep(Environment, capture.output(str(v1)), value=TRUE)

[1]   .. ..- attr(*, \.Environment\)=environment: 0x01f11a30 
[2]   .. .. ..- attr(*, \.Environment\)=environment: 0x01f11a30 


'Some of the environments in a few cases': remember environments have 
environments (and so on), and that namespaces and packages are also 
environments.  So we need to know about the environment of 
environment(v1$terms), which also gets saved (either as a reference or 
as an environment, depending on what it is).


And this approach does not work for many of the commonest cases:


f - function() {

+ x - pi
+ g - function() print(x)
+ return(g)
+ }

g - f()
str(g)

function ()
  - attr(*, source)= chr function() print(x)

ls(environment(g))

[1] g x

In fact I think it works only for formulae.


-- Tony Plate

On 7/10/2010 10:10 PM, bill.venab...@csiro.au wrote:

Well, I have answered one of my questions below.  The hidden
environment is attached to the 'terms' component of v1.


Well, not really hidden.  A terms component is a formula (see 
?terms.object), and a formula has an environment just as a closure 
does.  In neither case does the print() method tell you about it -- 
but ?formula does.




I've just changed the default print method for formulas to display the 
environment if it is not globalenv(), which is the rule used for 
closures as well.  So now in R-devel:


 as.formula(y ~ x)
y ~ x

as before, but

 as.formula(y ~ x, env=new.env())
y ~ x
environment: 01f83400

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Large discrepancies in the same object being saved to .RData

2010-07-10 Thread Paul Johnson
On Wed, Jul 7, 2010 at 7:12 AM, Duncan Murdoch murdoch.dun...@gmail.com wrote:
 On 06/07/2010 9:04 PM, julian.tay...@csiro.au wrote:

 Hi developers,



 After some investigation I have found there can be large discrepancies in
 the same object being saved as an external xx.RData file. The immediate
 repercussion of this is the possible increased size of your .RData workspace
 for no apparent reason.



 I haven't worked through your example, but in general the way that local
 objects get captured is when part of the return value includes an
 environment.

Hi, can I ask a follow up question?

Is there a tool to browse *.Rdata files without loading them into R?

In HDF5 (a data storage format we use sometimes), there is a CLI
program h5dump that will spit out line-by-line all the contents of a
storage entity.  It will literally track through all the metadata, all
the vectors of scores, etc.  I've found that handy to see what's
really  in there in cases like the one that OP asked about.
Sometimes, we find that there are things that are in there by
mistake, as Duncan describes, and then we can try to figure why they
are in there.

pj


-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Large discrepancies in the same object being saved to .RData

2010-07-10 Thread Duncan Murdoch

On 10/07/2010 2:33 PM, Paul Johnson wrote:

On Wed, Jul 7, 2010 at 7:12 AM, Duncan Murdoch murdoch.dun...@gmail.com wrote:
  

On 06/07/2010 9:04 PM, julian.tay...@csiro.au wrote:


Hi developers,



After some investigation I have found there can be large discrepancies in
the same object being saved as an external xx.RData file. The immediate
repercussion of this is the possible increased size of your .RData workspace
for no apparent reason.



  

I haven't worked through your example, but in general the way that local
objects get captured is when part of the return value includes an
environment.



Hi, can I ask a follow up question?

Is there a tool to browse *.Rdata files without loading them into R?
  


I don't know of one.  You can load the whole file into an empty 
environment, but then you lose information about where did it come from?


Duncan Murdoch

In HDF5 (a data storage format we use sometimes), there is a CLI
program h5dump that will spit out line-by-line all the contents of a
storage entity.  It will literally track through all the metadata, all
the vectors of scores, etc.  I've found that handy to see what's
really  in there in cases like the one that OP asked about.
Sometimes, we find that there are things that are in there by
mistake, as Duncan describes, and then we can try to figure why they
are in there.

pj





__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Large discrepancies in the same object being saved to .RData

2010-07-10 Thread Bill.Venables
I'm still a bit puzzled by the original question.  I don't think it
has much to do with .RData files and their sizes.  For me the puzzle
comes much earlier.  Here is an example of what I mean using a little
session

 usedVcells - function() gc()[Vcells, used]
 usedVcells()### the base load
[1] 96345

### Now look at what happens when a function returns a formula as the
### value, with a big item floating around in the function closure:

 f0 - function() {
+ junk - rnorm(1000)
+ y ~ x
+ }
 v0 - f0()
 usedVcells()   ### much bigger than base, why?
[1] 10096355
 v0 ### no obvious envirnoment
y ~ x
 object.size(v0)  ### so far, no clue given where
   ### the extra Vcells are located.
372 bytes

### Does v0 have an enclosing environment?

 environment(v0) ### yep.
environment: 0x021cc538
 ls(envir = environment(v0)) ### as expected, there's the junk
[1] junk
 rm(junk, envir = environment(v0))  ### this does the trick.
 usedVcells()
[1] 96355

### Now consider a second example where the object
### is not a formula, but contains one.

 f1 - function() {
+ junk - rnorm(1000)
+ x - 1:3
+ y - rnorm(3)
+ lm(y ~ x)
+ }

 v1 - f1()
 usedVcells()  ### as might have been expected.
[1] 10096455

### in this case, though, there is no 
### (obvious) enclosing environment

 environment(v1)  
NULL
 object.size(v1)  ### so where are the junk Vcells located?
7744 bytes
 ls(envir = environment(v1))  ### clearly wil not work
Error in ls(envir = environment(v1)) : invalid 'envir' argument

 rm(v1) ### removing the object does clear out the junk.
 usedVcells()
[1] 96366
 

And in this second case, as noted by Julian Taylor, if you save() the
object the .RData file is also huge.  There is an environment attached
to the object somewhere, but it appears to be occluded and entirely
inaccessible.  (I have poked around the object components trying to
find the thing but without success.)

Have I missed something?

Bill Venables.

-Original Message-
From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org] On 
Behalf Of Duncan Murdoch
Sent: Sunday, 11 July 2010 10:36 AM
To: Paul Johnson
Cc: r-devel@r-project.org
Subject: Re: [Rd] Large discrepancies in the same object being saved to .RData

On 10/07/2010 2:33 PM, Paul Johnson wrote:
 On Wed, Jul 7, 2010 at 7:12 AM, Duncan Murdoch murdoch.dun...@gmail.com 
 wrote:
   
 On 06/07/2010 9:04 PM, julian.tay...@csiro.au wrote:
 
 Hi developers,



 After some investigation I have found there can be large discrepancies in
 the same object being saved as an external xx.RData file. The immediate
 repercussion of this is the possible increased size of your .RData workspace
 for no apparent reason.



   
 I haven't worked through your example, but in general the way that local
 objects get captured is when part of the return value includes an
 environment.
 

 Hi, can I ask a follow up question?

 Is there a tool to browse *.Rdata files without loading them into R?
   

I don't know of one.  You can load the whole file into an empty 
environment, but then you lose information about where did it come from?

Duncan Murdoch
 In HDF5 (a data storage format we use sometimes), there is a CLI
 program h5dump that will spit out line-by-line all the contents of a
 storage entity.  It will literally track through all the metadata, all
 the vectors of scores, etc.  I've found that handy to see what's
 really  in there in cases like the one that OP asked about.
 Sometimes, we find that there are things that are in there by
 mistake, as Duncan describes, and then we can try to figure why they
 are in there.

 pj




__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Large discrepancies in the same object being saved to .RData

2010-07-10 Thread Bill.Venables
Well, I have answered one of my questions below.  The hidden
environment is attached to the 'terms' component of v1.

To see this 

 lapply(v1, environment)
$coefficients
NULL

$residuals
NULL

$effects
NULL

$rank
NULL

$fitted.values
NULL

$assign
NULL

$qr
NULL

$df.residual
NULL

$xlevels
NULL

$call
NULL

$terms
environment: 0x021b9e18

$model
NULL

 rm(junk, envir = with(v1, environment(terms)))
 usedVcells()
[1] 96532
  

This is still a bit of a trap for young (and old!) players...

I think the main point in my mind is why is it that object.size()
excludes enclosing environments in its reckonings?

Bill Venables.

-Original Message-
From: Venables, Bill (CMIS, Cleveland) 
Sent: Sunday, 11 July 2010 11:40 AM
To: 'Duncan Murdoch'; 'Paul Johnson'
Cc: 'r-devel@r-project.org'; Taylor, Julian (CMIS, Waite Campus)
Subject: RE: [Rd] Large discrepancies in the same object being saved to .RData

I'm still a bit puzzled by the original question.  I don't think it
has much to do with .RData files and their sizes.  For me the puzzle
comes much earlier.  Here is an example of what I mean using a little
session

 usedVcells - function() gc()[Vcells, used]
 usedVcells()### the base load
[1] 96345

### Now look at what happens when a function returns a formula as the
### value, with a big item floating around in the function closure:

 f0 - function() {
+ junk - rnorm(1000)
+ y ~ x
+ }
 v0 - f0()
 usedVcells()   ### much bigger than base, why?
[1] 10096355
 v0 ### no obvious envirnoment
y ~ x
 object.size(v0)  ### so far, no clue given where
   ### the extra Vcells are located.
372 bytes

### Does v0 have an enclosing environment?

 environment(v0) ### yep.
environment: 0x021cc538
 ls(envir = environment(v0)) ### as expected, there's the junk
[1] junk
 rm(junk, envir = environment(v0))  ### this does the trick.
 usedVcells()
[1] 96355

### Now consider a second example where the object
### is not a formula, but contains one.

 f1 - function() {
+ junk - rnorm(1000)
+ x - 1:3
+ y - rnorm(3)
+ lm(y ~ x)
+ }

 v1 - f1()
 usedVcells()  ### as might have been expected.
[1] 10096455

### in this case, though, there is no 
### (obvious) enclosing environment

 environment(v1)  
NULL
 object.size(v1)  ### so where are the junk Vcells located?
7744 bytes
 ls(envir = environment(v1))  ### clearly wil not work
Error in ls(envir = environment(v1)) : invalid 'envir' argument

 rm(v1) ### removing the object does clear out the junk.
 usedVcells()
[1] 96366
 

And in this second case, as noted by Julian Taylor, if you save() the
object the .RData file is also huge.  There is an environment attached
to the object somewhere, but it appears to be occluded and entirely
inaccessible.  (I have poked around the object components trying to
find the thing but without success.)

Have I missed something?

Bill Venables.

-Original Message-
From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org] On 
Behalf Of Duncan Murdoch
Sent: Sunday, 11 July 2010 10:36 AM
To: Paul Johnson
Cc: r-devel@r-project.org
Subject: Re: [Rd] Large discrepancies in the same object being saved to .RData

On 10/07/2010 2:33 PM, Paul Johnson wrote:
 On Wed, Jul 7, 2010 at 7:12 AM, Duncan Murdoch murdoch.dun...@gmail.com 
 wrote:
   
 On 06/07/2010 9:04 PM, julian.tay...@csiro.au wrote:
 
 Hi developers,



 After some investigation I have found there can be large discrepancies in
 the same object being saved as an external xx.RData file. The immediate
 repercussion of this is the possible increased size of your .RData workspace
 for no apparent reason.



   
 I haven't worked through your example, but in general the way that local
 objects get captured is when part of the return value includes an
 environment.
 

 Hi, can I ask a follow up question?

 Is there a tool to browse *.Rdata files without loading them into R?
   

I don't know of one.  You can load the whole file into an empty 
environment, but then you lose information about where did it come from?

Duncan Murdoch
 In HDF5 (a data storage format we use sometimes), there is a CLI
 program h5dump that will spit out line-by-line all the contents of a
 storage entity.  It will literally track through all the metadata, all
 the vectors of scores, etc.  I've found that handy to see what's
 really  in there in cases like the one that OP asked about.
 Sometimes, we find that there are things that are in there by
 mistake, as Duncan describes, and then we can try to figure why they
 are in there.

 pj




__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Large discrepancies in the same object being saved to .RData

2010-07-10 Thread Duncan Murdoch

On 10/07/2010 10:10 PM, bill.venab...@csiro.au wrote:

Well, I have answered one of my questions below.  The hidden
environment is attached to the 'terms' component of v1.

To see this 

  

lapply(v1, environment)


$coefficients
NULL

$residuals
NULL

$effects
NULL

$rank
NULL

$fitted.values
NULL

$assign
NULL

$qr
NULL

$df.residual
NULL

$xlevels
NULL

$call
NULL

$terms
environment: 0x021b9e18

$model
NULL

  

rm(junk, envir = with(v1, environment(terms)))
usedVcells()


[1] 96532
  
 



This is still a bit of a trap for young (and old!) players...

I think the main point in my mind is why is it that object.size()
excludes enclosing environments in its reckonings?
  


I think the idea is that the environment is not part of the object, it 
is just referenced by the object. In fact, there are at least two 
references to the environment in your second example:


environment(v1$terms)

and

attr(v1$terms, .Environment)

both refer to it. So you can't just add the size of an environment every 
time you come across it, you would need to keep track of whether it had 
already been counted or not. So as ?object.size says,


Associated space (e.g. the environment of a function and what the
pointer in a ‘EXTPTRSXP’ points to) is not included in the
calculation.

If you really want to know how much space an object will take when saved, 
probably the only reliable way is to save the object and look at how much space 
the file takes.
  


Duncan Murdoch


Bill Venables.

-Original Message-
From: Venables, Bill (CMIS, Cleveland) 
Sent: Sunday, 11 July 2010 11:40 AM

To: 'Duncan Murdoch'; 'Paul Johnson'
Cc: 'r-devel@r-project.org'; Taylor, Julian (CMIS, Waite Campus)
Subject: RE: [Rd] Large discrepancies in the same object being saved to .RData

I'm still a bit puzzled by the original question.  I don't think it
has much to do with .RData files and their sizes.  For me the puzzle
comes much earlier.  Here is an example of what I mean using a little
session

  

usedVcells - function() gc()[Vcells, used]
usedVcells()### the base load


[1] 96345

### Now look at what happens when a function returns a formula as the
### value, with a big item floating around in the function closure:

  

f0 - function() {


+ junk - rnorm(1000)
+ y ~ x
+ }
  

v0 - f0()
usedVcells()   ### much bigger than base, why?


[1] 10096355
  

v0 ### no obvious envirnoment


y ~ x
  

object.size(v0)  ### so far, no clue given where


   ### the extra Vcells are located.
372 bytes

### Does v0 have an enclosing environment?

  

environment(v0) ### yep.


environment: 0x021cc538
  

ls(envir = environment(v0)) ### as expected, there's the junk


[1] junk
  

rm(junk, envir = environment(v0))  ### this does the trick.
usedVcells()


[1] 96355

### Now consider a second example where the object
### is not a formula, but contains one.

  

f1 - function() {


+ junk - rnorm(1000)
+ x - 1:3
+ y - rnorm(3)
+ lm(y ~ x)
+ }

  

v1 - f1()
usedVcells()  ### as might have been expected.


[1] 10096455

### in this case, though, there is no 
### (obvious) enclosing environment


  
environment(v1)  


NULL
  

object.size(v1)  ### so where are the junk Vcells located?


7744 bytes
  

ls(envir = environment(v1))  ### clearly wil not work


Error in ls(envir = environment(v1)) : invalid 'envir' argument

  

rm(v1) ### removing the object does clear out the junk.
usedVcells()


[1] 96366
  


And in this second case, as noted by Julian Taylor, if you save() the
object the .RData file is also huge.  There is an environment attached
to the object somewhere, but it appears to be occluded and entirely
inaccessible.  (I have poked around the object components trying to
find the thing but without success.)

Have I missed something?

Bill Venables.

-Original Message-
From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org] On 
Behalf Of Duncan Murdoch
Sent: Sunday, 11 July 2010 10:36 AM
To: Paul Johnson
Cc: r-devel@r-project.org
Subject: Re: [Rd] Large discrepancies in the same object being saved to .RData

On 10/07/2010 2:33 PM, Paul Johnson wrote:
  

On Wed, Jul 7, 2010 at 7:12 AM, Duncan Murdoch murdoch.dun...@gmail.com wrote:
  


On 06/07/2010 9:04 PM, julian.tay...@csiro.au wrote:

  

Hi developers,



After some investigation I have found there can be large discrepancies in
the same object being saved as an external xx.RData file. The immediate
repercussion of this is the possible increased size of your .RData workspace
for no apparent reason.



  


I haven't worked through your example, but in general the way that local
objects get captured is when part of the return value includes an
environment.

  

Hi, can I ask a follow up question?

Is there a tool to browse *.Rdata files without loading them into R?
  



I don't know of one.  You can load

[Rd] Large discrepancies in the same object being saved to .RData

2010-07-07 Thread Julian.Taylor
Hi developers,



After some investigation I have found there can be large discrepancies in the 
same object being saved as an external xx.RData file. The immediate 
repercussion of this is the possible increased size of your .RData workspace 
for no apparent reason.



The function and its three scenarios below highlight these discrepancies. Note 
that the object being returned is exactly the same in each circumstance. The 
first scenario simply loops over a set of lm() models from a simulated set of 
data. The second adds a reasonably large matrix calculation within the loop. 
The third highlights exactly where the discrepancy lies. It appears that when 
the object is saved to an xx.RData it is still burdened, in some capacity, 
with the objects created in the function. Only deleting these objects at the 
end of the function ensures the realistic size of the returned object. 
Performing gc() after each of these short simulations shows that the Vcells 
that are accumulated in the function environment appear to remain after the 
function returns. These cached remains are then transferred to the .RData upon 
saving of the object(s). This is occurring quite broadly across the Windows 7 
(R 2.10.1) and 64 Bit Ubuntu Linux (R 2.9.0) systems that I use.



A similar problem was partially pointed out four years ago



http://tolstoy.newcastle.edu.au/R/help/06/03/24060.html



and has been made more obvious in the scenarios given below.



Admittedly I have had many problems with workspace .RData sizes over the years 
and it has taken me some time to realise what is actually occurring. Can 
someone enlighten myself and my colleagues as to why the objects created and 
evaluated in a function call stack are saved, in some capacity, with the 
returned object?



Cheers,

Julian



### small simulation from a clean directory



lmfunc - function(loop = 20, add = FALSE, gr = FALSE){

  lmlist - rmlist - list()

  set.seed(100)

  dat - data.frame(matrix(rnorm(100*100), ncol = 100))

  rm - matrix(rnorm(10), ncol = 1000)

  names(dat)[1] - y

  i - 1

  for(i in 1:loop) {

lmlist[[i]] - lm(y ~ ., data = dat)

if(add)

rmlist[[i]] - rm

  }

  fm - lmlist[[loop]]

  if(gr) {

print(what - ls(envir = sys.frame(which = 1)))

remove(list = setdiff(what, fm))

  }

  fm

}



# baseline gc()



 gc()

  used (Mb) gc trigger (Mb) max used (Mb)

Ncells 153325  4.1 35  9.4   35  9.4

Vcells  99228  0.8 786432  6.0   386446  3.0



## 1. simple lm() simulation



 lmtest1 - lmfunc()

 gc()

  used (Mb) gc trigger (Mb) max used (Mb)

Ncells 184470  5.0 407500 10.9   35  9.4

Vcells 842169  6.51300721 10.0  1162577  8.9



 save(lmtest1, file = lm1.RData)

 system(ls -s lm1.RData)

4312 lm1.RData



## A moderate increase in Vcells; .RData object around 4.5 Mb



## 2. add matrix calculation to loop



 lmtest2 - lmfunc(add = TRUE)

 gc()

   used (Mb) gc trigger (Mb) max used (Mb)

Ncells  209316  5.6 407500 10.9   405340 10.9

Vcells 3584244 27.44175939 31.9  3900869 29.8



 save(lmtest2, file = lm2.RData)

 system(ls -s lm2.RData)

19324 lm2.RData



## A enormous increase in Vcells; .RData object is now 19Mb+



## 3. delete all objects in function call stack



 lmtest3 - lmfunc(add = TRUE, gr = TRUE)

 gc()

   used (Mb) gc trigger (Mb) max used (Mb)

Ncells  210766  5.7 467875 12.5   467875 12.5

Vcells 3615863 27.66933688 52.9  6898609 52.7



 save(lmtest3, file = lm3.RData)

 system(ls -s lm3.RData)

320 lm3.RData



## A minimal increase in Vcells; .RData object is now 320Kb



 sapply(ls(pattern = lmtest*), function(x) object.size(get(x, envir = 
 .GlobalEnv)))

lmtest1 lmtest2 lmtest3

 358428  358428  358428



## all objects are deemed the same size by object.size()

# End sim

--
---
Dr. Julian Taylor phone: +61 8 8303 8792
Postdoctoral Fellow fax: +61 8 8303 8763
CMIS, CSIRO  mobile: +61 4 1638 8180
Private Mail Bag 2email: julian.tay...@csiro.au
Glen Osmond, SA, 5064
---


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Large discrepancies in the same object being saved to .RData

2010-07-07 Thread Duncan Murdoch

On 06/07/2010 9:04 PM, julian.tay...@csiro.au wrote:

Hi developers,



After some investigation I have found there can be large discrepancies in the same object 
being saved as an external xx.RData file. The immediate repercussion of this 
is the possible increased size of your .RData workspace for no apparent reason.



The function and its three scenarios below highlight these discrepancies. Note that the object 
being returned is exactly the same in each circumstance. The first scenario simply loops over a set 
of lm() models from a simulated set of data. The second adds a reasonably large matrix calculation 
within the loop. The third highlights exactly where the discrepancy lies. It appears that when the 
object is saved to an xx.RData it is still burdened, in some capacity, with the objects 
created in the function. Only deleting these objects at the end of the function ensures the 
realistic size of the returned object. Performing gc() after each of these short simulations shows 
that the Vcells that are accumulated in the function environment appear to remain after 
the function returns. These cached remains are then transferred to the .RData upon saving of the 
object(s). This is occurring quite broadly across the Windows 7 (R 2.10.1) and 64 Bit Ubuntu Linux 
(R 2.9.0) systems that I us!

e.




A similar problem was partially pointed out four years ago



http://tolstoy.newcastle.edu.au/R/help/06/03/24060.html



and has been made more obvious in the scenarios given below.



Admittedly I have had many problems with workspace .RData sizes over the years 
and it has taken me some time to realise what is actually occurring. Can 
someone enlighten myself and my colleagues as to why the objects created and 
evaluated in a function call stack are saved, in some capacity, with the 
returned object?
  


I haven't worked through your example, but in general the way that local 
objects get captured is when part of the return value includes an 
environment.  Examples of things that include an environment are locally 
created functions and formulas.  It's probably the latter that you're 
seeing.  When R computes the result of y ~ . or a similar formula, it 
attaches a pointer to the environment in which the calculation took 
place, so that later when the formula is used, it can look up y there.  
For example, in your line


lm(y ~ ., data = dat)


from your code, the formula y ~ . needs to be computed before R knows 
that you've explicitly listed a dataframe holding the data, and before 
it knows whether the variable y is in that dataframe or is just a local 
variable in the current function.


Since these are just pointers to the environment, this doesn't take up 
much space in memory, but when you save the object to disk, a copy of 
the whole environment will be made, and that can end up wasting up a lot 
of space if the environment contains a lot of things that aren't needed 
by the formula.


Duncan Murdoch


Cheers,

Julian



### small simulation from a clean directory



lmfunc - function(loop = 20, add = FALSE, gr = FALSE){

  lmlist - rmlist - list()

  set.seed(100)

  dat - data.frame(matrix(rnorm(100*100), ncol = 100))

  rm - matrix(rnorm(10), ncol = 1000)

  names(dat)[1] - y

  i - 1

  for(i in 1:loop) {

lmlist[[i]] - lm(y ~ ., data = dat)

if(add)

rmlist[[i]] - rm

  }

  fm - lmlist[[loop]]

  if(gr) {

print(what - ls(envir = sys.frame(which = 1)))

remove(list = setdiff(what, fm))

  }

  fm

}



# baseline gc()



  

gc()



  used (Mb) gc trigger (Mb) max used (Mb)

Ncells 153325  4.1 35  9.4   35  9.4

Vcells  99228  0.8 786432  6.0   386446  3.0



## 1. simple lm() simulation



  

lmtest1 - lmfunc()



  

gc()



  used (Mb) gc trigger (Mb) max used (Mb)

Ncells 184470  5.0 407500 10.9   35  9.4

Vcells 842169  6.51300721 10.0  1162577  8.9



  

save(lmtest1, file = lm1.RData)



  

system(ls -s lm1.RData)



4312 lm1.RData



## A moderate increase in Vcells; .RData object around 4.5 Mb



## 2. add matrix calculation to loop



  

lmtest2 - lmfunc(add = TRUE)



  

gc()



   used (Mb) gc trigger (Mb) max used (Mb)

Ncells  209316  5.6 407500 10.9   405340 10.9

Vcells 3584244 27.44175939 31.9  3900869 29.8



  

save(lmtest2, file = lm2.RData)



  

system(ls -s lm2.RData)



19324 lm2.RData



## A enormous increase in Vcells; .RData object is now 19Mb+



## 3. delete all objects in function call stack



  

lmtest3 - lmfunc(add = TRUE, gr = TRUE)



  

gc()



   used (Mb) gc trigger (Mb) max used (Mb)

Ncells  210766  5.7 467875 12.5   467875 12.5

Vcells 3615863 27.66933688 52.9  6898609 52.7



  

save(lmtest3, file = lm3.RData)



  

system(ls -s lm3.RData)



320 lm3.RData



## A minimal increase in Vcells; .RData object is now 320Kb



  

sapply(ls(pattern =