[Bioc-devel] writeVcf performance

2014-08-05 Thread Michael Lawrence
Hi guys (Val, Martin, Herve):

Anyone have an itch for optimization? The writeVcf function is currently a
bottleneck in our WGS genotyping pipeline. For a typical 50 million row
gVCF, it was taking 2.25 hours prior to yesterday's improvements
(pasteCollapseRows) that brought it down to about 1 hour, which is still
too long by my standards ( 0). Only takes 3 minutes to call the genotypes
(and associated likelihoods etc) from the variant calls (using 80 cores and
450 GB RAM on one node), so the output is an issue. Profiling suggests that
the running time scales non-linearly in the number of rows.

Digging a little deeper, it seems to be something with R's string/memory
allocation. Below, pasting 1 million strings takes 6 seconds, but 10
million strings takes over 2 minutes. It gets way worse with 50 million. I
suspect it has something to do with R's string hash table.

set.seed(1000)
end - sample(1e8, 1e6)
system.time(paste0(END, =, end))
   user  system elapsed
  6.396   0.028   6.420

end - sample(1e8, 1e7)
system.time(paste0(END, =, end))
   user  system elapsed
134.714   0.352 134.978

Indeed, even this takes a long time (in a fresh session):

set.seed(1000)
end - sample(1e8, 1e6)
end - sample(1e8, 1e7)
system.time(as.character(end))
   user  system elapsed
 57.224   0.156  57.366

But running it a second time is faster (about what one would expect?):

system.time(levels - as.character(end))
   user  system elapsed
 23.582   0.021  23.589

I did some simple profiling of R to find that the resizing of the string
hash table is not a significant component of the time. So maybe something
to do with the R heap/gc? No time right now to go deeper. But I know Martin
likes this sort of thing ;)

Michael

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Additional summarizeOverlaps counting modes for ChIP-Seq

2014-08-05 Thread Ryan C. Thompson

Hi Valerie,

I got really busy around May and never got a chance to thank you for 
adding this option to summarizeOverlaps! So thank you!


-Ryan

On Thu 01 May 2014 04:25:33 PM PDT, Valerie Obenchain wrote:

GenomicAlignments 1.1.8 has a 'preprocess.reads' argument. This should
be a function where the first argument is 'reads' and the return value
is compatible with 'reads' in the pre-defined count modes.

I've used your ResizeReads() as an example in the man page. I think
the ability to pre-filter and used a pre-defined mode will be useful.
Thanks for the suggestion.

Valerie


On 05/01/2014 02:29 PM, Valerie Obenchain wrote:

On 05/01/2014 02:05 PM, Ryan wrote:

Hi Valerie,

On Thu May 1 13:27:16 2014, Valerie Obenchain wrote:


I have some concerns about the *ExtraArgs() functions. Passing
flexible args to findOverlaps in the existing mode functions
fundamentally changes the documented behavior. The modes were created
to capture specific overlap situations pertaining to gene features
which are graphically depicted in the vignette. Changing 'maxgap' or
'minoverlap' will produce a variety of results inconsistent with past
behavior and difficult to document (e.g., under what circumstances
will IntersectionNotEmpty now register a hit).

Well, I wasn't so sure about those functions either. Obviously you can
pass arguments that break things. They were mostly designed to be
constructors for specific counting modes involving the
minoverlap/maxgap
arguments, but I decided I didn't need those modes after all. They're
certainly not designed to be exposed to the user. I haven't carefully
considered the interaction between the counting mode and
maxgap/minoverlap, but I believe that it would be roughly equivalent to
extending/shrinking the features/reads by the specified amount (with
some differences for e.g. a feature/read smaller than 2*minoverlap).
For
example, with a read length of 100 and a minoverlap of 10 in Union
counting mode, this would be the same as truncating the first and last
10 (or mabe 9?) bases and operating in normal Union mode. As I said,
though, there may be edge cases that I haven't thought of where
unexpected things happen.


I agree that controlling the overlap args is appealing and I like the
added ability to resize. I've created a 'chipseq' mode that combines
these ideas and gives what ResizeReads() was doing but now in 'mode'
form. If this implementation gives you the flexibility you were
looking for I'll check it into devel.


This sounds nice, but if I use the 'chipseq' mode, how do I specify
whether I want Union, IntersectionNotEmpty, or IntersectionStrict? It
looks like it just does Union? IntersectionStrict would be useful for
specifying that the read has to occur entirely within the bounds of a
called peak, for example. This is why I implemented it as a wrapper
that takes another mode as an argument, so that the resizing logic and
the counting logic were independent.


'chipseq' didn't implement the standard modes b/c I wanted to avoid the
clash of passing unconventional findOverlaps() args to those modes. The
assumption was that if the user specified a mode they would expect a
certain behavior ...

Maybe summarizeOverlaps could

accept an optional read modification function, and if this is
provided, it will pass the reads through this before passing them to
the
counting function. The read modification function would have to take
any
valid reads argument and return another valid reads argument. It could
be used for modifying the reads as well as filtering them. This would
allow resizing without the awkward nesting method that I've used.


Good idea. Maybe a 'preprocess' or 'prefilter' arg to allow massaging
before counting. I'll post back when it's done.

Valerie



A couple of questions:

- Do you want to handle paired-end reads? You coerce to a GRanges to
resize but don't coerce back.

For paired end reads, there is no need to estimate the fragment length,
because the pair gives you both ends of the fragment. So if I had
paired-end ChIP-Seq data, I would use it as is with no resizing. I
can't
personally think of a reason to resize a paired-end fragment, but I
don't know if others might need that.

I corece to GRanges because I know how GRanges work, but I'm not as
familiar with GAlignments so I don't know how the resize function works
on GAlignments and other classes. I'm sure you know better than I do
how
these work. If the coercion is superfluous, feel free to eliminate it.


- Do you want to require strand info for all reads? Is this because of
how resize() anchors * to 'start'?

Yes, I require strand info for all reads because the reads must be
directionally extended, which requires strand info. Ditto for counting
the 5-prime and 3-prime ends.

-Ryan





chipseq - function(features, reads, ignore.strand=FALSE,
inter.feature=TRUE,
type=any, maxgap=0L, minoverlap=1L,
width=NULL, fix=start, use.names=TRUE)
{
reads - as(reads, GRanges)
if (any(strand(reads) == *))
stop(all reads 

Re: [Bioc-devel] Additional summarizeOverlaps counting modes for ChIP-Seq

2014-08-05 Thread Ryan C. Thompson

Hi again,

I'm looking at the examples in the summarizeOverlaps help page here: 
http://www.bioconductor.org/packages/devel/bioc/manuals/GenomicAlignments/man/GenomicAlignments.pdf


And the examples fod preprocess.reads are a little confusing. One of 
the examples passes some additional ... options to summarizeOverlaps, 
and implies that these will be passed along as the ... arguments to 
the proprocess.reads function:


   summarizeOverlaps(grl, reads, mode=Union, 
preprocess.reads=ResizeReads, width=1, fix=end)


The width and fix arguments are implied to be passed through to 
ResizeReads, but I don't see it documented anywhere how this would be 
done. The summarizeOverlaps documentation for ... says Additional 
arguments for BAM file methods such as singleEnd, fragments or param 
that apply to the reading of records from a file (see below). I don't 
see anything about passing through to preprocess.reads.


Incidentally, this is why my original example used a function that 
constructed a closure with these arguments already bound. I would write 
the example like this to ensure no ambiguity in argument passing (pay 
attention to parens):


   ResizeReads - function(mode, ...) {
   resize.args - list(...)
   function(reads) {
   reads - as(reads, GRanges)
   ## Need strandedness
   stopifnot(all(strand(reads) != *))
   do.call(resize, c(list(x=reads), resize.args))
   }
   }

   ## By default ResizeReads() counts reads that overlap on the 5 end:
   summarizeOverlaps(grl, reads, mode=Union, 
preprocess.reads=ResizeReads())


   ## Count reads that overlap on the 3 end by passing new values
   ## for width and fix:
   summarizeOverlaps(grl, reads, mode=Union, 
preprocess.reads=ResizeReads(width=1, fix=end))


Anyway, I don't have the devel version of R handy to test this out, so 
I don't know if what I've described is a problem in practice. But I 
think that either the preprocess.reads function should be required to 
only take one argument, or else the method of passing through 
additional arguments to it should be documented.


-Ryan

On Tue 05 Aug 2014 05:12:41 PM PDT, Ryan C. Thompson wrote:

Hi Valerie,

I got really busy around May and never got a chance to thank you for
adding this option to summarizeOverlaps! So thank you!

-Ryan

On Thu 01 May 2014 04:25:33 PM PDT, Valerie Obenchain wrote:

GenomicAlignments 1.1.8 has a 'preprocess.reads' argument. This should
be a function where the first argument is 'reads' and the return value
is compatible with 'reads' in the pre-defined count modes.

I've used your ResizeReads() as an example in the man page. I think
the ability to pre-filter and used a pre-defined mode will be useful.
Thanks for the suggestion.

Valerie


On 05/01/2014 02:29 PM, Valerie Obenchain wrote:

On 05/01/2014 02:05 PM, Ryan wrote:

Hi Valerie,

On Thu May 1 13:27:16 2014, Valerie Obenchain wrote:


I have some concerns about the *ExtraArgs() functions. Passing
flexible args to findOverlaps in the existing mode functions
fundamentally changes the documented behavior. The modes were created
to capture specific overlap situations pertaining to gene features
which are graphically depicted in the vignette. Changing 'maxgap' or
'minoverlap' will produce a variety of results inconsistent with past
behavior and difficult to document (e.g., under what circumstances
will IntersectionNotEmpty now register a hit).

Well, I wasn't so sure about those functions either. Obviously you can
pass arguments that break things. They were mostly designed to be
constructors for specific counting modes involving the
minoverlap/maxgap
arguments, but I decided I didn't need those modes after all. They're
certainly not designed to be exposed to the user. I haven't carefully
considered the interaction between the counting mode and
maxgap/minoverlap, but I believe that it would be roughly
equivalent to
extending/shrinking the features/reads by the specified amount (with
some differences for e.g. a feature/read smaller than 2*minoverlap).
For
example, with a read length of 100 and a minoverlap of 10 in Union
counting mode, this would be the same as truncating the first and last
10 (or mabe 9?) bases and operating in normal Union mode. As I said,
though, there may be edge cases that I haven't thought of where
unexpected things happen.


I agree that controlling the overlap args is appealing and I like the
added ability to resize. I've created a 'chipseq' mode that combines
these ideas and gives what ResizeReads() was doing but now in 'mode'
form. If this implementation gives you the flexibility you were
looking for I'll check it into devel.


This sounds nice, but if I use the 'chipseq' mode, how do I specify
whether I want Union, IntersectionNotEmpty, or IntersectionStrict? It
looks like it just does Union? IntersectionStrict would be useful for
specifying that the read has to occur entirely within the bounds of a
called peak, for example. This is why I implemented it 

[Rd] Minor documentation change for assign

2014-08-05 Thread Grant Rettke
Good afternoon,

Today I was reading the documentation

,
| help('-');
`
.

The passage (I added the asterisks)

  The operators '' are normally only used in functions, and
  cause a search *to* made through parent environments for an
  existing definition of the variable being assigned.  If such
  a

looks like it might have wanted to have been phrased as

  The operators '' are normally only used in functions, and
  cause a search *to be* made through parent environments for
  an existing definition of the variable being assigned.  If
  such a
.

Wanted to share that with the devs, and if that is a wanted change do
people usually submit patches
or what is the preferred mechanism?

,
| sessionInfo()
`

,
| R version 3.1.1 (2014-07-10)
| Platform: x86_64-apple-darwin13.2.0 (64-bit)
|
| locale:
| [1] en_US/en_US/en_US/C/en_US/en_US
|
| attached base packages:
| [1] stats graphics  grDevices utils datasets  methods   base
|
| loaded via a namespace (and not attached):
| [1] compiler_3.1.1 tools_3.1.1
`

Kind regards,

Grant Rettke | ACM, ASA, FSF, IEEE, SIAM
g...@wisdomandwonder.com | http://www.wisdomandwonder.com/
“Wisdom begins in wonder.” --Socrates
((λ (x) (x x)) (λ (x) (x x)))
“Life has become immeasurably better since I have been forced to stop
taking it seriously.” --Thompson

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Is it a good idea or even possible to redefine attach?

2014-08-05 Thread Grant Rettke
Hi,

Today I got curious about whether or not I /could/ remove `attach' from
my system so:
- Backed it up
- Implemented a new one
- Like this

,
| attach.old - attach
| attach - function(...) {stop(NEVER USE ATTACH)}
`

I got the error:

,
| Error: cannot change value of locked binding for 'attach'
`

If I unlock `attach' I assume that I could stomp on it... however is
that even a good idea?

What will I break?

My goal was never to allow `attach' in my system, but it was just an
idea.

Kind regards,

Grant Rettke | ACM, ASA, FSF, IEEE, SIAM
g...@wisdomandwonder.com | http://www.wisdomandwonder.com/
“Wisdom begins in wonder.” --Socrates
((λ (x) (x x)) (λ (x) (x x)))
“Life has become immeasurably better since I have been forced to stop
taking it seriously.” --Thompson

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Is it a good idea or even possible to redefine attach?

2014-08-05 Thread Ista Zahn
On Tue, Aug 5, 2014 at 2:49 PM, Grant Rettke g...@wisdomandwonder.com wrote:
 Hi,

 Today I got curious about whether or not I /could/ remove `attach' from
 my system so:
 - Backed it up
 - Implemented a new one
 - Like this

 ,
 | attach.old - attach
 | attach - function(...) {stop(NEVER USE ATTACH)}
 `

Just masking it with your own function, e.g.,

attach - function(...) {stop(NEVER USE ATTACH)}

should be enough to discourage you from using it.


 I got the error:

 ,
 | Error: cannot change value of locked binding for 'attach'
 `

 If I unlock `attach' I assume that I could stomp on it... however is
 that even a good idea?

 What will I break?

Anything that uses attach.


 My goal was never to allow `attach' in my system, but it was just an
 idea.

 Kind regards,

 Grant Rettke | ACM, ASA, FSF, IEEE, SIAM
 g...@wisdomandwonder.com | http://www.wisdomandwonder.com/
 “Wisdom begins in wonder.” --Socrates
 ((λ (x) (x x)) (λ (x) (x x)))
 “Life has become immeasurably better since I have been forced to stop
 taking it seriously.” --Thompson

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Is it a good idea or even possible to redefine attach?

2014-08-05 Thread Winston Chang
On Tue, Aug 5, 2014 at 1:49 PM, Grant Rettke g...@wisdomandwonder.com wrote:

 Hi,

 Today I got curious about whether or not I /could/ remove `attach' from
 my system so:
 - Backed it up
 - Implemented a new one
 - Like this

 ,
 | attach.old - attach
 | attach - function(...) {stop(NEVER USE ATTACH)}
 `

 I got the error:

 ,
 | Error: cannot change value of locked binding for 'attach'
 `

 If I unlock `attach' I assume that I could stomp on it... however is
 that even a good idea?

 What will I break?


If you change the base package environment's copy of `attach` (via
`as.environment('package:base')`) , probably not much will break,
except for your own code. If, on the other hand, you change the base
namespace's copy of `attach` (via `asNamespace('base')`, any package
that's subsequently loaded and uses `attach` would run into problems.
Still, either way is probably not a good idea.

I agree with Ista: assigning it yourself in the the global environment
is a better idea. If you want to keep your global env clear of stuff
like this, you can put it in an environment and attach that
environment as a parent of global:

e - new.env()
e$attach -  function(...) {stop(NEVER USE ATTACH)}
base::attach(e, name = my_stuff, warn.conflicts = FALSE)

# Test it out:
attach()

# You can see that the my_stuff env is the parent of global env
search()


-Winston

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Is it a good idea or even possible to redefine attach?

2014-08-05 Thread Grant Rettke
That is delightful.

When I run it like this:
• Start R
• Nothing in .Rprofile
• Paste in your code
╭
│ gcrenv - new.env()
│ gcrenv$attach.old - attach
│ gcrenv$attach - function(...){stop(NEVER USE ATTACH)}
│ base::attach(gcrenv, name=gcr, warn.conflicts = FALSE)
╰
• I get exactly what is expected, I think
╭
│ search()
╰
╭
│  [1] .GlobalEnvgcr   ESSR
│  [4] package:stats package:graphics  package:grDevices
│  [7] package:utils package:datasets  package:methods
│ [10] Autoloads package:base
╰

Just to be sure:
• Is that what is expected?
• I am surprised because I thought that `gcr' would come first before
  `.GlobalEnv'
  • Perhaps I mis understand, as `.GlobalEnv' is actually the REPL?

My goal is to move that to my .Rprofile so that it is always run and I
can forget about it more or less.

Reading [this] I felt like `.First' would be the right place to put it,
but then read further to find that packages are only loaded /after/
`.First' has completed.  Curious, I tried it just to be sure. I am now
:).

This is the .Rprofile file:

╭
│ cat(.Rprofile: Setting CMU repository\n)
│ r = getOption(repos)
│ r[CRAN] = http://lib.stat.cmu.edu/R/CRAN/;
│ options(repos = r)
│ rm(r)
│
│ .First - function() {
│«same code as above»
│ }
╰

(I included the repository load, and understand it should not impact
things here)

This is run after normal startup of R:

╭
│ search()
╰
╭
│  [1] .GlobalEnvpackage:stats package:graphics
│  [4] package:grDevices package:utils package:datasets
│  [7] gcr   package:methods   Autoloads
│ [10] package:base
╰

When I read this, I read it as:
• My rebind of `attach' occurs
• Then all of the packages are loaded and they are referring to
  my-rebound `attach'
• That is a problem because it *will* break package code
• Clearly me putting that code in `.Rprofile' is the wrong place.

What I am looking for now is the right way to achieve what is
demonstarted but automatically via a startup file.

My ideas:
• Manually paste the step each time
• Always use Emacs and ESS to run R and add a hook so that the code will
  be run after ESS loads
• Something I am missing


[this]
http://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html
Grant Rettke | ACM, ASA, FSF, IEEE, SIAM
g...@wisdomandwonder.com | http://www.wisdomandwonder.com/
“Wisdom begins in wonder.” --Socrates
((λ (x) (x x)) (λ (x) (x x)))
“Life has become immeasurably better since I have been forced to stop
taking it seriously.” --Thompson


On Tue, Aug 5, 2014 at 2:47 PM, Winston Chang winstoncha...@gmail.com wrote:
 On Tue, Aug 5, 2014 at 1:49 PM, Grant Rettke g...@wisdomandwonder.com wrote:

 Hi,

 Today I got curious about whether or not I /could/ remove `attach' from
 my system so:
 - Backed it up
 - Implemented a new one
 - Like this

 ,
 | attach.old - attach
 | attach - function(...) {stop(NEVER USE ATTACH)}
 `

 I got the error:

 ,
 | Error: cannot change value of locked binding for 'attach'
 `

 If I unlock `attach' I assume that I could stomp on it... however is
 that even a good idea?

 What will I break?


 If you change the base package environment's copy of `attach` (via
 `as.environment('package:base')`) , probably not much will break,
 except for your own code. If, on the other hand, you change the base
 namespace's copy of `attach` (via `asNamespace('base')`, any package
 that's subsequently loaded and uses `attach` would run into problems.
 Still, either way is probably not a good idea.

 I agree with Ista: assigning it yourself in the the global environment
 is a better idea. If you want to keep your global env clear of stuff
 like this, you can put it in an environment and attach that
 environment as a parent of global:

 e - new.env()
 e$attach -  function(...) {stop(NEVER USE ATTACH)}
 base::attach(e, name = my_stuff, warn.conflicts = FALSE)

 # Test it out:
 attach()

 # You can see that the my_stuff env is the parent of global env
 search()


 -Winston

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] More than one package document with the same name

2014-08-05 Thread Davison, Jerry
Hi, 

I sent this to the Bioconductor mailing list (q.v.), replies recommended this 
forum. Here it is: 
A suggestion. Typically a package provides documentation of two types, one or 
more vignettes and a reference manual; and they have the same file name, 
PackageName.pdf.  When downloaded some file name manipulation is required, or 
accept PackageName(1).pdf.

I suggest the documents be given different names, for example the reference 
manual could be named PackageNameRefMan.pdf. Package checking could enforce the 
rule. 
Jerry 


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] More than one package document with the same name

2014-08-05 Thread John McKown
On Tue, Aug 5, 2014 at 5:47 PM, Davison, Jerry jdavi...@fhcrc.org wrote:
 Hi,

 I sent this to the Bioconductor mailing list (q.v.), replies recommended this 
 forum. Here it is:
 A suggestion. Typically a package provides documentation of two types, one or 
 more vignettes and a reference manual; and they have the same file name, 
 PackageName.pdf.  When downloaded some file name manipulation is required, or 
 accept PackageName(1).pdf.

 I suggest the documents be given different names, for example the reference 
 manual could be named PackageNameRefMan.pdf. Package checking could enforce 
 the rule.
 Jerry


Sounds like you want to put all the pdf files in a single directory. I
like that too. I cheat. I run Linux. I have a ~/Documents/R-pdfs in
which I keep symlinks to all the pdf files. And I do it similar to
the way that you indicate. I make the name of the pdf be
${enclosing_directory}_${original_pdf_name.pdf}. I do something like:

cd ~/Documents/R-pdf
find /usr/lib64/R -name '*.pdf'|\
while read i;do
  file=${i##*/};
  dir=${i%%/doc/*.pdf};
  package=${dir##*/};
  ln -s $i ${package}_${file};
done

Above won't work if anything has a blank in it. Windows people tend to
do this. Most UNIX people are better trained.

It would be nice if the packagers did this. But the above works for
me. On Linux and other UNIX like systems. Won't work for the poor,
benighted Windows people. But I think something similar is possible.
But I don't know Windows well enough.

Maranatha! 
John McKown

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Is it a good idea or even possible to redefine attach?

2014-08-05 Thread Winston Chang
On Tue, Aug 5, 2014 at 4:37 PM, Grant Rettke g...@wisdomandwonder.com wrote:

 That is delightful.

 When I run it like this:
 • Start R
 • Nothing in .Rprofile
 • Paste in your code
 ╭
 │ gcrenv - new.env()
 │ gcrenv$attach.old - attach
 │ gcrenv$attach - function(...){stop(NEVER USE ATTACH)}
 │ base::attach(gcrenv, name=gcr, warn.conflicts = FALSE)
 ╰
 • I get exactly what is expected, I think
 ╭
 │ search()
 ╰
 ╭
 │  [1] .GlobalEnvgcr   ESSR
 │  [4] package:stats package:graphics  package:grDevices
 │  [7] package:utils package:datasets  package:methods
 │ [10] Autoloads package:base
 ╰

 Just to be sure:
 • Is that what is expected?
 • I am surprised because I thought that `gcr' would come first before
   `.GlobalEnv'
   • Perhaps I mis understand, as `.GlobalEnv' is actually the REPL?

 My goal is to move that to my .Rprofile so that it is always run and I
 can forget about it more or less.

 Reading [this] I felt like `.First' would be the right place to put it,
 but then read further to find that packages are only loaded /after/
 `.First' has completed.  Curious, I tried it just to be sure. I am now
 :).

 This is the .Rprofile file:

 ╭
 │ cat(.Rprofile: Setting CMU repository\n)
 │ r = getOption(repos)
 │ r[CRAN] = http://lib.stat.cmu.edu/R/CRAN/;
 │ options(repos = r)
 │ rm(r)
 │
 │ .First - function() {
 │«same code as above»
 │ }
 ╰

 (I included the repository load, and understand it should not impact
 things here)

 This is run after normal startup of R:

 ╭
 │ search()
 ╰
 ╭
 │  [1] .GlobalEnvpackage:stats package:graphics
 │  [4] package:grDevices package:utils package:datasets
 │  [7] gcr   package:methods   Autoloads
 │ [10] package:base
 ╰

 When I read this, I read it as:
 • My rebind of `attach' occurs
 • Then all of the packages are loaded and they are referring to
   my-rebound `attach'
 • That is a problem because it *will* break package code
 • Clearly me putting that code in `.Rprofile' is the wrong place.


That order for search path should actually be fine. To understand why,
you first have to know the difference between the _binding_
environment for an object, and the _enclosing_ environment for a
function.

The binding environment is where you can find an object. For example,
in the global env, you have a bunch bindings (we often call them
variables), that point to various objects - vectors, data frames,
other environments, etc.

The enclosing environment for a function is where the function runs
in when it's called.

Most R objects have just a binding environment (a variable or
reference that points to the object); functions also have an enclosing
environment. These two environments aren't necessarily the same.

When you run search(), it shows the set of environments where R will
look for an object of a given name, when you run stuff at the console
(and are in the global env). The trick is that, although you can find
a function (they are bound bound) in one of these _package_
environments, those functions run in (are enclosed by) a different
environment: the a corresponding _namespace_ environment.

The way that a namespace environment is set up with the arrangement of
its ancestor environments, it will find the base namespace version of
`attach` before it finds yours, even if your personal gcr environment
comes early in the search path.

=
# Here's an example to illustrate. The `utils::alarm` function calls
`cat`, which is in base.

alarm
# function ()
# {
# cat(\a)
# flush.console()
# }
# environment: namespace:utils


# Running it makes the screen flash or beep
alarm()
# [screen flashes]


# We'll put a replacement version of cat early in the search path,
between utils and base
my_stuff - new.env()
my_stuff$cat - function(...) stop(Tried to call cat)
base::attach(my_stuff, pos=length(search()) - 1, name=my_stuff)

search()
#  [1] .GlobalEnvtools:rstudio package:stats
package:graphics
#  [5] package:grDevices package:utils package:datasets
package:methods
#  [9] my_stuff  Autoloads package:base

# Calling cat from the console gives the error, as expected
cat()
# Error in cat() : Tried to call cat

# But when we run alarm(), it still gets the real version of `cat()`,
# because it finds the the original base namespace version of cat
# before it finds yours.
alarm()
# [screen flashes]

==

You can even alter package environments without affecting the
corresponding namespace environment. The exception to the package and
namespace environments being distinct is the base environment; change
one and you change the other. (I just realized this and have to
retract my earlier statement about the behavior being different if
change attach in the base package env vs. the base namespace env.)

-Winston

__
R-devel@r-project.org mailing list