Re: [Rd] we need an exists/get hybrid

2014-12-04 Thread Lorenz, David
All,
  So that suggests that .GlobalEnv[[X]] is more efficient than get(X,
pos=1L). What about .GlobalEnv[[X]] -  value, compared to assign(X,
value)?
Dave

On Wed, Dec 3, 2014 at 3:30 PM, Peter Haverty haverty.pe...@gene.com
wrote:

 Thanks Winston!  I'm amazed that [[ beats calling the .Internal
 directly.  I guess the difference between .Primitive vs. .Internal is
 pretty significant for things on this time scale.

 NULL meaning NULL and NULL meaning undefined would lead to the same path
 for much of my code.  I'll be swapping out many exists and get calls later
 today.  Thanks!

 I do still think it would be very useful to have some way to discriminate
 the two NULL cases.  I'm reminded of how perl does the same thing.  It's
 been a while, but it was something like

 if (defined(x{'c'})) { print x{'c'}; }  # This is still two lookups, but it
 has the defined concept.

 or maybe even

 if (defined( foo = x{'c'} ) ) { print foo; }


 Thanks again for the timings!


 Pete

 
 Peter M. Haverty, Ph.D.
 Genentech, Inc.
 phave...@gene.com

 On Wed, Dec 3, 2014 at 12:48 PM, Winston Chang winstoncha...@gmail.com
 wrote:

  I've looked at related speed issues in the past, and have a couple
  related points to add. (I've put the info below at
  http://rpubs.com/wch/46428.)
 
  There's a significant amount of overhead just from calling the R
  function get(). This is true even when you skip the pos argument and
  provide envir. For example, if you call get(), it takes much more time
  than .Internal(get()), which is what get() does.
 
  If you already know that the object exists in an environment, it's
  faster to use e$x, and slightly faster still to use e[[x]]:
 
  e - new.env()
  e$a - 1
 
  # Accessing objects in environments
  microbenchmark(
get(a, e, inherits = FALSE),
get(a, envir = e, inherits = FALSE),
.Internal(get(a, e, any, FALSE)),
e$a,
e[[a]],
.Primitive([[)(e, a),
 
unit = us
  )
  #   median  name
  # 1 1.0300 get(a, e, inherits = FALSE)
  # 2 0.9425 get(a, envir = e, inherits = FALSE)
  # 3 0.3080  .Internal(get(a, e, any, FALSE))
  # 4 0.2305   e$a
  # 5 0.1740  e[[a]]
  # 6 0.2905  .Primitive([[)(e, a)
 
 
  A similar thing happens with exists(): the R function wrapper adds
  significant overhead on top of .Internal(exists()). It's also faster
  to use $ and [[, then test for NULL, but of course this won't
  distinguish between objects that don't exist, and those that do exist
  but have a NULL value:
 
  # Test for existence of `a` (which exists), and `c` (which doesn't)
  microbenchmark(
exists('a', e, inherits = FALSE),
exists('a', envir = e, inherits = FALSE),
.Internal(exists('a', e, 'any', FALSE)),
'a' %in% ls(e, all.names = TRUE),
is.null(e[['a']]),
is.null(e$a),
 
exists('c', e, inherits = FALSE),
exists('c', envir = e, inherits = FALSE),
.Internal(exists('c', e, 'any', FALSE)),
'c' %in% ls(e, all.names = TRUE),
is.null(e[['c']]),
is.null(e$c),
 
unit = us
  )
  #median name
  # 1  1.2015 exists(a, e, inherits = FALSE)
  # 2  1.0545 exists(a, envir = e, inherits = FALSE)
  # 3  0.3615  .Internal(exists(a, e, any, FALSE))
  # 4  7.6345 a %in% ls(e, all.names = TRUE)
  # 5  0.3055is.null(e[[a]])
  # 6  0.3270 is.null(e$a)
  # 7  1.1890 exists(c, e, inherits = FALSE)
  # 8  1.0370 exists(c, envir = e, inherits = FALSE)
  # 9  0.3465  .Internal(exists(c, e, any, FALSE))
  # 10 7.5475 c %in% ls(e, all.names = TRUE)
  # 11 0.2675is.null(e[[c]])
  # 12 0.3010 is.null(e$c)
 
 
  -Winston
 
  On Tue, Dec 2, 2014 at 8:46 PM, Peter Haverty haverty.pe...@gene.com
  wrote:
   Hi All,
  
   I've been looking into speeding up the loading of packages that use a
 lot
   of S4.  After profiling I noticed the exists function accounts for a
   surprising fraction of the time.  I have some thoughts about speeding
 up
   exists (below). More to the point of this post, Martin Mächler noted
 that
   'exists' and 'get' are often used in conjunction.  Both functions are
   different usages of the do_get C function, so it's a pity to run that
  twice.
  
   get gives an error when a symbol is not found, so you can't just do a
   'get'.  With R's C library, one might do
  
   SEXP x = findVarInFrame3(symbol,env);
   if (x != R_UnboundValue) {
   // do stuff with x
   }
  
   It would be very convenient to have something like this at the R level.
  We
   don't want to do any tryCatch stuff or to add args to get (That would
  kill
   any speed advantage. The overhead for handling redundant args accounts
  for
   30% of the time used by exists).  Michael Lawrence and I worked out
  that
   we need a function that returns either the 

Re: [Rd] we need an exists/get hybrid

2014-12-04 Thread Sven E. Templer
David, 'assign' is slower than '-':

##   median  expr

## 1 0.1440  X - letters
## 2 0.4420 .Internal(assign(X, letters, e, F))
## 3 1.1820   e[[X]] - letters
## 4 1.2570e$X - letters
## 5 1.8380 assign(X, letters, envir = e, inherits = F)
## 6 1.9415 assign(X, letters, e, inherits = F)

(micro seconds, 500 times, see http://rpubs.com/setempler/46568)

---

Two questions:

'X-letters' is the fastest since it does not need to change the
environment from 'benchmark' to 'e'?
Why is the call to '.Internal' faster than '[[-' as compared to the
'get'/'[[' functions/benchmark of Winston?

thanks,
s

On 4 December 2014 at 15:24, Lorenz, David lor...@usgs.gov wrote:
 All,
   So that suggests that .GlobalEnv[[X]] is more efficient than get(X,
 pos=1L). What about .GlobalEnv[[X]] -  value, compared to assign(X,
 value)?
 Dave

 On Wed, Dec 3, 2014 at 3:30 PM, Peter Haverty haverty.pe...@gene.com
 wrote:

 Thanks Winston!  I'm amazed that [[ beats calling the .Internal
 directly.  I guess the difference between .Primitive vs. .Internal is
 pretty significant for things on this time scale.

 NULL meaning NULL and NULL meaning undefined would lead to the same path
 for much of my code.  I'll be swapping out many exists and get calls later
 today.  Thanks!

 I do still think it would be very useful to have some way to discriminate
 the two NULL cases.  I'm reminded of how perl does the same thing.  It's
 been a while, but it was something like

 if (defined(x{'c'})) { print x{'c'}; }  # This is still two lookups, but it
 has the defined concept.

 or maybe even

 if (defined( foo = x{'c'} ) ) { print foo; }


 Thanks again for the timings!


 Pete

 
 Peter M. Haverty, Ph.D.
 Genentech, Inc.
 phave...@gene.com

 On Wed, Dec 3, 2014 at 12:48 PM, Winston Chang winstoncha...@gmail.com
 wrote:

  I've looked at related speed issues in the past, and have a couple
  related points to add. (I've put the info below at
  http://rpubs.com/wch/46428.)
 
  There's a significant amount of overhead just from calling the R
  function get(). This is true even when you skip the pos argument and
  provide envir. For example, if you call get(), it takes much more time
  than .Internal(get()), which is what get() does.
 
  If you already know that the object exists in an environment, it's
  faster to use e$x, and slightly faster still to use e[[x]]:
 
  e - new.env()
  e$a - 1
 
  # Accessing objects in environments
  microbenchmark(
get(a, e, inherits = FALSE),
get(a, envir = e, inherits = FALSE),
.Internal(get(a, e, any, FALSE)),
e$a,
e[[a]],
.Primitive([[)(e, a),
 
unit = us
  )
  #   median  name
  # 1 1.0300 get(a, e, inherits = FALSE)
  # 2 0.9425 get(a, envir = e, inherits = FALSE)
  # 3 0.3080  .Internal(get(a, e, any, FALSE))
  # 4 0.2305   e$a
  # 5 0.1740  e[[a]]
  # 6 0.2905  .Primitive([[)(e, a)
 
 
  A similar thing happens with exists(): the R function wrapper adds
  significant overhead on top of .Internal(exists()). It's also faster
  to use $ and [[, then test for NULL, but of course this won't
  distinguish between objects that don't exist, and those that do exist
  but have a NULL value:
 
  # Test for existence of `a` (which exists), and `c` (which doesn't)
  microbenchmark(
exists('a', e, inherits = FALSE),
exists('a', envir = e, inherits = FALSE),
.Internal(exists('a', e, 'any', FALSE)),
'a' %in% ls(e, all.names = TRUE),
is.null(e[['a']]),
is.null(e$a),
 
exists('c', e, inherits = FALSE),
exists('c', envir = e, inherits = FALSE),
.Internal(exists('c', e, 'any', FALSE)),
'c' %in% ls(e, all.names = TRUE),
is.null(e[['c']]),
is.null(e$c),
 
unit = us
  )
  #median name
  # 1  1.2015 exists(a, e, inherits = FALSE)
  # 2  1.0545 exists(a, envir = e, inherits = FALSE)
  # 3  0.3615  .Internal(exists(a, e, any, FALSE))
  # 4  7.6345 a %in% ls(e, all.names = TRUE)
  # 5  0.3055is.null(e[[a]])
  # 6  0.3270 is.null(e$a)
  # 7  1.1890 exists(c, e, inherits = FALSE)
  # 8  1.0370 exists(c, envir = e, inherits = FALSE)
  # 9  0.3465  .Internal(exists(c, e, any, FALSE))
  # 10 7.5475 c %in% ls(e, all.names = TRUE)
  # 11 0.2675is.null(e[[c]])
  # 12 0.3010 is.null(e$c)
 
 
  -Winston
 
  On Tue, Dec 2, 2014 at 8:46 PM, Peter Haverty haverty.pe...@gene.com
  wrote:
   Hi All,
  
   I've been looking into speeding up the loading of packages that use a
 lot
   of S4.  After profiling I noticed the exists function accounts for a
   surprising fraction of the time.  I have some thoughts about speeding
 up
   

Re: [Rd] we need an exists/get hybrid

2014-12-03 Thread Winston Chang
I've looked at related speed issues in the past, and have a couple
related points to add. (I've put the info below at
http://rpubs.com/wch/46428.)

There’s a significant amount of overhead just from calling the R
function get(). This is true even when you skip the pos argument and
provide envir. For example, if you call get(), it takes much more time
than .Internal(get()), which is what get() does.

If you already know that the object exists in an environment, it's
faster to use e$x, and slightly faster still to use e[[x]]:

e - new.env()
e$a - 1

# Accessing objects in environments
microbenchmark(
  get(a, e, inherits = FALSE),
  get(a, envir = e, inherits = FALSE),
  .Internal(get(a, e, any, FALSE)),
  e$a,
  e[[a]],
  .Primitive([[)(e, a),

  unit = us
)
#   median  name
# 1 1.0300 get(a, e, inherits = FALSE)
# 2 0.9425 get(a, envir = e, inherits = FALSE)
# 3 0.3080  .Internal(get(a, e, any, FALSE))
# 4 0.2305   e$a
# 5 0.1740  e[[a]]
# 6 0.2905  .Primitive([[)(e, a)


A similar thing happens with exists(): the R function wrapper adds
significant overhead on top of .Internal(exists()). It’s also faster
to use $ and [[, then test for NULL, but of course this won’t
distinguish between objects that don’t exist, and those that do exist
but have a NULL value:

# Test for existence of `a` (which exists), and `c` (which doesn't)
microbenchmark(
  exists('a', e, inherits = FALSE),
  exists('a', envir = e, inherits = FALSE),
  .Internal(exists('a', e, 'any', FALSE)),
  'a' %in% ls(e, all.names = TRUE),
  is.null(e[['a']]),
  is.null(e$a),

  exists('c', e, inherits = FALSE),
  exists('c', envir = e, inherits = FALSE),
  .Internal(exists('c', e, 'any', FALSE)),
  'c' %in% ls(e, all.names = TRUE),
  is.null(e[['c']]),
  is.null(e$c),

  unit = us
)
#median name
# 1  1.2015 exists(a, e, inherits = FALSE)
# 2  1.0545 exists(a, envir = e, inherits = FALSE)
# 3  0.3615  .Internal(exists(a, e, any, FALSE))
# 4  7.6345 a %in% ls(e, all.names = TRUE)
# 5  0.3055is.null(e[[a]])
# 6  0.3270 is.null(e$a)
# 7  1.1890 exists(c, e, inherits = FALSE)
# 8  1.0370 exists(c, envir = e, inherits = FALSE)
# 9  0.3465  .Internal(exists(c, e, any, FALSE))
# 10 7.5475 c %in% ls(e, all.names = TRUE)
# 11 0.2675is.null(e[[c]])
# 12 0.3010 is.null(e$c)


-Winston

On Tue, Dec 2, 2014 at 8:46 PM, Peter Haverty haverty.pe...@gene.com wrote:
 Hi All,

 I've been looking into speeding up the loading of packages that use a lot
 of S4.  After profiling I noticed the exists function accounts for a
 surprising fraction of the time.  I have some thoughts about speeding up
 exists (below). More to the point of this post, Martin Mächler noted that
 'exists' and 'get' are often used in conjunction.  Both functions are
 different usages of the do_get C function, so it's a pity to run that twice.

 get gives an error when a symbol is not found, so you can't just do a
 'get'.  With R's C library, one might do

 SEXP x = findVarInFrame3(symbol,env);
 if (x != R_UnboundValue) {
 // do stuff with x
 }

 It would be very convenient to have something like this at the R level. We
 don't want to do any tryCatch stuff or to add args to get (That would kill
 any speed advantage. The overhead for handling redundant args accounts for
 30% of the time used by exists).  Michael Lawrence and I worked out that
 we need a function that returns either the desired object, or something
 that represents R_UnboundValue. We also need a very cheap way to check if
 something equals this new R_UnboundValue. This might look like

 if (defined(x - fetch(symbol, env))) {
   do_stuff_with_x(x)
 }

 A few more thoughts about exists:

 Moving the bit of R in the exists function to C saves 10% of the time.
 Dropping the redundant pos and frame args entirely saves 30% of the time
 used by this function. I suggest that the arguments of both get and
 exists should
 be simplified to (x, envir, mode, inherits). The existing C code handles
 numeric, character, and environment input for where. The arg frame is
 rarely used (0/128 exists calls in the methods package). Users that need to
 can call sys.frame themselves. get already lacks a frame argument and the
 manpage for exists notes that envir is only there for backwards
 compatibility. Let's deprecate the extra args in exists and get and perhaps
 move the extra argument handling to C in the interim.  Similarly, the
 assign function does nothing with the immediate argument.

 I'd be interested to hear if there is any support for a fetch-like
 function (and/or deprecating some unused arguments).

 All the best,
 Pete



 Pete

 
 Peter M. Haverty, Ph.D.
 Genentech, Inc.
 phave...@gene.com

 [[alternative HTML version 

Re: [Rd] we need an exists/get hybrid

2014-12-03 Thread Peter Haverty
Thanks Winston!  I'm amazed that [[ beats calling the .Internal
directly.  I guess the difference between .Primitive vs. .Internal is
pretty significant for things on this time scale.

NULL meaning NULL and NULL meaning undefined would lead to the same path
for much of my code.  I'll be swapping out many exists and get calls later
today.  Thanks!

I do still think it would be very useful to have some way to discriminate
the two NULL cases.  I'm reminded of how perl does the same thing.  It's
been a while, but it was something like

if (defined(x{'c'})) { print x{'c'}; }  # This is still two lookups, but it
has the defined concept.

or maybe even

if (defined( foo = x{'c'} ) ) { print foo; }


Thanks again for the timings!


Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Wed, Dec 3, 2014 at 12:48 PM, Winston Chang winstoncha...@gmail.com
wrote:

 I've looked at related speed issues in the past, and have a couple
 related points to add. (I've put the info below at
 http://rpubs.com/wch/46428.)

 There's a significant amount of overhead just from calling the R
 function get(). This is true even when you skip the pos argument and
 provide envir. For example, if you call get(), it takes much more time
 than .Internal(get()), which is what get() does.

 If you already know that the object exists in an environment, it's
 faster to use e$x, and slightly faster still to use e[[x]]:

 e - new.env()
 e$a - 1

 # Accessing objects in environments
 microbenchmark(
   get(a, e, inherits = FALSE),
   get(a, envir = e, inherits = FALSE),
   .Internal(get(a, e, any, FALSE)),
   e$a,
   e[[a]],
   .Primitive([[)(e, a),

   unit = us
 )
 #   median  name
 # 1 1.0300 get(a, e, inherits = FALSE)
 # 2 0.9425 get(a, envir = e, inherits = FALSE)
 # 3 0.3080  .Internal(get(a, e, any, FALSE))
 # 4 0.2305   e$a
 # 5 0.1740  e[[a]]
 # 6 0.2905  .Primitive([[)(e, a)


 A similar thing happens with exists(): the R function wrapper adds
 significant overhead on top of .Internal(exists()). It's also faster
 to use $ and [[, then test for NULL, but of course this won't
 distinguish between objects that don't exist, and those that do exist
 but have a NULL value:

 # Test for existence of `a` (which exists), and `c` (which doesn't)
 microbenchmark(
   exists('a', e, inherits = FALSE),
   exists('a', envir = e, inherits = FALSE),
   .Internal(exists('a', e, 'any', FALSE)),
   'a' %in% ls(e, all.names = TRUE),
   is.null(e[['a']]),
   is.null(e$a),

   exists('c', e, inherits = FALSE),
   exists('c', envir = e, inherits = FALSE),
   .Internal(exists('c', e, 'any', FALSE)),
   'c' %in% ls(e, all.names = TRUE),
   is.null(e[['c']]),
   is.null(e$c),

   unit = us
 )
 #median name
 # 1  1.2015 exists(a, e, inherits = FALSE)
 # 2  1.0545 exists(a, envir = e, inherits = FALSE)
 # 3  0.3615  .Internal(exists(a, e, any, FALSE))
 # 4  7.6345 a %in% ls(e, all.names = TRUE)
 # 5  0.3055is.null(e[[a]])
 # 6  0.3270 is.null(e$a)
 # 7  1.1890 exists(c, e, inherits = FALSE)
 # 8  1.0370 exists(c, envir = e, inherits = FALSE)
 # 9  0.3465  .Internal(exists(c, e, any, FALSE))
 # 10 7.5475 c %in% ls(e, all.names = TRUE)
 # 11 0.2675is.null(e[[c]])
 # 12 0.3010 is.null(e$c)


 -Winston

 On Tue, Dec 2, 2014 at 8:46 PM, Peter Haverty haverty.pe...@gene.com
 wrote:
  Hi All,
 
  I've been looking into speeding up the loading of packages that use a lot
  of S4.  After profiling I noticed the exists function accounts for a
  surprising fraction of the time.  I have some thoughts about speeding up
  exists (below). More to the point of this post, Martin M�chler noted that
  'exists' and 'get' are often used in conjunction.  Both functions are
  different usages of the do_get C function, so it's a pity to run that
 twice.
 
  get gives an error when a symbol is not found, so you can't just do a
  'get'.  With R's C library, one might do
 
  SEXP x = findVarInFrame3(symbol,env);
  if (x != R_UnboundValue) {
  // do stuff with x
  }
 
  It would be very convenient to have something like this at the R level.
 We
  don't want to do any tryCatch stuff or to add args to get (That would
 kill
  any speed advantage. The overhead for handling redundant args accounts
 for
  30% of the time used by exists).  Michael Lawrence and I worked out
 that
  we need a function that returns either the desired object, or something
  that represents R_UnboundValue. We also need a very cheap way to check if
  something equals this new R_UnboundValue. This might look like
 
  if (defined(x - fetch(symbol, env))) {
do_stuff_with_x(x)
  }
 
  A few more thoughts about exists:
 
  Moving the bit of R in the exists function to C saves 10% of the time.