Re: [Rd] Assign in place and reference counting

Jiří Moravec Tue, 06 Jan 2026 17:31:12 -0800

Hi Simon (Šimon?),

> You are re-inventing the reference classes (or R6 if you preferpackage space). By definition, environments are the only mutable objectsin R, no other objects are, so what you are doing is adding a referencewrapper around an immutable object.

yes, what I am doing is essentially very barebone RC or R6, but in purebase R in fewer lines of code.

I stole it from R6 benchmarks:https://r6.r-lib.org/articles/Performance.html#environment-created-by-a-function-call-without-class-attribute


and from https://github.com/r-lib/testthat/blob/main/R/stack.R

In the actual code, I am using an environment as a hidden state object,specifically to track test results:


https://github.com/J-Moravec/mutr/blob/master/mutr.r#L6-L13

so I need an environment object (or bunch of global variables).

> Clearly, for the example above the entire discussion is irrelevant,since copies are much cheaper than function calls so unless you have astack of billions it makes no difference. In addition, your get() willalways return a copy regardless, because you are forcing it by thesubsetting, so, paradoxically, the naïve (yet more readable) implementation


I guess I was so scared of the Second Circle that I fell in the Eight.

https://www.burns-stat.com/pages/Tutor/R_inferno.pdf

I didn't realized that the "get" will force a copy, and after testing,the performance is better for growing variant (I guess since R is nowinternally pre-allocating for short vectors, but as you said, internaldetails :) ).

I guess I can keep it simple then and don't need to be as worried aboutre-allocating small vectors.


https://gist.github.com/J-Moravec/07bde03068ece71495976b0388c4b519

Thanks, this was nice learning experience,


-- Jirka

On 7/01/26 11:12, Simon Urbanek wrote:

On 7/01/2026, at 09:29, Jiří Moravec <[email protected]> wrote:

Hi Ivan,

can't say that I fully understand yet the described mechanism,
namely given what you have described at the end,
something I found myself with:

   env = parent.env() # doesn't work with emptyenv()
   env$vec = c("a","b")
   .Internal(address(env$vec))
   env2 = env
   with(env, {vec[1] = "foo"})

Where `with` runs eval(substitute())` internally.

---

I am just playing with fixed buffers or stacks, I previously was able to do 
stack with:

     new_stack = function(){
         size = 0
         items = vector("character", 8)

         add = function(x){
             size <<- size + 1
             items[size] <<- x
             }

         get = function(){
             items[seq_len(size)]
             }

         environment()
         }

     stack = new_stack()
     tracemem(stack$items)
     stack2 = stack
     .Internal(address(stack$items))
     stack$add("foo")
     stack$add("bar")
     # Memory is the same
     .Internal(address(stack$items))
     # stack2 is the same as stack
     stack2$get() # [1] "foo" "bar"

Which works, is really cool, and allows memory efficient (or so I hope) shared 
resources with reference-like schematic with other type that environments.
I just hoped that further simplification would be possible.


You are re-inventing the reference classes (or R6 if you prefer package space). 
By definition, environments are the only mutable objects in R, no other objects 
are, so what you are doing is adding a reference wrapper around an immutable 
object.

The fact that `items` can be modified in place is orthogonal to that: as Ivan 
said, that's just an under-the-hood optimization. The language definition says 
that the `items` before and after the subassignment are two different objects - 
both immutable. From user's perspective there is no difference, but R is smart 
enough to optimize away the copy if it is safe, i.e., when it knows for sure 
that no one can access the original object so it can cheat and re-use the 
original object instead, but that fact is intended to be entirely invisible to 
the user.

Clearly, for the example above the entire discussion is irrelevant, since 
copies are much cheaper than function calls so unless you have a stack of 
billions it makes no difference. In addition, your get() will always return a 
copy regardless, because you are forcing it by the subsetting, so, 
paradoxically, the naïve (yet more readable) implementation

         items = character()
         add = function(x) items <<- c(items, x)
         get = function() items

is actually faster if you have comparable number of gets and adds since gets 
don't need to copy (and uses less memory). My recommendation would be to not 
worry about internal optimisations because a) they change all the time so your 
assumptions about undefined behavior may be broken and backfire, b) the time 
you spend on it is many orders of magnitude more than any potential savings and 
c) trying to exploit specific behavior makes code less readable and thus more 
error-prone.

Cheers,
Šimon

I believe this works because the function "add" is evaluated in the same 
environment (a with the `with`), but I don't fully get _why_.

I will spend some time reading the subset assignment section.


On 6/01/26 23:39, Ivan Krylov wrote:

В Mon, 5 Jan 2026 16:30:43 +1300
Jiří Moravec <[email protected]> пишет:

1. Is there documentation of `reference counting`?

There is a short description at
<https://developer.r-project.org/Refcnt.html>. The general rule for
package developers is "Except in very special and well understood
circumstances, an argument passed down to C code should not be modified
if it has a positive reference count, even if that count is equal to
one".

For an example of when a reference count of 1 is not safe, consider:

foo <- bar <- baz <- list(x = 42+0) # make a fresh numeric vector
.Call(modify_me, foo$x)

foo$x has a reference count of only 1, so NOT_SHARED() is true. On the
other hand, since the bindings 'foo', 'bar', 'baz' all share the same
list (whose reference count is 3), altering foo$x by reference from C
code would also change the values of 'bar' and 'baz', which violates
the value semantics of lists in R.

2. Is the demonstrated behaviour a bug?

In this particular case, you've shown the duplication could have been
avoided, so at the very least you've got a feature request to make
complex assignment more efficient. Now the question is, why does the
duplication happen and how hard it is to avoid performing it without
breaking anything?

The complex assignment rules are described here:

https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Subset-assignment-1

If you call tracemem(env$vec) and set a breakpoint in
memtrace_report(), you can see that env$vec is duplicated in eval.c,
function evalseq():

(gdb) l evalseq
(gdb) b 3201
Breakpoint 2 at 0x55555569dacb: file eval.c, line 3201.
(gdb) commands 2

call Rf_PrintValue(nexpr)
c
end

(gdb) b 3209
Breakpoint 3 at 0x55555569dad3: file eval.c, line 3209.
(gdb) commands 3

call R_inspect(nval)
call R_inspect(val)
c
end

In both cases, the expression being evaluated is `*tmp*`$vec, with
`*tmp*` aliased to `env` without incrementing its reference count. When
evaluating the first assignment, `env$vec[1] <- 5`, `nval` is the
vector being updated, and `val` is a special, non-reference-counting
pairlist containing `env` and `as.name("env")`:

Breakpoint 3, evalseq <...> at eval.c:3209
# first the 'nval', note REF(1)
@55555615f588 14 REALSXP g0c4 [REF(1)] (len=8, tl=0) 5,0,0,0,0,...
# next the 'val', note REF(1) for its first element
@555557df5fb0 02 LISTSXP g0c0 [STP]
   @555557d0a3b8 04 ENVSXP g0c0 [REF(1)] <0x555557d0a3b8>
<...>
   @555555a2ae88 01 SYMSXP g0c0 [MARK,REF(1785)] "env"

Next, after `env2 <- env`, we attempt an assignment again:

Breakpoint 3, evalseq <...> at eval.c:3209
# again, 'nval' has a reference count of 1
@55555615f588 14 REALSXP g0c4 [REF(1)] (len=8, tl=0) 5,0,0,0,0,...
# but now 'env' has a reference count of 2
@555557dfd108 02 LISTSXP g0c0 [STP]
   @555557d0a3b8 04 ENVSXP g0c0 [REF(2)] <0x555557d0a3b8> # <-- here
<...>
   @555555a2ae88 01 SYMSXP g0c0 [MARK,REF(1787)] "env"

Since `env` is referenced twice, it's MAYBE_SHARED, so the condition

if (MAYBE_REFERENCED(nval) &&
     (MAYBE_SHARED(nval) || MAYBE_SHARED(CAR(val))))

is true, and `nval` (env$x) is duplicated before the assignment.

This would've been necessary if 'env' was a list (or another
value-semantics object; see the first example above).

3. I would guess that assign in place in this case is
implementation-specific detail and not specified behaviour, so one
shouldn't rely on it.

True. R's copy-on-write is an optimisation, although a very useful one.

4. Is there way how to do this (i.e., fixed buffer) in base R without
relying on C with .Call?

This is a kludge, but if you allow your environment to be enclosed by
the base environment, you can perform the sub-assignment directly
inside it, without invoking complex assignment:

env3 <- new.env(parent = baseenv())
env3$vec <- vector("numeric", 8)
tracemem(env3$vec)
eval(substitute(vec[i] <- v, list(i = 1, v = 5)), env3)
env4 <- env3
eval(substitute(vec[i] <- v, list(i = 2, v = 6)), env3)
# still not duplicated

(I've also tried substitute(..., list(`<-` = base::`<-`)) for use in an
empty environment, but that breaks when it tries to invoke `[<-`.)

What is the overall problem you would like to solve?

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Assign in place and reference counting

Reply via email to