Re: [Rd] Invisible names problem

Simon Urbanek Wed, 22 Jul 2020 14:00:14 -0700

Very interesting:

> .Internal(inspect(k[i]))
@10a4bc000 14 REALSXP g0c7 [ATT] (len=20000, tl=0) 1,2,3,4,1,...
ATTRIB:
  @7fa24f07fa58 02 LISTSXP g0c0 [REF(1)] 
    TAG: @7fa24b803e90 01 SYMSXP g0c0 [MARK,REF(5814),LCK,gp=0x6000] "names" 
(has value)
    @10a4e4000 16 STRSXP g0c7 [REF(1)] (len=20000, tl=0)
      @7fa24ba575c8 09 CHARSXP g0c1 [MARK,REF(35005),gp=0x61] [ASCII] [cached] 
"a"
      @7fa24be24428 09 CHARSXP g0c1 [MARK,REF(35010),gp=0x61] [ASCII] [cached] 
"b"
      @7fa24b806ec0 09 CHARSXP g0c1 [MARK,REF(35082),gp=0x61] [ASCII] [cached] 
"c"
      @7fa24bcc6af0 09 CHARSXP g0c1 [MARK,REF(35003),gp=0x61] [ASCII] [cached] 
"d"
      @7fa24ba575c8 09 CHARSXP g0c1 [MARK,REF(35005),gp=0x61] [ASCII] [cached] 
"a"
      ...

> .Internal(inspect(unname(k[i])))
@10a50c000 14 REALSXP g0c7 [] (len=20000, tl=0) 1,2,3,4,1,...

> .Internal(inspect(x2))
@7fa24fc692d8 14 REALSXP g0c0 [REF(1)]  wrapper [srt=-2147483648,no_na=0]
  @10a228000 14 REALSXP g0c7 [REF(1),ATT] (len=20000, tl=0) 1,2,3,4,1,...
  ATTRIB:
    @7fa24fc69850 02 LISTSXP g0c0 [REF(1)] 
      TAG: @7fa24b803e90 01 SYMSXP g0c0 [MARK,REF(5797),LCK,gp=0x4000] "names" 
(has value)
      @10a250000 16 STRSXP g0c7 [REF(65535)] (len=20000, tl=0)
        @7fa24ba575c8 09 CHARSXP g0c1 [MARK,REF(10005),gp=0x61] [ASCII] 
[cached] "a"
        @7fa24be24428 09 CHARSXP g0c1 [MARK,REF(10010),gp=0x61] [ASCII] 
[cached] "b"
        @7fa24b806ec0 09 CHARSXP g0c1 [MARK,REF(10077),gp=0x61] [ASCII] 
[cached] "c"
        @7fa24bcc6af0 09 CHARSXP g0c1 [MARK,REF(10003),gp=0x61] [ASCII] 
[cached] "d"
        @7fa24ba575c8 09 CHARSXP g0c1 [MARK,REF(10005),gp=0x61] [ASCII] 
[cached] "a"
        ...

If you don't assign the intermediate result things are simple as R knows there 
are no references so the names can be simply removed. However, if you assign 
the result that is not possible as there is still the reference in x2 at the 
time when unname() creates its own local temporary variable obj to do what 
probably most of us would use which is names(obj) <- NULL (i.e. names(x2) <- 
NULL avoids that problem.since you don't need both x2 and obj).

To be precise, when you use unname() on an assigned object, R has to 
technically keep two copies - one for the existing x2 and a second in unname() 
for obj so it can call names(obj)<-NULL for the modification. To avoid that R 
instead creates a wrapper for the original x2 which says "like x2 but names are 
NULL". The rationale is that for large vector it is better to keep records of 
metadata changes rather than duplicating the object. This way the vector is 
stored only once. However, as you blow way the original x2, all that is left is 
k[I] with the extra information "don't use the names". Unfortunately, R cannot 
know that you will eventually only keep the version without the names - at 
which point it could strip the names since they are not referenced anymore.

I'm not sure what is the best solution here. In theory, if the wrapper found 
out that the object it is wrapping has no more references it could remove the 
names, but I'm sure that would only solve some cases (what if you duplicated 
the wrapper and thus there were multiple wrappers referencing it?) and not sure 
if it has a way to find out. The other way to deal with that would be at 
serialization time if it could be detected such that it can remove the wrapper. 
Since the intersection of serialization experts and ALTREP experts is exactly 
one, I'll leave it to that set to comment further ;).

Cheers,
Simon

> On Jul 23, 2020, at 07:29, Pan Domu <konto7628845...@gmail.com> wrote:
> 
> I ran into strange behavior when removing names.
> 
> Two ways of removing names:
> 
>    i <- rep(1:4, length.out=20000)
>    k <- c(a=1, b=2, c=3, d=4)
> 
>    x1 <- unname(k[i])
>    x2 <- k[i]
>    x2 <- unname(x2)
> 
> Are they identical?
> 
>    identical(x1,x2) # TRUE
> 
> but no
> 
>    identical(serialize(x1,NULL),serialize(x2,NULL)) # FALSE
> 
> But problem is with serialization type 3, cause:
> 
>    identical(serialize(x1,NULL,version = 2),serialize(x2,NULL,version =
> 2)) # TRUE
> 
> It seems that the second one keeps names somewhere invisibly.
> 
> Some function can lost them, e.g. head:
> 
>    identical(serialize(head(x1, 20001),NULL),serialize(head(x2,
> 20001),NULL)) # TRUE
> 
> But not saveRDS (so files are bigger), tibble family keeps them but base
> data.frame seems to drop them.
> 
> From my test invisible names are in following cases:
> 
>   x1 <- k[i] %>% unname()
>   x3 <- k[i]; x3 <- unname(x3)
>   x5 <- k[i]; x5 <- `names<-`(x5, NULL)
>   x6 <- k[i]; x6 <- unname(x6)
> 
> but not in this one
>   x2 <- unname(k[i])
>   x4 <- k[i]; names(x4) <- NULL
> 
> What kind of magick is that?
> 
> It hits us when we upgrade from 3.5 (when serialization changed) and had
> impact on parallelization (cause serialized objects were bigger).
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Invisible names problem

Reply via email to