В Fri, 12 Jan 2024 00:11:45 -0500 Dipterix Wang <dipterix.w...@gmail.com> пишет:
> I wonder how hard it would be to have options to discard source when > serializing R objects? > Currently my analyses heavily depend on digest function to generate > file caches and automatically schedule pipelines (to update cache) > when changes are detected. Source references may be the main problem here, but not the only one. There are also string encodings and function bytecode (which may or may not be present and probably changes between R versions). I've been collecting the ways that the objects that are identical() to each other can serialize() differently in my package 'depcache'; I'm sure I missed a few. Admittedly, string encodings are less important nowadays (except on older Windows and weirdly set up Unix-like systems). Thankfully, the digest package already knows to skip the serialization header (which contains the current version of R). serialize() only knows about basic types [*], and source references are implemented on top of these as objects of class 'srcref'. Sometimes they are attached as attributes to other objects, other times (e.g. in quote(function(){}), [**]) just sitting there as arguments to a call. Sometimes you can hash the output of deparse(x) instead of serialize(x) [***]. Text representations aren't without their own problems (e.g. IEEE floating-point numbers not being representable as decimal fractions), but at least deparsing both ignores the source references and punts the encoding problem to the abstraction layer above it: deparse() is the same for both '\uff' and iconv('\uff', 'UTF-8', 'latin1'): just "ÿ". Unfortunately, this doesn't solve the environment problem. For these, you really need a way to canonicalize the reference-semantics objects before serializing them without changing the originals, even in cases like a <- new.env(); b <- new.env(); a$x <- b; b$x <- a. I'm not sure that reference hooks can help with that. In order to implement it properly, the fixup process will have to rely on global state and keep weak references to the environments it visits and creates shadow copies of. I think it's not impossible to implement serialize_to_canonical_representation() for an R package, but it will be a lot of work to decide which parts are canonical and which should be discarded. -- Best regards, Ivan [*] https://cran.r-project.org/doc/manuals/R-ints.html#Serialization-Formats [**] https://bugs.r-project.org/show_bug.cgi?id=18638 [***] https://stat.ethz.ch/pipermail/r-devel/2023-March/082505.html ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel