Re: [Rd] A demonstrated shortcoming of the R package management system
Hi Hadley, On 8 August 2023 at 08:34, Hadley Wickham wrote: | Do you think it's worth also/instead considering a fix to S4 to avoid | this caching issue in future R versions? That is somewhat orthogonal to my point of "'some uses' of the 20 year old S4 system (which as we know is fairly widely used 'out there') break deployments" and the related "this is also a PITA for binary distributors". The existing body of code seems to need some help. | (This is top of my for me as we consider the design of S7, and I | recently made a note to ensure we avoid similar problems there: | https://github.com/RConsortium/OOP-WG/issues/317) I haven't followed the S7 repo closely but peek every couple of months. It seems sensible to avoid repeating shortcomings identfied elsewhere. Best, Dirk -- dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Vectorize library() to improve handling of versioned deps [sprint project proposal]
Hi Gabriel, Nice idea! I have encountered this problem several times, while probably a better management of libraries could avoided the issue this is an elegant solution. How would this proposal handle arguments mask.ok, include.only and exclude? For example, in the (edge) case two packages export a function with the same name but we want to mask one of them (pkgA::fun, pkgC::fun). How would that work for instance in library(c("pkgA", "pkgC"), mask.ok = c( pkgC = "fun"))? Would the name of the character be useful to decide which one is ok to mask? (in the example I would expect the pkgC::fun be masked by pkgA::fun in the inverse order or loading). If this doesn't go ahead, perhaps a function could be implemented to check these situations in the .libPaths (in a package if not in base) ? Best, Lluís On Mon, 7 Aug 2023 at 22:34, Gabriel Becker wrote: > Hi All, > > This is a proposal for a project which could be worked on during the R > development Sprint at the end of this month; it was requested that we start > a discussion here to see what R-core's thoughts on it were before we > officially add it to the docket. > > > AFAIK, R officially supports both versioned dependencies (almost > exclusively of the >=version variety) and library paths with more than one > directory. Further, I believe if at least de facto supports the same > package being installed in different directories along the lib path. The > most common of these, I'd bet, would be different versions of the same > library being installed in a site library and in a user's personal library, > though that's not the only way this can happen. > > The combination of these two features, however can give rise to > packages which are all correctly installed and all loadable individually, > but which must be loaded in a particular order when used together or the > loading of some of them will fail. > > Consider the following dependency structure between packages > > pkgA: pkgB (>= 0.5.0) > > pkgC: pkgB (>= 0.6.0) > > Consider the following multi-libpath setup: > > ~/pth1/: pkgA, pkg B [0.5.1] > ~/pth2/: pkg C, pkg B [0.6.5] > > And consider that we have the libpath c("~/pth1/", "~/pth2"). > > If we do > > library(pkgA) > > Things will work great. > > Same if we do > > library(pkgC) > > BUT, if we do > > library(pkgA) > library(pkgC) > > pkgC will not be able to be loaded, because an insufficient version of > pkgB will > already be loaded. > > I propose that library be modified to be able to take a character vector of > package names, when it does, it performs the dependency calculations to > determine how all packages in the vector can be loaded (in the order they > appear). In the example above, this would mean that if we did > > library(c("pkgA", "pkgC")) > > It would determine that pkgB version 0.6.5 was needed (or alternatively, > that version 0.5.1 was insufficient) and use that *when loading the > dependencies of pkgA*. > > The proposal issue for the sprint itself is here: > https://github.com/r-devel/r-project-sprint-2023/discussions/15 > > Thoughts? > > ~G > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] A demonstrated shortcoming of the R package management system
Hi Dirk, Do you think it's worth also/instead considering a fix to S4 to avoid this caching issue in future R versions? (This is top of my for me as we consider the design of S7, and I recently made a note to ensure we avoid similar problems there: https://github.com/RConsortium/OOP-WG/issues/317) Hadley On Sun, Aug 6, 2023 at 4:05 PM Dirk Eddelbuettel wrote: > > > CRAN, by relying on the powerful package management system that is part of R, > provides an unparalleled framework for extending R with nearly 20k packages. > > We recently encountered an issue that highlights a missing element in the > otherwise outstanding package management system. So we would like to start a > discussion about enhancing its feature set. As shown below, a mechanism to > force reinstallation of packages may be needed. > > A demo is included below, it is reproducible in a container. We find the > easiest/fastest reproduction is by saving the code snippet below in the > current directory as eg 'matrixIssue.R' and have it run in a container as > >docker run --rm -ti -v `pwd`:/mnt rocker/r2u Rscript /mnt/matrixIssue.R > > This runs in under two minutes, first installing the older Matrix, next > installs SeuratObject, and then by removing the older Matrix making the > (already installed) current Matrix version the default. This simulates a > package update for Matrix. Which, as the final snippet demonstrates, silently > breaks SeuratObject as the cached S4 method Csparse_validate is now missing. > So when SeuratObject was installed under Matrix 1.5.1, it becomes unuseable > under Matrix 1.6.0. > > What this shows is that a call to update.packages() will silently corrupt an > existing installation. We understand that this was known and addressed at > CRAN by rebuilding all binary packages (for macOS and Windows). > > But it leaves both users relying on source installation as well as > distributors of source packages in a dire situation. It hurt me three times: > my default R installation was affected with unit tests (involving > SeuratObject) silently failing. It similarly broke our CI setup at work. And > it created a fairly bad headache for the Debian packaging I am involved with > (and I surmise it affects other distro similarly). > > It would be good to have a mechanism where a package, when being upgraded, > could flag that 'more actions are required' by the system (administrator). > We think this example demonstrates that we need such a mechanism to avoid > (silently !!) breaking existing installations, possibly by forcing > reinstallation of other packages. R knows the package dependency graph and > could trigger this, possibly after an 'opt-in' variable the user / admin > sets. > > One possibility may be to add a new (versioned) field 'Breaks:'. Matrix could > then have added 'Breaks: SeuratObject (<= 4.1.3)' preventing an installation > of Matrix 1.6.0 when SeuratObject 4.1.3 (or earlier) is present, but > permitting an update to Matrix 1.6.0 alongside a new version, say, 4.1.4 of > SeuratObject which could itself have a versioned Depends: Matrix (>= 1.6.0). > > Regards, Dirk > > > ## Code example follows. Recommended to run the rocker/r2u container. > ## Could also run 'apt update -qq; apt upgrade -y' but not required > ## Thanks to my colleague Paul Hoffman for the core of this example > > ## now have Matrix 1.6.0 because r2u and CRAN remain current but we can > install an older Matrix > remotes::install_version('Matrix', '1.5.1') > > ## we can confirm that we have Matrix 1.5.1 > packageVersion("Matrix") > > ## we now install SeuratObject from source and to speed things up we first > install the binary > install.packages("SeuratObject") # in this container via bspm/r2u as binary > ## and then force a source installation (turning bspm off) _while Matrix is > at 1.5.1_ > if (requireNamespace("bspm", quietly=TRUE) bspm::disable() > Sys.setenv(PKG_CXXFLAGS='-Wno-ignored-attributes') # Eigen compilation > noise silencer > install.packages('SeuratObject') > > ## we now remove the Matrix package version 1.5.1 we installed into > /usr/local leaving 1.6.0 > remove.packages("Matrix") > packageVersion("Matrix") > > ## and we now run a bit of SeuratObject code that is now broken as > Csparse_validate is gone > suppressMessages(library(SeuratObject)) > data('pbmc_small') > graph <- pbmc_small[['RNA_snn']] > class(graph) > getClass('Graph') > show(graph) # this fails > > > -- > dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- http://hadley.nz __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] feature request: optim() iteration of functions that return multiple values
But why time methods that the author (me!) has been telling the community for years have updates? Especially as optimx::optimr() uses same syntax as optim() and gives access to a number of solvers, both production and didactic. This set of solvers is being improved or added to regularly, with a major renewal almost complete (for the adventurous, code on https://github.com/nashjc/optimx). Note also that the default Nelder-Mead is good for exploring function surface and is quite robust at getting quickly into the region of a minimum, but can be quite poor in "finishing" the process. Tools have different strengths and weaknesses. optim() was more or less state of the art a couple of decades ago, but there are other choices now. JN On 2023-08-08 05:14, Sami Tuomivaara wrote: Thank you all very much for the suggestions, after testing, each of them would be a viable solution in certain contexts. Code for benchmarking: # preliminaries install.packages("microbenchmark") library(microbenchmark) data <- new.env() data$ans2 <- 0 data$ans3 <- 0 data$i <- 0 data$fun.value <- numeric(1000) # define functions rosenbrock_env <- function(x, data) { x1 <- x[1] x2 <- x[2] ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2 ans2 <- ans^2 ans3 <- sqrt(abs(ans)) data$i <- data$i + 1 data$fun.value[data$i] <- ans ans } rosenbrock_env2 <- function(x, data) { x1 <- x[1] x2 <- x[2] ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2 ans2 <- ans^2 ans3 <- sqrt(abs(ans)) data$ans2 <- ans2 data$ans3 <- ans3 ans } rosenbrock_attr <- function(x) { x1 <- x[1] x2 <- x[2] ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2 ans2 <- ans^2 ans3 <- sqrt(abs(ans)) attr(ans, "ans2") <- ans2 attr(ans, "ans3") <- ans3 ans } rosenbrock_extra <- function(x, extraInfo = FALSE) { x1 <- x[1] x2 <- x[2] ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2 ans2 <- ans^2 ans3 <- sqrt(abs(ans)) if (extraInfo) list(ans = ans, ans2 = ans2, ans3 = ans3) else ans } rosenbrock_all <- function(x) { x1 <- x[1] x2 <- x[2] ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2 ans2 <- ans^2 ans3 <- sqrt(abs(ans)) list(ans = ans, ans2 = ans2, ans3 = ans3) } returnFirst <- function(fun) function(...) do.call(fun,list(...))[[1]] rosenbrock_all2 <- returnFirst(rosenbrock_all) # benchmark all functions set.seed <- 100 microbenchmark(env = optim(c(-1,2), rosenbrock_env, data = data), env2 = optim(c(-1,2), rosenbrock_env2, data = data), attr = optim(c(-1,2), rosenbrock_attr), extra = optim(c(-1,2), rosenbrock_extra, extraInfo = FALSE), all2 = optim(c(-1,2), rosenbrock_all2), times = 100) # correct parameters and return values? env <- optim(c(-1,2), rosenbrock_env, data = data) env2 <- optim(c(-1,2), rosenbrock_env2, data = data) attr <- optim(c(-1,2), rosenbrock_attr) extra <- optim(c(-1,2), rosenbrock_extra, extraInfo = FALSE) all2 <- optim(c(-1,2), rosenbrock_all2) # correct return values with optimized parameters? env. <- rosenbrock_env(env$par, data) env2. <- rosenbrock_env(env2$par, data) attr. <- rosenbrock_attr(attr$par) extra. <- rosenbrock_extra(extra$par, extraInfo = FALSE) all2. <- rosenbrock_all2(all2$par) # functions that return more than one value all. <- rosenbrock_all(all2$par) extra2. <- rosenbrock_extra(extra$par, extraInfo = TRUE) # environment values correct? data$ans2 data$ans3 data$i data$fun.value microbenchmarking results: Unit: microseconds expr minlq meanmedian uq max neval env 644.102 3919.6010 9598.3971 7950.0005 15582.8515 42210.900 100 env2 337.001 351.5510 479.2900 391.7505 460.3520 6900.800 100 attr 350.201 367.3010 502.0319 409.7510 483.6505 6772.800 100 extra 276.800 287.2010 402.4231 302.6510 371.5015 6457.201 100 all2 630.801 646.9015 785.9880 678.0010 808.9510 6411.102 100 rosenbrock_env and _env2 functions differ in that _env accesses vectors in the defined environment by indexing, whereas _env2 doesn't (hope I interpreted this right?). This appears to be expensive operation, but allows saving values during the steps of the optim iteration, rather than just at convergence. Overall, _extra has consistently lowest median execution time! My earlier workaround was to write two separate functions, one of which returns extra values; all suggested approaches simplify that approach considerably. I am also now more educated about attributes and environments that I did not know how to utilize before and that proved to be very useful concepts. Again, thank you everyone for your input! [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listin
Re: [Rd] feature request: optim() iteration of functions that return multiple values
Thank you all very much for the suggestions, after testing, each of them would be a viable solution in certain contexts. Code for benchmarking: # preliminaries install.packages("microbenchmark") library(microbenchmark) data <- new.env() data$ans2 <- 0 data$ans3 <- 0 data$i <- 0 data$fun.value <- numeric(1000) # define functions rosenbrock_env <- function(x, data) { x1 <- x[1] x2 <- x[2] ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2 ans2 <- ans^2 ans3 <- sqrt(abs(ans)) data$i <- data$i + 1 data$fun.value[data$i] <- ans ans } rosenbrock_env2 <- function(x, data) { x1 <- x[1] x2 <- x[2] ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2 ans2 <- ans^2 ans3 <- sqrt(abs(ans)) data$ans2 <- ans2 data$ans3 <- ans3 ans } rosenbrock_attr <- function(x) { x1 <- x[1] x2 <- x[2] ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2 ans2 <- ans^2 ans3 <- sqrt(abs(ans)) attr(ans, "ans2") <- ans2 attr(ans, "ans3") <- ans3 ans } rosenbrock_extra <- function(x, extraInfo = FALSE) { x1 <- x[1] x2 <- x[2] ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2 ans2 <- ans^2 ans3 <- sqrt(abs(ans)) if (extraInfo) list(ans = ans, ans2 = ans2, ans3 = ans3) else ans } rosenbrock_all <- function(x) { x1 <- x[1] x2 <- x[2] ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2 ans2 <- ans^2 ans3 <- sqrt(abs(ans)) list(ans = ans, ans2 = ans2, ans3 = ans3) } returnFirst <- function(fun) function(...) do.call(fun,list(...))[[1]] rosenbrock_all2 <- returnFirst(rosenbrock_all) # benchmark all functions set.seed <- 100 microbenchmark(env = optim(c(-1,2), rosenbrock_env, data = data), env2 = optim(c(-1,2), rosenbrock_env2, data = data), attr = optim(c(-1,2), rosenbrock_attr), extra = optim(c(-1,2), rosenbrock_extra, extraInfo = FALSE), all2 = optim(c(-1,2), rosenbrock_all2), times = 100) # correct parameters and return values? env <- optim(c(-1,2), rosenbrock_env, data = data) env2 <- optim(c(-1,2), rosenbrock_env2, data = data) attr <- optim(c(-1,2), rosenbrock_attr) extra <- optim(c(-1,2), rosenbrock_extra, extraInfo = FALSE) all2 <- optim(c(-1,2), rosenbrock_all2) # correct return values with optimized parameters? env. <- rosenbrock_env(env$par, data) env2. <- rosenbrock_env(env2$par, data) attr. <- rosenbrock_attr(attr$par) extra. <- rosenbrock_extra(extra$par, extraInfo = FALSE) all2. <- rosenbrock_all2(all2$par) # functions that return more than one value all. <- rosenbrock_all(all2$par) extra2. <- rosenbrock_extra(extra$par, extraInfo = TRUE) # environment values correct? data$ans2 data$ans3 data$i data$fun.value microbenchmarking results: Unit: microseconds expr minlq meanmedian uq max neval env 644.102 3919.6010 9598.3971 7950.0005 15582.8515 42210.900 100 env2 337.001 351.5510 479.2900 391.7505 460.3520 6900.800 100 attr 350.201 367.3010 502.0319 409.7510 483.6505 6772.800 100 extra 276.800 287.2010 402.4231 302.6510 371.5015 6457.201 100 all2 630.801 646.9015 785.9880 678.0010 808.9510 6411.102 100 rosenbrock_env and _env2 functions differ in that _env accesses vectors in the defined environment by indexing, whereas _env2 doesn't (hope I interpreted this right?). This appears to be expensive operation, but allows saving values during the steps of the optim iteration, rather than just at convergence. Overall, _extra has consistently lowest median execution time! My earlier workaround was to write two separate functions, one of which returns extra values; all suggested approaches simplify that approach considerably. I am also now more educated about attributes and environments that I did not know how to utilize before and that proved to be very useful concepts. Again, thank you everyone for your input! [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel