Re: [Rd] A demonstrated shortcoming of the R package management system

2023-08-08 Thread Dirk Eddelbuettel


Hi Hadley,

On 8 August 2023 at 08:34, Hadley Wickham wrote:
| Do you think it's worth also/instead considering a fix to S4 to avoid
| this caching issue in future R versions?

That is somewhat orthogonal to my point of "'some uses' of the 20 year old S4
system (which as we know is fairly widely used 'out there') break
deployments" and the related "this is also a PITA for binary distributors".

The existing body of code seems to need some help.

| (This is top of my for me as we consider the design of S7, and I
| recently made a note to ensure we avoid similar problems there:
| https://github.com/RConsortium/OOP-WG/issues/317)

I haven't followed the S7 repo closely but peek every couple of months. It
seems sensible to avoid repeating shortcomings identfied elsewhere.

Best,  Dirk

-- 
dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Vectorize library() to improve handling of versioned deps [sprint project proposal]

2023-08-08 Thread Lluís Revilla
Hi Gabriel,

Nice idea! I have encountered this problem several times, while probably a
better management of libraries could avoided the issue this is an elegant
solution.

How would this proposal handle arguments mask.ok, include.only and exclude?
For example, in the (edge) case two packages export a function with the
same name but we want to mask one of them (pkgA::fun, pkgC::fun).
How would that work for instance in library(c("pkgA", "pkgC"), mask.ok = c(
pkgC = "fun"))?
Would the name of the character be useful to decide which one is ok to
mask? (in the example I would expect the pkgC::fun be masked by pkgA::fun in
the inverse order or loading).

If this doesn't go ahead, perhaps a function could be implemented to check
these situations in the .libPaths (in a package if not in base) ?

Best,

Lluís

On Mon, 7 Aug 2023 at 22:34, Gabriel Becker  wrote:

> Hi All,
>
> This is a proposal for a project which could be worked on during the R
> development Sprint at the end of this month; it was requested that we start
> a discussion here to see what R-core's thoughts on it were before we
> officially add it to the docket.
>
>
> AFAIK, R officially supports both versioned dependencies (almost
> exclusively of the >=version variety) and library paths with more than one
> directory. Further, I believe if at least de facto supports the same
> package being installed in different directories along the lib path. The
> most common of these, I'd bet, would be different versions of the same
> library being installed in a site library and in a user's personal library,
> though that's not the only way this can happen.
>
> The combination of these two features, however can give rise to
> packages which are all correctly installed and all loadable individually,
> but which must be loaded in a particular order when used together or the
> loading of some of them will fail.
>
> Consider the following dependency structure between packages
>
> pkgA: pkgB (>= 0.5.0)
>
> pkgC: pkgB (>= 0.6.0)
>
> Consider the following multi-libpath setup:
>
> ~/pth1/: pkgA, pkg B [0.5.1]
> ~/pth2/: pkg C, pkg B [0.6.5]
>
> And consider that we have the libpath c("~/pth1/", "~/pth2").
>
> If we do
>
> library(pkgA)
>
> Things will work great.
>
> Same if we do
>
> library(pkgC)
>
> BUT, if we do
>
> library(pkgA)
> library(pkgC)
>
> pkgC will not be able to be loaded, because an insufficient version of
> pkgB will
> already be loaded.
>
> I propose that library be modified to be able to take a character vector of
> package names, when it does, it performs the dependency calculations to
> determine how all packages in the vector can be loaded (in the order they
> appear). In the example above, this would mean that if we did
>
> library(c("pkgA", "pkgC"))
>
> It would determine that pkgB version 0.6.5 was needed (or alternatively,
> that version 0.5.1 was insufficient) and use that *when loading the
> dependencies of pkgA*.
>
> The proposal issue for the sprint itself is here:
> https://github.com/r-devel/r-project-sprint-2023/discussions/15
>
> Thoughts?
>
> ~G
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] A demonstrated shortcoming of the R package management system

2023-08-08 Thread Hadley Wickham
Hi Dirk,

Do you think it's worth also/instead considering a fix to S4 to avoid
this caching issue in future R versions?

(This is top of my for me as we consider the design of S7, and I
recently made a note to ensure we avoid similar problems there:
https://github.com/RConsortium/OOP-WG/issues/317)

Hadley

On Sun, Aug 6, 2023 at 4:05 PM Dirk Eddelbuettel  wrote:
>
>
> CRAN, by relying on the powerful package management system that is part of R,
> provides an unparalleled framework for extending R with nearly 20k packages.
>
> We recently encountered an issue that highlights a missing element in the
> otherwise outstanding package management system. So we would like to start a
> discussion about enhancing its feature set. As shown below, a mechanism to
> force reinstallation of packages may be needed.
>
> A demo is included below, it is reproducible in a container. We find the
> easiest/fastest reproduction is by saving the code snippet below in the
> current directory as eg 'matrixIssue.R' and have it run in a container as
>
>docker run --rm -ti -v `pwd`:/mnt rocker/r2u Rscript /mnt/matrixIssue.R
>
> This runs in under two minutes, first installing the older Matrix, next
> installs SeuratObject, and then by removing the older Matrix making the
> (already installed) current Matrix version the default. This simulates a
> package update for Matrix. Which, as the final snippet demonstrates, silently
> breaks SeuratObject as the cached S4 method Csparse_validate is now missing.
> So when SeuratObject was installed under Matrix 1.5.1, it becomes unuseable
> under Matrix 1.6.0.
>
> What this shows is that a call to update.packages() will silently corrupt an
> existing installation.  We understand that this was known and addressed at
> CRAN by rebuilding all binary packages (for macOS and Windows).
>
> But it leaves both users relying on source installation as well as
> distributors of source packages in a dire situation. It hurt me three times:
> my default R installation was affected with unit tests (involving
> SeuratObject) silently failing. It similarly broke our CI setup at work.  And
> it created a fairly bad headache for the Debian packaging I am involved with
> (and I surmise it affects other distro similarly).
>
> It would be good to have a mechanism where a package, when being upgraded,
> could flag that 'more actions are required' by the system (administrator).
> We think this example demonstrates that we need such a mechanism to avoid
> (silently !!) breaking existing installations, possibly by forcing
> reinstallation of other packages.  R knows the package dependency graph and
> could trigger this, possibly after an 'opt-in' variable the user / admin
> sets.
>
> One possibility may be to add a new (versioned) field 'Breaks:'. Matrix could
> then have added 'Breaks: SeuratObject (<= 4.1.3)' preventing an installation
> of Matrix 1.6.0 when SeuratObject 4.1.3 (or earlier) is present, but
> permitting an update to Matrix 1.6.0 alongside a new version, say, 4.1.4 of
> SeuratObject which could itself have a versioned Depends: Matrix (>= 1.6.0).
>
> Regards,  Dirk
>
>
> ## Code example follows. Recommended to run the rocker/r2u container.
> ## Could also run 'apt update -qq; apt upgrade -y' but not required
> ## Thanks to my colleague Paul Hoffman for the core of this example
>
> ## now have Matrix 1.6.0 because r2u and CRAN remain current but we can 
> install an older Matrix
> remotes::install_version('Matrix', '1.5.1')
>
> ## we can confirm that we have Matrix 1.5.1
> packageVersion("Matrix")
>
> ## we now install SeuratObject from source and to speed things up we first 
> install the binary
> install.packages("SeuratObject")   # in this container via bspm/r2u as binary
> ## and then force a source installation (turning bspm off) _while Matrix is 
> at 1.5.1_
> if (requireNamespace("bspm", quietly=TRUE) bspm::disable()
> Sys.setenv(PKG_CXXFLAGS='-Wno-ignored-attributes')  # Eigen compilation 
> noise silencer
> install.packages('SeuratObject')
>
> ## we now remove the Matrix package version 1.5.1 we installed into 
> /usr/local leaving 1.6.0
> remove.packages("Matrix")
> packageVersion("Matrix")
>
> ## and we now run a bit of SeuratObject code that is now broken as 
> Csparse_validate is gone
> suppressMessages(library(SeuratObject))
> data('pbmc_small')
> graph <- pbmc_small[['RNA_snn']]
> class(graph)
> getClass('Graph')
> show(graph) # this fails
>
>
> --
> dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



-- 
http://hadley.nz

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] feature request: optim() iteration of functions that return multiple values

2023-08-08 Thread J C Nash

But why time methods that the author (me!) has been telling the community for
years have updates? Especially as optimx::optimr() uses same syntax as optim()
and gives access to a number of solvers, both production and didactic. This set
of solvers is being improved or added to regularly, with a major renewal almost
complete (for the adventurous, code on https://github.com/nashjc/optimx).

Note also that the default Nelder-Mead is good for exploring function surface 
and
is quite robust at getting quickly into the region of a minimum, but can be 
quite
poor in "finishing" the process. Tools have different strengths and weaknesses.
optim() was more or less state of the art a couple of decades ago, but there are
other choices now.

JN

On 2023-08-08 05:14, Sami Tuomivaara wrote:

Thank you all very much for the suggestions, after testing, each of them would 
be a viable solution in certain contexts.  Code for benchmarking:

# preliminaries
install.packages("microbenchmark")
library(microbenchmark)


data <- new.env()
data$ans2 <- 0
data$ans3 <- 0
data$i <- 0
data$fun.value <- numeric(1000)

# define functions

rosenbrock_env <- function(x, data)
{
   x1 <- x[1]
   x2 <- x[2]
   ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2
   ans2 <- ans^2
   ans3 <- sqrt(abs(ans))
   data$i <- data$i + 1
   data$fun.value[data$i] <- ans
   ans
}


rosenbrock_env2 <- function(x, data)
{
   x1 <- x[1]
   x2 <- x[2]
   ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2
   ans2 <- ans^2
   ans3 <- sqrt(abs(ans))
   data$ans2 <- ans2
   data$ans3 <- ans3
   ans
}

rosenbrock_attr <- function(x)
{
   x1 <- x[1]
   x2 <- x[2]
   ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2
   ans2 <- ans^2
   ans3 <- sqrt(abs(ans))
   attr(ans, "ans2") <- ans2
   attr(ans, "ans3") <- ans3
   ans
}


rosenbrock_extra <- function(x, extraInfo = FALSE)
{
   x1 <- x[1]
   x2 <- x[2]
   ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2
   ans2 <- ans^2
   ans3 <- sqrt(abs(ans))
   if (extraInfo) list(ans = ans, ans2 = ans2, ans3 = ans3)
   else ans
}


rosenbrock_all <- function(x)
{
   x1 <- x[1]
   x2 <- x[2]
   ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2
   ans2 <- ans^2
   ans3 <- sqrt(abs(ans))
   list(ans = ans, ans2 = ans2, ans3 = ans3)
}

returnFirst <- function(fun) function(...) do.call(fun,list(...))[[1]]
rosenbrock_all2 <- returnFirst(rosenbrock_all)


# benchmark all functions
set.seed <- 100

microbenchmark(env = optim(c(-1,2), rosenbrock_env, data = data),
env2 = optim(c(-1,2), rosenbrock_env2, data = data),
attr = optim(c(-1,2), rosenbrock_attr),
extra = optim(c(-1,2), rosenbrock_extra, extraInfo = FALSE),
all2 = optim(c(-1,2), rosenbrock_all2),
times = 100)


# correct parameters and return values?
env <- optim(c(-1,2), rosenbrock_env, data = data)
env2 <- optim(c(-1,2), rosenbrock_env2, data = data)
attr <- optim(c(-1,2), rosenbrock_attr)
extra <- optim(c(-1,2), rosenbrock_extra, extraInfo = FALSE)
all2 <- optim(c(-1,2), rosenbrock_all2)

# correct return values with optimized parameters?
env. <- rosenbrock_env(env$par, data)
env2. <- rosenbrock_env(env2$par, data)
attr. <- rosenbrock_attr(attr$par)
extra. <- rosenbrock_extra(extra$par, extraInfo = FALSE)
all2. <- rosenbrock_all2(all2$par)

# functions that return more than one value
all. <- rosenbrock_all(all2$par)
extra2. <- rosenbrock_extra(extra$par, extraInfo = TRUE)

# environment values correct?
data$ans2
data$ans3
data$i
data$fun.value


microbenchmarking results:

Unit: microseconds
   expr minlq  meanmedian uq   max neval
env 644.102 3919.6010 9598.3971 7950.0005 15582.8515 42210.900   100
   env2 337.001  351.5510  479.2900  391.7505   460.3520  6900.800   100
   attr 350.201  367.3010  502.0319  409.7510   483.6505  6772.800   100
  extra 276.800  287.2010  402.4231  302.6510   371.5015  6457.201   100
   all2 630.801  646.9015  785.9880  678.0010   808.9510  6411.102   100

rosenbrock_env and _env2 functions differ in that _env accesses vectors in the 
defined environment by indexing, whereas _env2 doesn't (hope I interpreted this 
right?).  This appears to be expensive operation, but allows saving values 
during the steps of the optim iteration, rather than just at convergence.  
Overall, _extra has consistently lowest median execution time!

My earlier workaround was to write two separate functions, one of which returns 
extra values; all suggested approaches simplify that approach considerably.  I 
am also now more educated about attributes and environments that I did not know 
how to utilize before and that proved to be very useful concepts.  Again, thank 
you everyone for your input!


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listin

Re: [Rd] feature request: optim() iteration of functions that return multiple values

2023-08-08 Thread Sami Tuomivaara
Thank you all very much for the suggestions, after testing, each of them would 
be a viable solution in certain contexts.  Code for benchmarking:

# preliminaries
install.packages("microbenchmark")
library(microbenchmark)


data <- new.env()
data$ans2 <- 0
data$ans3 <- 0
data$i <- 0
data$fun.value <- numeric(1000)

# define functions

rosenbrock_env <- function(x, data)
{
  x1 <- x[1]
  x2 <- x[2]
  ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2
  ans2 <- ans^2
  ans3 <- sqrt(abs(ans))
  data$i <- data$i + 1
  data$fun.value[data$i] <- ans
  ans
}


rosenbrock_env2 <- function(x, data)
{
  x1 <- x[1]
  x2 <- x[2]
  ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2
  ans2 <- ans^2
  ans3 <- sqrt(abs(ans))
  data$ans2 <- ans2
  data$ans3 <- ans3
  ans
}

rosenbrock_attr <- function(x)
{
  x1 <- x[1]
  x2 <- x[2]
  ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2
  ans2 <- ans^2
  ans3 <- sqrt(abs(ans))
  attr(ans, "ans2") <- ans2
  attr(ans, "ans3") <- ans3
  ans
}


rosenbrock_extra <- function(x, extraInfo = FALSE)
{
  x1 <- x[1]
  x2 <- x[2]
  ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2
  ans2 <- ans^2
  ans3 <- sqrt(abs(ans))
  if (extraInfo) list(ans = ans, ans2 = ans2, ans3 = ans3)
  else ans
}


rosenbrock_all <- function(x)
{
  x1 <- x[1]
  x2 <- x[2]
  ans <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2
  ans2 <- ans^2
  ans3 <- sqrt(abs(ans))
  list(ans = ans, ans2 = ans2, ans3 = ans3)
}

returnFirst <- function(fun) function(...) do.call(fun,list(...))[[1]]
rosenbrock_all2 <- returnFirst(rosenbrock_all)


# benchmark all functions
set.seed <- 100

microbenchmark(env = optim(c(-1,2), rosenbrock_env, data = data),
   env2 = optim(c(-1,2), rosenbrock_env2, data = data),
   attr = optim(c(-1,2), rosenbrock_attr),
   extra = optim(c(-1,2), rosenbrock_extra, extraInfo = FALSE),
   all2 = optim(c(-1,2), rosenbrock_all2),
   times = 100)


# correct parameters and return values?
env <- optim(c(-1,2), rosenbrock_env, data = data)
env2 <- optim(c(-1,2), rosenbrock_env2, data = data)
attr <- optim(c(-1,2), rosenbrock_attr)
extra <- optim(c(-1,2), rosenbrock_extra, extraInfo = FALSE)
all2 <- optim(c(-1,2), rosenbrock_all2)

# correct return values with optimized parameters?
env. <- rosenbrock_env(env$par, data)
env2. <- rosenbrock_env(env2$par, data)
attr. <- rosenbrock_attr(attr$par)
extra. <- rosenbrock_extra(extra$par, extraInfo = FALSE)
all2. <- rosenbrock_all2(all2$par)

# functions that return more than one value
all. <- rosenbrock_all(all2$par)
extra2. <- rosenbrock_extra(extra$par, extraInfo = TRUE)

# environment values correct?
data$ans2
data$ans3
data$i
data$fun.value


microbenchmarking results:

Unit: microseconds
  expr minlq  meanmedian uq   max neval
   env 644.102 3919.6010 9598.3971 7950.0005 15582.8515 42210.900   100
  env2 337.001  351.5510  479.2900  391.7505   460.3520  6900.800   100
  attr 350.201  367.3010  502.0319  409.7510   483.6505  6772.800   100
 extra 276.800  287.2010  402.4231  302.6510   371.5015  6457.201   100
  all2 630.801  646.9015  785.9880  678.0010   808.9510  6411.102   100

rosenbrock_env and _env2 functions differ in that _env accesses vectors in the 
defined environment by indexing, whereas _env2 doesn't (hope I interpreted this 
right?).  This appears to be expensive operation, but allows saving values 
during the steps of the optim iteration, rather than just at convergence.  
Overall, _extra has consistently lowest median execution time!

My earlier workaround was to write two separate functions, one of which returns 
extra values; all suggested approaches simplify that approach considerably.  I 
am also now more educated about attributes and environments that I did not know 
how to utilize before and that proved to be very useful concepts.  Again, thank 
you everyone for your input!


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel