Interesting and strategic topic indeed.

One other point is that reproducibility (and backwards compatibility) is
also very important in the industry. To get acceptance it can really help
if you can easily reproduce results.

Concerning the arguments that I read in this discussion:

- "do it yourself"
The point is to discuss to find the best way for the community, and
thinking collectively about this general problems can never hurt.
Once a consensus is reached we can think about the resources.

- "don't think the effort is worth it, instead install a specific version
of package" + "new sessionInfoPlus()":
This could work, meaning achieving the same result, but not at the same
price for users, because it would require each script writer to include its
sessionInfo(),  to store them along the scripts in repositories. And prior
to running the scripts, you would have to install the snapshot of packages,
not mentioning install problems and so on.

- "versions automatically at package build time (n DESCRIPTION)":
does not really solve the problems, because if package A is submitted with
dependency B-1.0 and package C with dependency B-2 and do you do ?

- "exact deps versions":
will put a lot of burden of the developer.

- "I do not want to wait a year to get a new (or updated package)", "access
to bug fixes":

Installed packages are already setup as libraries. By default you have the
library inside the R installation, that contains base packages + those
installed by install.packages() if you have the proper permissions, the
personal library otherwise.
Why not organizing these libraries so that:
  - normal CRAN versions associated with the R version gets installed along
the base packages
  - "critical updates", meaning important bugs found in normal CRAN
versions installed in the critical/ library
  - additional packages and updated package in another library.
This way, using the existing .libPaths() mechanism, or equivalently the
lib.loc option of library, one could easily switch between the library that
will ensure full compatibility and reproducibility with the R version, or
add critical updates, or use the newer or updated packages.

- new use case.
Here in Quartz bio we have two architectures, so two R installations for
each R version. It is quite cumbersome to keep them consistent because the
installed version depends on the moment you perform the install.packages().

So I second the Jeroen proposal to have a snapshot of packages versions
tied to a given R version, well tested altogether. This implies as stated
by Herve  to keep all package source versions, and will solve the bioC
reproducibility issue.

Best,
Karl Forner








On Tue, Mar 18, 2014 at 9:24 PM, Jeroen Ooms <jeroen.o...@stat.ucla.edu>wrote:

> This came up again recently with an irreproducible paper. Below an
> attempt to make a case for extending the r-devel/r-release cycle to
> CRAN packages. These suggestions are not in any way intended as
> criticism on anyone or the status quo.
>
> The proposal described in [1] is to freeze a snapshot of CRAN along
> with every release of R. In this design, updates for contributed
> packages treated the same as updates for base packages in the sense
> that they are only published to the r-devel branch of CRAN and do not
> affect users of "released" versions of R. Thereby all users, stacks
> and applications using a particular version of R will by default be
> using the identical version of each CRAN package. The bioconductor
> project uses similar policies.
>
> This system has several important advantages:
>
> ## Reproducibility
>
> Currently r/sweave/knitr scripts are unstable because of ambiguity
> introduced by constantly changing cran packages. This causes scripts
> to break or change behavior when upstream packages are updated, which
> makes reproducing old results extremely difficult.
>
> A common counter-argument is that script authors should document
> package versions used in the script using sessionInfo(). However even
> if authors would manually do this, reconstructing the author's
> environment from this information is cumbersome and often nearly
> impossible, because binary packages might no longer be available,
> dependency conflicts, etc. See [1] for a worked example. In practice,
> the current system causes many results or documents generated with R
> no to be reproducible, sometimes already after a few months.
>
> In a system where contributed packages inherit the r-base release
> cycle, scripts will behave the same across users/systems/time within a
> given version of R. This severely reduces ambiguity of R behavior, and
> has the potential of making reproducibility a natural part of the
> language, rather than a tedious exercise.
>
> ## Repository Management
>
> Just like scripts suffer from upstream changes, so do packages
> depending on other packages. A particular package that has been
> developed and tested against the current version of a particular
> dependency is not guaranteed to work against *any future version* of
> that dependency. Therefore, packages inevitably break over time as
> their dependencies are updated.
>
> One recent example is the Rcpp 0.11 release, which required all
> reverse dependencies to be rebuild/modified. This updated caused some
> serious disruption on our production servers. Initially we refrained
> from updating Rcpp on these servers to prevent currently installed
> packages depending on Rcpp to stop working. However soon after the
> Rcpp 0.11 release, many other cran packages started to require Rcpp >=
> 0.11, and our users started complaining about not being able to
> install those packages. This resulted in the impossible situation
> where currently installed packages would not work with the new Rcpp,
> but newly installed packages would not work with the old Rcpp.
>
> Current CRAN policies blame this problem on package authors. However
> as is explained in [1], this policy does not solve anything, is
> unsustainable with growing repository size, and sets completely the
> wrong incentives for contributing code. Progress comes with breaking
> changes, and the system should be able to accommodate this. Much of
> the trouble could have been prevented by a system that does not push
> bleeding edge updates straight to end-users, but has a devel branch
> where conflicts are resolved before publishing them in the next
> r-release.
>
> ## Reliability
>
> Another example, this time on a very small scale. We recently
> discovered that R code plotting medal counts from the Sochi Olympics
> generated different results for users on OSX than it did on
> Linux/Windows. After some debugging, we narrowed it down to the XML
> package. The application used the following code to scrape results
> from the Sochi website:
>
> XML::readHTMLTable("http://www.sochi2014.com/en/speed-skating";, which=2,
> skip=1)
>
> This code was developed and tested on mac, but results in a different
> winner on windows/linux. This happens because the current version of
> the XML package on CRAN is 3.98, but the latest mac binary is 3.95.
> Apparently this new version of XML introduces a tiny change that
> causes html-table-headers to become colnames, rather than a row in the
> matrix, resulting in different medal counts.
>
> This example illustrates that we should never assume package versions
> to be interchangeable. Any small bugfix release can have side effects
> altering results. It is impossible to protect code against such
> upstream changes using CMD check or unit testing. All R scripts and
> packages are really only developed and tested for a single version of
> their dependencies. Assuming anything else makes results
> untrustworthy, and code unreliable.
>
> ## Summary
>
> Extending the r-release cycle to CRAN seems like a solution that would
> be easy to implement. Package updates simply only get pushed to the
> r-devel branches of cran, rather than r-release and r-release-old.
> This separates development from production/use in a way that is common
> sense in most open source communities. Benefits for R include:
>
> - Regular R users (statisticians, researchers, students, teachers) can
> share their homemade scripts/documents/packages and rely on them to
> work and produce the same results within a given version of R, without
> manual efforts to manage package versions.
>
> - Package authors can publish breaking changes to the devel branch
> without causing major disruption or affecting users and/or
> maintainers. Authors of depending packages have a timeframe to sync
> their package with upstream changes before the next release.
>
> - CRAN maintainers can focus quality control and testing efforts on
> the devel branch around the time of the code freeze. No need for
> crisis management when a package update introduces some severe
> breaking changes. Users of released versions are unaffected.
>
>
> [1] http://journal.r-project.org/archive/2013-1/ooms.pdf
>
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to