Re: [R-pkg-devel] How to store large data to be used in an R package?

2024-03-26 Thread Dirk Eddelbuettel


On 25 March 2024 at 11:12, Jairo Hidalgo Migueles wrote:
| I'm reaching out to seek some guidance regarding the storage of relatively
| large data, ranging from 10-40 MB, intended for use within an R package.
| Specifically, this data consists of regression and random forest models
| crucial for making predictions within our R package.
| 
| Initially, I attempted to save these models as internal data within the
| package. While this approach maintains functionality, it has led to a
| package size exceeding 20 MB. I'm concerned that this would complicate
| submitting the package to CRAN in the future.
| 
| I would greatly appreciate any suggestions or insights you may have on
| alternative methods or best practices for efficiently storing and accessing
| this data within our R package.

Brooke and I wrote a paper on one way of addressing it via a 'data' package
accessibly via an Additional_repositories: entry supported by a drat repo.

See https://journal.r-project.org/archive/2017/RJ-2017-026/index.html for the
paper which contains a nice slow walkthrough of all the details.

Dirk

-- 
dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] How to store large data to be used in an R package?

2024-03-25 Thread Ivan Krylov via R-package-devel
В Mon, 25 Mar 2024 11:12:57 +0100
Jairo Hidalgo Migueles  пишет:

> Specifically, this data consists of regression and random forest
> models crucial for making predictions within our R package.

Apologies for asking a silly question, but is there a chance that these
models are large by accident (e.g. because an object references a large
environment containing multiple copies of the training dataset)? Or it
is there really more than a million weights required to make
predictions?

> Initially, I attempted to save these models as internal data within
> the package. While this approach maintains functionality, it has led
> to a package size exceeding 20 MB. I'm concerned that this would
> complicate submitting the package to CRAN in the future.

The policy mentions the possibility of having a separate large
data-only package. Since CRAN strives to archive all package versions,
this data-only package will have to be updated as rarely as possible.
You will need to ask CRAN for approval.

If there is a significant amount of core functionality inside your
package that does *not* require the large data (so that it can still
be installed and used without the data), you can publish the data-only
package yourself (e.g. using the 'drat' package), put it in Suggests
and link to it in the Additional_repositories field of your DESCRIPTION.
Alternatively, you can publish the data on Zenodo and offer to download
it on first use. Make sure to (1) use tools::R_user_dir to determine
where to put the files, (2) only download the files after the user
explicitly agrees to it and (3) test as much of your package
functionality as possible without requiring the data to be downloaded.

-- 
Best regards,
Ivan

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


[R-pkg-devel] How to store large data to be used in an R package?

2024-03-25 Thread Jairo Hidalgo Migueles
Dear all,

I'm reaching out to seek some guidance regarding the storage of relatively
large data, ranging from 10-40 MB, intended for use within an R package.
Specifically, this data consists of regression and random forest models
crucial for making predictions within our R package.

Initially, I attempted to save these models as internal data within the
package. While this approach maintains functionality, it has led to a
package size exceeding 20 MB. I'm concerned that this would complicate
submitting the package to CRAN in the future.

I would greatly appreciate any suggestions or insights you may have on
alternative methods or best practices for efficiently storing and accessing
this data within our R package.

Jairo

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel