Re: [R-pkg-devel] Best practices for distributing large data files

Blätte , Andreas Wed, 16 Feb 2022 01:24:54 -0800

Dear Rafa, 

AWS is a good option, and we are very satisfied with Zenodo for data that can 
be made publicly available.


See the corpus_install() function in the 'cwbtools' package I maintain: 
https://github.com/PolMine/cwbtools/blob/master/R/corpus.R It offers a download 
option from both AWS and Zenodo, e.g. to download / install the (~ 1 GB) 
GermaParl corpus: https://doi.org/10.5281/zenodo.3742113

Zenodo is easy to use and has the great advantage that a DOI is assigned 
automatically. AWS is our option for restricted data. Yet managing access 
rights is appropriately with AWS is not easy. Users often need assistance to 
create credential files. 

Kind regards
Andreas 




Am 16.02.22, 03:55 schrieb "R-package-devel im Auftrag von Ayala Hernandez, 
Rafael" <r-package-devel-boun...@r-project.org im Auftrag von 
r.ayal...@imperial.ac.uk>:

    Dear all,

    I am currently trying to think of the best way to distribute large sets of 
coefficients required by my package asteRisk.

    At the moment, I am using an accessory data package, asteRiskData, 
available from a drat repository, that bundles all of the required coefficients 
already parsed and stored as R objects.

    However, as my package grows, the amount of data required is also growing. 
This has made the size of asteRiskData grow larger, reaching 99.99 MB at the 
moment, which is at the limit of what would be upload able to GitHub. Since the 
source package must be uploaded a a single .tar.gz file for the drat 
repository, I see no easy workaround, other than splitting it into multiple, 
accessory data packages.

    I believe this option could become rather troublesome in the future, if the 
number of accessory data packages starts to grow too much.

    So I would like to ask, is there any recommended procedure for distributing 
such large data files? 

    Another option that has been suggested to me is not to use an accessory 
data package at all, but instead download and parse the required data on demand 
from the corresponding internet resources, store them locally, and then have 
future sessions load them from the local copies, therefore not requiring 
download and parsing in every R session, but only once (or possibly only once 
in a while, if the associated resource is updated). However, this would be 
leaving files of relatively large size (several 10s of MB) scattered in the 
local environment of users (instead of having them all centralized in the 
accessory data package). Is this option acceptable as well?

    Thanks a lot in advance for any insights

    Best wishes,

    Rafa
    ______________________________________________
    R-package-devel@r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-package-devel

______________________________________________
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel

Re: [R-pkg-devel] Best practices for distributing large data files

Reply via email to