Disk space is unquestionably an issue for places like CRAN or BioConductor that host all R packages. I would be surprised if it matters to individual users of a single package in a world where a 4TB external hard drive costs about $100 US.

On 1/23/2023 7:23 AM, Kern, Lori wrote:
I would also argue that it is Bioconductor's current policy not to have such 
large data stored directly in a package and to use the hub interface.  Large 
data files often aren't necessary for an end user and as you say are only used 
for examples;  often smaller files are sufficient for proof of principle use of 
a package and users may not want to run the examples and it would be wasted 
space on their local machines.




Lori Shepherd - Kern

Bioconductor Core Team

Roswell Park Comprehensive Cancer Center

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263

________________________________
From: Bioc-devel <bioc-devel-boun...@r-project.org> on behalf of Park, Adam Keebum 
<sein.p...@psu.edu>
Sent: Saturday, January 21, 2023 3:24 PM
To: bioc-devel@r-project.org <bioc-devel@r-project.org>
Subject: [Bioc-devel] 5mb limit for packfiles in .git is too harsh

Dear community,

First, I want to appreciate Nathan's amazing help on my two previous inquiries. 
The answers effectively led me to pinpoint the issue.

  The final decision I made after hours of analysis was to remove all data 
files exceeding 50k sizes from the git history. However, such practice is not 
sustainable and actually is pathological because it invalidates virtually all 
previous data files and hence hampers reproducibility of previous commits, 
especially unit testing. Therefore, I want to leave a message here with a hope 
to reach administrators of bioconductor.

  I would claim that this policy should be relaxed at least for the git packfile. Most 
of us know that the .pack file residing in .git/objects/pack has frequently been accused 
by BiocChecker() for its large size (as in 
here<https://secure-web.cisco.com/1qGX_Y4A5aLPFy3eRY-StH4bJj6acUqiCPpnAZk1XROjj3YSQBLlkOjw9SzySW3-54oanxv5lMb3x-79HcBca4hB56pNaazyfWiqSKTFhFMBhjSj3UvAcKCvtXvnNnccS8Dh8GBKsSLFM7XGtUXwnUXMsFl6bXxqCbwHuJ0k-9OD8E0UH_a0_DC4H8RPKHjmlpPD6aZkQ-uUkR9oDX2AQzZ_iWF8cx_HocFFrSDDX6pd7KxyhWmfTK-RJ-1sRl1Wzhx3QrBJ1w2pwCV5t2woXiWnLJYd-5rZHHpLWcBrBnbc2VLWl8xq_-IaQKA54f-jg/https%3A%2F%2Fstat.ethz.ch%2Fpipermail%2Fbioc-devel%2F2019-February%2F014703.html>
 or 
here<https://secure-web.cisco.com/1WtGICK5jVidJpKk46cYpaiDvrGne4qJhF9IGphAWyhtUNMwz2UYMByDVrGbF2PkYwK5Y-3jD6W3eRjQ7c1DTMhqyOHccVdZNsKC1mE-xTaaruLRm5B5PJy2uv0ymdcMYmefu4VAogprvuWNILutLatvBFAt5o6Im0t1AywlrrVnS8Bqiq689nBIt1Xv42Km49nRiuxHUN9f7eritfAj5Nk8o7hqFalP2cWRqoAPeoYaaD8tPwTlflqUCRdqhfDEVD_D121aEBCCqxkapNzPjFO42t1weBPb8bIbSKysCOs-D
mDQ
  
xgQhjrRnLRVxmAIHw/https%3A%2F%2Fstat.ethz.ch%2Fpipermail%2Fbioc-devel%2F2020-October%2F017273.html>),
 which is natural due to the purpose of packfiles: storing "all removal history" in a 
single compact 
space<https://secure-web.cisco.com/1G_W-7yuorugfNr9gW8kIh_xqltZF-Anj3z-6pPyO655Z6ZuYqwNwU4rrWEYYp0BIlzrdn3Z8yFK8Q0c4YIdBGzSBQ1corj7678dm_RcqbOo_LMcY8PPCDuV_JyHR2bv8kbwfXL83HPfiRs1OZDze7rpAtCTKhL1dYJekOcOek3jbQC9vod7p0UB9llGOINWvGozPm76XRYbxlu03ERon3tHht0OtSNKFZcbyPJHSmHQdZaVWTizeweMJv-VzO2Dy_SwhIP93G34M4HeMe7HORWO0DWFCelheyqdZCbmsGlQUtVqVP2bVcjkNEWSRyHiv/https%3A%2F%2Fgit-scm.com%2Fbook%2Fen%2Fv2%2FGit-Internals-Packfiles%23%3A%7E%3Atext%3DThe%2520packfile%2520is%2520a%2520single,seek%20to%20a%20specific%20object.>.
  Compressing the whole git history in a file is effective only until the 
majority of delta are sentence-based changes in a text source file for example. 
In my practice, however, a modification in blob files tended to contribute much 
more because of boosted delta after compressing datasets where some 
modification has shaken their bit patterns. Such changes were still 
kilobyte-level, but gradually impacted the whole pack file size so I had to 
remove those cases. The current policy therefore forces deletions of kilo-sized 
files in git history, not just 'large' files...
  I might not be the only one using multiple 100kb-sized experimental data in 
unit testing and vignettes. Containing dozens of such files in a 5mb package 
might be acceptable. I believe the same can hold for the pack file because it 
just represents a collection of previous files which are still less than 5mb. I 
guess the policy can relax such file size limit to allow safer and reproducible 
developer practices.

Sincerely,
Adam.


         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://secure-web.cisco.com/18ds0t8sN_xLwRtJPaWHyi5zrye3D9GyCRcoOHe6-4at-ka3CpgB4AnL2xz22t3EIVWC7R7H4Kn-0LpNvrquyGQTXMCON5bvAM2iWDyvYS_yeeql1JgxR4cdEN6O0QghD5GLtfd5KVYTL9GsWkxg7h_eWZuvb9kFUxLl14WvjlZfJD8B39Dv-9qlUv1kgkwHnNhCl21jvpvzcpHcYw1sCwYYXYPZ4_syRLynX-VwMnL9MekVDDeJOFJNz2iNw_UkIPOfY0WgudAnnyaZFzAEZR5ugi9N7fNuGH9MRGS9fqqJNH6_X8Fa97o4XgjSmsjcR/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel



This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.
        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to