You might like to listen to debconf20 talk on DataLad ;-) At some point I have started even to establish some kind of dh-datalad helper so that .deb package would contain a datalad dataset (git/git-annex repo), and would just `get` data files upon installation... So -- yes, they would not be "self contained" but it is infeasible for any sizeable data packages on debian. But they could be versioned, point to specific git state of corresponding datasets, provide lightweight and efficient upgrades (only changed/new files would need to be fetched), etc. They could be partitioned into smaller subdatasets or custom views to be provided, like we have
https://github.com/datalad-datasets/hcp-structural-preprocessed which is a selection from a larger https://github.com/datalad-datasets/human-connectome-project-openaccess Never finished that helper though -- we just (develop and) use datalad directly and had no debian packages which would need strict dependency on the datasets. More of sample datasets could be found on https://datasets.datalad.org/ -- data primarily comes from original repositories, and covers now > 200TB We had started to collect resources someone might like to datalad'ify relevant to bioinformatics: https://github.com/datalad/datalad/milestone/14?closed=1 but since we are not in bioinformatics field, never actually addressed them. I also know that https://github.com/notestaff is actively using git-annex (not sure if datalad -- but he did submit some issues, so he might) for bioinformatics. Might be worth checking with him if git-annex/datalad would be decided to be used. On Thu, 03 Sep 2020, Steffen Möller wrote: > Hello, > We are closing in on the workflows. What is kind of missing are the > mostly invariant inputs like the genomes of pathogens and very much so > the reference genomes of the human, mouse, rat, worm, fly, .... you name > them. > Other than a few years ago, hard drives are now big enough to > accommodate the one or other genome and derivative indexes. Just - I > don't think we want to organize in our regular Debian infrastructure > something as variant as public genome (yes, they are still regularly > updated, very much so) and that is so very security-irrelevant (just > some data). Also, different sites will vary a lot in where this data > shall be organized and all those scripts should likely be > executed/initiated as/by non-root. There are public sites for this from > where this data can be downloaded. Any redundancy to these sites imho > mostly hurts us. The other side is that to just get something up quickly > and for reproducibility tests, our infrastructure is difficult to beat. > Please kindly throw your ideas at me how you would like whole genomes to > be presented by Debian to the average user and to professionals. Just > reply to this thread and/or send me "+1"s a PM and I summarize this up > in a document which I suggest we then talk about in a jitsi meeting. -- Yaroslav O. Halchenko Center for Open Neuroscience http://centerforopenneuroscience.org Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 WWW: http://www.linkedin.com/in/yarik

