The name "datalad" (https://www.thesaurus.com/browse/lad) I definitely like.
I suggest to collect more input/use cases over the weekend and then see how this goes. My immediate thought is that we describe our alternatives and try them all on a machine that we share. On 03.09.20 21:53, Yaroslav Halchenko wrote: > On Thu, 03 Sep 2020, Steffen Möller wrote: > >> I looked at datasets.datalad.org. I could well imagine to use your >> technology for other (larger) databases like Pfam or UniProt or PDB. For >> cute little genomes my initial reaction was that I felt overwhelmed. >> Your pointer will certainly help to define what we want. Many thanks! > FWIW, a few more notes since you seems to be interested ;): we do > have an elderly https://github.com/datalad/datalad-crawler/ "origin of > it all" but now just an extension to datalad which allows for efficient > "updates" and crawling of external resources. See e.g. this > asciinema/script: https://www.datalad.org/for/data-consumers > > But in many use cases a straight "datalad addurls" command > (http://docs.datalad.org/en/stable/generated/man/datalad-addurls.html > part of handbook with an example: > http://handbook.datalad.org/en/latest/usecases/HCP_dataset.html?highlight=addurls#dataset-creation-with-datalad-addurls) > could be sufficient to "quickly" (depending on bandwidth and/or either > you use --fast option) populate a datalad dataset with files specified > in a spreadsheet/structured records. > > So if you have some kind of .json or .csv/.tsv with records -- you could > try it quickly. addurls also automagically adds columns as git-annex > metadata per each file so someone could "toy around" (so far I > underused the feature) with "git annex views": > https://git-annex.branchable.com/git-annex-view/ > or later to facilitate metadata extraction/aggregation/search. > > A sample dataset (original cause for addurls to be written) is available > from http://datasets.datalad.org/?dir=/labs/openneurolab/metasearch if > you decide to explore (that data is open so no authorization for > access would be needed). > > The problem you might encounter in your cases is (not that great) > scalability of git/git-annex to contain hundreds of files in a > single repo. So you might like splitting them into subdatasets (git > submodules) or providing custom views as I had mentioned before. > > addurls makes it easy by establishing a subdataset whenever it > encounters // (instead of /) for path separation in the provided > filename. > > PS I shut up now ;) Sorry for the flood of info. We are just very > excited for DataLad even though we had been working on it for over 6 > years and should be sick of it and git-annex by now ;-) but we do > not! >

