Re: How to package human, mouse and viral genomes?

Steffen Möller Thu, 03 Sep 2020 13:34:00 -0700

The name "datalad" (https://www.thesaurus.com/browse/lad) I definitely like.


I suggest to collect more input/use cases over the weekend and then see
how this goes. My immediate thought is that we describe our alternatives
and try them all on a machine that we share.

On 03.09.20 21:53, Yaroslav Halchenko wrote:
> On Thu, 03 Sep 2020, Steffen Möller wrote:
>
>> I looked at datasets.datalad.org. I could well imagine to use your
>> technology for other (larger) databases like Pfam or UniProt or PDB. For
>> cute little genomes my initial reaction was that I felt overwhelmed.
>> Your pointer will certainly help to define what we want. Many thanks!
> FWIW, a few more notes since you seems to be interested ;):  we do
> have an elderly https://github.com/datalad/datalad-crawler/ "origin of
> it all" but now just an extension to datalad which allows for efficient
> "updates" and crawling of external resources.  See e.g. this
> asciinema/script: https://www.datalad.org/for/data-consumers
>
> But in many use cases a straight   "datalad addurls" command
> (http://docs.datalad.org/en/stable/generated/man/datalad-addurls.html
> part of handbook with an example:
> http://handbook.datalad.org/en/latest/usecases/HCP_dataset.html?highlight=addurls#dataset-creation-with-datalad-addurls)
> could be sufficient to "quickly" (depending on bandwidth and/or either
> you use --fast option) populate a datalad dataset with files specified
> in a spreadsheet/structured records.
>
> So if you have some kind of .json or .csv/.tsv with records -- you could
> try it quickly.  addurls also automagically adds columns as git-annex
> metadata  per each file so someone could "toy around" (so far I
> underused the feature) with "git annex views":
> https://git-annex.branchable.com/git-annex-view/
> or later to facilitate metadata extraction/aggregation/search.
>
> A sample dataset (original cause for addurls to be written) is available
> from http://datasets.datalad.org/?dir=/labs/openneurolab/metasearch  if
> you decide to explore (that data is open so no authorization for
> access would be needed).
>
> The problem you might encounter in your cases is (not that great)
> scalability of git/git-annex to contain hundreds of files in a
> single repo.  So you might like splitting them into subdatasets (git
> submodules) or providing custom views as I had mentioned before.
>
> addurls makes it easy by establishing a subdataset whenever it
> encounters // (instead of /) for path separation in the provided
> filename.
>
> PS I shut up now ;) Sorry for the flood of info.  We are just very
> excited for DataLad even though we had been working on it for over 6
> years and should be sick of it and git-annex by now ;-)  but we do
> not!
>

Re: How to package human, mouse and viral genomes?

Reply via email to