Hi EasyBuilders,
recently I stumbled upon the task of downloading Google’s Open Image Dataset V4 [1]. It’s a dataset creating for training and validating image recognition machine learning engines. Basically, everyone on this ML field downloads at least one of such datasets, called Imagenet [2]. There are many other datasets which are shared among pretty much every scientist of a specific field. For another example, the copernicus datasets for earth sciences [3]. In that sense, a dataset is a tool - it’s not different from BOOST or NumPy or GROMACS. Given that they are used the same way by everyone, and they tend to be massive ([1] is 19tb, and it’s quite small), it makes a huge sense, in a administrative way, to have them identical and shared among all users. Which would mean, in supercomputers of research institutions, as a module one can load. Does that make any sense for you? To have easyconfigs which “install” (i.e. download and unpack) datasets in standard locations on the system, reproducible across systems? What do you think? Thanks for the attention, and merry xmas :-) [1] https://storage.googleapis.com/openimages/web/index.html <https://storage.googleapis.com/openimages/web/index.html> [2] http://www.image-net.org <http://www.image-net.org/> [3] https://www.copernicus.eu/en <https://www.copernicus.eu/en>
signature.asc
Description: Message signed with OpenPGP

