Hi, Thank you for this food for thought.
I agree that the frontier between code and data is arbitary. However, I am not sure to get the picture about the data management in the context of Reproducible Science. What is the issue ? So, I catch your invitation to explore your idea. :-) Let think about the old lab experiment. On one hand, you have your protocol and the description of all the steps. On the other hand, you have measurements and results. Then, I am able to imagine a sense of some bit-to-bit mechanism for the protocol part. I am not sure about the measurements part. Well, protocol is code or workflow; measurements are data. And I agree that e.g., information of electronic orbits or weights of a trained neural network is sometimes part of the protocol. :-) For me, just talking about code, it is not a straightforward task to define what are the properties for a reproducible and fully controled computational environment. It is --I guess-- what Guix is defining (transactional, user-profile, hackable, etc.). Then, it appears to me even more difficult about data. What are such properties for data management ? In other words, on the paper, what are the benefits of a management of some piece of data in the store ? For example for the applications of weights of a trained neural network; or of the positions of the atoms in protein structure. For me --maybe I have wrong-- the way is to define a package (or workflow) that fetches the data from some external source, cleans if needed, does some checks, and then puts it to /path/to/somewhere/ outside the store. In parallel computing, this /path/to/somewhere/ is accessible by all the nodes. Moreover, this /path/to/somewhere/ contains something hash-based in the folder name. Is it not enough ? Why do you need the history of changes ? as git provide ? Secrets is another story than reproducible science toolchain, I guess. Thank you again. All the best, simon