Currently the daemon implements a simple yet efficient way do deduplicate files identical among store items. The /gnu/store/.links directory contains hard links to files in the store; the link name is the base32-encoded SHA256 of the file. When the daemon adds a new file in the store, it checks in /gnu/store/.links whether an identical file is already in store, and if so makes a hard link to that thing.
When installing, say, two different variants of texlive, which in practice are 90% bit-identical, there’s a lot of deduplication happening. However, we still end up downloading the whole texlive archive just to realize that we already have most of its files in store. A solution to this would be to change the HTTP substitute protocol. ‘guix publish’ could serve content-addressed files. For instance, http://example.org/1ghws12lrp62vvxxxqmxp7jgxv2p18ihiyq420ag77nh9bw5qsfg.file would serve the contents of the store file that has the given hash. The archive format would have to be different from the one currently implemented by ‘write-file’: for regular files, ‘write-contents’ would simply write the hash of the contents, and it would be up to the substituter to go fetch that file if it’s not already in store (which can be determined by looking it up in /gnu/store/.links.) This is not very sophisticated, but has the advantage of being relatively easy to implement in Guix itself. The downside is that Hydra would most likely not implement this new protocol (which would give us another incentive to move away from it.) Thoughts? Patches? :-) Ludo’. PS: Title inspired by <http://www.sansbullshitsans.com/>.