Hi Ludo, On Thu, 16 Mar 2023 at 18:45, Ludovic Courtès <l...@gnu.org> wrote:
> Thanks for starting this discussion! I feel this discussion is still pending, so I am resuming. :-) If context is missing, the thread starts here. intrinsic vs extrinsic identifier: toward more robustness? Simon Tournier <zimon.touto...@gmail.com> Fri, 03 Mar 2023 19:07:23 +0100 id:87jzzxd7z8....@gmail.com https://lists.gnu.org/archive/html/guix-devel/2023-03 https://yhetil.org/guix/87jzzxd7z8....@gmail.com > Sources (fixed-output derivations) are already content-addressed, by > definition (I prefer “content addressing” over “intrinsic > identification” because that’s a more widely recognized term). >From my understanding, this is correct only when the sources live in the Guix project infrastructure. I agree that if the source is substitutable (= the source exists on one of substitute servers, i.e., Guix project servers), then the fixed-output derivation is content-addressed, For instance, let consider this fixed-output derivation: --8<---------------cut here---------------start------------->8--- Derive ([("out","/gnu/store/n1k6jppyasn20zr6m8sfyv5ll07ibyf1-asciidoc-8.6.10.tar.gz","sha256","9e52f8578d891beaef25730a92a6e723596ddbd07bfe0d2a56486fcf63a0b983")] ,[] ,["/gnu/store/5iw2ivjw5njyyvi7avyphfcibgbqdbsc-mirrors","/gnu/store/vwyxp1dq4lb97n6b20w5cqxasy2dai79-content-addressed-mirrors"] ,"x86_64-linux","builtin:download",[] ,[("content-addressed-mirrors","/gnu/store/vwyxp1dq4lb97n6b20w5cqxasy2dai79-content-addressed-mirrors") ,("impureEnvVars","http_proxy https_proxy LC_ALL LC_MESSAGES LANG COLUMNS") ,("mirrors","/gnu/store/5iw2ivjw5njyyvi7avyphfcibgbqdbsc-mirrors") ,("out","/gnu/store/n1k6jppyasn20zr6m8sfyv5ll07ibyf1-asciidoc-8.6.10.tar.gz") ,("preferLocalBuild","1") ,("url","\"https://github.com/asciidoc/asciidoc/archive/8.6.10.tar.gz\"")]) --8<---------------cut here---------------end--------------->8--- I agree that the “url” field is useless while the content exists on the “content-addressed-mirrors” list. If one opens that file, then the code reads: --8<---------------cut here---------------start------------->8--- (begin (use-modules (guix base32)) (define (guix-publish host) (lambda (file algo hash) (string-append "https://" host "/file/" file "/" (symbol->string algo) "/" (bytevector->nix-base32-string hash)))) (module-autoload! (current-module) (quote (guix base16)) (quote (bytevector->base16-string))) (list (guix-publish "ci.guix.gnu.org") (lambda (file algo hash) (string-append "https://tarballs.nixos.org/" (symbol->string algo) "/" (bytevector->nix-base32-string hash))) (lambda (file algo hash) (string-append "https://archive.softwareheritage.org/api/1/content/" (symbol->string algo) ":" (bytevector->base16-string hash) "/raw/")))) --8<---------------cut here---------------end--------------->8--- Therefore, the look-up is done with some content-addressed via these 3 servers. > In a way, like Maxime way saying, the URL/URI is just a hint; what > matters it the content hash that appears in the origin. However, from my understanding, it is incorrect to speak about content-addressed when the source (fixed-output derivation) does not exist for whatever reason on any substitute servers. The URL/URI is not “just a hint”. It *is* the location from where the data are fetched. And it is not content-addressed. If I am incorrect, please could you explain? Please note that if only one source is missing than all the castle falls down. Other said, robustness means the hunt of the corner cases. :-) If I want to time-machine to d63ee94d63c667e0c63651d6b775460f4c67497d from Sat Jan 4 2020, and need Git, then it fails because: sha256 hash mismatch for /gnu/store/n1k6jppyasn20zr6m8sfyv5ll07ibyf1-asciidoc-8.6.10.tar.gz: expected hash: 10xrl1iwyvs8aqm0vzkvs3dnsn93wyk942kk4ppyl6w9imbzhlly actual hash: 1sh341j7ripkdb2wn6yf3rciln8ll89351b3d55gpkj89wypkmi2 Game over. )-: Do we share the same understanding? > What’s missing, both in SWH and in Guix, is the ability to store > multiple hashes. SWH could certainly store several hashes, computed > using different serialization and hash algorithm combinations. [...] > The other option—storing multiple hashes for each origin in Guix—doesn’t > sound practical: I can’t imagine packages storing and updating more than > one content hash per package. That doesn’t sound reasonable. Plus it > would be a long-term solution and wouldn’t help today. Yes, the core question is where to store the database mapping these multiple hashes. Software Heritage (SWH) is one option although 1. it had not been discussed yet how the Nar hashes will be publicly exposed, if they are and 2. if SWH will implement a resolver Nar -> SWHID. On the other hand, on Guix side, we are already building a database mapping multiple hashes: Disarchive database. :-) The question with the Disarchive database is its redundancy, IMHO. Concretely, if disarchive.guix.gnu.org is down, game over. I wish long live to Guix project :-) but it would appear to me more robust to propose a counter-measure. The big picture is: if I publish a paper which details about numerical processing using Guix, then having a Guix installation at hand would be the only condition for redoing. Last, please note Guix is already storing multiple hashes for some origins. It is the case for ’git-fetch’ methods, for example. All these packages using a plain Git commit hash are somehow storing two content-addressed hashes (Git and Nar). If one needs examples about how upstream can manage the ugly way their mutable Git tag, for recent cases: bug#66015: Removal of python-pyxel Simon Tournier <zimon.touto...@gmail.com> Fri, 15 Sep 2023 21:09:59 +0200 id:874jjv9rso....@gmail.com https://issues.guix.gnu.org/66015 https://issues.guix.gnu.org/msgid/874jjv9rso....@gmail.com https://yhetil.org/guix/874jjv9rso....@gmail.com and [bug#66013] [PATCH 0/4] gnu: bap, python-glcontext: Fix hash and update. Simon Tournier <zimon.touto...@gmail.com> Fri, 15 Sep 2023 20:38:34 +0200 id:cover.1694800551.git.zimon.touto...@gmail.com https://issues.guix.gnu.org/66013 https://issues.guix.gnu.org/msgid/cover.1694800551.git.zimon.touto...@gmail.com https://yhetil.org/guix/cover.1694800551.git.zimon.touto...@gmail.com All in all, I think we will have more robustness if the Guix I am running implements by its own some builtin features for content-addressed instead of relying on external databases. It is not clear for me how exactly, hence the discussion. :-) Another angle to see the problem of the multiple hashes is for using IPFS, GNUnet and friends. ( I let aside long-term vs today because the time-frame I am interested in is: “guarantees“ that I will be able to redo in 3 years later what I am doing in a very near future. And now I am trying to redo something from 3 years back to spot the potential problems and fix them or improve. I do not really care about the state of redoing Guix as 3 years ago because almost no one published papers using Guix 3 years ago. ;-) Guix is becoming popular in scientific context, yeah! so my interest about this robustness is for when Guix will be just a bit more popular. ) Cheers, simon